aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIngo Molnar <mingo@kernel.org>2016-06-30 02:27:41 -0400
committerIngo Molnar <mingo@kernel.org>2016-06-30 02:27:41 -0400
commit54d5f16e55a7cdd64e0f6bcadf2b5f871f94bb83 (patch)
tree169537619e16c6a6d802585ba97aae405642233a
parent4c2e07c6a29e0129e975727b9f57eede813eea85 (diff)
parent4d03754f04247bc4d469b78b61cac942df37445d (diff)
Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU changes from Paul E. McKenney: - Documentation updates. Just some simple changes, no design-level additions. - Miscellaneous fixes. - Torture-test updates. Signed-off-by: Ingo Molnar <mingo@kernel.org>
-rw-r--r--Documentation/RCU/Design/Requirements/Requirements.html35
-rw-r--r--Documentation/RCU/stallwarn.txt2
-rw-r--r--Documentation/RCU/whatisRCU.txt3
-rw-r--r--Documentation/sysctl/kernel.txt12
-rw-r--r--include/linux/kernel.h1
-rw-r--r--include/linux/rcupdate.h23
-rw-r--r--include/linux/torture.h4
-rw-r--r--init/Kconfig1
-rw-r--r--kernel/rcu/rcuperf.c25
-rw-r--r--kernel/rcu/rcutorture.c9
-rw-r--r--kernel/rcu/tree.c586
-rw-r--r--kernel/rcu/tree.h15
-rw-r--r--kernel/rcu/tree_exp.h656
-rw-r--r--kernel/rcu/tree_plugin.h95
-rw-r--r--kernel/rcu/update.c7
-rw-r--r--kernel/sysctl.c11
-rw-r--r--kernel/torture.c176
-rw-r--r--lib/Kconfig.debug33
-rw-r--r--tools/testing/selftests/rcutorture/bin/functions.sh12
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh34
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/kvm.sh2
-rwxr-xr-xtools/testing/selftests/rcutorture/bin/parse-console.sh7
-rw-r--r--tools/testing/selftests/rcutorture/doc/initrd.txt22
23 files changed, 978 insertions, 793 deletions
diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html
index e7e24b3e86e2..ece410f40436 100644
--- a/Documentation/RCU/Design/Requirements/Requirements.html
+++ b/Documentation/RCU/Design/Requirements/Requirements.html
@@ -2391,6 +2391,41 @@ and <tt>RCU_NONIDLE()</tt> on the other while inspecting
2391idle-loop code. 2391idle-loop code.
2392Steven Rostedt supplied <tt>_rcuidle</tt> event tracing, 2392Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
2393which is used quite heavily in the idle loop. 2393which is used quite heavily in the idle loop.
2394However, there are some restrictions on the code placed within
2395<tt>RCU_NONIDLE()</tt>:
2396
2397<ol>
2398<li> Blocking is prohibited.
2399 In practice, this is not a serious restriction given that idle
2400 tasks are prohibited from blocking to begin with.
2401<li> Although nesting <tt>RCU_NONIDLE()</tt> is permited, they cannot
2402 nest indefinitely deeply.
2403 However, given that they can be nested on the order of a million
2404 deep, even on 32-bit systems, this should not be a serious
2405 restriction.
2406 This nesting limit would probably be reached long after the
2407 compiler OOMed or the stack overflowed.
2408<li> Any code path that enters <tt>RCU_NONIDLE()</tt> must sequence
2409 out of that same <tt>RCU_NONIDLE()</tt>.
2410 For example, the following is grossly illegal:
2411
2412 <blockquote>
2413 <pre>
2414 1 RCU_NONIDLE({
2415 2 do_something();
2416 3 goto bad_idea; /* BUG!!! */
2417 4 do_something_else();});
2418 5 bad_idea:
2419 </pre>
2420 </blockquote>
2421
2422 <p>
2423 It is just as illegal to transfer control into the middle of
2424 <tt>RCU_NONIDLE()</tt>'s argument.
2425 Yes, in theory, you could transfer in as long as you also
2426 transferred out, but in practice you could also expect to get sharply
2427 worded review comments.
2428</ol>
2394 2429
2395<p> 2430<p>
2396It is similarly socially unacceptable to interrupt an 2431It is similarly socially unacceptable to interrupt an
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
index 0f7fb4298e7e..e93d04133fe7 100644
--- a/Documentation/RCU/stallwarn.txt
+++ b/Documentation/RCU/stallwarn.txt
@@ -49,7 +49,7 @@ rcupdate.rcu_task_stall_timeout
49 This boot/sysfs parameter controls the RCU-tasks stall warning 49 This boot/sysfs parameter controls the RCU-tasks stall warning
50 interval. A value of zero or less suppresses RCU-tasks stall 50 interval. A value of zero or less suppresses RCU-tasks stall
51 warnings. A positive value sets the stall-warning interval 51 warnings. A positive value sets the stall-warning interval
52 in jiffies. An RCU-tasks stall warning starts wtih the line: 52 in jiffies. An RCU-tasks stall warning starts with the line:
53 53
54 INFO: rcu_tasks detected stalls on tasks: 54 INFO: rcu_tasks detected stalls on tasks:
55 55
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 111770ffa10e..204422719197 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -5,6 +5,9 @@ to start learning about RCU:
52. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/ 52. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/
63. RCU part 3: the RCU API http://lwn.net/Articles/264090/ 63. RCU part 3: the RCU API http://lwn.net/Articles/264090/
74. The RCU API, 2010 Edition http://lwn.net/Articles/418853/ 74. The RCU API, 2010 Edition http://lwn.net/Articles/418853/
8 2010 Big API Table http://lwn.net/Articles/419086/
95. The RCU API, 2014 Edition http://lwn.net/Articles/609904/
10 2014 Big API Table http://lwn.net/Articles/609973/
8 11
9 12
10What is RCU? 13What is RCU?
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt
index a3683ce2a2f3..33204604de6c 100644
--- a/Documentation/sysctl/kernel.txt
+++ b/Documentation/sysctl/kernel.txt
@@ -58,6 +58,7 @@ show up in /proc/sys/kernel:
58- panic_on_stackoverflow 58- panic_on_stackoverflow
59- panic_on_unrecovered_nmi 59- panic_on_unrecovered_nmi
60- panic_on_warn 60- panic_on_warn
61- panic_on_rcu_stall
61- perf_cpu_time_max_percent 62- perf_cpu_time_max_percent
62- perf_event_paranoid 63- perf_event_paranoid
63- perf_event_max_stack 64- perf_event_max_stack
@@ -618,6 +619,17 @@ a kernel rebuild when attempting to kdump at the location of a WARN().
618 619
619============================================================== 620==============================================================
620 621
622panic_on_rcu_stall:
623
624When set to 1, calls panic() after RCU stall detection messages. This
625is useful to define the root cause of RCU stalls using a vmcore.
626
6270: do not panic() when RCU stall takes place, default behavior.
628
6291: panic() after printing RCU stall messages.
630
631==============================================================
632
621perf_cpu_time_max_percent: 633perf_cpu_time_max_percent:
622 634
623Hints to the kernel how much CPU time it should be allowed to 635Hints to the kernel how much CPU time it should be allowed to
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 94aa10ffe156..c42082112ec8 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -451,6 +451,7 @@ extern int panic_on_oops;
451extern int panic_on_unrecovered_nmi; 451extern int panic_on_unrecovered_nmi;
452extern int panic_on_io_nmi; 452extern int panic_on_io_nmi;
453extern int panic_on_warn; 453extern int panic_on_warn;
454extern int sysctl_panic_on_rcu_stall;
454extern int sysctl_panic_on_stackoverflow; 455extern int sysctl_panic_on_stackoverflow;
455 456
456extern bool crash_kexec_post_notifiers; 457extern bool crash_kexec_post_notifiers;
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 5f1533e3d032..3bc5de08c0b7 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -45,6 +45,7 @@
45#include <linux/bug.h> 45#include <linux/bug.h>
46#include <linux/compiler.h> 46#include <linux/compiler.h>
47#include <linux/ktime.h> 47#include <linux/ktime.h>
48#include <linux/irqflags.h>
48 49
49#include <asm/barrier.h> 50#include <asm/barrier.h>
50 51
@@ -379,12 +380,13 @@ static inline void rcu_init_nohz(void)
379 * in the inner idle loop. 380 * in the inner idle loop.
380 * 381 *
381 * This macro provides the way out: RCU_NONIDLE(do_something_with_RCU()) 382 * This macro provides the way out: RCU_NONIDLE(do_something_with_RCU())
382 * will tell RCU that it needs to pay attending, invoke its argument 383 * will tell RCU that it needs to pay attention, invoke its argument
383 * (in this example, a call to the do_something_with_RCU() function), 384 * (in this example, calling the do_something_with_RCU() function),
384 * and then tell RCU to go back to ignoring this CPU. It is permissible 385 * and then tell RCU to go back to ignoring this CPU. It is permissible
385 * to nest RCU_NONIDLE() wrappers, but the nesting level is currently 386 * to nest RCU_NONIDLE() wrappers, but not indefinitely (but the limit is
386 * quite limited. If deeper nesting is required, it will be necessary 387 * on the order of a million or so, even on 32-bit systems). It is
387 * to adjust DYNTICK_TASK_NESTING_VALUE accordingly. 388 * not legal to block within RCU_NONIDLE(), nor is it permissible to
389 * transfer control either into or out of RCU_NONIDLE()'s statement.
388 */ 390 */
389#define RCU_NONIDLE(a) \ 391#define RCU_NONIDLE(a) \
390 do { \ 392 do { \
@@ -649,7 +651,16 @@ static inline void rcu_preempt_sleep_check(void)
649 * please be careful when making changes to rcu_assign_pointer() and the 651 * please be careful when making changes to rcu_assign_pointer() and the
650 * other macros that it invokes. 652 * other macros that it invokes.
651 */ 653 */
652#define rcu_assign_pointer(p, v) smp_store_release(&p, RCU_INITIALIZER(v)) 654#define rcu_assign_pointer(p, v) \
655({ \
656 uintptr_t _r_a_p__v = (uintptr_t)(v); \
657 \
658 if (__builtin_constant_p(v) && (_r_a_p__v) == (uintptr_t)NULL) \
659 WRITE_ONCE((p), (typeof(p))(_r_a_p__v)); \
660 else \
661 smp_store_release(&p, RCU_INITIALIZER((typeof(p))_r_a_p__v)); \
662 _r_a_p__v; \
663})
653 664
654/** 665/**
655 * rcu_access_pointer() - fetch RCU pointer with no dereferencing 666 * rcu_access_pointer() - fetch RCU pointer with no dereferencing
diff --git a/include/linux/torture.h b/include/linux/torture.h
index 7759fc3c622d..6685a73736a2 100644
--- a/include/linux/torture.h
+++ b/include/linux/torture.h
@@ -50,6 +50,10 @@
50 do { if (verbose) pr_alert("%s" TORTURE_FLAG "!!! %s\n", torture_type, s); } while (0) 50 do { if (verbose) pr_alert("%s" TORTURE_FLAG "!!! %s\n", torture_type, s); } while (0)
51 51
52/* Definitions for online/offline exerciser. */ 52/* Definitions for online/offline exerciser. */
53bool torture_offline(int cpu, long *n_onl_attempts, long *n_onl_successes,
54 unsigned long *sum_offl, int *min_onl, int *max_onl);
55bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes,
56 unsigned long *sum_onl, int *min_onl, int *max_onl);
53int torture_onoff_init(long ooholdoff, long oointerval); 57int torture_onoff_init(long ooholdoff, long oointerval);
54void torture_onoff_stats(void); 58void torture_onoff_stats(void);
55bool torture_onoff_failures(void); 59bool torture_onoff_failures(void);
diff --git a/init/Kconfig b/init/Kconfig
index f755a602d4a1..a068265fbcaf 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -517,6 +517,7 @@ config SRCU
517config TASKS_RCU 517config TASKS_RCU
518 bool 518 bool
519 default n 519 default n
520 depends on !UML
520 select SRCU 521 select SRCU
521 help 522 help
522 This option enables a task-based RCU implementation that uses 523 This option enables a task-based RCU implementation that uses
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 3cee0d8393ed..d38ab08a3fe7 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -58,7 +58,7 @@ MODULE_AUTHOR("Paul E. McKenney <paulmck@linux.vnet.ibm.com>");
58#define VERBOSE_PERFOUT_ERRSTRING(s) \ 58#define VERBOSE_PERFOUT_ERRSTRING(s) \
59 do { if (verbose) pr_alert("%s" PERF_FLAG "!!! %s\n", perf_type, s); } while (0) 59 do { if (verbose) pr_alert("%s" PERF_FLAG "!!! %s\n", perf_type, s); } while (0)
60 60
61torture_param(bool, gp_exp, true, "Use expedited GP wait primitives"); 61torture_param(bool, gp_exp, false, "Use expedited GP wait primitives");
62torture_param(int, holdoff, 10, "Holdoff time before test start (s)"); 62torture_param(int, holdoff, 10, "Holdoff time before test start (s)");
63torture_param(int, nreaders, -1, "Number of RCU reader threads"); 63torture_param(int, nreaders, -1, "Number of RCU reader threads");
64torture_param(int, nwriters, -1, "Number of RCU updater threads"); 64torture_param(int, nwriters, -1, "Number of RCU updater threads");
@@ -96,12 +96,7 @@ static int rcu_perf_writer_state;
96#define MAX_MEAS 10000 96#define MAX_MEAS 10000
97#define MIN_MEAS 100 97#define MIN_MEAS 100
98 98
99#if defined(MODULE) || defined(CONFIG_RCU_PERF_TEST_RUNNABLE) 99static int perf_runnable = IS_ENABLED(MODULE);
100#define RCUPERF_RUNNABLE_INIT 1
101#else
102#define RCUPERF_RUNNABLE_INIT 0
103#endif
104static int perf_runnable = RCUPERF_RUNNABLE_INIT;
105module_param(perf_runnable, int, 0444); 100module_param(perf_runnable, int, 0444);
106MODULE_PARM_DESC(perf_runnable, "Start rcuperf at boot"); 101MODULE_PARM_DESC(perf_runnable, "Start rcuperf at boot");
107 102
@@ -363,8 +358,6 @@ rcu_perf_writer(void *arg)
363 u64 *wdpp = writer_durations[me]; 358 u64 *wdpp = writer_durations[me];
364 359
365 VERBOSE_PERFOUT_STRING("rcu_perf_writer task started"); 360 VERBOSE_PERFOUT_STRING("rcu_perf_writer task started");
366 WARN_ON(rcu_gp_is_expedited() && !rcu_gp_is_normal() && !gp_exp);
367 WARN_ON(rcu_gp_is_normal() && gp_exp);
368 WARN_ON(!wdpp); 361 WARN_ON(!wdpp);
369 set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids)); 362 set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
370 sp.sched_priority = 1; 363 sp.sched_priority = 1;
@@ -631,12 +624,24 @@ rcu_perf_init(void)
631 firsterr = -ENOMEM; 624 firsterr = -ENOMEM;
632 goto unwind; 625 goto unwind;
633 } 626 }
627 if (rcu_gp_is_expedited() && !rcu_gp_is_normal() && !gp_exp) {
628 VERBOSE_PERFOUT_ERRSTRING("All grace periods expedited, no normal ones to measure!");
629 firsterr = -EINVAL;
630 goto unwind;
631 }
632 if (rcu_gp_is_normal() && gp_exp) {
633 VERBOSE_PERFOUT_ERRSTRING("All grace periods normal, no expedited ones to measure!");
634 firsterr = -EINVAL;
635 goto unwind;
636 }
634 for (i = 0; i < nrealwriters; i++) { 637 for (i = 0; i < nrealwriters; i++) {
635 writer_durations[i] = 638 writer_durations[i] =
636 kcalloc(MAX_MEAS, sizeof(*writer_durations[i]), 639 kcalloc(MAX_MEAS, sizeof(*writer_durations[i]),
637 GFP_KERNEL); 640 GFP_KERNEL);
638 if (!writer_durations[i]) 641 if (!writer_durations[i]) {
642 firsterr = -ENOMEM;
639 goto unwind; 643 goto unwind;
644 }
640 firsterr = torture_create_kthread(rcu_perf_writer, (void *)i, 645 firsterr = torture_create_kthread(rcu_perf_writer, (void *)i,
641 writer_tasks[i]); 646 writer_tasks[i]);
642 if (firsterr) 647 if (firsterr)
diff --git a/kernel/rcu/rcutorture.c b/kernel/rcu/rcutorture.c
index 084a28a732eb..971e2b138063 100644
--- a/kernel/rcu/rcutorture.c
+++ b/kernel/rcu/rcutorture.c
@@ -182,12 +182,7 @@ static const char *rcu_torture_writer_state_getname(void)
182 return rcu_torture_writer_state_names[i]; 182 return rcu_torture_writer_state_names[i];
183} 183}
184 184
185#if defined(MODULE) || defined(CONFIG_RCU_TORTURE_TEST_RUNNABLE) 185static int torture_runnable = IS_ENABLED(MODULE);
186#define RCUTORTURE_RUNNABLE_INIT 1
187#else
188#define RCUTORTURE_RUNNABLE_INIT 0
189#endif
190static int torture_runnable = RCUTORTURE_RUNNABLE_INIT;
191module_param(torture_runnable, int, 0444); 186module_param(torture_runnable, int, 0444);
192MODULE_PARM_DESC(torture_runnable, "Start rcutorture at boot"); 187MODULE_PARM_DESC(torture_runnable, "Start rcutorture at boot");
193 188
@@ -1476,7 +1471,7 @@ static int rcu_torture_barrier_cbs(void *arg)
1476 break; 1471 break;
1477 /* 1472 /*
1478 * The above smp_load_acquire() ensures barrier_phase load 1473 * The above smp_load_acquire() ensures barrier_phase load
1479 * is ordered before the folloiwng ->call(). 1474 * is ordered before the following ->call().
1480 */ 1475 */
1481 local_irq_disable(); /* Just to test no-irq call_rcu(). */ 1476 local_irq_disable(); /* Just to test no-irq call_rcu(). */
1482 cur_ops->call(&rcu, rcu_torture_barrier_cbf); 1477 cur_ops->call(&rcu, rcu_torture_barrier_cbf);
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c7f1bc4f817c..f433959e9322 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -125,12 +125,14 @@ int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
125/* Number of rcu_nodes at specified level. */ 125/* Number of rcu_nodes at specified level. */
126static int num_rcu_lvl[] = NUM_RCU_LVL_INIT; 126static int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
127int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */ 127int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
128/* panic() on RCU Stall sysctl. */
129int sysctl_panic_on_rcu_stall __read_mostly;
128 130
129/* 131/*
130 * The rcu_scheduler_active variable transitions from zero to one just 132 * The rcu_scheduler_active variable transitions from zero to one just
131 * before the first task is spawned. So when this variable is zero, RCU 133 * before the first task is spawned. So when this variable is zero, RCU
132 * can assume that there is but one task, allowing RCU to (for example) 134 * can assume that there is but one task, allowing RCU to (for example)
133 * optimize synchronize_sched() to a simple barrier(). When this variable 135 * optimize synchronize_rcu() to a simple barrier(). When this variable
134 * is one, RCU must actually do all the hard work required to detect real 136 * is one, RCU must actually do all the hard work required to detect real
135 * grace periods. This variable is also used to suppress boot-time false 137 * grace periods. This variable is also used to suppress boot-time false
136 * positives from lockdep-RCU error checking. 138 * positives from lockdep-RCU error checking.
@@ -159,6 +161,7 @@ static void invoke_rcu_core(void);
159static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp); 161static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp);
160static void rcu_report_exp_rdp(struct rcu_state *rsp, 162static void rcu_report_exp_rdp(struct rcu_state *rsp,
161 struct rcu_data *rdp, bool wake); 163 struct rcu_data *rdp, bool wake);
164static void sync_sched_exp_online_cleanup(int cpu);
162 165
163/* rcuc/rcub kthread realtime priority */ 166/* rcuc/rcub kthread realtime priority */
164#ifdef CONFIG_RCU_KTHREAD_PRIO 167#ifdef CONFIG_RCU_KTHREAD_PRIO
@@ -1284,9 +1287,9 @@ static void rcu_dump_cpu_stacks(struct rcu_state *rsp)
1284 rcu_for_each_leaf_node(rsp, rnp) { 1287 rcu_for_each_leaf_node(rsp, rnp) {
1285 raw_spin_lock_irqsave_rcu_node(rnp, flags); 1288 raw_spin_lock_irqsave_rcu_node(rnp, flags);
1286 if (rnp->qsmask != 0) { 1289 if (rnp->qsmask != 0) {
1287 for (cpu = 0; cpu <= rnp->grphi - rnp->grplo; cpu++) 1290 for_each_leaf_node_possible_cpu(rnp, cpu)
1288 if (rnp->qsmask & (1UL << cpu)) 1291 if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu))
1289 dump_cpu_task(rnp->grplo + cpu); 1292 dump_cpu_task(cpu);
1290 } 1293 }
1291 raw_spin_unlock_irqrestore_rcu_node(rnp, flags); 1294 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
1292 } 1295 }
@@ -1311,6 +1314,12 @@ static void rcu_stall_kick_kthreads(struct rcu_state *rsp)
1311 } 1314 }
1312} 1315}
1313 1316
1317static inline void panic_on_rcu_stall(void)
1318{
1319 if (sysctl_panic_on_rcu_stall)
1320 panic("RCU Stall\n");
1321}
1322
1314static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum) 1323static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
1315{ 1324{
1316 int cpu; 1325 int cpu;
@@ -1351,10 +1360,9 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
1351 raw_spin_lock_irqsave_rcu_node(rnp, flags); 1360 raw_spin_lock_irqsave_rcu_node(rnp, flags);
1352 ndetected += rcu_print_task_stall(rnp); 1361 ndetected += rcu_print_task_stall(rnp);
1353 if (rnp->qsmask != 0) { 1362 if (rnp->qsmask != 0) {
1354 for (cpu = 0; cpu <= rnp->grphi - rnp->grplo; cpu++) 1363 for_each_leaf_node_possible_cpu(rnp, cpu)
1355 if (rnp->qsmask & (1UL << cpu)) { 1364 if (rnp->qsmask & leaf_node_cpu_bit(rnp, cpu)) {
1356 print_cpu_stall_info(rsp, 1365 print_cpu_stall_info(rsp, cpu);
1357 rnp->grplo + cpu);
1358 ndetected++; 1366 ndetected++;
1359 } 1367 }
1360 } 1368 }
@@ -1390,6 +1398,8 @@ static void print_other_cpu_stall(struct rcu_state *rsp, unsigned long gpnum)
1390 1398
1391 rcu_check_gp_kthread_starvation(rsp); 1399 rcu_check_gp_kthread_starvation(rsp);
1392 1400
1401 panic_on_rcu_stall();
1402
1393 force_quiescent_state(rsp); /* Kick them all. */ 1403 force_quiescent_state(rsp); /* Kick them all. */
1394} 1404}
1395 1405
@@ -1430,6 +1440,8 @@ static void print_cpu_stall(struct rcu_state *rsp)
1430 jiffies + 3 * rcu_jiffies_till_stall_check() + 3); 1440 jiffies + 3 * rcu_jiffies_till_stall_check() + 3);
1431 raw_spin_unlock_irqrestore_rcu_node(rnp, flags); 1441 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
1432 1442
1443 panic_on_rcu_stall();
1444
1433 /* 1445 /*
1434 * Attempt to revive the RCU machinery by forcing a context switch. 1446 * Attempt to revive the RCU machinery by forcing a context switch.
1435 * 1447 *
@@ -1989,8 +2001,7 @@ static bool rcu_gp_init(struct rcu_state *rsp)
1989 * of the tree within the rsp->node[] array. Note that other CPUs 2001 * of the tree within the rsp->node[] array. Note that other CPUs
1990 * will access only the leaves of the hierarchy, thus seeing that no 2002 * will access only the leaves of the hierarchy, thus seeing that no
1991 * grace period is in progress, at least until the corresponding 2003 * grace period is in progress, at least until the corresponding
1992 * leaf node has been initialized. In addition, we have excluded 2004 * leaf node has been initialized.
1993 * CPU-hotplug operations.
1994 * 2005 *
1995 * The grace period cannot complete until the initialization 2006 * The grace period cannot complete until the initialization
1996 * process finishes, because this kthread handles both. 2007 * process finishes, because this kthread handles both.
@@ -2872,7 +2883,6 @@ static void force_qs_rnp(struct rcu_state *rsp,
2872 unsigned long *maxj), 2883 unsigned long *maxj),
2873 bool *isidle, unsigned long *maxj) 2884 bool *isidle, unsigned long *maxj)
2874{ 2885{
2875 unsigned long bit;
2876 int cpu; 2886 int cpu;
2877 unsigned long flags; 2887 unsigned long flags;
2878 unsigned long mask; 2888 unsigned long mask;
@@ -2907,9 +2917,8 @@ static void force_qs_rnp(struct rcu_state *rsp,
2907 continue; 2917 continue;
2908 } 2918 }
2909 } 2919 }
2910 cpu = rnp->grplo; 2920 for_each_leaf_node_possible_cpu(rnp, cpu) {
2911 bit = 1; 2921 unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
2912 for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
2913 if ((rnp->qsmask & bit) != 0) { 2922 if ((rnp->qsmask & bit) != 0) {
2914 if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj)) 2923 if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
2915 mask |= bit; 2924 mask |= bit;
@@ -3448,549 +3457,6 @@ static bool rcu_seq_done(unsigned long *sp, unsigned long s)
3448 return ULONG_CMP_GE(READ_ONCE(*sp), s); 3457 return ULONG_CMP_GE(READ_ONCE(*sp), s);
3449} 3458}
3450 3459
3451/* Wrapper functions for expedited grace periods. */
3452static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
3453{
3454 rcu_seq_start(&rsp->expedited_sequence);
3455}
3456static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
3457{
3458 rcu_seq_end(&rsp->expedited_sequence);
3459 smp_mb(); /* Ensure that consecutive grace periods serialize. */
3460}
3461static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
3462{
3463 unsigned long s;
3464
3465 smp_mb(); /* Caller's modifications seen first by other CPUs. */
3466 s = rcu_seq_snap(&rsp->expedited_sequence);
3467 trace_rcu_exp_grace_period(rsp->name, s, TPS("snap"));
3468 return s;
3469}
3470static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
3471{
3472 return rcu_seq_done(&rsp->expedited_sequence, s);
3473}
3474
3475/*
3476 * Reset the ->expmaskinit values in the rcu_node tree to reflect any
3477 * recent CPU-online activity. Note that these masks are not cleared
3478 * when CPUs go offline, so they reflect the union of all CPUs that have
3479 * ever been online. This means that this function normally takes its
3480 * no-work-to-do fastpath.
3481 */
3482static void sync_exp_reset_tree_hotplug(struct rcu_state *rsp)
3483{
3484 bool done;
3485 unsigned long flags;
3486 unsigned long mask;
3487 unsigned long oldmask;
3488 int ncpus = READ_ONCE(rsp->ncpus);
3489 struct rcu_node *rnp;
3490 struct rcu_node *rnp_up;
3491
3492 /* If no new CPUs onlined since last time, nothing to do. */
3493 if (likely(ncpus == rsp->ncpus_snap))
3494 return;
3495 rsp->ncpus_snap = ncpus;
3496
3497 /*
3498 * Each pass through the following loop propagates newly onlined
3499 * CPUs for the current rcu_node structure up the rcu_node tree.
3500 */
3501 rcu_for_each_leaf_node(rsp, rnp) {
3502 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3503 if (rnp->expmaskinit == rnp->expmaskinitnext) {
3504 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3505 continue; /* No new CPUs, nothing to do. */
3506 }
3507
3508 /* Update this node's mask, track old value for propagation. */
3509 oldmask = rnp->expmaskinit;
3510 rnp->expmaskinit = rnp->expmaskinitnext;
3511 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3512
3513 /* If was already nonzero, nothing to propagate. */
3514 if (oldmask)
3515 continue;
3516
3517 /* Propagate the new CPU up the tree. */
3518 mask = rnp->grpmask;
3519 rnp_up = rnp->parent;
3520 done = false;
3521 while (rnp_up) {
3522 raw_spin_lock_irqsave_rcu_node(rnp_up, flags);
3523 if (rnp_up->expmaskinit)
3524 done = true;
3525 rnp_up->expmaskinit |= mask;
3526 raw_spin_unlock_irqrestore_rcu_node(rnp_up, flags);
3527 if (done)
3528 break;
3529 mask = rnp_up->grpmask;
3530 rnp_up = rnp_up->parent;
3531 }
3532 }
3533}
3534
3535/*
3536 * Reset the ->expmask values in the rcu_node tree in preparation for
3537 * a new expedited grace period.
3538 */
3539static void __maybe_unused sync_exp_reset_tree(struct rcu_state *rsp)
3540{
3541 unsigned long flags;
3542 struct rcu_node *rnp;
3543
3544 sync_exp_reset_tree_hotplug(rsp);
3545 rcu_for_each_node_breadth_first(rsp, rnp) {
3546 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3547 WARN_ON_ONCE(rnp->expmask);
3548 rnp->expmask = rnp->expmaskinit;
3549 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3550 }
3551}
3552
3553/*
3554 * Return non-zero if there is no RCU expedited grace period in progress
3555 * for the specified rcu_node structure, in other words, if all CPUs and
3556 * tasks covered by the specified rcu_node structure have done their bit
3557 * for the current expedited grace period. Works only for preemptible
3558 * RCU -- other RCU implementation use other means.
3559 *
3560 * Caller must hold the rcu_state's exp_mutex.
3561 */
3562static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
3563{
3564 return rnp->exp_tasks == NULL &&
3565 READ_ONCE(rnp->expmask) == 0;
3566}
3567
3568/*
3569 * Report the exit from RCU read-side critical section for the last task
3570 * that queued itself during or before the current expedited preemptible-RCU
3571 * grace period. This event is reported either to the rcu_node structure on
3572 * which the task was queued or to one of that rcu_node structure's ancestors,
3573 * recursively up the tree. (Calm down, calm down, we do the recursion
3574 * iteratively!)
3575 *
3576 * Caller must hold the rcu_state's exp_mutex and the specified rcu_node
3577 * structure's ->lock.
3578 */
3579static void __rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
3580 bool wake, unsigned long flags)
3581 __releases(rnp->lock)
3582{
3583 unsigned long mask;
3584
3585 for (;;) {
3586 if (!sync_rcu_preempt_exp_done(rnp)) {
3587 if (!rnp->expmask)
3588 rcu_initiate_boost(rnp, flags);
3589 else
3590 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3591 break;
3592 }
3593 if (rnp->parent == NULL) {
3594 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3595 if (wake) {
3596 smp_mb(); /* EGP done before wake_up(). */
3597 swake_up(&rsp->expedited_wq);
3598 }
3599 break;
3600 }
3601 mask = rnp->grpmask;
3602 raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled */
3603 rnp = rnp->parent;
3604 raw_spin_lock_rcu_node(rnp); /* irqs already disabled */
3605 WARN_ON_ONCE(!(rnp->expmask & mask));
3606 rnp->expmask &= ~mask;
3607 }
3608}
3609
3610/*
3611 * Report expedited quiescent state for specified node. This is a
3612 * lock-acquisition wrapper function for __rcu_report_exp_rnp().
3613 *
3614 * Caller must hold the rcu_state's exp_mutex.
3615 */
3616static void __maybe_unused rcu_report_exp_rnp(struct rcu_state *rsp,
3617 struct rcu_node *rnp, bool wake)
3618{
3619 unsigned long flags;
3620
3621 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3622 __rcu_report_exp_rnp(rsp, rnp, wake, flags);
3623}
3624
3625/*
3626 * Report expedited quiescent state for multiple CPUs, all covered by the
3627 * specified leaf rcu_node structure. Caller must hold the rcu_state's
3628 * exp_mutex.
3629 */
3630static void rcu_report_exp_cpu_mult(struct rcu_state *rsp, struct rcu_node *rnp,
3631 unsigned long mask, bool wake)
3632{
3633 unsigned long flags;
3634
3635 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3636 if (!(rnp->expmask & mask)) {
3637 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3638 return;
3639 }
3640 rnp->expmask &= ~mask;
3641 __rcu_report_exp_rnp(rsp, rnp, wake, flags); /* Releases rnp->lock. */
3642}
3643
3644/*
3645 * Report expedited quiescent state for specified rcu_data (CPU).
3646 */
3647static void rcu_report_exp_rdp(struct rcu_state *rsp, struct rcu_data *rdp,
3648 bool wake)
3649{
3650 rcu_report_exp_cpu_mult(rsp, rdp->mynode, rdp->grpmask, wake);
3651}
3652
3653/* Common code for synchronize_{rcu,sched}_expedited() work-done checking. */
3654static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
3655 unsigned long s)
3656{
3657 if (rcu_exp_gp_seq_done(rsp, s)) {
3658 trace_rcu_exp_grace_period(rsp->name, s, TPS("done"));
3659 /* Ensure test happens before caller kfree(). */
3660 smp_mb__before_atomic(); /* ^^^ */
3661 atomic_long_inc(stat);
3662 return true;
3663 }
3664 return false;
3665}
3666
3667/*
3668 * Funnel-lock acquisition for expedited grace periods. Returns true
3669 * if some other task completed an expedited grace period that this task
3670 * can piggy-back on, and with no mutex held. Otherwise, returns false
3671 * with the mutex held, indicating that the caller must actually do the
3672 * expedited grace period.
3673 */
3674static bool exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
3675{
3676 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());
3677 struct rcu_node *rnp = rdp->mynode;
3678 struct rcu_node *rnp_root = rcu_get_root(rsp);
3679
3680 /* Low-contention fastpath. */
3681 if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s) &&
3682 (rnp == rnp_root ||
3683 ULONG_CMP_LT(READ_ONCE(rnp_root->exp_seq_rq), s)) &&
3684 !mutex_is_locked(&rsp->exp_mutex) &&
3685 mutex_trylock(&rsp->exp_mutex))
3686 goto fastpath;
3687
3688 /*
3689 * Each pass through the following loop works its way up
3690 * the rcu_node tree, returning if others have done the work or
3691 * otherwise falls through to acquire rsp->exp_mutex. The mapping
3692 * from CPU to rcu_node structure can be inexact, as it is just
3693 * promoting locality and is not strictly needed for correctness.
3694 */
3695 for (; rnp != NULL; rnp = rnp->parent) {
3696 if (sync_exp_work_done(rsp, &rdp->exp_workdone1, s))
3697 return true;
3698
3699 /* Work not done, either wait here or go up. */
3700 spin_lock(&rnp->exp_lock);
3701 if (ULONG_CMP_GE(rnp->exp_seq_rq, s)) {
3702
3703 /* Someone else doing GP, so wait for them. */
3704 spin_unlock(&rnp->exp_lock);
3705 trace_rcu_exp_funnel_lock(rsp->name, rnp->level,
3706 rnp->grplo, rnp->grphi,
3707 TPS("wait"));
3708 wait_event(rnp->exp_wq[(s >> 1) & 0x3],
3709 sync_exp_work_done(rsp,
3710 &rdp->exp_workdone2, s));
3711 return true;
3712 }
3713 rnp->exp_seq_rq = s; /* Followers can wait on us. */
3714 spin_unlock(&rnp->exp_lock);
3715 trace_rcu_exp_funnel_lock(rsp->name, rnp->level, rnp->grplo,
3716 rnp->grphi, TPS("nxtlvl"));
3717 }
3718 mutex_lock(&rsp->exp_mutex);
3719fastpath:
3720 if (sync_exp_work_done(rsp, &rdp->exp_workdone3, s)) {
3721 mutex_unlock(&rsp->exp_mutex);
3722 return true;
3723 }
3724 rcu_exp_gp_seq_start(rsp);
3725 trace_rcu_exp_grace_period(rsp->name, s, TPS("start"));
3726 return false;
3727}
3728
3729/* Invoked on each online non-idle CPU for expedited quiescent state. */
3730static void sync_sched_exp_handler(void *data)
3731{
3732 struct rcu_data *rdp;
3733 struct rcu_node *rnp;
3734 struct rcu_state *rsp = data;
3735
3736 rdp = this_cpu_ptr(rsp->rda);
3737 rnp = rdp->mynode;
3738 if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
3739 __this_cpu_read(rcu_sched_data.cpu_no_qs.b.exp))
3740 return;
3741 if (rcu_is_cpu_rrupt_from_idle()) {
3742 rcu_report_exp_rdp(&rcu_sched_state,
3743 this_cpu_ptr(&rcu_sched_data), true);
3744 return;
3745 }
3746 __this_cpu_write(rcu_sched_data.cpu_no_qs.b.exp, true);
3747 resched_cpu(smp_processor_id());
3748}
3749
3750/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
3751static void sync_sched_exp_online_cleanup(int cpu)
3752{
3753 struct rcu_data *rdp;
3754 int ret;
3755 struct rcu_node *rnp;
3756 struct rcu_state *rsp = &rcu_sched_state;
3757
3758 rdp = per_cpu_ptr(rsp->rda, cpu);
3759 rnp = rdp->mynode;
3760 if (!(READ_ONCE(rnp->expmask) & rdp->grpmask))
3761 return;
3762 ret = smp_call_function_single(cpu, sync_sched_exp_handler, rsp, 0);
3763 WARN_ON_ONCE(ret);
3764}
3765
3766/*
3767 * Select the nodes that the upcoming expedited grace period needs
3768 * to wait for.
3769 */
3770static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
3771 smp_call_func_t func)
3772{
3773 int cpu;
3774 unsigned long flags;
3775 unsigned long mask;
3776 unsigned long mask_ofl_test;
3777 unsigned long mask_ofl_ipi;
3778 int ret;
3779 struct rcu_node *rnp;
3780
3781 sync_exp_reset_tree(rsp);
3782 rcu_for_each_leaf_node(rsp, rnp) {
3783 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3784
3785 /* Each pass checks a CPU for identity, offline, and idle. */
3786 mask_ofl_test = 0;
3787 for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++) {
3788 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
3789 struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
3790
3791 if (raw_smp_processor_id() == cpu ||
3792 !(atomic_add_return(0, &rdtp->dynticks) & 0x1))
3793 mask_ofl_test |= rdp->grpmask;
3794 }
3795 mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
3796
3797 /*
3798 * Need to wait for any blocked tasks as well. Note that
3799 * additional blocking tasks will also block the expedited
3800 * GP until such time as the ->expmask bits are cleared.
3801 */
3802 if (rcu_preempt_has_tasks(rnp))
3803 rnp->exp_tasks = rnp->blkd_tasks.next;
3804 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3805
3806 /* IPI the remaining CPUs for expedited quiescent state. */
3807 mask = 1;
3808 for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
3809 if (!(mask_ofl_ipi & mask))
3810 continue;
3811retry_ipi:
3812 ret = smp_call_function_single(cpu, func, rsp, 0);
3813 if (!ret) {
3814 mask_ofl_ipi &= ~mask;
3815 continue;
3816 }
3817 /* Failed, raced with offline. */
3818 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3819 if (cpu_online(cpu) &&
3820 (rnp->expmask & mask)) {
3821 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3822 schedule_timeout_uninterruptible(1);
3823 if (cpu_online(cpu) &&
3824 (rnp->expmask & mask))
3825 goto retry_ipi;
3826 raw_spin_lock_irqsave_rcu_node(rnp, flags);
3827 }
3828 if (!(rnp->expmask & mask))
3829 mask_ofl_ipi &= ~mask;
3830 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
3831 }
3832 /* Report quiescent states for those that went offline. */
3833 mask_ofl_test |= mask_ofl_ipi;
3834 if (mask_ofl_test)
3835 rcu_report_exp_cpu_mult(rsp, rnp, mask_ofl_test, false);
3836 }
3837}
3838
3839static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
3840{
3841 int cpu;
3842 unsigned long jiffies_stall;
3843 unsigned long jiffies_start;
3844 unsigned long mask;
3845 int ndetected;
3846 struct rcu_node *rnp;
3847 struct rcu_node *rnp_root = rcu_get_root(rsp);
3848 int ret;
3849
3850 jiffies_stall = rcu_jiffies_till_stall_check();
3851 jiffies_start = jiffies;
3852
3853 for (;;) {
3854 ret = swait_event_timeout(
3855 rsp->expedited_wq,
3856 sync_rcu_preempt_exp_done(rnp_root),
3857 jiffies_stall);
3858 if (ret > 0 || sync_rcu_preempt_exp_done(rnp_root))
3859 return;
3860 if (ret < 0) {
3861 /* Hit a signal, disable CPU stall warnings. */
3862 swait_event(rsp->expedited_wq,
3863 sync_rcu_preempt_exp_done(rnp_root));
3864 return;
3865 }
3866 pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {",
3867 rsp->name);
3868 ndetected = 0;
3869 rcu_for_each_leaf_node(rsp, rnp) {
3870 ndetected += rcu_print_task_exp_stall(rnp);
3871 mask = 1;
3872 for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
3873 struct rcu_data *rdp;
3874
3875 if (!(rnp->expmask & mask))
3876 continue;
3877 ndetected++;
3878 rdp = per_cpu_ptr(rsp->rda, cpu);
3879 pr_cont(" %d-%c%c%c", cpu,
3880 "O."[!!cpu_online(cpu)],
3881 "o."[!!(rdp->grpmask & rnp->expmaskinit)],
3882 "N."[!!(rdp->grpmask & rnp->expmaskinitnext)]);
3883 }
3884 mask <<= 1;
3885 }
3886 pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
3887 jiffies - jiffies_start, rsp->expedited_sequence,
3888 rnp_root->expmask, ".T"[!!rnp_root->exp_tasks]);
3889 if (ndetected) {
3890 pr_err("blocking rcu_node structures:");
3891 rcu_for_each_node_breadth_first(rsp, rnp) {
3892 if (rnp == rnp_root)
3893 continue; /* printed unconditionally */
3894 if (sync_rcu_preempt_exp_done(rnp))
3895 continue;
3896 pr_cont(" l=%u:%d-%d:%#lx/%c",
3897 rnp->level, rnp->grplo, rnp->grphi,
3898 rnp->expmask,
3899 ".T"[!!rnp->exp_tasks]);
3900 }
3901 pr_cont("\n");
3902 }
3903 rcu_for_each_leaf_node(rsp, rnp) {
3904 mask = 1;
3905 for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask <<= 1) {
3906 if (!(rnp->expmask & mask))
3907 continue;
3908 dump_cpu_task(cpu);
3909 }
3910 }
3911 jiffies_stall = 3 * rcu_jiffies_till_stall_check() + 3;
3912 }
3913}
3914
3915/*
3916 * Wait for the current expedited grace period to complete, and then
3917 * wake up everyone who piggybacked on the just-completed expedited
3918 * grace period. Also update all the ->exp_seq_rq counters as needed
3919 * in order to avoid counter-wrap problems.
3920 */
3921static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
3922{
3923 struct rcu_node *rnp;
3924
3925 synchronize_sched_expedited_wait(rsp);
3926 rcu_exp_gp_seq_end(rsp);
3927 trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
3928
3929 /*
3930 * Switch over to wakeup mode, allowing the next GP, but -only- the
3931 * next GP, to proceed.
3932 */
3933 mutex_lock(&rsp->exp_wake_mutex);
3934 mutex_unlock(&rsp->exp_mutex);
3935
3936 rcu_for_each_node_breadth_first(rsp, rnp) {
3937 if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
3938 spin_lock(&rnp->exp_lock);
3939 /* Recheck, avoid hang in case someone just arrived. */
3940 if (ULONG_CMP_LT(rnp->exp_seq_rq, s))
3941 rnp->exp_seq_rq = s;
3942 spin_unlock(&rnp->exp_lock);
3943 }
3944 wake_up_all(&rnp->exp_wq[(rsp->expedited_sequence >> 1) & 0x3]);
3945 }
3946 trace_rcu_exp_grace_period(rsp->name, s, TPS("endwake"));
3947 mutex_unlock(&rsp->exp_wake_mutex);
3948}
3949
3950/**
3951 * synchronize_sched_expedited - Brute-force RCU-sched grace period
3952 *
3953 * Wait for an RCU-sched grace period to elapse, but use a "big hammer"
3954 * approach to force the grace period to end quickly. This consumes
3955 * significant time on all CPUs and is unfriendly to real-time workloads,
3956 * so is thus not recommended for any sort of common-case code. In fact,
3957 * if you are using synchronize_sched_expedited() in a loop, please
3958 * restructure your code to batch your updates, and then use a single
3959 * synchronize_sched() instead.
3960 *
3961 * This implementation can be thought of as an application of sequence
3962 * locking to expedited grace periods, but using the sequence counter to
3963 * determine when someone else has already done the work instead of for
3964 * retrying readers.
3965 */
3966void synchronize_sched_expedited(void)
3967{
3968 unsigned long s;
3969 struct rcu_state *rsp = &rcu_sched_state;
3970
3971 /* If only one CPU, this is automatically a grace period. */
3972 if (rcu_blocking_is_gp())
3973 return;
3974
3975 /* If expedited grace periods are prohibited, fall back to normal. */
3976 if (rcu_gp_is_normal()) {
3977 wait_rcu_gp(call_rcu_sched);
3978 return;
3979 }
3980
3981 /* Take a snapshot of the sequence number. */
3982 s = rcu_exp_gp_seq_snap(rsp);
3983 if (exp_funnel_lock(rsp, s))
3984 return; /* Someone else did our work for us. */
3985
3986 /* Initialize the rcu_node tree in preparation for the wait. */
3987 sync_rcu_exp_select_cpus(rsp, sync_sched_exp_handler);
3988
3989 /* Wait and clean up, including waking everyone. */
3990 rcu_exp_wait_wake(rsp, s);
3991}
3992EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
3993
3994/* 3460/*
3995 * Check to see if there is any immediate RCU-related work to be done 3461 * Check to see if there is any immediate RCU-related work to be done
3996 * by the current CPU, for the specified type of RCU, returning 1 if so. 3462 * by the current CPU, for the specified type of RCU, returning 1 if so.
@@ -4281,7 +3747,7 @@ rcu_boot_init_percpu_data(int cpu, struct rcu_state *rsp)
4281 3747
4282 /* Set up local state, ensuring consistent view of global state. */ 3748 /* Set up local state, ensuring consistent view of global state. */
4283 raw_spin_lock_irqsave_rcu_node(rnp, flags); 3749 raw_spin_lock_irqsave_rcu_node(rnp, flags);
4284 rdp->grpmask = 1UL << (cpu - rdp->mynode->grplo); 3750 rdp->grpmask = leaf_node_cpu_bit(rdp->mynode, cpu);
4285 rdp->dynticks = &per_cpu(rcu_dynticks, cpu); 3751 rdp->dynticks = &per_cpu(rcu_dynticks, cpu);
4286 WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != DYNTICK_TASK_EXIT_IDLE); 3752 WARN_ON_ONCE(rdp->dynticks->dynticks_nesting != DYNTICK_TASK_EXIT_IDLE);
4287 WARN_ON_ONCE(atomic_read(&rdp->dynticks->dynticks) != 1); 3753 WARN_ON_ONCE(atomic_read(&rdp->dynticks->dynticks) != 1);
@@ -4364,9 +3830,6 @@ static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
4364 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu); 3830 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
4365 struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */ 3831 struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
4366 3832
4367 if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
4368 return;
4369
4370 /* Remove outgoing CPU from mask in the leaf rcu_node structure. */ 3833 /* Remove outgoing CPU from mask in the leaf rcu_node structure. */
4371 mask = rdp->grpmask; 3834 mask = rdp->grpmask;
4372 raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */ 3835 raw_spin_lock_irqsave_rcu_node(rnp, flags); /* Enforce GP memory-order guarantee. */
@@ -4751,4 +4214,5 @@ void __init rcu_init(void)
4751 rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu); 4214 rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
4752} 4215}
4753 4216
4217#include "tree_exp.h"
4754#include "tree_plugin.h" 4218#include "tree_plugin.h"
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index e3959f5e6ddf..f714f873bf9d 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -254,6 +254,13 @@ struct rcu_node {
254} ____cacheline_internodealigned_in_smp; 254} ____cacheline_internodealigned_in_smp;
255 255
256/* 256/*
257 * Bitmasks in an rcu_node cover the interval [grplo, grphi] of CPU IDs, and
258 * are indexed relative to this interval rather than the global CPU ID space.
259 * This generates the bit for a CPU in node-local masks.
260 */
261#define leaf_node_cpu_bit(rnp, cpu) (1UL << ((cpu) - (rnp)->grplo))
262
263/*
257 * Do a full breadth-first scan of the rcu_node structures for the 264 * Do a full breadth-first scan of the rcu_node structures for the
258 * specified rcu_state structure. 265 * specified rcu_state structure.
259 */ 266 */
@@ -281,6 +288,14 @@ struct rcu_node {
281 (rnp) < &(rsp)->node[rcu_num_nodes]; (rnp)++) 288 (rnp) < &(rsp)->node[rcu_num_nodes]; (rnp)++)
282 289
283/* 290/*
291 * Iterate over all possible CPUs in a leaf RCU node.
292 */
293#define for_each_leaf_node_possible_cpu(rnp, cpu) \
294 for ((cpu) = cpumask_next(rnp->grplo - 1, cpu_possible_mask); \
295 cpu <= rnp->grphi; \
296 cpu = cpumask_next((cpu), cpu_possible_mask))
297
298/*
284 * Union to allow "aggregate OR" operation on the need for a quiescent 299 * Union to allow "aggregate OR" operation on the need for a quiescent
285 * state by the normal and expedited grace periods. 300 * state by the normal and expedited grace periods.
286 */ 301 */
diff --git a/kernel/rcu/tree_exp.h b/kernel/rcu/tree_exp.h
new file mode 100644
index 000000000000..d400434af6b2
--- /dev/null
+++ b/kernel/rcu/tree_exp.h
@@ -0,0 +1,656 @@
1/*
2 * RCU expedited grace periods
3 *
4 * This program is free software; you can redistribute it and/or modify
5 * it under the terms of the GNU General Public License as published by
6 * the Free Software Foundation; either version 2 of the License, or
7 * (at your option) any later version.
8 *
9 * This program is distributed in the hope that it will be useful,
10 * but WITHOUT ANY WARRANTY; without even the implied warranty of
11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 * GNU General Public License for more details.
13 *
14 * You should have received a copy of the GNU General Public License
15 * along with this program; if not, you can access it online at
16 * http://www.gnu.org/licenses/gpl-2.0.html.
17 *
18 * Copyright IBM Corporation, 2016
19 *
20 * Authors: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
21 */
22
23/* Wrapper functions for expedited grace periods. */
24static void rcu_exp_gp_seq_start(struct rcu_state *rsp)
25{
26 rcu_seq_start(&rsp->expedited_sequence);
27}
28static void rcu_exp_gp_seq_end(struct rcu_state *rsp)
29{
30 rcu_seq_end(&rsp->expedited_sequence);
31 smp_mb(); /* Ensure that consecutive grace periods serialize. */
32}
33static unsigned long rcu_exp_gp_seq_snap(struct rcu_state *rsp)
34{
35 unsigned long s;
36
37 smp_mb(); /* Caller's modifications seen first by other CPUs. */
38 s = rcu_seq_snap(&rsp->expedited_sequence);
39 trace_rcu_exp_grace_period(rsp->name, s, TPS("snap"));
40 return s;
41}
42static bool rcu_exp_gp_seq_done(struct rcu_state *rsp, unsigned long s)
43{
44 return rcu_seq_done(&rsp->expedited_sequence, s);
45}
46
47/*
48 * Reset the ->expmaskinit values in the rcu_node tree to reflect any
49 * recent CPU-online activity. Note that these masks are not cleared
50 * when CPUs go offline, so they reflect the union of all CPUs that have
51 * ever been online. This means that this function normally takes its
52 * no-work-to-do fastpath.
53 */
54static void sync_exp_reset_tree_hotplug(struct rcu_state *rsp)
55{
56 bool done;
57 unsigned long flags;
58 unsigned long mask;
59 unsigned long oldmask;
60 int ncpus = READ_ONCE(rsp->ncpus);
61 struct rcu_node *rnp;
62 struct rcu_node *rnp_up;
63
64 /* If no new CPUs onlined since last time, nothing to do. */
65 if (likely(ncpus == rsp->ncpus_snap))
66 return;
67 rsp->ncpus_snap = ncpus;
68
69 /*
70 * Each pass through the following loop propagates newly onlined
71 * CPUs for the current rcu_node structure up the rcu_node tree.
72 */
73 rcu_for_each_leaf_node(rsp, rnp) {
74 raw_spin_lock_irqsave_rcu_node(rnp, flags);
75 if (rnp->expmaskinit == rnp->expmaskinitnext) {
76 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
77 continue; /* No new CPUs, nothing to do. */
78 }
79
80 /* Update this node's mask, track old value for propagation. */
81 oldmask = rnp->expmaskinit;
82 rnp->expmaskinit = rnp->expmaskinitnext;
83 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
84
85 /* If was already nonzero, nothing to propagate. */
86 if (oldmask)
87 continue;
88
89 /* Propagate the new CPU up the tree. */
90 mask = rnp->grpmask;
91 rnp_up = rnp->parent;
92 done = false;
93 while (rnp_up) {
94 raw_spin_lock_irqsave_rcu_node(rnp_up, flags);
95 if (rnp_up->expmaskinit)
96 done = true;
97 rnp_up->expmaskinit |= mask;
98 raw_spin_unlock_irqrestore_rcu_node(rnp_up, flags);
99 if (done)
100 break;
101 mask = rnp_up->grpmask;
102 rnp_up = rnp_up->parent;
103 }
104 }
105}
106
107/*
108 * Reset the ->expmask values in the rcu_node tree in preparation for
109 * a new expedited grace period.
110 */
111static void __maybe_unused sync_exp_reset_tree(struct rcu_state *rsp)
112{
113 unsigned long flags;
114 struct rcu_node *rnp;
115
116 sync_exp_reset_tree_hotplug(rsp);
117 rcu_for_each_node_breadth_first(rsp, rnp) {
118 raw_spin_lock_irqsave_rcu_node(rnp, flags);
119 WARN_ON_ONCE(rnp->expmask);
120 rnp->expmask = rnp->expmaskinit;
121 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
122 }
123}
124
125/*
126 * Return non-zero if there is no RCU expedited grace period in progress
127 * for the specified rcu_node structure, in other words, if all CPUs and
128 * tasks covered by the specified rcu_node structure have done their bit
129 * for the current expedited grace period. Works only for preemptible
130 * RCU -- other RCU implementation use other means.
131 *
132 * Caller must hold the rcu_state's exp_mutex.
133 */
134static int sync_rcu_preempt_exp_done(struct rcu_node *rnp)
135{
136 return rnp->exp_tasks == NULL &&
137 READ_ONCE(rnp->expmask) == 0;
138}
139
140/*
141 * Report the exit from RCU read-side critical section for the last task
142 * that queued itself during or before the current expedited preemptible-RCU
143 * grace period. This event is reported either to the rcu_node structure on
144 * which the task was queued or to one of that rcu_node structure's ancestors,
145 * recursively up the tree. (Calm down, calm down, we do the recursion
146 * iteratively!)
147 *
148 * Caller must hold the rcu_state's exp_mutex and the specified rcu_node
149 * structure's ->lock.
150 */
151static void __rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
152 bool wake, unsigned long flags)
153 __releases(rnp->lock)
154{
155 unsigned long mask;
156
157 for (;;) {
158 if (!sync_rcu_preempt_exp_done(rnp)) {
159 if (!rnp->expmask)
160 rcu_initiate_boost(rnp, flags);
161 else
162 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
163 break;
164 }
165 if (rnp->parent == NULL) {
166 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
167 if (wake) {
168 smp_mb(); /* EGP done before wake_up(). */
169 swake_up(&rsp->expedited_wq);
170 }
171 break;
172 }
173 mask = rnp->grpmask;
174 raw_spin_unlock_rcu_node(rnp); /* irqs remain disabled */
175 rnp = rnp->parent;
176 raw_spin_lock_rcu_node(rnp); /* irqs already disabled */
177 WARN_ON_ONCE(!(rnp->expmask & mask));
178 rnp->expmask &= ~mask;
179 }
180}
181
182/*
183 * Report expedited quiescent state for specified node. This is a
184 * lock-acquisition wrapper function for __rcu_report_exp_rnp().
185 *
186 * Caller must hold the rcu_state's exp_mutex.
187 */
188static void __maybe_unused rcu_report_exp_rnp(struct rcu_state *rsp,
189 struct rcu_node *rnp, bool wake)
190{
191 unsigned long flags;
192
193 raw_spin_lock_irqsave_rcu_node(rnp, flags);
194 __rcu_report_exp_rnp(rsp, rnp, wake, flags);
195}
196
197/*
198 * Report expedited quiescent state for multiple CPUs, all covered by the
199 * specified leaf rcu_node structure. Caller must hold the rcu_state's
200 * exp_mutex.
201 */
202static void rcu_report_exp_cpu_mult(struct rcu_state *rsp, struct rcu_node *rnp,
203 unsigned long mask, bool wake)
204{
205 unsigned long flags;
206
207 raw_spin_lock_irqsave_rcu_node(rnp, flags);
208 if (!(rnp->expmask & mask)) {
209 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
210 return;
211 }
212 rnp->expmask &= ~mask;
213 __rcu_report_exp_rnp(rsp, rnp, wake, flags); /* Releases rnp->lock. */
214}
215
216/*
217 * Report expedited quiescent state for specified rcu_data (CPU).
218 */
219static void rcu_report_exp_rdp(struct rcu_state *rsp, struct rcu_data *rdp,
220 bool wake)
221{
222 rcu_report_exp_cpu_mult(rsp, rdp->mynode, rdp->grpmask, wake);
223}
224
225/* Common code for synchronize_{rcu,sched}_expedited() work-done checking. */
226static bool sync_exp_work_done(struct rcu_state *rsp, atomic_long_t *stat,
227 unsigned long s)
228{
229 if (rcu_exp_gp_seq_done(rsp, s)) {
230 trace_rcu_exp_grace_period(rsp->name, s, TPS("done"));
231 /* Ensure test happens before caller kfree(). */
232 smp_mb__before_atomic(); /* ^^^ */
233 atomic_long_inc(stat);
234 return true;
235 }
236 return false;
237}
238
239/*
240 * Funnel-lock acquisition for expedited grace periods. Returns true
241 * if some other task completed an expedited grace period that this task
242 * can piggy-back on, and with no mutex held. Otherwise, returns false
243 * with the mutex held, indicating that the caller must actually do the
244 * expedited grace period.
245 */
246static bool exp_funnel_lock(struct rcu_state *rsp, unsigned long s)
247{
248 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, raw_smp_processor_id());
249 struct rcu_node *rnp = rdp->mynode;
250 struct rcu_node *rnp_root = rcu_get_root(rsp);
251
252 /* Low-contention fastpath. */
253 if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s) &&
254 (rnp == rnp_root ||
255 ULONG_CMP_LT(READ_ONCE(rnp_root->exp_seq_rq), s)) &&
256 !mutex_is_locked(&rsp->exp_mutex) &&
257 mutex_trylock(&rsp->exp_mutex))
258 goto fastpath;
259
260 /*
261 * Each pass through the following loop works its way up
262 * the rcu_node tree, returning if others have done the work or
263 * otherwise falls through to acquire rsp->exp_mutex. The mapping
264 * from CPU to rcu_node structure can be inexact, as it is just
265 * promoting locality and is not strictly needed for correctness.
266 */
267 for (; rnp != NULL; rnp = rnp->parent) {
268 if (sync_exp_work_done(rsp, &rdp->exp_workdone1, s))
269 return true;
270
271 /* Work not done, either wait here or go up. */
272 spin_lock(&rnp->exp_lock);
273 if (ULONG_CMP_GE(rnp->exp_seq_rq, s)) {
274
275 /* Someone else doing GP, so wait for them. */
276 spin_unlock(&rnp->exp_lock);
277 trace_rcu_exp_funnel_lock(rsp->name, rnp->level,
278 rnp->grplo, rnp->grphi,
279 TPS("wait"));
280 wait_event(rnp->exp_wq[(s >> 1) & 0x3],
281 sync_exp_work_done(rsp,
282 &rdp->exp_workdone2, s));
283 return true;
284 }
285 rnp->exp_seq_rq = s; /* Followers can wait on us. */
286 spin_unlock(&rnp->exp_lock);
287 trace_rcu_exp_funnel_lock(rsp->name, rnp->level, rnp->grplo,
288 rnp->grphi, TPS("nxtlvl"));
289 }
290 mutex_lock(&rsp->exp_mutex);
291fastpath:
292 if (sync_exp_work_done(rsp, &rdp->exp_workdone3, s)) {
293 mutex_unlock(&rsp->exp_mutex);
294 return true;
295 }
296 rcu_exp_gp_seq_start(rsp);
297 trace_rcu_exp_grace_period(rsp->name, s, TPS("start"));
298 return false;
299}
300
301/* Invoked on each online non-idle CPU for expedited quiescent state. */
302static void sync_sched_exp_handler(void *data)
303{
304 struct rcu_data *rdp;
305 struct rcu_node *rnp;
306 struct rcu_state *rsp = data;
307
308 rdp = this_cpu_ptr(rsp->rda);
309 rnp = rdp->mynode;
310 if (!(READ_ONCE(rnp->expmask) & rdp->grpmask) ||
311 __this_cpu_read(rcu_sched_data.cpu_no_qs.b.exp))
312 return;
313 if (rcu_is_cpu_rrupt_from_idle()) {
314 rcu_report_exp_rdp(&rcu_sched_state,
315 this_cpu_ptr(&rcu_sched_data), true);
316 return;
317 }
318 __this_cpu_write(rcu_sched_data.cpu_no_qs.b.exp, true);
319 resched_cpu(smp_processor_id());
320}
321
322/* Send IPI for expedited cleanup if needed at end of CPU-hotplug operation. */
323static void sync_sched_exp_online_cleanup(int cpu)
324{
325 struct rcu_data *rdp;
326 int ret;
327 struct rcu_node *rnp;
328 struct rcu_state *rsp = &rcu_sched_state;
329
330 rdp = per_cpu_ptr(rsp->rda, cpu);
331 rnp = rdp->mynode;
332 if (!(READ_ONCE(rnp->expmask) & rdp->grpmask))
333 return;
334 ret = smp_call_function_single(cpu, sync_sched_exp_handler, rsp, 0);
335 WARN_ON_ONCE(ret);
336}
337
338/*
339 * Select the nodes that the upcoming expedited grace period needs
340 * to wait for.
341 */
342static void sync_rcu_exp_select_cpus(struct rcu_state *rsp,
343 smp_call_func_t func)
344{
345 int cpu;
346 unsigned long flags;
347 unsigned long mask_ofl_test;
348 unsigned long mask_ofl_ipi;
349 int ret;
350 struct rcu_node *rnp;
351
352 sync_exp_reset_tree(rsp);
353 rcu_for_each_leaf_node(rsp, rnp) {
354 raw_spin_lock_irqsave_rcu_node(rnp, flags);
355
356 /* Each pass checks a CPU for identity, offline, and idle. */
357 mask_ofl_test = 0;
358 for_each_leaf_node_possible_cpu(rnp, cpu) {
359 struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
360 struct rcu_dynticks *rdtp = &per_cpu(rcu_dynticks, cpu);
361
362 if (raw_smp_processor_id() == cpu ||
363 !(atomic_add_return(0, &rdtp->dynticks) & 0x1))
364 mask_ofl_test |= rdp->grpmask;
365 }
366 mask_ofl_ipi = rnp->expmask & ~mask_ofl_test;
367
368 /*
369 * Need to wait for any blocked tasks as well. Note that
370 * additional blocking tasks will also block the expedited
371 * GP until such time as the ->expmask bits are cleared.
372 */
373 if (rcu_preempt_has_tasks(rnp))
374 rnp->exp_tasks = rnp->blkd_tasks.next;
375 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
376
377 /* IPI the remaining CPUs for expedited quiescent state. */
378 for_each_leaf_node_possible_cpu(rnp, cpu) {
379 unsigned long mask = leaf_node_cpu_bit(rnp, cpu);
380 if (!(mask_ofl_ipi & mask))
381 continue;
382retry_ipi:
383 ret = smp_call_function_single(cpu, func, rsp, 0);
384 if (!ret) {
385 mask_ofl_ipi &= ~mask;
386 continue;
387 }
388 /* Failed, raced with offline. */
389 raw_spin_lock_irqsave_rcu_node(rnp, flags);
390 if (cpu_online(cpu) &&
391 (rnp->expmask & mask)) {
392 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
393 schedule_timeout_uninterruptible(1);
394 if (cpu_online(cpu) &&
395 (rnp->expmask & mask))
396 goto retry_ipi;
397 raw_spin_lock_irqsave_rcu_node(rnp, flags);
398 }
399 if (!(rnp->expmask & mask))
400 mask_ofl_ipi &= ~mask;
401 raw_spin_unlock_irqrestore_rcu_node(rnp, flags);
402 }
403 /* Report quiescent states for those that went offline. */
404 mask_ofl_test |= mask_ofl_ipi;
405 if (mask_ofl_test)
406 rcu_report_exp_cpu_mult(rsp, rnp, mask_ofl_test, false);
407 }
408}
409
410static void synchronize_sched_expedited_wait(struct rcu_state *rsp)
411{
412 int cpu;
413 unsigned long jiffies_stall;
414 unsigned long jiffies_start;
415 unsigned long mask;
416 int ndetected;
417 struct rcu_node *rnp;
418 struct rcu_node *rnp_root = rcu_get_root(rsp);
419 int ret;
420
421 jiffies_stall = rcu_jiffies_till_stall_check();
422 jiffies_start = jiffies;
423
424 for (;;) {
425 ret = swait_event_timeout(
426 rsp->expedited_wq,
427 sync_rcu_preempt_exp_done(rnp_root),
428 jiffies_stall);
429 if (ret > 0 || sync_rcu_preempt_exp_done(rnp_root))
430 return;
431 if (ret < 0) {
432 /* Hit a signal, disable CPU stall warnings. */
433 swait_event(rsp->expedited_wq,
434 sync_rcu_preempt_exp_done(rnp_root));
435 return;
436 }
437 pr_err("INFO: %s detected expedited stalls on CPUs/tasks: {",
438 rsp->name);
439 ndetected = 0;
440 rcu_for_each_leaf_node(rsp, rnp) {
441 ndetected += rcu_print_task_exp_stall(rnp);
442 for_each_leaf_node_possible_cpu(rnp, cpu) {
443 struct rcu_data *rdp;
444
445 mask = leaf_node_cpu_bit(rnp, cpu);
446 if (!(rnp->expmask & mask))
447 continue;
448 ndetected++;
449 rdp = per_cpu_ptr(rsp->rda, cpu);
450 pr_cont(" %d-%c%c%c", cpu,
451 "O."[!!cpu_online(cpu)],
452 "o."[!!(rdp->grpmask & rnp->expmaskinit)],
453 "N."[!!(rdp->grpmask & rnp->expmaskinitnext)]);
454 }
455 }
456 pr_cont(" } %lu jiffies s: %lu root: %#lx/%c\n",
457 jiffies - jiffies_start, rsp->expedited_sequence,
458 rnp_root->expmask, ".T"[!!rnp_root->exp_tasks]);
459 if (ndetected) {
460 pr_err("blocking rcu_node structures:");
461 rcu_for_each_node_breadth_first(rsp, rnp) {
462 if (rnp == rnp_root)
463 continue; /* printed unconditionally */
464 if (sync_rcu_preempt_exp_done(rnp))
465 continue;
466 pr_cont(" l=%u:%d-%d:%#lx/%c",
467 rnp->level, rnp->grplo, rnp->grphi,
468 rnp->expmask,
469 ".T"[!!rnp->exp_tasks]);
470 }
471 pr_cont("\n");
472 }
473 rcu_for_each_leaf_node(rsp, rnp) {
474 for_each_leaf_node_possible_cpu(rnp, cpu) {
475 mask = leaf_node_cpu_bit(rnp, cpu);
476 if (!(rnp->expmask & mask))
477 continue;
478 dump_cpu_task(cpu);
479 }
480 }
481 jiffies_stall = 3 * rcu_jiffies_till_stall_check() + 3;
482 }
483}
484
485/*
486 * Wait for the current expedited grace period to complete, and then
487 * wake up everyone who piggybacked on the just-completed expedited
488 * grace period. Also update all the ->exp_seq_rq counters as needed
489 * in order to avoid counter-wrap problems.
490 */
491static void rcu_exp_wait_wake(struct rcu_state *rsp, unsigned long s)
492{
493 struct rcu_node *rnp;
494
495 synchronize_sched_expedited_wait(rsp);
496 rcu_exp_gp_seq_end(rsp);
497 trace_rcu_exp_grace_period(rsp->name, s, TPS("end"));
498
499 /*
500 * Switch over to wakeup mode, allowing the next GP, but -only- the
501 * next GP, to proceed.
502 */
503 mutex_lock(&rsp->exp_wake_mutex);
504 mutex_unlock(&rsp->exp_mutex);
505
506 rcu_for_each_node_breadth_first(rsp, rnp) {
507 if (ULONG_CMP_LT(READ_ONCE(rnp->exp_seq_rq), s)) {
508 spin_lock(&rnp->exp_lock);
509 /* Recheck, avoid hang in case someone just arrived. */
510 if (ULONG_CMP_LT(rnp->exp_seq_rq, s))
511 rnp->exp_seq_rq = s;
512 spin_unlock(&rnp->exp_lock);
513 }
514 wake_up_all(&rnp->exp_wq[(rsp->expedited_sequence >> 1) & 0x3]);
515 }
516 trace_rcu_exp_grace_period(rsp->name, s, TPS("endwake"));
517 mutex_unlock(&rsp->exp_wake_mutex);
518}
519
520/**
521 * synchronize_sched_expedited - Brute-force RCU-sched grace period
522 *
523 * Wait for an RCU-sched grace period to elapse, but use a "big hammer"
524 * approach to force the grace period to end quickly. This consumes
525 * significant time on all CPUs and is unfriendly to real-time workloads,
526 * so is thus not recommended for any sort of common-case code. In fact,
527 * if you are using synchronize_sched_expedited() in a loop, please
528 * restructure your code to batch your updates, and then use a single
529 * synchronize_sched() instead.
530 *
531 * This implementation can be thought of as an application of sequence
532 * locking to expedited grace periods, but using the sequence counter to
533 * determine when someone else has already done the work instead of for
534 * retrying readers.
535 */
536void synchronize_sched_expedited(void)
537{
538 unsigned long s;
539 struct rcu_state *rsp = &rcu_sched_state;
540
541 /* If only one CPU, this is automatically a grace period. */
542 if (rcu_blocking_is_gp())
543 return;
544
545 /* If expedited grace periods are prohibited, fall back to normal. */
546 if (rcu_gp_is_normal()) {
547 wait_rcu_gp(call_rcu_sched);
548 return;
549 }
550
551 /* Take a snapshot of the sequence number. */
552 s = rcu_exp_gp_seq_snap(rsp);
553 if (exp_funnel_lock(rsp, s))
554 return; /* Someone else did our work for us. */
555
556 /* Initialize the rcu_node tree in preparation for the wait. */
557 sync_rcu_exp_select_cpus(rsp, sync_sched_exp_handler);
558
559 /* Wait and clean up, including waking everyone. */
560 rcu_exp_wait_wake(rsp, s);
561}
562EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
563
564#ifdef CONFIG_PREEMPT_RCU
565
566/*
567 * Remote handler for smp_call_function_single(). If there is an
568 * RCU read-side critical section in effect, request that the
569 * next rcu_read_unlock() record the quiescent state up the
570 * ->expmask fields in the rcu_node tree. Otherwise, immediately
571 * report the quiescent state.
572 */
573static void sync_rcu_exp_handler(void *info)
574{
575 struct rcu_data *rdp;
576 struct rcu_state *rsp = info;
577 struct task_struct *t = current;
578
579 /*
580 * Within an RCU read-side critical section, request that the next
581 * rcu_read_unlock() report. Unless this RCU read-side critical
582 * section has already blocked, in which case it is already set
583 * up for the expedited grace period to wait on it.
584 */
585 if (t->rcu_read_lock_nesting > 0 &&
586 !t->rcu_read_unlock_special.b.blocked) {
587 t->rcu_read_unlock_special.b.exp_need_qs = true;
588 return;
589 }
590
591 /*
592 * We are either exiting an RCU read-side critical section (negative
593 * values of t->rcu_read_lock_nesting) or are not in one at all
594 * (zero value of t->rcu_read_lock_nesting). Or we are in an RCU
595 * read-side critical section that blocked before this expedited
596 * grace period started. Either way, we can immediately report
597 * the quiescent state.
598 */
599 rdp = this_cpu_ptr(rsp->rda);
600 rcu_report_exp_rdp(rsp, rdp, true);
601}
602
603/**
604 * synchronize_rcu_expedited - Brute-force RCU grace period
605 *
606 * Wait for an RCU-preempt grace period, but expedite it. The basic
607 * idea is to IPI all non-idle non-nohz online CPUs. The IPI handler
608 * checks whether the CPU is in an RCU-preempt critical section, and
609 * if so, it sets a flag that causes the outermost rcu_read_unlock()
610 * to report the quiescent state. On the other hand, if the CPU is
611 * not in an RCU read-side critical section, the IPI handler reports
612 * the quiescent state immediately.
613 *
614 * Although this is a greate improvement over previous expedited
615 * implementations, it is still unfriendly to real-time workloads, so is
616 * thus not recommended for any sort of common-case code. In fact, if
617 * you are using synchronize_rcu_expedited() in a loop, please restructure
618 * your code to batch your updates, and then Use a single synchronize_rcu()
619 * instead.
620 */
621void synchronize_rcu_expedited(void)
622{
623 struct rcu_state *rsp = rcu_state_p;
624 unsigned long s;
625
626 /* If expedited grace periods are prohibited, fall back to normal. */
627 if (rcu_gp_is_normal()) {
628 wait_rcu_gp(call_rcu);
629 return;
630 }
631
632 s = rcu_exp_gp_seq_snap(rsp);
633 if (exp_funnel_lock(rsp, s))
634 return; /* Someone else did our work for us. */
635
636 /* Initialize the rcu_node tree in preparation for the wait. */
637 sync_rcu_exp_select_cpus(rsp, sync_rcu_exp_handler);
638
639 /* Wait for ->blkd_tasks lists to drain, then wake everyone up. */
640 rcu_exp_wait_wake(rsp, s);
641}
642EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
643
644#else /* #ifdef CONFIG_PREEMPT_RCU */
645
646/*
647 * Wait for an rcu-preempt grace period, but make it happen quickly.
648 * But because preemptible RCU does not exist, map to rcu-sched.
649 */
650void synchronize_rcu_expedited(void)
651{
652 synchronize_sched_expedited();
653}
654EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
655
656#endif /* #else #ifdef CONFIG_PREEMPT_RCU */
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index ff1cd4e1188d..0082fce402a0 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -79,8 +79,6 @@ static void __init rcu_bootup_announce_oddness(void)
79 pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n"); 79 pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
80 if (IS_ENABLED(CONFIG_PROVE_RCU)) 80 if (IS_ENABLED(CONFIG_PROVE_RCU))
81 pr_info("\tRCU lockdep checking is enabled.\n"); 81 pr_info("\tRCU lockdep checking is enabled.\n");
82 if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_RUNNABLE))
83 pr_info("\tRCU torture testing starts during boot.\n");
84 if (RCU_NUM_LVLS >= 4) 82 if (RCU_NUM_LVLS >= 4)
85 pr_info("\tFour(or more)-level hierarchy is enabled.\n"); 83 pr_info("\tFour(or more)-level hierarchy is enabled.\n");
86 if (RCU_FANOUT_LEAF != 16) 84 if (RCU_FANOUT_LEAF != 16)
@@ -681,84 +679,6 @@ void synchronize_rcu(void)
681} 679}
682EXPORT_SYMBOL_GPL(synchronize_rcu); 680EXPORT_SYMBOL_GPL(synchronize_rcu);
683 681
684/*
685 * Remote handler for smp_call_function_single(). If there is an
686 * RCU read-side critical section in effect, request that the
687 * next rcu_read_unlock() record the quiescent state up the
688 * ->expmask fields in the rcu_node tree. Otherwise, immediately
689 * report the quiescent state.
690 */
691static void sync_rcu_exp_handler(void *info)
692{
693 struct rcu_data *rdp;
694 struct rcu_state *rsp = info;
695 struct task_struct *t = current;
696
697 /*
698 * Within an RCU read-side critical section, request that the next
699 * rcu_read_unlock() report. Unless this RCU read-side critical
700 * section has already blocked, in which case it is already set
701 * up for the expedited grace period to wait on it.
702 */
703 if (t->rcu_read_lock_nesting > 0 &&
704 !t->rcu_read_unlock_special.b.blocked) {
705 t->rcu_read_unlock_special.b.exp_need_qs = true;
706 return;
707 }
708
709 /*
710 * We are either exiting an RCU read-side critical section (negative
711 * values of t->rcu_read_lock_nesting) or are not in one at all
712 * (zero value of t->rcu_read_lock_nesting). Or we are in an RCU
713 * read-side critical section that blocked before this expedited
714 * grace period started. Either way, we can immediately report
715 * the quiescent state.
716 */
717 rdp = this_cpu_ptr(rsp->rda);
718 rcu_report_exp_rdp(rsp, rdp, true);
719}
720
721/**
722 * synchronize_rcu_expedited - Brute-force RCU grace period
723 *
724 * Wait for an RCU-preempt grace period, but expedite it. The basic
725 * idea is to IPI all non-idle non-nohz online CPUs. The IPI handler
726 * checks whether the CPU is in an RCU-preempt critical section, and
727 * if so, it sets a flag that causes the outermost rcu_read_unlock()
728 * to report the quiescent state. On the other hand, if the CPU is
729 * not in an RCU read-side critical section, the IPI handler reports
730 * the quiescent state immediately.
731 *
732 * Although this is a greate improvement over previous expedited
733 * implementations, it is still unfriendly to real-time workloads, so is
734 * thus not recommended for any sort of common-case code. In fact, if
735 * you are using synchronize_rcu_expedited() in a loop, please restructure
736 * your code to batch your updates, and then Use a single synchronize_rcu()
737 * instead.
738 */
739void synchronize_rcu_expedited(void)
740{
741 struct rcu_state *rsp = rcu_state_p;
742 unsigned long s;
743
744 /* If expedited grace periods are prohibited, fall back to normal. */
745 if (rcu_gp_is_normal()) {
746 wait_rcu_gp(call_rcu);
747 return;
748 }
749
750 s = rcu_exp_gp_seq_snap(rsp);
751 if (exp_funnel_lock(rsp, s))
752 return; /* Someone else did our work for us. */
753
754 /* Initialize the rcu_node tree in preparation for the wait. */
755 sync_rcu_exp_select_cpus(rsp, sync_rcu_exp_handler);
756
757 /* Wait for ->blkd_tasks lists to drain, then wake everyone up. */
758 rcu_exp_wait_wake(rsp, s);
759}
760EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
761
762/** 682/**
763 * rcu_barrier - Wait until all in-flight call_rcu() callbacks complete. 683 * rcu_barrier - Wait until all in-flight call_rcu() callbacks complete.
764 * 684 *
@@ -883,16 +803,6 @@ static void rcu_preempt_check_callbacks(void)
883} 803}
884 804
885/* 805/*
886 * Wait for an rcu-preempt grace period, but make it happen quickly.
887 * But because preemptible RCU does not exist, map to rcu-sched.
888 */
889void synchronize_rcu_expedited(void)
890{
891 synchronize_sched_expedited();
892}
893EXPORT_SYMBOL_GPL(synchronize_rcu_expedited);
894
895/*
896 * Because preemptible RCU does not exist, rcu_barrier() is just 806 * Because preemptible RCU does not exist, rcu_barrier() is just
897 * another name for rcu_barrier_sched(). 807 * another name for rcu_barrier_sched().
898 */ 808 */
@@ -1254,8 +1164,9 @@ static void rcu_boost_kthread_setaffinity(struct rcu_node *rnp, int outgoingcpu)
1254 return; 1164 return;
1255 if (!zalloc_cpumask_var(&cm, GFP_KERNEL)) 1165 if (!zalloc_cpumask_var(&cm, GFP_KERNEL))
1256 return; 1166 return;
1257 for (cpu = rnp->grplo; cpu <= rnp->grphi; cpu++, mask >>= 1) 1167 for_each_leaf_node_possible_cpu(rnp, cpu)
1258 if ((mask & 0x1) && cpu != outgoingcpu) 1168 if ((mask & leaf_node_cpu_bit(rnp, cpu)) &&
1169 cpu != outgoingcpu)
1259 cpumask_set_cpu(cpu, cm); 1170 cpumask_set_cpu(cpu, cm);
1260 if (cpumask_weight(cm) == 0) 1171 if (cpumask_weight(cm) == 0)
1261 cpumask_setall(cm); 1172 cpumask_setall(cm);
diff --git a/kernel/rcu/update.c b/kernel/rcu/update.c
index 3e888cd5a594..f0d8322bc3ec 100644
--- a/kernel/rcu/update.c
+++ b/kernel/rcu/update.c
@@ -528,6 +528,7 @@ static int rcu_task_stall_timeout __read_mostly = HZ * 60 * 10;
528module_param(rcu_task_stall_timeout, int, 0644); 528module_param(rcu_task_stall_timeout, int, 0644);
529 529
530static void rcu_spawn_tasks_kthread(void); 530static void rcu_spawn_tasks_kthread(void);
531static struct task_struct *rcu_tasks_kthread_ptr;
531 532
532/* 533/*
533 * Post an RCU-tasks callback. First call must be from process context 534 * Post an RCU-tasks callback. First call must be from process context
@@ -537,6 +538,7 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
537{ 538{
538 unsigned long flags; 539 unsigned long flags;
539 bool needwake; 540 bool needwake;
541 bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
540 542
541 rhp->next = NULL; 543 rhp->next = NULL;
542 rhp->func = func; 544 rhp->func = func;
@@ -545,7 +547,9 @@ void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
545 *rcu_tasks_cbs_tail = rhp; 547 *rcu_tasks_cbs_tail = rhp;
546 rcu_tasks_cbs_tail = &rhp->next; 548 rcu_tasks_cbs_tail = &rhp->next;
547 raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags); 549 raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
548 if (needwake) { 550 /* We can't create the thread unless interrupts are enabled. */
551 if ((needwake && havetask) ||
552 (!havetask && !irqs_disabled_flags(flags))) {
549 rcu_spawn_tasks_kthread(); 553 rcu_spawn_tasks_kthread();
550 wake_up(&rcu_tasks_cbs_wq); 554 wake_up(&rcu_tasks_cbs_wq);
551 } 555 }
@@ -790,7 +794,6 @@ static int __noreturn rcu_tasks_kthread(void *arg)
790static void rcu_spawn_tasks_kthread(void) 794static void rcu_spawn_tasks_kthread(void)
791{ 795{
792 static DEFINE_MUTEX(rcu_tasks_kthread_mutex); 796 static DEFINE_MUTEX(rcu_tasks_kthread_mutex);
793 static struct task_struct *rcu_tasks_kthread_ptr;
794 struct task_struct *t; 797 struct task_struct *t;
795 798
796 if (READ_ONCE(rcu_tasks_kthread_ptr)) { 799 if (READ_ONCE(rcu_tasks_kthread_ptr)) {
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 87b2fc38398b..35f0dcb1cb4f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1205,6 +1205,17 @@ static struct ctl_table kern_table[] = {
1205 .extra2 = &one, 1205 .extra2 = &one,
1206 }, 1206 },
1207#endif 1207#endif
1208#if defined(CONFIG_TREE_RCU) || defined(CONFIG_PREEMPT_RCU)
1209 {
1210 .procname = "panic_on_rcu_stall",
1211 .data = &sysctl_panic_on_rcu_stall,
1212 .maxlen = sizeof(sysctl_panic_on_rcu_stall),
1213 .mode = 0644,
1214 .proc_handler = proc_dointvec_minmax,
1215 .extra1 = &zero,
1216 .extra2 = &one,
1217 },
1218#endif
1208 { } 1219 { }
1209}; 1220};
1210 1221
diff --git a/kernel/torture.c b/kernel/torture.c
index fa0bdeee17ac..75961b3decfe 100644
--- a/kernel/torture.c
+++ b/kernel/torture.c
@@ -82,6 +82,104 @@ static int min_online = -1;
82static int max_online; 82static int max_online;
83 83
84/* 84/*
85 * Attempt to take a CPU offline. Return false if the CPU is already
86 * offline or if it is not subject to CPU-hotplug operations. The
87 * caller can detect other failures by looking at the statistics.
88 */
89bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
90 unsigned long *sum_offl, int *min_offl, int *max_offl)
91{
92 unsigned long delta;
93 int ret;
94 unsigned long starttime;
95
96 if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
97 return false;
98
99 if (verbose)
100 pr_alert("%s" TORTURE_FLAG
101 "torture_onoff task: offlining %d\n",
102 torture_type, cpu);
103 starttime = jiffies;
104 (*n_offl_attempts)++;
105 ret = cpu_down(cpu);
106 if (ret) {
107 if (verbose)
108 pr_alert("%s" TORTURE_FLAG
109 "torture_onoff task: offline %d failed: errno %d\n",
110 torture_type, cpu, ret);
111 } else {
112 if (verbose)
113 pr_alert("%s" TORTURE_FLAG
114 "torture_onoff task: offlined %d\n",
115 torture_type, cpu);
116 (*n_offl_successes)++;
117 delta = jiffies - starttime;
118 sum_offl += delta;
119 if (*min_offl < 0) {
120 *min_offl = delta;
121 *max_offl = delta;
122 }
123 if (*min_offl > delta)
124 *min_offl = delta;
125 if (*max_offl < delta)
126 *max_offl = delta;
127 }
128
129 return true;
130}
131EXPORT_SYMBOL_GPL(torture_offline);
132
133/*
134 * Attempt to bring a CPU online. Return false if the CPU is already
135 * online or if it is not subject to CPU-hotplug operations. The
136 * caller can detect other failures by looking at the statistics.
137 */
138bool torture_online(int cpu, long *n_onl_attempts, long *n_onl_successes,
139 unsigned long *sum_onl, int *min_onl, int *max_onl)
140{
141 unsigned long delta;
142 int ret;
143 unsigned long starttime;
144
145 if (cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
146 return false;
147
148 if (verbose)
149 pr_alert("%s" TORTURE_FLAG
150 "torture_onoff task: onlining %d\n",
151 torture_type, cpu);
152 starttime = jiffies;
153 (*n_onl_attempts)++;
154 ret = cpu_up(cpu);
155 if (ret) {
156 if (verbose)
157 pr_alert("%s" TORTURE_FLAG
158 "torture_onoff task: online %d failed: errno %d\n",
159 torture_type, cpu, ret);
160 } else {
161 if (verbose)
162 pr_alert("%s" TORTURE_FLAG
163 "torture_onoff task: onlined %d\n",
164 torture_type, cpu);
165 (*n_onl_successes)++;
166 delta = jiffies - starttime;
167 *sum_onl += delta;
168 if (*min_onl < 0) {
169 *min_onl = delta;
170 *max_onl = delta;
171 }
172 if (*min_onl > delta)
173 *min_onl = delta;
174 if (*max_onl < delta)
175 *max_onl = delta;
176 }
177
178 return true;
179}
180EXPORT_SYMBOL_GPL(torture_online);
181
182/*
85 * Execute random CPU-hotplug operations at the interval specified 183 * Execute random CPU-hotplug operations at the interval specified
86 * by the onoff_interval. 184 * by the onoff_interval.
87 */ 185 */
@@ -89,16 +187,19 @@ static int
89torture_onoff(void *arg) 187torture_onoff(void *arg)
90{ 188{
91 int cpu; 189 int cpu;
92 unsigned long delta;
93 int maxcpu = -1; 190 int maxcpu = -1;
94 DEFINE_TORTURE_RANDOM(rand); 191 DEFINE_TORTURE_RANDOM(rand);
95 int ret;
96 unsigned long starttime;
97 192
98 VERBOSE_TOROUT_STRING("torture_onoff task started"); 193 VERBOSE_TOROUT_STRING("torture_onoff task started");
99 for_each_online_cpu(cpu) 194 for_each_online_cpu(cpu)
100 maxcpu = cpu; 195 maxcpu = cpu;
101 WARN_ON(maxcpu < 0); 196 WARN_ON(maxcpu < 0);
197
198 if (maxcpu == 0) {
199 VERBOSE_TOROUT_STRING("Only one CPU, so CPU-hotplug testing is disabled");
200 goto stop;
201 }
202
102 if (onoff_holdoff > 0) { 203 if (onoff_holdoff > 0) {
103 VERBOSE_TOROUT_STRING("torture_onoff begin holdoff"); 204 VERBOSE_TOROUT_STRING("torture_onoff begin holdoff");
104 schedule_timeout_interruptible(onoff_holdoff); 205 schedule_timeout_interruptible(onoff_holdoff);
@@ -106,69 +207,16 @@ torture_onoff(void *arg)
106 } 207 }
107 while (!torture_must_stop()) { 208 while (!torture_must_stop()) {
108 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1); 209 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
109 if (cpu_online(cpu) && cpu_is_hotpluggable(cpu)) { 210 if (!torture_offline(cpu,
110 if (verbose) 211 &n_offline_attempts, &n_offline_successes,
111 pr_alert("%s" TORTURE_FLAG 212 &sum_offline, &min_offline, &max_offline))
112 "torture_onoff task: offlining %d\n", 213 torture_online(cpu,
113 torture_type, cpu); 214 &n_online_attempts, &n_online_successes,
114 starttime = jiffies; 215 &sum_online, &min_online, &max_online);
115 n_offline_attempts++;
116 ret = cpu_down(cpu);
117 if (ret) {
118 if (verbose)
119 pr_alert("%s" TORTURE_FLAG
120 "torture_onoff task: offline %d failed: errno %d\n",
121 torture_type, cpu, ret);
122 } else {
123 if (verbose)
124 pr_alert("%s" TORTURE_FLAG
125 "torture_onoff task: offlined %d\n",
126 torture_type, cpu);
127 n_offline_successes++;
128 delta = jiffies - starttime;
129 sum_offline += delta;
130 if (min_offline < 0) {
131 min_offline = delta;
132 max_offline = delta;
133 }
134 if (min_offline > delta)
135 min_offline = delta;
136 if (max_offline < delta)
137 max_offline = delta;
138 }
139 } else if (cpu_is_hotpluggable(cpu)) {
140 if (verbose)
141 pr_alert("%s" TORTURE_FLAG
142 "torture_onoff task: onlining %d\n",
143 torture_type, cpu);
144 starttime = jiffies;
145 n_online_attempts++;
146 ret = cpu_up(cpu);
147 if (ret) {
148 if (verbose)
149 pr_alert("%s" TORTURE_FLAG
150 "torture_onoff task: online %d failed: errno %d\n",
151 torture_type, cpu, ret);
152 } else {
153 if (verbose)
154 pr_alert("%s" TORTURE_FLAG
155 "torture_onoff task: onlined %d\n",
156 torture_type, cpu);
157 n_online_successes++;
158 delta = jiffies - starttime;
159 sum_online += delta;
160 if (min_online < 0) {
161 min_online = delta;
162 max_online = delta;
163 }
164 if (min_online > delta)
165 min_online = delta;
166 if (max_online < delta)
167 max_online = delta;
168 }
169 }
170 schedule_timeout_interruptible(onoff_interval); 216 schedule_timeout_interruptible(onoff_interval);
171 } 217 }
218
219stop:
172 torture_kthread_stopping("torture_onoff"); 220 torture_kthread_stopping("torture_onoff");
173 return 0; 221 return 0;
174} 222}
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index b9cfdbfae9aa..805b7048a1bd 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1307,22 +1307,6 @@ config RCU_PERF_TEST
1307 Say M if you want the RCU performance tests to build as a module. 1307 Say M if you want the RCU performance tests to build as a module.
1308 Say N if you are unsure. 1308 Say N if you are unsure.
1309 1309
1310config RCU_PERF_TEST_RUNNABLE
1311 bool "performance tests for RCU runnable by default"
1312 depends on RCU_PERF_TEST = y
1313 default n
1314 help
1315 This option provides a way to build the RCU performance tests
1316 directly into the kernel without them starting up at boot time.
1317 You can use /sys/module to manually override this setting.
1318 This /proc file is available only when the RCU performance
1319 tests have been built into the kernel.
1320
1321 Say Y here if you want the RCU performance tests to start during
1322 boot (you probably don't).
1323 Say N here if you want the RCU performance tests to start only
1324 after being manually enabled via /sys/module.
1325
1326config RCU_TORTURE_TEST 1310config RCU_TORTURE_TEST
1327 tristate "torture tests for RCU" 1311 tristate "torture tests for RCU"
1328 depends on DEBUG_KERNEL 1312 depends on DEBUG_KERNEL
@@ -1340,23 +1324,6 @@ config RCU_TORTURE_TEST
1340 Say M if you want the RCU torture tests to build as a module. 1324 Say M if you want the RCU torture tests to build as a module.
1341 Say N if you are unsure. 1325 Say N if you are unsure.
1342 1326
1343config RCU_TORTURE_TEST_RUNNABLE
1344 bool "torture tests for RCU runnable by default"
1345 depends on RCU_TORTURE_TEST = y
1346 default n
1347 help
1348 This option provides a way to build the RCU torture tests
1349 directly into the kernel without them starting up at boot
1350 time. You can use /proc/sys/kernel/rcutorture_runnable
1351 to manually override this setting. This /proc file is
1352 available only when the RCU torture tests have been built
1353 into the kernel.
1354
1355 Say Y here if you want the RCU torture tests to start during
1356 boot (you probably don't).
1357 Say N here if you want the RCU torture tests to start only
1358 after being manually enabled via /proc.
1359
1360config RCU_TORTURE_TEST_SLOW_PREINIT 1327config RCU_TORTURE_TEST_SLOW_PREINIT
1361 bool "Slow down RCU grace-period pre-initialization to expose races" 1328 bool "Slow down RCU grace-period pre-initialization to expose races"
1362 depends on RCU_TORTURE_TEST 1329 depends on RCU_TORTURE_TEST
diff --git a/tools/testing/selftests/rcutorture/bin/functions.sh b/tools/testing/selftests/rcutorture/bin/functions.sh
index b325470c01b3..1426a9b97494 100644
--- a/tools/testing/selftests/rcutorture/bin/functions.sh
+++ b/tools/testing/selftests/rcutorture/bin/functions.sh
@@ -99,8 +99,9 @@ configfrag_hotplug_cpu () {
99# identify_boot_image qemu-cmd 99# identify_boot_image qemu-cmd
100# 100#
101# Returns the relative path to the kernel build image. This will be 101# Returns the relative path to the kernel build image. This will be
102# arch/<arch>/boot/bzImage unless overridden with the TORTURE_BOOT_IMAGE 102# arch/<arch>/boot/bzImage or vmlinux if bzImage is not a target for the
103# environment variable. 103# architecture, unless overridden with the TORTURE_BOOT_IMAGE environment
104# variable.
104identify_boot_image () { 105identify_boot_image () {
105 if test -n "$TORTURE_BOOT_IMAGE" 106 if test -n "$TORTURE_BOOT_IMAGE"
106 then 107 then
@@ -110,11 +111,8 @@ identify_boot_image () {
110 qemu-system-x86_64|qemu-system-i386) 111 qemu-system-x86_64|qemu-system-i386)
111 echo arch/x86/boot/bzImage 112 echo arch/x86/boot/bzImage
112 ;; 113 ;;
113 qemu-system-ppc64)
114 echo arch/powerpc/boot/bzImage
115 ;;
116 *) 114 *)
117 echo "" 115 echo vmlinux
118 ;; 116 ;;
119 esac 117 esac
120 fi 118 fi
@@ -175,7 +173,7 @@ identify_qemu_args () {
175 qemu-system-x86_64|qemu-system-i386) 173 qemu-system-x86_64|qemu-system-i386)
176 ;; 174 ;;
177 qemu-system-ppc64) 175 qemu-system-ppc64)
178 echo -enable-kvm -M pseries -cpu POWER7 -nodefaults 176 echo -enable-kvm -M pseries -nodefaults
179 echo -device spapr-vscsi 177 echo -device spapr-vscsi
180 if test -n "$TORTURE_QEMU_INTERACTIVE" -a -n "$TORTURE_QEMU_MAC" 178 if test -n "$TORTURE_QEMU_INTERACTIVE" -a -n "$TORTURE_QEMU_MAC"
181 then 179 then
diff --git a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
index 4109f306d855..ea6e373edc27 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm-test-1-run.sh
@@ -8,9 +8,9 @@
8# 8#
9# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args 9# Usage: kvm-test-1-run.sh config builddir resdir seconds qemu-args boot_args
10# 10#
11# qemu-args defaults to "-enable-kvm -soundhw pcspk -nographic", along with 11# qemu-args defaults to "-enable-kvm -nographic", along with arguments
12# arguments specifying the number of CPUs and other 12# specifying the number of CPUs and other options
13# options generated from the underlying CPU architecture. 13# generated from the underlying CPU architecture.
14# boot_args defaults to value returned by the per_version_boot_params 14# boot_args defaults to value returned by the per_version_boot_params
15# shell function. 15# shell function.
16# 16#
@@ -96,7 +96,8 @@ if test "$base_resdir" != "$resdir" -a -f $base_resdir/bzImage -a -f $base_resdi
96then 96then
97 # Rerunning previous test, so use that test's kernel. 97 # Rerunning previous test, so use that test's kernel.
98 QEMU="`identify_qemu $base_resdir/vmlinux`" 98 QEMU="`identify_qemu $base_resdir/vmlinux`"
99 KERNEL=$base_resdir/bzImage 99 BOOT_IMAGE="`identify_boot_image $QEMU`"
100 KERNEL=$base_resdir/${BOOT_IMAGE##*/} # use the last component of ${BOOT_IMAGE}
100 ln -s $base_resdir/Make*.out $resdir # for kvm-recheck.sh 101 ln -s $base_resdir/Make*.out $resdir # for kvm-recheck.sh
101 ln -s $base_resdir/.config $resdir # for kvm-recheck.sh 102 ln -s $base_resdir/.config $resdir # for kvm-recheck.sh
102elif kvm-build.sh $config_template $builddir $T 103elif kvm-build.sh $config_template $builddir $T
@@ -110,7 +111,7 @@ then
110 if test -n "$BOOT_IMAGE" 111 if test -n "$BOOT_IMAGE"
111 then 112 then
112 cp $builddir/$BOOT_IMAGE $resdir 113 cp $builddir/$BOOT_IMAGE $resdir
113 KERNEL=$resdir/bzImage 114 KERNEL=$resdir/${BOOT_IMAGE##*/}
114 else 115 else
115 echo No identifiable boot image, not running KVM, see $resdir. 116 echo No identifiable boot image, not running KVM, see $resdir.
116 echo Do the torture scripts know about your architecture? 117 echo Do the torture scripts know about your architecture?
@@ -147,7 +148,7 @@ then
147fi 148fi
148 149
149# Generate -smp qemu argument. 150# Generate -smp qemu argument.
150qemu_args="-enable-kvm -soundhw pcspk -nographic $qemu_args" 151qemu_args="-enable-kvm -nographic $qemu_args"
151cpu_count=`configNR_CPUS.sh $config_template` 152cpu_count=`configNR_CPUS.sh $config_template`
152cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"` 153cpu_count=`configfrag_boot_cpus "$boot_args" "$config_template" "$cpu_count"`
153vcpus=`identify_qemu_vcpus` 154vcpus=`identify_qemu_vcpus`
@@ -229,6 +230,7 @@ fi
229if test $commandcompleted -eq 0 -a -n "$qemu_pid" 230if test $commandcompleted -eq 0 -a -n "$qemu_pid"
230then 231then
231 echo Grace period for qemu job at pid $qemu_pid 232 echo Grace period for qemu job at pid $qemu_pid
233 oldline="`tail $resdir/console.log`"
232 while : 234 while :
233 do 235 do
234 kruntime=`awk 'BEGIN { print systime() - '"$kstarttime"' }' < /dev/null` 236 kruntime=`awk 'BEGIN { print systime() - '"$kstarttime"' }' < /dev/null`
@@ -238,13 +240,29 @@ then
238 else 240 else
239 break 241 break
240 fi 242 fi
241 if test $kruntime -ge $((seconds + $TORTURE_SHUTDOWN_GRACE)) 243 must_continue=no
244 newline="`tail $resdir/console.log`"
245 if test "$newline" != "$oldline" && echo $newline | grep -q ' [0-9]\+us : '
246 then
247 must_continue=yes
248 fi
249 last_ts="`tail $resdir/console.log | grep '^\[ *[0-9]\+\.[0-9]\+]' | tail -1 | sed -e 's/^\[ *//' -e 's/\..*$//'`"
250 if test -z "last_ts"
251 then
252 last_ts=0
253 fi
254 if test "$newline" != "$oldline" -a "$last_ts" -lt $((seconds + $TORTURE_SHUTDOWN_GRACE))
255 then
256 must_continue=yes
257 fi
258 if test $must_continue = no -a $kruntime -ge $((seconds + $TORTURE_SHUTDOWN_GRACE))
242 then 259 then
243 echo "!!! PID $qemu_pid hung at $kruntime vs. $seconds seconds" >> $resdir/Warnings 2>&1 260 echo "!!! PID $qemu_pid hung at $kruntime vs. $seconds seconds" >> $resdir/Warnings 2>&1
244 kill -KILL $qemu_pid 261 kill -KILL $qemu_pid
245 break 262 break
246 fi 263 fi
247 sleep 1 264 oldline=$newline
265 sleep 10
248 done 266 done
249elif test -z "$qemu_pid" 267elif test -z "$qemu_pid"
250then 268then
diff --git a/tools/testing/selftests/rcutorture/bin/kvm.sh b/tools/testing/selftests/rcutorture/bin/kvm.sh
index 0d598145873e..0aed965f0062 100755
--- a/tools/testing/selftests/rcutorture/bin/kvm.sh
+++ b/tools/testing/selftests/rcutorture/bin/kvm.sh
@@ -48,7 +48,7 @@ resdir=""
48configs="" 48configs=""
49cpus=0 49cpus=0
50ds=`date +%Y.%m.%d-%H:%M:%S` 50ds=`date +%Y.%m.%d-%H:%M:%S`
51jitter=0 51jitter="-1"
52 52
53. functions.sh 53. functions.sh
54 54
diff --git a/tools/testing/selftests/rcutorture/bin/parse-console.sh b/tools/testing/selftests/rcutorture/bin/parse-console.sh
index 5eb49b7f864c..08aa7d50ae0e 100755
--- a/tools/testing/selftests/rcutorture/bin/parse-console.sh
+++ b/tools/testing/selftests/rcutorture/bin/parse-console.sh
@@ -33,7 +33,7 @@ if grep -Pq '\x00' < $file
33then 33then
34 print_warning Console output contains nul bytes, old qemu still running? 34 print_warning Console output contains nul bytes, old qemu still running?
35fi 35fi
36egrep 'Badness|WARNING:|Warn|BUG|===========|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state' < $file | grep -v 'ODEBUG: ' | grep -v 'Warning: unable to open an initial console' > $1.diags 36egrep 'Badness|WARNING:|Warn|BUG|===========|Call Trace:|Oops:|detected stalls on CPUs/tasks:|self-detected stall on CPU|Stall ended before state dump start|\?\?\? Writer stall state|rcu_.*kthread starved for' < $file | grep -v 'ODEBUG: ' | grep -v 'Warning: unable to open an initial console' > $1.diags
37if test -s $1.diags 37if test -s $1.diags
38then 38then
39 print_warning Assertion failure in $file $title 39 print_warning Assertion failure in $file $title
@@ -69,6 +69,11 @@ then
69 then 69 then
70 summary="$summary Stalls: $n_stalls" 70 summary="$summary Stalls: $n_stalls"
71 fi 71 fi
72 n_starves=`grep -c 'rcu_.*kthread starved for' $1`
73 if test "$n_starves" -ne 0
74 then
75 summary="$summary Starves: $n_starves"
76 fi
72 print_warning Summary: $summary 77 print_warning Summary: $summary
73else 78else
74 rm $1.diags 79 rm $1.diags
diff --git a/tools/testing/selftests/rcutorture/doc/initrd.txt b/tools/testing/selftests/rcutorture/doc/initrd.txt
index 4170e714f044..833f826d6ec2 100644
--- a/tools/testing/selftests/rcutorture/doc/initrd.txt
+++ b/tools/testing/selftests/rcutorture/doc/initrd.txt
@@ -13,6 +13,22 @@ cd initrd
13cpio -id < /tmp/initrd.img.zcat 13cpio -id < /tmp/initrd.img.zcat
14------------------------------------------------------------------------ 14------------------------------------------------------------------------
15 15
16Another way to create an initramfs image is using "dracut"[1], which is
17available on many distros, however the initramfs dracut generates is a cpio
18archive with another cpio archive in it, so an extra step is needed to create
19the initrd directory hierarchy.
20
21Here are the commands to create a initrd directory for rcutorture using
22dracut:
23
24------------------------------------------------------------------------
25dracut --no-hostonly --no-hostonly-cmdline --module "base bash shutdown" /tmp/initramfs.img
26cd tools/testing/selftests/rcutorture
27mkdir initrd
28cd initrd
29/usr/lib/dracut/skipcpio /tmp/initramfs.img | zcat | cpio -id < /tmp/initramfs.img
30------------------------------------------------------------------------
31
16Interestingly enough, if you are running rcutorture, you don't really 32Interestingly enough, if you are running rcutorture, you don't really
17need userspace in many cases. Running without userspace has the 33need userspace in many cases. Running without userspace has the
18advantage of allowing you to test your kernel independently of the 34advantage of allowing you to test your kernel independently of the
@@ -89,3 +105,9 @@ while :
89do 105do
90 sleep 10 106 sleep 10
91done 107done
108------------------------------------------------------------------------
109
110References:
111[1]: https://dracut.wiki.kernel.org/index.php/Main_Page
112[2]: http://blog.elastocloud.org/2015/06/rapid-linux-kernel-devtest-with-qemu.html
113[3]: https://www.centos.org/forums/viewtopic.php?t=51621