diff options
author | Ingo Molnar <mingo@kernel.org> | 2013-04-30 04:49:04 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2013-04-30 04:49:04 -0400 |
commit | fd29f424d458118f02e89596505c68a63dcb3007 (patch) | |
tree | b52470ff7fe7a9f29260afe4a9f22a80fc900140 /Documentation | |
parent | c1be5a5b1b355d40e6cf79cc979eb66dafa24ad1 (diff) | |
parent | 49717cb40410fe4b563968680ff7c513967504c6 (diff) |
Merge branch 'rcu/doc' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/urgent
Pull RCU documentation update for reducing OS jitter due to
per-CPU kthreads, from Paul McKenney.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/RCU/checklist.txt | 26 | ||||
-rw-r--r-- | Documentation/RCU/lockdep.txt | 5 | ||||
-rw-r--r-- | Documentation/RCU/rcubarrier.txt | 15 | ||||
-rw-r--r-- | Documentation/RCU/stallwarn.txt | 33 | ||||
-rw-r--r-- | Documentation/RCU/whatisRCU.txt | 4 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 35 | ||||
-rw-r--r-- | Documentation/kernel-per-CPU-kthreads.txt | 202 |
7 files changed, 287 insertions, 33 deletions
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 31ef8fe07f82..79e789b8b8ea 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt | |||
@@ -217,9 +217,14 @@ over a rather long period of time, but improvements are always welcome! | |||
217 | whether the increased speed is worth it. | 217 | whether the increased speed is worth it. |
218 | 218 | ||
219 | 8. Although synchronize_rcu() is slower than is call_rcu(), it | 219 | 8. Although synchronize_rcu() is slower than is call_rcu(), it |
220 | usually results in simpler code. So, unless update performance | 220 | usually results in simpler code. So, unless update performance is |
221 | is critically important or the updaters cannot block, | 221 | critically important, the updaters cannot block, or the latency of |
222 | synchronize_rcu() should be used in preference to call_rcu(). | 222 | synchronize_rcu() is visible from userspace, synchronize_rcu() |
223 | should be used in preference to call_rcu(). Furthermore, | ||
224 | kfree_rcu() usually results in even simpler code than does | ||
225 | synchronize_rcu() without synchronize_rcu()'s multi-millisecond | ||
226 | latency. So please take advantage of kfree_rcu()'s "fire and | ||
227 | forget" memory-freeing capabilities where it applies. | ||
223 | 228 | ||
224 | An especially important property of the synchronize_rcu() | 229 | An especially important property of the synchronize_rcu() |
225 | primitive is that it automatically self-limits: if grace periods | 230 | primitive is that it automatically self-limits: if grace periods |
@@ -268,7 +273,8 @@ over a rather long period of time, but improvements are always welcome! | |||
268 | e. Periodically invoke synchronize_rcu(), permitting a limited | 273 | e. Periodically invoke synchronize_rcu(), permitting a limited |
269 | number of updates per grace period. | 274 | number of updates per grace period. |
270 | 275 | ||
271 | The same cautions apply to call_rcu_bh() and call_rcu_sched(). | 276 | The same cautions apply to call_rcu_bh(), call_rcu_sched(), |
277 | call_srcu(), and kfree_rcu(). | ||
272 | 278 | ||
273 | 9. All RCU list-traversal primitives, which include | 279 | 9. All RCU list-traversal primitives, which include |
274 | rcu_dereference(), list_for_each_entry_rcu(), and | 280 | rcu_dereference(), list_for_each_entry_rcu(), and |
@@ -296,9 +302,9 @@ over a rather long period of time, but improvements are always welcome! | |||
296 | all currently executing rcu_read_lock()-protected RCU read-side | 302 | all currently executing rcu_read_lock()-protected RCU read-side |
297 | critical sections complete. It does -not- necessarily guarantee | 303 | critical sections complete. It does -not- necessarily guarantee |
298 | that all currently running interrupts, NMIs, preempt_disable() | 304 | that all currently running interrupts, NMIs, preempt_disable() |
299 | code, or idle loops will complete. Therefore, if you do not have | 305 | code, or idle loops will complete. Therefore, if your |
300 | rcu_read_lock()-protected read-side critical sections, do -not- | 306 | read-side critical sections are protected by something other |
301 | use synchronize_rcu(). | 307 | than rcu_read_lock(), do -not- use synchronize_rcu(). |
302 | 308 | ||
303 | Similarly, disabling preemption is not an acceptable substitute | 309 | Similarly, disabling preemption is not an acceptable substitute |
304 | for rcu_read_lock(). Code that attempts to use preemption | 310 | for rcu_read_lock(). Code that attempts to use preemption |
@@ -401,9 +407,9 @@ over a rather long period of time, but improvements are always welcome! | |||
401 | read-side critical sections. It is the responsibility of the | 407 | read-side critical sections. It is the responsibility of the |
402 | RCU update-side primitives to deal with this. | 408 | RCU update-side primitives to deal with this. |
403 | 409 | ||
404 | 17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and | 410 | 17. Use CONFIG_PROVE_RCU, CONFIG_DEBUG_OBJECTS_RCU_HEAD, and the |
405 | the __rcu sparse checks to validate your RCU code. These | 411 | __rcu sparse checks (enabled by CONFIG_SPARSE_RCU_POINTER) to |
406 | can help find problems as follows: | 412 | validate your RCU code. These can help find problems as follows: |
407 | 413 | ||
408 | CONFIG_PROVE_RCU: check that accesses to RCU-protected data | 414 | CONFIG_PROVE_RCU: check that accesses to RCU-protected data |
409 | structures are carried out under the proper RCU | 415 | structures are carried out under the proper RCU |
diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt index a102d4b3724b..cd83d2348fef 100644 --- a/Documentation/RCU/lockdep.txt +++ b/Documentation/RCU/lockdep.txt | |||
@@ -64,6 +64,11 @@ checking of rcu_dereference() primitives: | |||
64 | but retain the compiler constraints that prevent duplicating | 64 | but retain the compiler constraints that prevent duplicating |
65 | or coalescsing. This is useful when when testing the | 65 | or coalescsing. This is useful when when testing the |
66 | value of the pointer itself, for example, against NULL. | 66 | value of the pointer itself, for example, against NULL. |
67 | rcu_access_index(idx): | ||
68 | Return the value of the index and omit all barriers, but | ||
69 | retain the compiler constraints that prevent duplicating | ||
70 | or coalescsing. This is useful when when testing the | ||
71 | value of the index itself, for example, against -1. | ||
67 | 72 | ||
68 | The rcu_dereference_check() check expression can be any boolean | 73 | The rcu_dereference_check() check expression can be any boolean |
69 | expression, but would normally include a lockdep expression. However, | 74 | expression, but would normally include a lockdep expression. However, |
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt index 38428c125135..2e319d1b9ef2 100644 --- a/Documentation/RCU/rcubarrier.txt +++ b/Documentation/RCU/rcubarrier.txt | |||
@@ -79,7 +79,20 @@ complete. Pseudo-code using rcu_barrier() is as follows: | |||
79 | 2. Execute rcu_barrier(). | 79 | 2. Execute rcu_barrier(). |
80 | 3. Allow the module to be unloaded. | 80 | 3. Allow the module to be unloaded. |
81 | 81 | ||
82 | The rcutorture module makes use of rcu_barrier in its exit function | 82 | There are also rcu_barrier_bh(), rcu_barrier_sched(), and srcu_barrier() |
83 | functions for the other flavors of RCU, and you of course must match | ||
84 | the flavor of rcu_barrier() with that of call_rcu(). If your module | ||
85 | uses multiple flavors of call_rcu(), then it must also use multiple | ||
86 | flavors of rcu_barrier() when unloading that module. For example, if | ||
87 | it uses call_rcu_bh(), call_srcu() on srcu_struct_1, and call_srcu() on | ||
88 | srcu_struct_2(), then the following three lines of code will be required | ||
89 | when unloading: | ||
90 | |||
91 | 1 rcu_barrier_bh(); | ||
92 | 2 srcu_barrier(&srcu_struct_1); | ||
93 | 3 srcu_barrier(&srcu_struct_2); | ||
94 | |||
95 | The rcutorture module makes use of rcu_barrier() in its exit function | ||
83 | as follows: | 96 | as follows: |
84 | 97 | ||
85 | 1 static void | 98 | 1 static void |
diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt index 1927151b386b..e38b8df3d727 100644 --- a/Documentation/RCU/stallwarn.txt +++ b/Documentation/RCU/stallwarn.txt | |||
@@ -92,14 +92,14 @@ If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set, | |||
92 | more information is printed with the stall-warning message, for example: | 92 | more information is printed with the stall-warning message, for example: |
93 | 93 | ||
94 | INFO: rcu_preempt detected stall on CPU | 94 | INFO: rcu_preempt detected stall on CPU |
95 | 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 | 95 | 0: (63959 ticks this GP) idle=241/3fffffffffffffff/0 softirq=82/543 |
96 | (t=65000 jiffies) | 96 | (t=65000 jiffies) |
97 | 97 | ||
98 | In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is | 98 | In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is |
99 | printed: | 99 | printed: |
100 | 100 | ||
101 | INFO: rcu_preempt detected stall on CPU | 101 | INFO: rcu_preempt detected stall on CPU |
102 | 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 drain=0 . timer not pending | 102 | 0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 softirq=82/543 last_accelerate: a345/d342 nonlazy_posted: 25 .D |
103 | (t=65000 jiffies) | 103 | (t=65000 jiffies) |
104 | 104 | ||
105 | The "(64628 ticks this GP)" indicates that this CPU has taken more | 105 | The "(64628 ticks this GP)" indicates that this CPU has taken more |
@@ -116,13 +116,28 @@ number between the two "/"s is the value of the nesting, which will | |||
116 | be a small positive number if in the idle loop and a very large positive | 116 | be a small positive number if in the idle loop and a very large positive |
117 | number (as shown above) otherwise. | 117 | number (as shown above) otherwise. |
118 | 118 | ||
119 | For CONFIG_RCU_FAST_NO_HZ kernels, the "drain=0" indicates that the CPU is | 119 | The "softirq=" portion of the message tracks the number of RCU softirq |
120 | not in the process of trying to force itself into dyntick-idle state, the | 120 | handlers that the stalled CPU has executed. The number before the "/" |
121 | "." indicates that the CPU has not given up forcing RCU into dyntick-idle | 121 | is the number that had executed since boot at the time that this CPU |
122 | mode (it would be "H" otherwise), and the "timer not pending" indicates | 122 | last noted the beginning of a grace period, which might be the current |
123 | that the CPU has not recently forced RCU into dyntick-idle mode (it | 123 | (stalled) grace period, or it might be some earlier grace period (for |
124 | would otherwise indicate the number of microseconds remaining in this | 124 | example, if the CPU might have been in dyntick-idle mode for an extended |
125 | forced state). | 125 | time period. The number after the "/" is the number that have executed |
126 | since boot until the current time. If this latter number stays constant | ||
127 | across repeated stall-warning messages, it is possible that RCU's softirq | ||
128 | handlers are no longer able to execute on this CPU. This can happen if | ||
129 | the stalled CPU is spinning with interrupts are disabled, or, in -rt | ||
130 | kernels, if a high-priority process is starving RCU's softirq handler. | ||
131 | |||
132 | For CONFIG_RCU_FAST_NO_HZ kernels, the "last_accelerate:" prints the | ||
133 | low-order 16 bits (in hex) of the jiffies counter when this CPU last | ||
134 | invoked rcu_try_advance_all_cbs() from rcu_needs_cpu() or last invoked | ||
135 | rcu_accelerate_cbs() from rcu_prepare_for_idle(). The "nonlazy_posted:" | ||
136 | prints the number of non-lazy callbacks posted since the last call to | ||
137 | rcu_needs_cpu(). Finally, an "L" indicates that there are currently | ||
138 | no non-lazy callbacks ("." is printed otherwise, as shown above) and | ||
139 | "D" indicates that dyntick-idle processing is enabled ("." is printed | ||
140 | otherwise, for example, if disabled via the "nohz=" kernel boot parameter). | ||
126 | 141 | ||
127 | 142 | ||
128 | Multiple Warnings From One Stall | 143 | Multiple Warnings From One Stall |
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt index 0cc7820967f4..10df0b82f459 100644 --- a/Documentation/RCU/whatisRCU.txt +++ b/Documentation/RCU/whatisRCU.txt | |||
@@ -265,9 +265,9 @@ rcu_dereference() | |||
265 | rcu_read_lock(); | 265 | rcu_read_lock(); |
266 | p = rcu_dereference(head.next); | 266 | p = rcu_dereference(head.next); |
267 | rcu_read_unlock(); | 267 | rcu_read_unlock(); |
268 | x = p->address; | 268 | x = p->address; /* BUG!!! */ |
269 | rcu_read_lock(); | 269 | rcu_read_lock(); |
270 | y = p->data; | 270 | y = p->data; /* BUG!!! */ |
271 | rcu_read_unlock(); | 271 | rcu_read_unlock(); |
272 | 272 | ||
273 | Holding a reference from one RCU read-side critical section | 273 | Holding a reference from one RCU read-side critical section |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 8ccbf27aead4..52ecc9b84673 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -2484,9 +2484,12 @@ bytes respectively. Such letter suffixes can also be entirely omitted. | |||
2484 | In kernels built with CONFIG_RCU_NOCB_CPU=y, set | 2484 | In kernels built with CONFIG_RCU_NOCB_CPU=y, set |
2485 | the specified list of CPUs to be no-callback CPUs. | 2485 | the specified list of CPUs to be no-callback CPUs. |
2486 | Invocation of these CPUs' RCU callbacks will | 2486 | Invocation of these CPUs' RCU callbacks will |
2487 | be offloaded to "rcuoN" kthreads created for | 2487 | be offloaded to "rcuox/N" kthreads created for |
2488 | that purpose. This reduces OS jitter on the | 2488 | that purpose, where "x" is "b" for RCU-bh, "p" |
2489 | for RCU-preempt, and "s" for RCU-sched, and "N" | ||
2490 | is the CPU number. This reduces OS jitter on the | ||
2489 | offloaded CPUs, which can be useful for HPC and | 2491 | offloaded CPUs, which can be useful for HPC and |
2492 | |||
2490 | real-time workloads. It can also improve energy | 2493 | real-time workloads. It can also improve energy |
2491 | efficiency for asymmetric multiprocessors. | 2494 | efficiency for asymmetric multiprocessors. |
2492 | 2495 | ||
@@ -2510,6 +2513,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted. | |||
2510 | leaf rcu_node structure. Useful for very large | 2513 | leaf rcu_node structure. Useful for very large |
2511 | systems. | 2514 | systems. |
2512 | 2515 | ||
2516 | rcutree.jiffies_till_first_fqs= [KNL,BOOT] | ||
2517 | Set delay from grace-period initialization to | ||
2518 | first attempt to force quiescent states. | ||
2519 | Units are jiffies, minimum value is zero, | ||
2520 | and maximum value is HZ. | ||
2521 | |||
2522 | rcutree.jiffies_till_next_fqs= [KNL,BOOT] | ||
2523 | Set delay between subsequent attempts to force | ||
2524 | quiescent states. Units are jiffies, minimum | ||
2525 | value is one, and maximum value is HZ. | ||
2526 | |||
2513 | rcutree.qhimark= [KNL,BOOT] | 2527 | rcutree.qhimark= [KNL,BOOT] |
2514 | Set threshold of queued | 2528 | Set threshold of queued |
2515 | RCU callbacks over which batch limiting is disabled. | 2529 | RCU callbacks over which batch limiting is disabled. |
@@ -2524,16 +2538,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted. | |||
2524 | rcutree.rcu_cpu_stall_timeout= [KNL,BOOT] | 2538 | rcutree.rcu_cpu_stall_timeout= [KNL,BOOT] |
2525 | Set timeout for RCU CPU stall warning messages. | 2539 | Set timeout for RCU CPU stall warning messages. |
2526 | 2540 | ||
2527 | rcutree.jiffies_till_first_fqs= [KNL,BOOT] | 2541 | rcutree.rcu_idle_gp_delay= [KNL,BOOT] |
2528 | Set delay from grace-period initialization to | 2542 | Set wakeup interval for idle CPUs that have |
2529 | first attempt to force quiescent states. | 2543 | RCU callbacks (RCU_FAST_NO_HZ=y). |
2530 | Units are jiffies, minimum value is zero, | ||
2531 | and maximum value is HZ. | ||
2532 | 2544 | ||
2533 | rcutree.jiffies_till_next_fqs= [KNL,BOOT] | 2545 | rcutree.rcu_idle_lazy_gp_delay= [KNL,BOOT] |
2534 | Set delay between subsequent attempts to force | 2546 | Set wakeup interval for idle CPUs that have |
2535 | quiescent states. Units are jiffies, minimum | 2547 | only "lazy" RCU callbacks (RCU_FAST_NO_HZ=y). |
2536 | value is one, and maximum value is HZ. | 2548 | Lazy RCU callbacks are those which RCU can |
2549 | prove do nothing more than free memory. | ||
2537 | 2550 | ||
2538 | rcutorture.fqs_duration= [KNL,BOOT] | 2551 | rcutorture.fqs_duration= [KNL,BOOT] |
2539 | Set duration of force_quiescent_state bursts. | 2552 | Set duration of force_quiescent_state bursts. |
diff --git a/Documentation/kernel-per-CPU-kthreads.txt b/Documentation/kernel-per-CPU-kthreads.txt new file mode 100644 index 000000000000..cbf7ae412da4 --- /dev/null +++ b/Documentation/kernel-per-CPU-kthreads.txt | |||
@@ -0,0 +1,202 @@ | |||
1 | REDUCING OS JITTER DUE TO PER-CPU KTHREADS | ||
2 | |||
3 | This document lists per-CPU kthreads in the Linux kernel and presents | ||
4 | options to control their OS jitter. Note that non-per-CPU kthreads are | ||
5 | not listed here. To reduce OS jitter from non-per-CPU kthreads, bind | ||
6 | them to a "housekeeping" CPU dedicated to such work. | ||
7 | |||
8 | |||
9 | REFERENCES | ||
10 | |||
11 | o Documentation/IRQ-affinity.txt: Binding interrupts to sets of CPUs. | ||
12 | |||
13 | o Documentation/cgroups: Using cgroups to bind tasks to sets of CPUs. | ||
14 | |||
15 | o man taskset: Using the taskset command to bind tasks to sets | ||
16 | of CPUs. | ||
17 | |||
18 | o man sched_setaffinity: Using the sched_setaffinity() system | ||
19 | call to bind tasks to sets of CPUs. | ||
20 | |||
21 | o /sys/devices/system/cpu/cpuN/online: Control CPU N's hotplug state, | ||
22 | writing "0" to offline and "1" to online. | ||
23 | |||
24 | o In order to locate kernel-generated OS jitter on CPU N: | ||
25 | |||
26 | cd /sys/kernel/debug/tracing | ||
27 | echo 1 > max_graph_depth # Increase the "1" for more detail | ||
28 | echo function_graph > current_tracer | ||
29 | # run workload | ||
30 | cat per_cpu/cpuN/trace | ||
31 | |||
32 | |||
33 | KTHREADS | ||
34 | |||
35 | Name: ehca_comp/%u | ||
36 | Purpose: Periodically process Infiniband-related work. | ||
37 | To reduce its OS jitter, do any of the following: | ||
38 | 1. Don't use eHCA Infiniband hardware, instead choosing hardware | ||
39 | that does not require per-CPU kthreads. This will prevent these | ||
40 | kthreads from being created in the first place. (This will | ||
41 | work for most people, as this hardware, though important, is | ||
42 | relatively old and is produced in relatively low unit volumes.) | ||
43 | 2. Do all eHCA-Infiniband-related work on other CPUs, including | ||
44 | interrupts. | ||
45 | 3. Rework the eHCA driver so that its per-CPU kthreads are | ||
46 | provisioned only on selected CPUs. | ||
47 | |||
48 | |||
49 | Name: irq/%d-%s | ||
50 | Purpose: Handle threaded interrupts. | ||
51 | To reduce its OS jitter, do the following: | ||
52 | 1. Use irq affinity to force the irq threads to execute on | ||
53 | some other CPU. | ||
54 | |||
55 | Name: kcmtpd_ctr_%d | ||
56 | Purpose: Handle Bluetooth work. | ||
57 | To reduce its OS jitter, do one of the following: | ||
58 | 1. Don't use Bluetooth, in which case these kthreads won't be | ||
59 | created in the first place. | ||
60 | 2. Use irq affinity to force Bluetooth-related interrupts to | ||
61 | occur on some other CPU and furthermore initiate all | ||
62 | Bluetooth activity on some other CPU. | ||
63 | |||
64 | Name: ksoftirqd/%u | ||
65 | Purpose: Execute softirq handlers when threaded or when under heavy load. | ||
66 | To reduce its OS jitter, each softirq vector must be handled | ||
67 | separately as follows: | ||
68 | TIMER_SOFTIRQ: Do all of the following: | ||
69 | 1. To the extent possible, keep the CPU out of the kernel when it | ||
70 | is non-idle, for example, by avoiding system calls and by forcing | ||
71 | both kernel threads and interrupts to execute elsewhere. | ||
72 | 2. Build with CONFIG_HOTPLUG_CPU=y. After boot completes, force | ||
73 | the CPU offline, then bring it back online. This forces | ||
74 | recurring timers to migrate elsewhere. If you are concerned | ||
75 | with multiple CPUs, force them all offline before bringing the | ||
76 | first one back online. Once you have onlined the CPUs in question, | ||
77 | do not offline any other CPUs, because doing so could force the | ||
78 | timer back onto one of the CPUs in question. | ||
79 | NET_TX_SOFTIRQ and NET_RX_SOFTIRQ: Do all of the following: | ||
80 | 1. Force networking interrupts onto other CPUs. | ||
81 | 2. Initiate any network I/O on other CPUs. | ||
82 | 3. Once your application has started, prevent CPU-hotplug operations | ||
83 | from being initiated from tasks that might run on the CPU to | ||
84 | be de-jittered. (It is OK to force this CPU offline and then | ||
85 | bring it back online before you start your application.) | ||
86 | BLOCK_SOFTIRQ: Do all of the following: | ||
87 | 1. Force block-device interrupts onto some other CPU. | ||
88 | 2. Initiate any block I/O on other CPUs. | ||
89 | 3. Once your application has started, prevent CPU-hotplug operations | ||
90 | from being initiated from tasks that might run on the CPU to | ||
91 | be de-jittered. (It is OK to force this CPU offline and then | ||
92 | bring it back online before you start your application.) | ||
93 | BLOCK_IOPOLL_SOFTIRQ: Do all of the following: | ||
94 | 1. Force block-device interrupts onto some other CPU. | ||
95 | 2. Initiate any block I/O and block-I/O polling on other CPUs. | ||
96 | 3. Once your application has started, prevent CPU-hotplug operations | ||
97 | from being initiated from tasks that might run on the CPU to | ||
98 | be de-jittered. (It is OK to force this CPU offline and then | ||
99 | bring it back online before you start your application.) | ||
100 | TASKLET_SOFTIRQ: Do one or more of the following: | ||
101 | 1. Avoid use of drivers that use tasklets. (Such drivers will contain | ||
102 | calls to things like tasklet_schedule().) | ||
103 | 2. Convert all drivers that you must use from tasklets to workqueues. | ||
104 | 3. Force interrupts for drivers using tasklets onto other CPUs, | ||
105 | and also do I/O involving these drivers on other CPUs. | ||
106 | SCHED_SOFTIRQ: Do all of the following: | ||
107 | 1. Avoid sending scheduler IPIs to the CPU to be de-jittered, | ||
108 | for example, ensure that at most one runnable kthread is present | ||
109 | on that CPU. If a thread that expects to run on the de-jittered | ||
110 | CPU awakens, the scheduler will send an IPI that can result in | ||
111 | a subsequent SCHED_SOFTIRQ. | ||
112 | 2. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, | ||
113 | CONFIG_NO_HZ_FULL=y, and, in addition, ensure that the CPU | ||
114 | to be de-jittered is marked as an adaptive-ticks CPU using the | ||
115 | "nohz_full=" boot parameter. This reduces the number of | ||
116 | scheduler-clock interrupts that the de-jittered CPU receives, | ||
117 | minimizing its chances of being selected to do the load balancing | ||
118 | work that runs in SCHED_SOFTIRQ context. | ||
119 | 3. To the extent possible, keep the CPU out of the kernel when it | ||
120 | is non-idle, for example, by avoiding system calls and by | ||
121 | forcing both kernel threads and interrupts to execute elsewhere. | ||
122 | This further reduces the number of scheduler-clock interrupts | ||
123 | received by the de-jittered CPU. | ||
124 | HRTIMER_SOFTIRQ: Do all of the following: | ||
125 | 1. To the extent possible, keep the CPU out of the kernel when it | ||
126 | is non-idle. For example, avoid system calls and force both | ||
127 | kernel threads and interrupts to execute elsewhere. | ||
128 | 2. Build with CONFIG_HOTPLUG_CPU=y. Once boot completes, force the | ||
129 | CPU offline, then bring it back online. This forces recurring | ||
130 | timers to migrate elsewhere. If you are concerned with multiple | ||
131 | CPUs, force them all offline before bringing the first one | ||
132 | back online. Once you have onlined the CPUs in question, do not | ||
133 | offline any other CPUs, because doing so could force the timer | ||
134 | back onto one of the CPUs in question. | ||
135 | RCU_SOFTIRQ: Do at least one of the following: | ||
136 | 1. Offload callbacks and keep the CPU in either dyntick-idle or | ||
137 | adaptive-ticks state by doing all of the following: | ||
138 | a. Build with CONFIG_RCU_NOCB_CPU=y, CONFIG_RCU_NOCB_CPU_ALL=y, | ||
139 | CONFIG_NO_HZ_FULL=y, and, in addition ensure that the CPU | ||
140 | to be de-jittered is marked as an adaptive-ticks CPU using | ||
141 | the "nohz_full=" boot parameter. Bind the rcuo kthreads | ||
142 | to housekeeping CPUs, which can tolerate OS jitter. | ||
143 | b. To the extent possible, keep the CPU out of the kernel | ||
144 | when it is non-idle, for example, by avoiding system | ||
145 | calls and by forcing both kernel threads and interrupts | ||
146 | to execute elsewhere. | ||
147 | 2. Enable RCU to do its processing remotely via dyntick-idle by | ||
148 | doing all of the following: | ||
149 | a. Build with CONFIG_NO_HZ=y and CONFIG_RCU_FAST_NO_HZ=y. | ||
150 | b. Ensure that the CPU goes idle frequently, allowing other | ||
151 | CPUs to detect that it has passed through an RCU quiescent | ||
152 | state. If the kernel is built with CONFIG_NO_HZ_FULL=y, | ||
153 | userspace execution also allows other CPUs to detect that | ||
154 | the CPU in question has passed through a quiescent state. | ||
155 | c. To the extent possible, keep the CPU out of the kernel | ||
156 | when it is non-idle, for example, by avoiding system | ||
157 | calls and by forcing both kernel threads and interrupts | ||
158 | to execute elsewhere. | ||
159 | |||
160 | Name: rcuc/%u | ||
161 | Purpose: Execute RCU callbacks in CONFIG_RCU_BOOST=y kernels. | ||
162 | To reduce its OS jitter, do at least one of the following: | ||
163 | 1. Build the kernel with CONFIG_PREEMPT=n. This prevents these | ||
164 | kthreads from being created in the first place, and also obviates | ||
165 | the need for RCU priority boosting. This approach is feasible | ||
166 | for workloads that do not require high degrees of responsiveness. | ||
167 | 2. Build the kernel with CONFIG_RCU_BOOST=n. This prevents these | ||
168 | kthreads from being created in the first place. This approach | ||
169 | is feasible only if your workload never requires RCU priority | ||
170 | boosting, for example, if you ensure frequent idle time on all | ||
171 | CPUs that might execute within the kernel. | ||
172 | 3. Build with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y, | ||
173 | which offloads all RCU callbacks to kthreads that can be moved | ||
174 | off of CPUs susceptible to OS jitter. This approach prevents the | ||
175 | rcuc/%u kthreads from having any work to do, so that they are | ||
176 | never awakened. | ||
177 | 4. Ensure that the CPU never enters the kernel, and, in particular, | ||
178 | avoid initiating any CPU hotplug operations on this CPU. This is | ||
179 | another way of preventing any callbacks from being queued on the | ||
180 | CPU, again preventing the rcuc/%u kthreads from having any work | ||
181 | to do. | ||
182 | |||
183 | Name: rcuob/%d, rcuop/%d, and rcuos/%d | ||
184 | Purpose: Offload RCU callbacks from the corresponding CPU. | ||
185 | To reduce its OS jitter, do at least one of the following: | ||
186 | 1. Use affinity, cgroups, or other mechanism to force these kthreads | ||
187 | to execute on some other CPU. | ||
188 | 2. Build with CONFIG_RCU_NOCB_CPUS=n, which will prevent these | ||
189 | kthreads from being created in the first place. However, please | ||
190 | note that this will not eliminate OS jitter, but will instead | ||
191 | shift it to RCU_SOFTIRQ. | ||
192 | |||
193 | Name: watchdog/%u | ||
194 | Purpose: Detect software lockups on each CPU. | ||
195 | To reduce its OS jitter, do at least one of the following: | ||
196 | 1. Build with CONFIG_LOCKUP_DETECTOR=n, which will prevent these | ||
197 | kthreads from being created in the first place. | ||
198 | 2. Echo a zero to /proc/sys/kernel/watchdog to disable the | ||
199 | watchdog timer. | ||
200 | 3. Echo a large number of /proc/sys/kernel/watchdog_thresh in | ||
201 | order to reduce the frequency of OS jitter due to the watchdog | ||
202 | timer down to a level that is acceptable for your workload. | ||