diff options
-rw-r--r-- | Documentation/timers/NO_HZ.txt | 273 |
1 files changed, 273 insertions, 0 deletions
diff --git a/Documentation/timers/NO_HZ.txt b/Documentation/timers/NO_HZ.txt new file mode 100644 index 000000000000..5b5322024067 --- /dev/null +++ b/Documentation/timers/NO_HZ.txt | |||
@@ -0,0 +1,273 @@ | |||
1 | NO_HZ: Reducing Scheduling-Clock Ticks | ||
2 | |||
3 | |||
4 | This document describes Kconfig options and boot parameters that can | ||
5 | reduce the number of scheduling-clock interrupts, thereby improving energy | ||
6 | efficiency and reducing OS jitter. Reducing OS jitter is important for | ||
7 | some types of computationally intensive high-performance computing (HPC) | ||
8 | applications and for real-time applications. | ||
9 | |||
10 | There are two main contexts in which the number of scheduling-clock | ||
11 | interrupts can be reduced compared to the old-school approach of sending | ||
12 | a scheduling-clock interrupt to all CPUs every jiffy whether they need | ||
13 | it or not (CONFIG_HZ_PERIODIC=y or CONFIG_NO_HZ=n for older kernels): | ||
14 | |||
15 | 1. Idle CPUs (CONFIG_NO_HZ_IDLE=y or CONFIG_NO_HZ=y for older kernels). | ||
16 | |||
17 | 2. CPUs having only one runnable task (CONFIG_NO_HZ_FULL=y). | ||
18 | |||
19 | These two cases are described in the following two sections, followed | ||
20 | by a third section on RCU-specific considerations and a fourth and final | ||
21 | section listing known issues. | ||
22 | |||
23 | |||
24 | IDLE CPUs | ||
25 | |||
26 | If a CPU is idle, there is little point in sending it a scheduling-clock | ||
27 | interrupt. After all, the primary purpose of a scheduling-clock interrupt | ||
28 | is to force a busy CPU to shift its attention among multiple duties, | ||
29 | and an idle CPU has no duties to shift its attention among. | ||
30 | |||
31 | The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending | ||
32 | scheduling-clock interrupts to idle CPUs, which is critically important | ||
33 | both to battery-powered devices and to highly virtualized mainframes. | ||
34 | A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would | ||
35 | drain its battery very quickly, easily 2-3 times as fast as would the | ||
36 | same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running | ||
37 | 1,500 OS instances might find that half of its CPU time was consumed by | ||
38 | unnecessary scheduling-clock interrupts. In these situations, there | ||
39 | is strong motivation to avoid sending scheduling-clock interrupts to | ||
40 | idle CPUs. That said, dyntick-idle mode is not free: | ||
41 | |||
42 | 1. It increases the number of instructions executed on the path | ||
43 | to and from the idle loop. | ||
44 | |||
45 | 2. On many architectures, dyntick-idle mode also increases the | ||
46 | number of expensive clock-reprogramming operations. | ||
47 | |||
48 | Therefore, systems with aggressive real-time response constraints often | ||
49 | run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) | ||
50 | in order to avoid degrading from-idle transition latencies. | ||
51 | |||
52 | An idle CPU that is not receiving scheduling-clock interrupts is said to | ||
53 | be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running | ||
54 | tickless". The remainder of this document will use "dyntick-idle mode". | ||
55 | |||
56 | There is also a boot parameter "nohz=" that can be used to disable | ||
57 | dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". | ||
58 | By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling | ||
59 | dyntick-idle mode. | ||
60 | |||
61 | |||
62 | CPUs WITH ONLY ONE RUNNABLE TASK | ||
63 | |||
64 | If a CPU has only one runnable task, there is little point in sending it | ||
65 | a scheduling-clock interrupt because there is no other task to switch to. | ||
66 | |||
67 | The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid | ||
68 | sending scheduling-clock interrupts to CPUs with a single runnable task, | ||
69 | and such CPUs are said to be "adaptive-ticks CPUs". This is important | ||
70 | for applications with aggressive real-time response constraints because | ||
71 | it allows them to improve their worst-case response times by the maximum | ||
72 | duration of a scheduling-clock interrupt. It is also important for | ||
73 | computationally intensive short-iteration workloads: If any CPU is | ||
74 | delayed during a given iteration, all the other CPUs will be forced to | ||
75 | wait idle while the delayed CPU finishes. Thus, the delay is multiplied | ||
76 | by one less than the number of CPUs. In these situations, there is | ||
77 | again strong motivation to avoid sending scheduling-clock interrupts. | ||
78 | |||
79 | By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" | ||
80 | boot parameter specifies the adaptive-ticks CPUs. For example, | ||
81 | "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks | ||
82 | CPUs. Note that you are prohibited from marking all of the CPUs as | ||
83 | adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain | ||
84 | online to handle timekeeping tasks in order to ensure that system calls | ||
85 | like gettimeofday() returns accurate values on adaptive-tick CPUs. | ||
86 | (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no | ||
87 | running user processes to observe slight drifts in clock rate.) | ||
88 | Therefore, the boot CPU is prohibited from entering adaptive-ticks | ||
89 | mode. Specifying a "nohz_full=" mask that includes the boot CPU will | ||
90 | result in a boot-time error message, and the boot CPU will be removed | ||
91 | from the mask. | ||
92 | |||
93 | Alternatively, the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter specifies | ||
94 | that all CPUs other than the boot CPU are adaptive-ticks CPUs. This | ||
95 | Kconfig parameter will be overridden by the "nohz_full=" boot parameter, | ||
96 | so that if both the CONFIG_NO_HZ_FULL_ALL=y Kconfig parameter and | ||
97 | the "nohz_full=1" boot parameter is specified, the boot parameter will | ||
98 | prevail so that only CPU 1 will be an adaptive-ticks CPU. | ||
99 | |||
100 | Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. | ||
101 | This is covered in the "RCU IMPLICATIONS" section below. | ||
102 | |||
103 | Normally, a CPU remains in adaptive-ticks mode as long as possible. | ||
104 | In particular, transitioning to kernel mode does not automatically change | ||
105 | the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, | ||
106 | for example, if that CPU enqueues an RCU callback. | ||
107 | |||
108 | Just as with dyntick-idle mode, the benefits of adaptive-tick mode do | ||
109 | not come for free: | ||
110 | |||
111 | 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run | ||
112 | adaptive ticks without also running dyntick idle. This dependency | ||
113 | extends down into the implementation, so that all of the costs | ||
114 | of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. | ||
115 | |||
116 | 2. The user/kernel transitions are slightly more expensive due | ||
117 | to the need to inform kernel subsystems (such as RCU) about | ||
118 | the change in mode. | ||
119 | |||
120 | 3. POSIX CPU timers on adaptive-tick CPUs may miss their deadlines | ||
121 | (perhaps indefinitely) because they currently rely on | ||
122 | scheduling-tick interrupts. This will likely be fixed in | ||
123 | one of two ways: (1) Prevent CPUs with POSIX CPU timers from | ||
124 | entering adaptive-tick mode, or (2) Use hrtimers or other | ||
125 | adaptive-ticks-immune mechanism to cause the POSIX CPU timer to | ||
126 | fire properly. | ||
127 | |||
128 | 4. If there are more perf events pending than the hardware can | ||
129 | accommodate, they are normally round-robined so as to collect | ||
130 | all of them over time. Adaptive-tick mode may prevent this | ||
131 | round-robining from happening. This will likely be fixed by | ||
132 | preventing CPUs with large numbers of perf events pending from | ||
133 | entering adaptive-tick mode. | ||
134 | |||
135 | 5. Scheduler statistics for adaptive-tick CPUs may be computed | ||
136 | slightly differently than those for non-adaptive-tick CPUs. | ||
137 | This might in turn perturb load-balancing of real-time tasks. | ||
138 | |||
139 | 6. The LB_BIAS scheduler feature is disabled by adaptive ticks. | ||
140 | |||
141 | Although improvements are expected over time, adaptive ticks is quite | ||
142 | useful for many types of real-time and compute-intensive applications. | ||
143 | However, the drawbacks listed above mean that adaptive ticks should not | ||
144 | (yet) be enabled by default. | ||
145 | |||
146 | |||
147 | RCU IMPLICATIONS | ||
148 | |||
149 | There are situations in which idle CPUs cannot be permitted to | ||
150 | enter either dyntick-idle mode or adaptive-tick mode, the most | ||
151 | common being when that CPU has RCU callbacks pending. | ||
152 | |||
153 | The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs | ||
154 | to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, | ||
155 | a timer will awaken these CPUs every four jiffies in order to ensure | ||
156 | that the RCU callbacks are processed in a timely fashion. | ||
157 | |||
158 | Another approach is to offload RCU callback processing to "rcuo" kthreads | ||
159 | using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to | ||
160 | offload may be selected via several methods: | ||
161 | |||
162 | 1. One of three mutually exclusive Kconfig options specify a | ||
163 | build-time default for the CPUs to offload: | ||
164 | |||
165 | a. The CONFIG_RCU_NOCB_CPU_NONE=y Kconfig option results in | ||
166 | no CPUs being offloaded. | ||
167 | |||
168 | b. The CONFIG_RCU_NOCB_CPU_ZERO=y Kconfig option causes | ||
169 | CPU 0 to be offloaded. | ||
170 | |||
171 | c. The CONFIG_RCU_NOCB_CPU_ALL=y Kconfig option causes all | ||
172 | CPUs to be offloaded. Note that the callbacks will be | ||
173 | offloaded to "rcuo" kthreads, and that those kthreads | ||
174 | will in fact run on some CPU. However, this approach | ||
175 | gives fine-grained control on exactly which CPUs the | ||
176 | callbacks run on, along with their scheduling priority | ||
177 | (including the default of SCHED_OTHER), and it further | ||
178 | allows this control to be varied dynamically at runtime. | ||
179 | |||
180 | 2. The "rcu_nocbs=" kernel boot parameter, which takes a comma-separated | ||
181 | list of CPUs and CPU ranges, for example, "1,3-5" selects CPUs 1, | ||
182 | 3, 4, and 5. The specified CPUs will be offloaded in addition to | ||
183 | any CPUs specified as offloaded by CONFIG_RCU_NOCB_CPU_ZERO=y or | ||
184 | CONFIG_RCU_NOCB_CPU_ALL=y. This means that the "rcu_nocbs=" boot | ||
185 | parameter has no effect for kernels built with RCU_NOCB_CPU_ALL=y. | ||
186 | |||
187 | The offloaded CPUs will never queue RCU callbacks, and therefore RCU | ||
188 | never prevents offloaded CPUs from entering either dyntick-idle mode | ||
189 | or adaptive-tick mode. That said, note that it is up to userspace to | ||
190 | pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the | ||
191 | scheduler will decide where to run them, which might or might not be | ||
192 | where you want them to run. | ||
193 | |||
194 | |||
195 | KNOWN ISSUES | ||
196 | |||
197 | o Dyntick-idle slows transitions to and from idle slightly. | ||
198 | In practice, this has not been a problem except for the most | ||
199 | aggressive real-time workloads, which have the option of disabling | ||
200 | dyntick-idle mode, an option that most of them take. However, | ||
201 | some workloads will no doubt want to use adaptive ticks to | ||
202 | eliminate scheduling-clock interrupt latencies. Here are some | ||
203 | options for these workloads: | ||
204 | |||
205 | a. Use PMQOS from userspace to inform the kernel of your | ||
206 | latency requirements (preferred). | ||
207 | |||
208 | b. On x86 systems, use the "idle=mwait" boot parameter. | ||
209 | |||
210 | c. On x86 systems, use the "intel_idle.max_cstate=" to limit | ||
211 | ` the maximum C-state depth. | ||
212 | |||
213 | d. On x86 systems, use the "idle=poll" boot parameter. | ||
214 | However, please note that use of this parameter can cause | ||
215 | your CPU to overheat, which may cause thermal throttling | ||
216 | to degrade your latencies -- and that this degradation can | ||
217 | be even worse than that of dyntick-idle. Furthermore, | ||
218 | this parameter effectively disables Turbo Mode on Intel | ||
219 | CPUs, which can significantly reduce maximum performance. | ||
220 | |||
221 | o Adaptive-ticks slows user/kernel transitions slightly. | ||
222 | This is not expected to be a problem for computationally intensive | ||
223 | workloads, which have few such transitions. Careful benchmarking | ||
224 | will be required to determine whether or not other workloads | ||
225 | are significantly affected by this effect. | ||
226 | |||
227 | o Adaptive-ticks does not do anything unless there is only one | ||
228 | runnable task for a given CPU, even though there are a number | ||
229 | of other situations where the scheduling-clock tick is not | ||
230 | needed. To give but one example, consider a CPU that has one | ||
231 | runnable high-priority SCHED_FIFO task and an arbitrary number | ||
232 | of low-priority SCHED_OTHER tasks. In this case, the CPU is | ||
233 | required to run the SCHED_FIFO task until it either blocks or | ||
234 | some other higher-priority task awakens on (or is assigned to) | ||
235 | this CPU, so there is no point in sending a scheduling-clock | ||
236 | interrupt to this CPU. However, the current implementation | ||
237 | nevertheless sends scheduling-clock interrupts to CPUs having a | ||
238 | single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER | ||
239 | tasks, even though these interrupts are unnecessary. | ||
240 | |||
241 | Better handling of these sorts of situations is future work. | ||
242 | |||
243 | o A reboot is required to reconfigure both adaptive idle and RCU | ||
244 | callback offloading. Runtime reconfiguration could be provided | ||
245 | if needed, however, due to the complexity of reconfiguring RCU at | ||
246 | runtime, there would need to be an earthshakingly good reason. | ||
247 | Especially given that you have the straightforward option of | ||
248 | simply offloading RCU callbacks from all CPUs and pinning them | ||
249 | where you want them whenever you want them pinned. | ||
250 | |||
251 | o Additional configuration is required to deal with other sources | ||
252 | of OS jitter, including interrupts and system-utility tasks | ||
253 | and processes. This configuration normally involves binding | ||
254 | interrupts and tasks to particular CPUs. | ||
255 | |||
256 | o Some sources of OS jitter can currently be eliminated only by | ||
257 | constraining the workload. For example, the only way to eliminate | ||
258 | OS jitter due to global TLB shootdowns is to avoid the unmapping | ||
259 | operations (such as kernel module unload operations) that | ||
260 | result in these shootdowns. For another example, page faults | ||
261 | and TLB misses can be reduced (and in some cases eliminated) by | ||
262 | using huge pages and by constraining the amount of memory used | ||
263 | by the application. Pre-faulting the working set can also be | ||
264 | helpful, especially when combined with the mlock() and mlockall() | ||
265 | system calls. | ||
266 | |||
267 | o Unless all CPUs are idle, at least one CPU must keep the | ||
268 | scheduling-clock interrupt going in order to support accurate | ||
269 | timekeeping. | ||
270 | |||
271 | o If there are adaptive-ticks CPUs, there will be at least one | ||
272 | CPU keeping the scheduling-clock interrupt going, even if all | ||
273 | CPUs are otherwise idle. | ||