diff options
author | Rafael J. Wysocki <rjw@rjwysocki.net> | 2017-03-13 18:59:57 -0400 |
---|---|---|
committer | Jonathan Corbet <corbet@lwn.net> | 2017-03-13 19:08:42 -0400 |
commit | 2a0e49279850d28c450f27e51b419ce90bacdcdc (patch) | |
tree | 96e995e194a1bb9926a4f1c4fa01571bf218e148 | |
parent | 8fa1bb506fc9b5b0f7b5e42cee4f8213325a98ee (diff) |
cpufreq: User/admin documentation update and consolidation
The user/admin documentation of cpufreq is badly outdated. It
conains stale and/or inaccurate information along with things
that are not particularly useful. Also, some of the important
pieces are missing from it.
For this reason, add a new user/admin document for cpufreq
containing current information to admin-guide and drop the old
outdated .txt documents it is replacing.
Since there will be more PM documents in admin-guide going forward,
create a separate directory for them and put the cpufreq document
in there right away.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-rw-r--r-- | Documentation/admin-guide/index.rst | 1 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/cpufreq.rst | 700 | ||||
-rw-r--r-- | Documentation/admin-guide/pm/index.rst | 15 | ||||
-rw-r--r-- | Documentation/cpu-freq/boost.txt | 93 | ||||
-rw-r--r-- | Documentation/cpu-freq/governors.txt | 301 | ||||
-rw-r--r-- | Documentation/cpu-freq/index.txt | 7 | ||||
-rw-r--r-- | Documentation/cpu-freq/user-guide.txt | 228 |
7 files changed, 716 insertions, 629 deletions
diff --git a/Documentation/admin-guide/index.rst b/Documentation/admin-guide/index.rst index 8ddae4e4299a..8c60a8a32a1a 100644 --- a/Documentation/admin-guide/index.rst +++ b/Documentation/admin-guide/index.rst | |||
@@ -60,6 +60,7 @@ configure specific aspects of kernel behavior to your liking. | |||
60 | mono | 60 | mono |
61 | java | 61 | java |
62 | ras | 62 | ras |
63 | pm/index | ||
63 | 64 | ||
64 | .. only:: subproject and html | 65 | .. only:: subproject and html |
65 | 66 | ||
diff --git a/Documentation/admin-guide/pm/cpufreq.rst b/Documentation/admin-guide/pm/cpufreq.rst new file mode 100644 index 000000000000..289c80f7760e --- /dev/null +++ b/Documentation/admin-guide/pm/cpufreq.rst | |||
@@ -0,0 +1,700 @@ | |||
1 | .. |struct cpufreq_policy| replace:: :c:type:`struct cpufreq_policy <cpufreq_policy>` | ||
2 | |||
3 | ======================= | ||
4 | CPU Performance Scaling | ||
5 | ======================= | ||
6 | |||
7 | :: | ||
8 | |||
9 | Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> | ||
10 | |||
11 | The Concept of CPU Performance Scaling | ||
12 | ====================================== | ||
13 | |||
14 | The majority of modern processors are capable of operating in a number of | ||
15 | different clock frequency and voltage configurations, often referred to as | ||
16 | Operating Performance Points or P-states (in ACPI terminology). As a rule, | ||
17 | the higher the clock frequency and the higher the voltage, the more instructions | ||
18 | can be retired by the CPU over a unit of time, but also the higher the clock | ||
19 | frequency and the higher the voltage, the more energy is consumed over a unit of | ||
20 | time (or the more power is drawn) by the CPU in the given P-state. Therefore | ||
21 | there is a natural tradeoff between the CPU capacity (the number of instructions | ||
22 | that can be executed over a unit of time) and the power drawn by the CPU. | ||
23 | |||
24 | In some situations it is desirable or even necessary to run the program as fast | ||
25 | as possible and then there is no reason to use any P-states different from the | ||
26 | highest one (i.e. the highest-performance frequency/voltage configuration | ||
27 | available). In some other cases, however, it may not be necessary to execute | ||
28 | instructions so quickly and maintaining the highest available CPU capacity for a | ||
29 | relatively long time without utilizing it entirely may be regarded as wasteful. | ||
30 | It also may not be physically possible to maintain maximum CPU capacity for too | ||
31 | long for thermal or power supply capacity reasons or similar. To cover those | ||
32 | cases, there are hardware interfaces allowing CPUs to be switched between | ||
33 | different frequency/voltage configurations or (in the ACPI terminology) to be | ||
34 | put into different P-states. | ||
35 | |||
36 | Typically, they are used along with algorithms to estimate the required CPU | ||
37 | capacity, so as to decide which P-states to put the CPUs into. Of course, since | ||
38 | the utilization of the system generally changes over time, that has to be done | ||
39 | repeatedly on a regular basis. The activity by which this happens is referred | ||
40 | to as CPU performance scaling or CPU frequency scaling (because it involves | ||
41 | adjusting the CPU clock frequency). | ||
42 | |||
43 | |||
44 | CPU Performance Scaling in Linux | ||
45 | ================================ | ||
46 | |||
47 | The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` | ||
48 | (CPU Frequency scaling) subsystem that consists of three layers of code: the | ||
49 | core, scaling governors and scaling drivers. | ||
50 | |||
51 | The ``CPUFreq`` core provides the common code infrastructure and user space | ||
52 | interfaces for all platforms that support CPU performance scaling. It defines | ||
53 | the basic framework in which the other components operate. | ||
54 | |||
55 | Scaling governors implement algorithms to estimate the required CPU capacity. | ||
56 | As a rule, each governor implements one, possibly parametrized, scaling | ||
57 | algorithm. | ||
58 | |||
59 | Scaling drivers talk to the hardware. They provide scaling governors with | ||
60 | information on the available P-states (or P-state ranges in some cases) and | ||
61 | access platform-specific hardware interfaces to change CPU P-states as requested | ||
62 | by scaling governors. | ||
63 | |||
64 | In principle, all available scaling governors can be used with every scaling | ||
65 | driver. That design is based on the observation that the information used by | ||
66 | performance scaling algorithms for P-state selection can be represented in a | ||
67 | platform-independent form in the majority of cases, so it should be possible | ||
68 | to use the same performance scaling algorithm implemented in exactly the same | ||
69 | way regardless of which scaling driver is used. Consequently, the same set of | ||
70 | scaling governors should be suitable for every supported platform. | ||
71 | |||
72 | However, that observation may not hold for performance scaling algorithms | ||
73 | based on information provided by the hardware itself, for example through | ||
74 | feedback registers, as that information is typically specific to the hardware | ||
75 | interface it comes from and may not be easily represented in an abstract, | ||
76 | platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers | ||
77 | to bypass the governor layer and implement their own performance scaling | ||
78 | algorithms. That is done by the ``intel_pstate`` scaling driver. | ||
79 | |||
80 | |||
81 | ``CPUFreq`` Policy Objects | ||
82 | ========================== | ||
83 | |||
84 | In some cases the hardware interface for P-state control is shared by multiple | ||
85 | CPUs. That is, for example, the same register (or set of registers) is used to | ||
86 | control the P-state of multiple CPUs at the same time and writing to it affects | ||
87 | all of those CPUs simultaneously. | ||
88 | |||
89 | Sets of CPUs sharing hardware P-state control interfaces are represented by | ||
90 | ``CPUFreq`` as |struct cpufreq_policy| objects. For consistency, | ||
91 | |struct cpufreq_policy| is also used when there is only one CPU in the given | ||
92 | set. | ||
93 | |||
94 | The ``CPUFreq`` core maintains a pointer to a |struct cpufreq_policy| object for | ||
95 | every CPU in the system, including CPUs that are currently offline. If multiple | ||
96 | CPUs share the same hardware P-state control interface, all of the pointers | ||
97 | corresponding to them point to the same |struct cpufreq_policy| object. | ||
98 | |||
99 | ``CPUFreq`` uses |struct cpufreq_policy| as its basic data type and the design | ||
100 | of its user space interface is based on the policy concept. | ||
101 | |||
102 | |||
103 | CPU Initialization | ||
104 | ================== | ||
105 | |||
106 | First of all, a scaling driver has to be registered for ``CPUFreq`` to work. | ||
107 | It is only possible to register one scaling driver at a time, so the scaling | ||
108 | driver is expected to be able to handle all CPUs in the system. | ||
109 | |||
110 | The scaling driver may be registered before or after CPU registration. If | ||
111 | CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to | ||
112 | take a note of all of the already registered CPUs during the registration of the | ||
113 | scaling driver. In turn, if any CPUs are registered after the registration of | ||
114 | the scaling driver, the ``CPUFreq`` core will be invoked to take note of them | ||
115 | at their registration time. | ||
116 | |||
117 | In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it | ||
118 | has not seen so far as soon as it is ready to handle that CPU. [Note that the | ||
119 | logical CPU may be a physical single-core processor, or a single core in a | ||
120 | multicore processor, or a hardware thread in a physical processor or processor | ||
121 | core. In what follows "CPU" always means "logical CPU" unless explicitly stated | ||
122 | otherwise and the word "processor" is used to refer to the physical part | ||
123 | possibly including multiple logical CPUs.] | ||
124 | |||
125 | Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set | ||
126 | for the given CPU and if so, it skips the policy object creation. Otherwise, | ||
127 | a new policy object is created and initialized, which involves the creation of | ||
128 | a new policy directory in ``sysfs``, and the policy pointer corresponding to | ||
129 | the given CPU is set to the new policy object's address in memory. | ||
130 | |||
131 | Next, the scaling driver's ``->init()`` callback is invoked with the policy | ||
132 | pointer of the new CPU passed to it as the argument. That callback is expected | ||
133 | to initialize the performance scaling hardware interface for the given CPU (or, | ||
134 | more precisely, for the set of CPUs sharing the hardware interface it belongs | ||
135 | to, represented by its policy object) and, if the policy object it has been | ||
136 | called for is new, to set parameters of the policy, like the minimum and maximum | ||
137 | frequencies supported by the hardware, the table of available frequencies (if | ||
138 | the set of supported P-states is not a continuous range), and the mask of CPUs | ||
139 | that belong to the same policy (including both online and offline CPUs). That | ||
140 | mask is then used by the core to populate the policy pointers for all of the | ||
141 | CPUs in it. | ||
142 | |||
143 | The next major initialization step for a new policy object is to attach a | ||
144 | scaling governor to it (to begin with, that is the default scaling governor | ||
145 | determined by the kernel configuration, but it may be changed later | ||
146 | via ``sysfs``). First, a pointer to the new policy object is passed to the | ||
147 | governor's ``->init()`` callback which is expected to initialize all of the | ||
148 | data structures necessary to handle the given policy and, possibly, to add | ||
149 | a governor ``sysfs`` interface to it. Next, the governor is started by | ||
150 | invoking its ``->start()`` callback. | ||
151 | |||
152 | That callback it expected to register per-CPU utilization update callbacks for | ||
153 | all of the online CPUs belonging to the given policy with the CPU scheduler. | ||
154 | The utilization update callbacks will be invoked by the CPU scheduler on | ||
155 | important events, like task enqueue and dequeue, on every iteration of the | ||
156 | scheduler tick or generally whenever the CPU utilization may change (from the | ||
157 | scheduler's perspective). They are expected to carry out computations needed | ||
158 | to determine the P-state to use for the given policy going forward and to | ||
159 | invoke the scaling driver to make changes to the hardware in accordance with | ||
160 | the P-state selection. The scaling driver may be invoked directly from | ||
161 | scheduler context or asynchronously, via a kernel thread or workqueue, depending | ||
162 | on the configuration and capabilities of the scaling driver and the governor. | ||
163 | |||
164 | Similar steps are taken for policy objects that are not new, but were "inactive" | ||
165 | previously, meaning that all of the CPUs belonging to them were offline. The | ||
166 | only practical difference in that case is that the ``CPUFreq`` core will attempt | ||
167 | to use the scaling governor previously used with the policy that became | ||
168 | "inactive" (and is re-initialized now) instead of the default governor. | ||
169 | |||
170 | In turn, if a previously offline CPU is being brought back online, but some | ||
171 | other CPUs sharing the policy object with it are online already, there is no | ||
172 | need to re-initialize the policy object at all. In that case, it only is | ||
173 | necessary to restart the scaling governor so that it can take the new online CPU | ||
174 | into account. That is achieved by invoking the governor's ``->stop`` and | ||
175 | ``->start()`` callbacks, in this order, for the entire policy. | ||
176 | |||
177 | As mentioned before, the ``intel_pstate`` scaling driver bypasses the scaling | ||
178 | governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. | ||
179 | Consequently, if ``intel_pstate`` is used, scaling governors are not attached to | ||
180 | new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked | ||
181 | to register per-CPU utilization update callbacks for each policy. These | ||
182 | callbacks are invoked by the CPU scheduler in the same way as for scaling | ||
183 | governors, but in the ``intel_pstate`` case they both determine the P-state to | ||
184 | use and change the hardware configuration accordingly in one go from scheduler | ||
185 | context. | ||
186 | |||
187 | The policy objects created during CPU initialization and other data structures | ||
188 | associated with them are torn down when the scaling driver is unregistered | ||
189 | (which happens when the kernel module containing it is unloaded, for example) or | ||
190 | when the last CPU belonging to the given policy in unregistered. | ||
191 | |||
192 | |||
193 | Policy Interface in ``sysfs`` | ||
194 | ============================= | ||
195 | |||
196 | During the initialization of the kernel, the ``CPUFreq`` core creates a | ||
197 | ``sysfs`` directory (kobject) called ``cpufreq`` under | ||
198 | :file:`/sys/devices/system/cpu/`. | ||
199 | |||
200 | That directory contains a ``policyX`` subdirectory (where ``X`` represents an | ||
201 | integer number) for every policy object maintained by the ``CPUFreq`` core. | ||
202 | Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links | ||
203 | under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer | ||
204 | that may be different from the one represented by ``X``) for all of the CPUs | ||
205 | associated with (or belonging to) the given policy. The ``policyX`` directories | ||
206 | in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific | ||
207 | attributes (files) to control ``CPUFreq`` behavior for the corresponding policy | ||
208 | objects (that is, for all of the CPUs associated with them). | ||
209 | |||
210 | Some of those attributes are generic. They are created by the ``CPUFreq`` core | ||
211 | and their behavior generally does not depend on what scaling driver is in use | ||
212 | and what scaling governor is attached to the given policy. Some scaling drivers | ||
213 | also add driver-specific attributes to the policy directories in ``sysfs`` to | ||
214 | control policy-specific aspects of driver behavior. | ||
215 | |||
216 | The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` | ||
217 | are the following: | ||
218 | |||
219 | ``affected_cpus`` | ||
220 | List of online CPUs belonging to this policy (i.e. sharing the hardware | ||
221 | performance scaling interface represented by the ``policyX`` policy | ||
222 | object). | ||
223 | |||
224 | ``bios_limit`` | ||
225 | If the platform firmware (BIOS) tells the OS to apply an upper limit to | ||
226 | CPU frequencies, that limit will be reported through this attribute (if | ||
227 | present). | ||
228 | |||
229 | The existence of the limit may be a result of some (often unintentional) | ||
230 | BIOS settings, restrictions coming from a service processor or another | ||
231 | BIOS/HW-based mechanisms. | ||
232 | |||
233 | This does not cover ACPI thermal limitations which can be discovered | ||
234 | through a generic thermal driver. | ||
235 | |||
236 | This attribute is not present if the scaling driver in use does not | ||
237 | support it. | ||
238 | |||
239 | ``cpuinfo_max_freq`` | ||
240 | Maximum possible operating frequency the CPUs belonging to this policy | ||
241 | can run at (in kHz). | ||
242 | |||
243 | ``cpuinfo_min_freq`` | ||
244 | Minimum possible operating frequency the CPUs belonging to this policy | ||
245 | can run at (in kHz). | ||
246 | |||
247 | ``cpuinfo_transition_latency`` | ||
248 | The time it takes to switch the CPUs belonging to this policy from one | ||
249 | P-state to another, in nanoseconds. | ||
250 | |||
251 | If unknown or if known to be so high that the scaling driver does not | ||
252 | work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) | ||
253 | will be returned by reads from this attribute. | ||
254 | |||
255 | ``related_cpus`` | ||
256 | List of all (online and offline) CPUs belonging to this policy. | ||
257 | |||
258 | ``scaling_available_governors`` | ||
259 | List of ``CPUFreq`` scaling governors present in the kernel that can | ||
260 | be attached to this policy or (if the ``intel_pstate`` scaling driver is | ||
261 | in use) list of scaling algorithms provided by the driver that can be | ||
262 | applied to this policy. | ||
263 | |||
264 | [Note that some governors are modular and it may be necessary to load a | ||
265 | kernel module for the governor held by it to become available and be | ||
266 | listed by this attribute.] | ||
267 | |||
268 | ``scaling_cur_freq`` | ||
269 | Current frequency of all of the CPUs belonging to this policy (in kHz). | ||
270 | |||
271 | For the majority of scaling drivers, this is the frequency of the last | ||
272 | P-state requested by the driver from the hardware using the scaling | ||
273 | interface provided by it, which may or may not reflect the frequency | ||
274 | the CPU is actually running at (due to hardware design and other | ||
275 | limitations). | ||
276 | |||
277 | Some scaling drivers (e.g. ``intel_pstate``) attempt to provide | ||
278 | information more precisely reflecting the current CPU frequency through | ||
279 | this attribute, but that still may not be the exact current CPU | ||
280 | frequency as seen by the hardware at the moment. | ||
281 | |||
282 | ``scaling_driver`` | ||
283 | The scaling driver currently in use. | ||
284 | |||
285 | ``scaling_governor`` | ||
286 | The scaling governor currently attached to this policy or (if the | ||
287 | ``intel_pstate`` scaling driver is in use) the scaling algorithm | ||
288 | provided by the driver that is currently applied to this policy. | ||
289 | |||
290 | This attribute is read-write and writing to it will cause a new scaling | ||
291 | governor to be attached to this policy or a new scaling algorithm | ||
292 | provided by the scaling driver to be applied to it (in the | ||
293 | ``intel_pstate`` case), as indicated by the string written to this | ||
294 | attribute (which must be one of the names listed by the | ||
295 | ``scaling_available_governors`` attribute described above). | ||
296 | |||
297 | ``scaling_max_freq`` | ||
298 | Maximum frequency the CPUs belonging to this policy are allowed to be | ||
299 | running at (in kHz). | ||
300 | |||
301 | This attribute is read-write and writing a string representing an | ||
302 | integer to it will cause a new limit to be set (it must not be lower | ||
303 | than the value of the ``scaling_min_freq`` attribute). | ||
304 | |||
305 | ``scaling_min_freq`` | ||
306 | Minimum frequency the CPUs belonging to this policy are allowed to be | ||
307 | running at (in kHz). | ||
308 | |||
309 | This attribute is read-write and writing a string representing a | ||
310 | non-negative integer to it will cause a new limit to be set (it must not | ||
311 | be higher than the value of the ``scaling_max_freq`` attribute). | ||
312 | |||
313 | ``scaling_setspeed`` | ||
314 | This attribute is functional only if the `userspace`_ scaling governor | ||
315 | is attached to the given policy. | ||
316 | |||
317 | It returns the last frequency requested by the governor (in kHz) or can | ||
318 | be written to in order to set a new frequency for the policy. | ||
319 | |||
320 | |||
321 | Generic Scaling Governors | ||
322 | ========================= | ||
323 | |||
324 | ``CPUFreq`` provides generic scaling governors that can be used with all | ||
325 | scaling drivers. As stated before, each of them implements a single, possibly | ||
326 | parametrized, performance scaling algorithm. | ||
327 | |||
328 | Scaling governors are attached to policy objects and different policy objects | ||
329 | can be handled by different scaling governors at the same time (although that | ||
330 | may lead to suboptimal results in some cases). | ||
331 | |||
332 | The scaling governor for a given policy object can be changed at any time with | ||
333 | the help of the ``scaling_governor`` policy attribute in ``sysfs``. | ||
334 | |||
335 | Some governors expose ``sysfs`` attributes to control or fine-tune the scaling | ||
336 | algorithms implemented by them. Those attributes, referred to as governor | ||
337 | tunables, can be either global (system-wide) or per-policy, depending on the | ||
338 | scaling driver in use. If the driver requires governor tunables to be | ||
339 | per-policy, they are located in a subdirectory of each policy directory. | ||
340 | Otherwise, they are located in a subdirectory under | ||
341 | :file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the | ||
342 | subdirectory containing the governor tunables is the name of the governor | ||
343 | providing them. | ||
344 | |||
345 | ``performance`` | ||
346 | --------------- | ||
347 | |||
348 | When attached to a policy object, this governor causes the highest frequency, | ||
349 | within the ``scaling_max_freq`` policy limit, to be requested for that policy. | ||
350 | |||
351 | The request is made once at that time the governor for the policy is set to | ||
352 | ``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | ||
353 | policy limits change after that. | ||
354 | |||
355 | ``powersave`` | ||
356 | ------------- | ||
357 | |||
358 | When attached to a policy object, this governor causes the lowest frequency, | ||
359 | within the ``scaling_min_freq`` policy limit, to be requested for that policy. | ||
360 | |||
361 | The request is made once at that time the governor for the policy is set to | ||
362 | ``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | ||
363 | policy limits change after that. | ||
364 | |||
365 | ``userspace`` | ||
366 | ------------- | ||
367 | |||
368 | This governor does not do anything by itself. Instead, it allows user space | ||
369 | to set the CPU frequency for the policy it is attached to by writing to the | ||
370 | ``scaling_setspeed`` attribute of that policy. | ||
371 | |||
372 | ``schedutil`` | ||
373 | ------------- | ||
374 | |||
375 | This governor uses CPU utilization data available from the CPU scheduler. It | ||
376 | generally is regarded as a part of the CPU scheduler, so it can access the | ||
377 | scheduler's internal data structures directly. | ||
378 | |||
379 | It runs entirely in scheduler context, although in some cases it may need to | ||
380 | invoke the scaling driver asynchronously when it decides that the CPU frequency | ||
381 | should be changed for a given policy (that depends on whether or not the driver | ||
382 | is capable of changing the CPU frequency from scheduler context). | ||
383 | |||
384 | The actions of this governor for a particular CPU depend on the scheduling class | ||
385 | invoking its utilization update callback for that CPU. If it is invoked by the | ||
386 | RT or deadline scheduling classes, the governor will increase the frequency to | ||
387 | the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, | ||
388 | if it is invoked by the CFS scheduling class, the governor will use the | ||
389 | Per-Entity Load Tracking (PELT) metric for the root control group of the | ||
390 | given CPU as the CPU utilization estimate (see the `Per-entity load tracking`_ | ||
391 | LWN.net article for a description of the PELT mechanism). Then, the new | ||
392 | CPU frequency to apply is computed in accordance with the formula | ||
393 | |||
394 | f = 1.25 * ``f_0`` * ``util`` / ``max`` | ||
395 | |||
396 | where ``util`` is the PELT number, ``max`` is the theoretical maximum of | ||
397 | ``util``, and ``f_0`` is either the maximum possible CPU frequency for the given | ||
398 | policy (if the PELT number is frequency-invariant), or the current CPU frequency | ||
399 | (otherwise). | ||
400 | |||
401 | This governor also employs a mechanism allowing it to temporarily bump up the | ||
402 | CPU frequency for tasks that have been waiting on I/O most recently, called | ||
403 | "IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag | ||
404 | is passed by the scheduler to the governor callback which causes the frequency | ||
405 | to go up to the allowed maximum immediately and then draw back to the value | ||
406 | returned by the above formula over time. | ||
407 | |||
408 | This governor exposes only one tunable: | ||
409 | |||
410 | ``rate_limit_us`` | ||
411 | Minimum time (in microseconds) that has to pass between two consecutive | ||
412 | runs of governor computations (default: 1000 times the scaling driver's | ||
413 | transition latency). | ||
414 | |||
415 | The purpose of this tunable is to reduce the scheduler context overhead | ||
416 | of the governor which might be excessive without it. | ||
417 | |||
418 | This governor generally is regarded as a replacement for the older `ondemand`_ | ||
419 | and `conservative`_ governors (described below), as it is simpler and more | ||
420 | tightly integrated with the CPU scheduler, its overhead in terms of CPU context | ||
421 | switches and similar is less significant, and it uses the scheduler's own CPU | ||
422 | utilization metric, so in principle its decisions should not contradict the | ||
423 | decisions made by the other parts of the scheduler. | ||
424 | |||
425 | ``ondemand`` | ||
426 | ------------ | ||
427 | |||
428 | This governor uses CPU load as a CPU frequency selection metric. | ||
429 | |||
430 | In order to estimate the current CPU load, it measures the time elapsed between | ||
431 | consecutive invocations of its worker routine and computes the fraction of that | ||
432 | time in which the given CPU was not idle. The ratio of the non-idle (active) | ||
433 | time to the total CPU time is taken as an estimate of the load. | ||
434 | |||
435 | If this governor is attached to a policy shared by multiple CPUs, the load is | ||
436 | estimated for all of them and the greatest result is taken as the load estimate | ||
437 | for the entire policy. | ||
438 | |||
439 | The worker routine of this governor has to run in process context, so it is | ||
440 | invoked asynchronously (via a workqueue) and CPU P-states are updated from | ||
441 | there if necessary. As a result, the scheduler context overhead from this | ||
442 | governor is minimum, but it causes additional CPU context switches to happen | ||
443 | relatively often and the CPU P-state updates triggered by it can be relatively | ||
444 | irregular. Also, it affects its own CPU load metric by running code that | ||
445 | reduces the CPU idle time (even though the CPU idle time is only reduced very | ||
446 | slightly by it). | ||
447 | |||
448 | It generally selects CPU frequencies proportional to the estimated load, so that | ||
449 | the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of | ||
450 | 1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute | ||
451 | corresponds to the load of 0, unless when the load exceeds a (configurable) | ||
452 | speedup threshold, in which case it will go straight for the highest frequency | ||
453 | it is allowed to use (the ``scaling_max_freq`` policy limit). | ||
454 | |||
455 | This governor exposes the following tunables: | ||
456 | |||
457 | ``sampling_rate`` | ||
458 | This is how often the governor's worker routine should run, in | ||
459 | microseconds. | ||
460 | |||
461 | Typically, it is set to values of the order of 10000 (10 ms). Its | ||
462 | default value is equal to the value of ``cpuinfo_transition_latency`` | ||
463 | for each policy this governor is attached to (but since the unit here | ||
464 | is greater by 1000, this means that the time represented by | ||
465 | ``sampling_rate`` is 1000 times greater than the transition latency by | ||
466 | default). | ||
467 | |||
468 | If this tunable is per-policy, the following shell command sets the time | ||
469 | represented by it to be 750 times as high as the transition latency:: | ||
470 | |||
471 | # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | ||
472 | |||
473 | |||
474 | ``min_sampling_rate`` | ||
475 | The minimum value of ``sampling_rate``. | ||
476 | |||
477 | Equal to 10000 (10 ms) if :c:macro:`CONFIG_NO_HZ_COMMON` and | ||
478 | :c:data:`tick_nohz_active` are both set or to 20 times the value of | ||
479 | :c:data:`jiffies` in microseconds otherwise. | ||
480 | |||
481 | ``up_threshold`` | ||
482 | If the estimated CPU load is above this value (in percent), the governor | ||
483 | will set the frequency to the maximum value allowed for the policy. | ||
484 | Otherwise, the selected frequency will be proportional to the estimated | ||
485 | CPU load. | ||
486 | |||
487 | ``ignore_nice_load`` | ||
488 | If set to 1 (default 0), it will cause the CPU load estimation code to | ||
489 | treat the CPU time spent on executing tasks with "nice" levels greater | ||
490 | than 0 as CPU idle time. | ||
491 | |||
492 | This may be useful if there are tasks in the system that should not be | ||
493 | taken into account when deciding what frequency to run the CPUs at. | ||
494 | Then, to make that happen it is sufficient to increase the "nice" level | ||
495 | of those tasks above 0 and set this attribute to 1. | ||
496 | |||
497 | ``sampling_down_factor`` | ||
498 | Temporary multiplier, between 1 (default) and 100 inclusive, to apply to | ||
499 | the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. | ||
500 | |||
501 | This causes the next execution of the governor's worker routine (after | ||
502 | setting the frequency to the allowed maximum) to be delayed, so the | ||
503 | frequency stays at the maximum level for a longer time. | ||
504 | |||
505 | Frequency fluctuations in some bursty workloads may be avoided this way | ||
506 | at the cost of additional energy spent on maintaining the maximum CPU | ||
507 | capacity. | ||
508 | |||
509 | ``powersave_bias`` | ||
510 | Reduction factor to apply to the original frequency target of the | ||
511 | governor (including the maximum value used when the ``up_threshold`` | ||
512 | value is exceeded by the estimated CPU load) or sensitivity threshold | ||
513 | for the AMD frequency sensitivity powersave bias driver | ||
514 | (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 | ||
515 | inclusive. | ||
516 | |||
517 | If the AMD frequency sensitivity powersave bias driver is not loaded, | ||
518 | the effective frequency to apply is given by | ||
519 | |||
520 | f * (1 - ``powersave_bias`` / 1000) | ||
521 | |||
522 | where f is the governor's original frequency target. The default value | ||
523 | of this attribute is 0 in that case. | ||
524 | |||
525 | If the AMD frequency sensitivity powersave bias driver is loaded, the | ||
526 | value of this attribute is 400 by default and it is used in a different | ||
527 | way. | ||
528 | |||
529 | On Family 16h (and later) AMD processors there is a mechanism to get a | ||
530 | measured workload sensitivity, between 0 and 100% inclusive, from the | ||
531 | hardware. That value can be used to estimate how the performance of the | ||
532 | workload running on a CPU will change in response to frequency changes. | ||
533 | |||
534 | The performance of a workload with the sensitivity of 0 (memory-bound or | ||
535 | IO-bound) is not expected to increase at all as a result of increasing | ||
536 | the CPU frequency, whereas workloads with the sensitivity of 100% | ||
537 | (CPU-bound) are expected to perform much better if the CPU frequency is | ||
538 | increased. | ||
539 | |||
540 | If the workload sensitivity is less than the threshold represented by | ||
541 | the ``powersave_bias`` value, the sensitivity powersave bias driver | ||
542 | will cause the governor to select a frequency lower than its original | ||
543 | target, so as to avoid over-provisioning workloads that will not benefit | ||
544 | from running at higher CPU frequencies. | ||
545 | |||
546 | ``conservative`` | ||
547 | ---------------- | ||
548 | |||
549 | This governor uses CPU load as a CPU frequency selection metric. | ||
550 | |||
551 | It estimates the CPU load in the same way as the `ondemand`_ governor described | ||
552 | above, but the CPU frequency selection algorithm implemented by it is different. | ||
553 | |||
554 | Namely, it avoids changing the frequency significantly over short time intervals | ||
555 | which may not be suitable for systems with limited power supply capacity (e.g. | ||
556 | battery-powered). To achieve that, it changes the frequency in relatively | ||
557 | small steps, one step at a time, up or down - depending on whether or not a | ||
558 | (configurable) threshold has been exceeded by the estimated CPU load. | ||
559 | |||
560 | This governor exposes the following tunables: | ||
561 | |||
562 | ``freq_step`` | ||
563 | Frequency step in percent of the maximum frequency the governor is | ||
564 | allowed to set (the ``scaling_max_freq`` policy limit), between 0 and | ||
565 | 100 (5 by default). | ||
566 | |||
567 | This is how much the frequency is allowed to change in one go. Setting | ||
568 | it to 0 will cause the default frequency step (5 percent) to be used | ||
569 | and setting it to 100 effectively causes the governor to periodically | ||
570 | switch the frequency between the ``scaling_min_freq`` and | ||
571 | ``scaling_max_freq`` policy limits. | ||
572 | |||
573 | ``down_threshold`` | ||
574 | Threshold value (in percent, 20 by default) used to determine the | ||
575 | frequency change direction. | ||
576 | |||
577 | If the estimated CPU load is greater than this value, the frequency will | ||
578 | go up (by ``freq_step``). If the load is less than this value (and the | ||
579 | ``sampling_down_factor`` mechanism is not in effect), the frequency will | ||
580 | go down. Otherwise, the frequency will not be changed. | ||
581 | |||
582 | ``sampling_down_factor`` | ||
583 | Frequency decrease deferral factor, between 1 (default) and 10 | ||
584 | inclusive. | ||
585 | |||
586 | It effectively causes the frequency to go down ``sampling_down_factor`` | ||
587 | times slower than it ramps up. | ||
588 | |||
589 | |||
590 | Frequency Boost Support | ||
591 | ======================= | ||
592 | |||
593 | Background | ||
594 | ---------- | ||
595 | |||
596 | Some processors support a mechanism to raise the operating frequency of some | ||
597 | cores in a multicore package temporarily (and above the sustainable frequency | ||
598 | threshold for the whole package) under certain conditions, for example if the | ||
599 | whole chip is not fully utilized and below its intended thermal or power budget. | ||
600 | |||
601 | Different names are used by different vendors to refer to this functionality. | ||
602 | For Intel processors it is referred to as "Turbo Boost", AMD calls it | ||
603 | "Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. | ||
604 | As a rule, it also is implemented differently by different vendors. The simple | ||
605 | term "frequency boost" is used here for brevity to refer to all of those | ||
606 | implementations. | ||
607 | |||
608 | The frequency boost mechanism may be either hardware-based or software-based. | ||
609 | If it is hardware-based (e.g. on x86), the decision to trigger the boosting is | ||
610 | made by the hardware (although in general it requires the hardware to be put | ||
611 | into a special state in which it can control the CPU frequency within certain | ||
612 | limits). If it is software-based (e.g. on ARM), the scaling driver decides | ||
613 | whether or not to trigger boosting and when to do that. | ||
614 | |||
615 | The ``boost`` File in ``sysfs`` | ||
616 | ------------------------------- | ||
617 | |||
618 | This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls | ||
619 | the "boost" setting for the whole system. It is not present if the underlying | ||
620 | scaling driver does not support the frequency boost mechanism (or supports it, | ||
621 | but provides a driver-specific interface for controlling it, like | ||
622 | ``intel_pstate``). | ||
623 | |||
624 | If the value in this file is 1, the frequency boost mechanism is enabled. This | ||
625 | means that either the hardware can be put into states in which it is able to | ||
626 | trigger boosting (in the hardware-based case), or the software is allowed to | ||
627 | trigger boosting (in the software-based case). It does not mean that boosting | ||
628 | is actually in use at the moment on any CPUs in the system. It only means a | ||
629 | permission to use the frequency boost mechanism (which still may never be used | ||
630 | for other reasons). | ||
631 | |||
632 | If the value in this file is 0, the frequency boost mechanism is disabled and | ||
633 | cannot be used at all. | ||
634 | |||
635 | The only values that can be written to this file are 0 and 1. | ||
636 | |||
637 | Rationale for Boost Control Knob | ||
638 | -------------------------------- | ||
639 | |||
640 | The frequency boost mechanism is generally intended to help to achieve optimum | ||
641 | CPU performance on time scales below software resolution (e.g. below the | ||
642 | scheduler tick interval) and it is demonstrably suitable for many workloads, but | ||
643 | it may lead to problems in certain situations. | ||
644 | |||
645 | For this reason, many systems make it possible to disable the frequency boost | ||
646 | mechanism in the platform firmware (BIOS) setup, but that requires the system to | ||
647 | be restarted for the setting to be adjusted as desired, which may not be | ||
648 | practical at least in some cases. For example: | ||
649 | |||
650 | 1. Boosting means overclocking the processor, although under controlled | ||
651 | conditions. Generally, the processor's energy consumption increases | ||
652 | as a result of increasing its frequency and voltage, even temporarily. | ||
653 | That may not be desirable on systems that switch to power sources of | ||
654 | limited capacity, such as batteries, so the ability to disable the boost | ||
655 | mechanism while the system is running may help there (but that depends on | ||
656 | the workload too). | ||
657 | |||
658 | 2. In some situations deterministic behavior is more important than | ||
659 | performance or energy consumption (or both) and the ability to disable | ||
660 | boosting while the system is running may be useful then. | ||
661 | |||
662 | 3. To examine the impact of the frequency boost mechanism itself, it is useful | ||
663 | to be able to run tests with and without boosting, preferably without | ||
664 | restarting the system in the meantime. | ||
665 | |||
666 | 4. Reproducible results are important when running benchmarks. Since | ||
667 | the boosting functionality depends on the load of the whole package, | ||
668 | single-thread performance may vary because of it which may lead to | ||
669 | unreproducible results sometimes. That can be avoided by disabling the | ||
670 | frequency boost mechanism before running benchmarks sensitive to that | ||
671 | issue. | ||
672 | |||
673 | Legacy AMD ``cpb`` Knob | ||
674 | ----------------------- | ||
675 | |||
676 | The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to | ||
677 | the global ``boost`` one. It is used for disabling/enabling the "Core | ||
678 | Performance Boost" feature of some AMD processors. | ||
679 | |||
680 | If present, that knob is located in every ``CPUFreq`` policy directory in | ||
681 | ``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called | ||
682 | ``cpb``, which indicates a more fine grained control interface. The actual | ||
683 | implementation, however, works on the system-wide basis and setting that knob | ||
684 | for one policy causes the same value of it to be set for all of the other | ||
685 | policies at the same time. | ||
686 | |||
687 | That knob is still supported on AMD processors that support its underlying | ||
688 | hardware feature, but it may be configured out of the kernel (via the | ||
689 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global | ||
690 | ``boost`` knob is present regardless. Thus it is always possible use the | ||
691 | ``boost`` knob instead of the ``cpb`` one which is highly recommended, as that | ||
692 | is more consistent with what all of the other systems do (and the ``cpb`` knob | ||
693 | may not be supported any more in the future). | ||
694 | |||
695 | The ``cpb`` knob is never present for any processors without the underlying | ||
696 | hardware feature (e.g. all Intel ones), even if the | ||
697 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. | ||
698 | |||
699 | |||
700 | .. _Per-entity load tracking: https://lwn.net/Articles/531853/ | ||
diff --git a/Documentation/admin-guide/pm/index.rst b/Documentation/admin-guide/pm/index.rst new file mode 100644 index 000000000000..c80f087321fc --- /dev/null +++ b/Documentation/admin-guide/pm/index.rst | |||
@@ -0,0 +1,15 @@ | |||
1 | ================ | ||
2 | Power Management | ||
3 | ================ | ||
4 | |||
5 | .. toctree:: | ||
6 | :maxdepth: 2 | ||
7 | |||
8 | cpufreq | ||
9 | |||
10 | .. only:: subproject and html | ||
11 | |||
12 | Indices | ||
13 | ======= | ||
14 | |||
15 | * :ref:`genindex` | ||
diff --git a/Documentation/cpu-freq/boost.txt b/Documentation/cpu-freq/boost.txt deleted file mode 100644 index dd62e1334f0a..000000000000 --- a/Documentation/cpu-freq/boost.txt +++ /dev/null | |||
@@ -1,93 +0,0 @@ | |||
1 | Processor boosting control | ||
2 | |||
3 | - information for users - | ||
4 | |||
5 | Quick guide for the impatient: | ||
6 | -------------------- | ||
7 | /sys/devices/system/cpu/cpufreq/boost | ||
8 | controls the boost setting for the whole system. You can read and write | ||
9 | that file with either "0" (boosting disabled) or "1" (boosting allowed). | ||
10 | Reading or writing 1 does not mean that the system is boosting at this | ||
11 | very moment, but only that the CPU _may_ raise the frequency at it's | ||
12 | discretion. | ||
13 | -------------------- | ||
14 | |||
15 | Introduction | ||
16 | ------------- | ||
17 | Some CPUs support a functionality to raise the operating frequency of | ||
18 | some cores in a multi-core package if certain conditions apply, mostly | ||
19 | if the whole chip is not fully utilized and below it's intended thermal | ||
20 | budget. The decision about boost disable/enable is made either at hardware | ||
21 | (e.g. x86) or software (e.g ARM). | ||
22 | On Intel CPUs this is called "Turbo Boost", AMD calls it "Turbo-Core", | ||
23 | in technical documentation "Core performance boost". In Linux we use | ||
24 | the term "boost" for convenience. | ||
25 | |||
26 | Rationale for disable switch | ||
27 | ---------------------------- | ||
28 | |||
29 | Though the idea is to just give better performance without any user | ||
30 | intervention, sometimes the need arises to disable this functionality. | ||
31 | Most systems offer a switch in the (BIOS) firmware to disable the | ||
32 | functionality at all, but a more fine-grained and dynamic control would | ||
33 | be desirable: | ||
34 | 1. While running benchmarks, reproducible results are important. Since | ||
35 | the boosting functionality depends on the load of the whole package, | ||
36 | single thread performance can vary. By explicitly disabling the boost | ||
37 | functionality at least for the benchmark's run-time the system will run | ||
38 | at a fixed frequency and results are reproducible again. | ||
39 | 2. To examine the impact of the boosting functionality it is helpful | ||
40 | to do tests with and without boosting. | ||
41 | 3. Boosting means overclocking the processor, though under controlled | ||
42 | conditions. By raising the frequency and the voltage the processor | ||
43 | will consume more power than without the boosting, which may be | ||
44 | undesirable for instance for mobile users. Disabling boosting may | ||
45 | save power here, though this depends on the workload. | ||
46 | |||
47 | |||
48 | User controlled switch | ||
49 | ---------------------- | ||
50 | |||
51 | To allow the user to toggle the boosting functionality, the cpufreq core | ||
52 | driver exports a sysfs knob to enable or disable it. There is a file: | ||
53 | /sys/devices/system/cpu/cpufreq/boost | ||
54 | which can either read "0" (boosting disabled) or "1" (boosting enabled). | ||
55 | The file is exported only when cpufreq driver supports boosting. | ||
56 | Explicitly changing the permissions and writing to that file anyway will | ||
57 | return EINVAL. | ||
58 | |||
59 | On supported CPUs one can write either a "0" or a "1" into this file. | ||
60 | This will either disable the boost functionality on all cores in the | ||
61 | whole system (0) or will allow the software or hardware to boost at will | ||
62 | (1). | ||
63 | |||
64 | Writing a "1" does not explicitly boost the system, but just allows the | ||
65 | CPU to boost at their discretion. Some implementations take external | ||
66 | factors like the chip's temperature into account, so boosting once does | ||
67 | not necessarily mean that it will occur every time even using the exact | ||
68 | same software setup. | ||
69 | |||
70 | |||
71 | AMD legacy cpb switch | ||
72 | --------------------- | ||
73 | The AMD powernow-k8 driver used to support a very similar switch to | ||
74 | disable or enable the "Core Performance Boost" feature of some AMD CPUs. | ||
75 | This switch was instantiated in each CPU's cpufreq directory | ||
76 | (/sys/devices/system/cpu[0-9]*/cpufreq) and was called "cpb". | ||
77 | Though the per CPU existence hints at a more fine grained control, the | ||
78 | actual implementation only supported a system-global switch semantics, | ||
79 | which was simply reflected into each CPU's file. Writing a 0 or 1 into it | ||
80 | would pull the other CPUs to the same state. | ||
81 | For compatibility reasons this file and its behavior is still supported | ||
82 | on AMD CPUs, though it is now protected by a config switch | ||
83 | (X86_ACPI_CPUFREQ_CPB). On Intel CPUs this file will never be created, | ||
84 | even with the config option set. | ||
85 | This functionality is considered legacy and will be removed in some future | ||
86 | kernel version. | ||
87 | |||
88 | More fine grained boosting control | ||
89 | ---------------------------------- | ||
90 | |||
91 | Technically it is possible to switch the boosting functionality at least | ||
92 | on a per package basis, for some CPUs even per core. Currently the driver | ||
93 | does not support it, but this may be implemented in the future. | ||
diff --git a/Documentation/cpu-freq/governors.txt b/Documentation/cpu-freq/governors.txt deleted file mode 100644 index 61b3184b6c24..000000000000 --- a/Documentation/cpu-freq/governors.txt +++ /dev/null | |||
@@ -1,301 +0,0 @@ | |||
1 | CPU frequency and voltage scaling code in the Linux(TM) kernel | ||
2 | |||
3 | |||
4 | L i n u x C P U F r e q | ||
5 | |||
6 | C P U F r e q G o v e r n o r s | ||
7 | |||
8 | - information for users and developers - | ||
9 | |||
10 | |||
11 | Dominik Brodowski <linux@brodo.de> | ||
12 | some additions and corrections by Nico Golde <nico@ngolde.de> | ||
13 | Rafael J. Wysocki <rafael.j.wysocki@intel.com> | ||
14 | Viresh Kumar <viresh.kumar@linaro.org> | ||
15 | |||
16 | |||
17 | |||
18 | Clock scaling allows you to change the clock speed of the CPUs on the | ||
19 | fly. This is a nice method to save battery power, because the lower | ||
20 | the clock speed, the less power the CPU consumes. | ||
21 | |||
22 | |||
23 | Contents: | ||
24 | --------- | ||
25 | 1. What is a CPUFreq Governor? | ||
26 | |||
27 | 2. Governors In the Linux Kernel | ||
28 | 2.1 Performance | ||
29 | 2.2 Powersave | ||
30 | 2.3 Userspace | ||
31 | 2.4 Ondemand | ||
32 | 2.5 Conservative | ||
33 | 2.6 Schedutil | ||
34 | |||
35 | 3. The Governor Interface in the CPUfreq Core | ||
36 | |||
37 | 4. References | ||
38 | |||
39 | |||
40 | 1. What Is A CPUFreq Governor? | ||
41 | ============================== | ||
42 | |||
43 | Most cpufreq drivers (except the intel_pstate and longrun) or even most | ||
44 | cpu frequency scaling algorithms only allow the CPU frequency to be set | ||
45 | to predefined fixed values. In order to offer dynamic frequency | ||
46 | scaling, the cpufreq core must be able to tell these drivers of a | ||
47 | "target frequency". So these specific drivers will be transformed to | ||
48 | offer a "->target/target_index/fast_switch()" call instead of the | ||
49 | "->setpolicy()" call. For set_policy drivers, all stays the same, | ||
50 | though. | ||
51 | |||
52 | How to decide what frequency within the CPUfreq policy should be used? | ||
53 | That's done using "cpufreq governors". | ||
54 | |||
55 | Basically, it's the following flow graph: | ||
56 | |||
57 | CPU can be set to switch independently | CPU can only be set | ||
58 | within specific "limits" | to specific frequencies | ||
59 | |||
60 | "CPUfreq policy" | ||
61 | consists of frequency limits (policy->{min,max}) | ||
62 | and CPUfreq governor to be used | ||
63 | / \ | ||
64 | / \ | ||
65 | / the cpufreq governor decides | ||
66 | / (dynamically or statically) | ||
67 | / what target_freq to set within | ||
68 | / the limits of policy->{min,max} | ||
69 | / \ | ||
70 | / \ | ||
71 | Using the ->setpolicy call, Using the ->target/target_index/fast_switch call, | ||
72 | the limits and the the frequency closest | ||
73 | "policy" is set. to target_freq is set. | ||
74 | It is assured that it | ||
75 | is within policy->{min,max} | ||
76 | |||
77 | |||
78 | 2. Governors In the Linux Kernel | ||
79 | ================================ | ||
80 | |||
81 | 2.1 Performance | ||
82 | --------------- | ||
83 | |||
84 | The CPUfreq governor "performance" sets the CPU statically to the | ||
85 | highest frequency within the borders of scaling_min_freq and | ||
86 | scaling_max_freq. | ||
87 | |||
88 | |||
89 | 2.2 Powersave | ||
90 | ------------- | ||
91 | |||
92 | The CPUfreq governor "powersave" sets the CPU statically to the | ||
93 | lowest frequency within the borders of scaling_min_freq and | ||
94 | scaling_max_freq. | ||
95 | |||
96 | |||
97 | 2.3 Userspace | ||
98 | ------------- | ||
99 | |||
100 | The CPUfreq governor "userspace" allows the user, or any userspace | ||
101 | program running with UID "root", to set the CPU to a specific frequency | ||
102 | by making a sysfs file "scaling_setspeed" available in the CPU-device | ||
103 | directory. | ||
104 | |||
105 | |||
106 | 2.4 Ondemand | ||
107 | ------------ | ||
108 | |||
109 | The CPUfreq governor "ondemand" sets the CPU frequency depending on the | ||
110 | current system load. Load estimation is triggered by the scheduler | ||
111 | through the update_util_data->func hook; when triggered, cpufreq checks | ||
112 | the CPU-usage statistics over the last period and the governor sets the | ||
113 | CPU accordingly. The CPU must have the capability to switch the | ||
114 | frequency very quickly. | ||
115 | |||
116 | Sysfs files: | ||
117 | |||
118 | * sampling_rate: | ||
119 | |||
120 | Measured in uS (10^-6 seconds), this is how often you want the kernel | ||
121 | to look at the CPU usage and to make decisions on what to do about the | ||
122 | frequency. Typically this is set to values of around '10000' or more. | ||
123 | It's default value is (cmp. with users-guide.txt): transition_latency | ||
124 | * 1000. Be aware that transition latency is in ns and sampling_rate | ||
125 | is in us, so you get the same sysfs value by default. Sampling rate | ||
126 | should always get adjusted considering the transition latency to set | ||
127 | the sampling rate 750 times as high as the transition latency in the | ||
128 | bash (as said, 1000 is default), do: | ||
129 | |||
130 | $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | ||
131 | |||
132 | * sampling_rate_min: | ||
133 | |||
134 | The sampling rate is limited by the HW transition latency: | ||
135 | transition_latency * 100 | ||
136 | |||
137 | Or by kernel restrictions: | ||
138 | - If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. | ||
139 | - If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is | ||
140 | used, the limits depend on the CONFIG_HZ option: | ||
141 | HZ=1000: min=20000us (20ms) | ||
142 | HZ=250: min=80000us (80ms) | ||
143 | HZ=100: min=200000us (200ms) | ||
144 | |||
145 | The highest value of kernel and HW latency restrictions is shown and | ||
146 | used as the minimum sampling rate. | ||
147 | |||
148 | * up_threshold: | ||
149 | |||
150 | This defines what the average CPU usage between the samplings of | ||
151 | 'sampling_rate' needs to be for the kernel to make a decision on | ||
152 | whether it should increase the frequency. For example when it is set | ||
153 | to its default value of '95' it means that between the checking | ||
154 | intervals the CPU needs to be on average more than 95% in use to then | ||
155 | decide that the CPU frequency needs to be increased. | ||
156 | |||
157 | * ignore_nice_load: | ||
158 | |||
159 | This parameter takes a value of '0' or '1'. When set to '0' (its | ||
160 | default), all processes are counted towards the 'cpu utilisation' | ||
161 | value. When set to '1', the processes that are run with a 'nice' | ||
162 | value will not count (and thus be ignored) in the overall usage | ||
163 | calculation. This is useful if you are running a CPU intensive | ||
164 | calculation on your laptop that you do not care how long it takes to | ||
165 | complete as you can 'nice' it and prevent it from taking part in the | ||
166 | deciding process of whether to increase your CPU frequency. | ||
167 | |||
168 | * sampling_down_factor: | ||
169 | |||
170 | This parameter controls the rate at which the kernel makes a decision | ||
171 | on when to decrease the frequency while running at top speed. When set | ||
172 | to 1 (the default) decisions to reevaluate load are made at the same | ||
173 | interval regardless of current clock speed. But when set to greater | ||
174 | than 1 (e.g. 100) it acts as a multiplier for the scheduling interval | ||
175 | for reevaluating load when the CPU is at its top speed due to high | ||
176 | load. This improves performance by reducing the overhead of load | ||
177 | evaluation and helping the CPU stay at its top speed when truly busy, | ||
178 | rather than shifting back and forth in speed. This tunable has no | ||
179 | effect on behavior at lower speeds/lower CPU loads. | ||
180 | |||
181 | * powersave_bias: | ||
182 | |||
183 | This parameter takes a value between 0 to 1000. It defines the | ||
184 | percentage (times 10) value of the target frequency that will be | ||
185 | shaved off of the target. For example, when set to 100 -- 10%, when | ||
186 | ondemand governor would have targeted 1000 MHz, it will target | ||
187 | 1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 | ||
188 | (disabled) by default. | ||
189 | |||
190 | When AMD frequency sensitivity powersave bias driver -- | ||
191 | drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter | ||
192 | defines the workload frequency sensitivity threshold in which a lower | ||
193 | frequency is chosen instead of ondemand governor's original target. | ||
194 | The frequency sensitivity is a hardware reported (on AMD Family 16h | ||
195 | Processors and above) value between 0 to 100% that tells software how | ||
196 | the performance of the workload running on a CPU will change when | ||
197 | frequency changes. A workload with sensitivity of 0% (memory/IO-bound) | ||
198 | will not perform any better on higher core frequency, whereas a | ||
199 | workload with sensitivity of 100% (CPU-bound) will perform better | ||
200 | higher the frequency. When the driver is loaded, this is set to 400 by | ||
201 | default -- for CPUs running workloads with sensitivity value below | ||
202 | 40%, a lower frequency is chosen. Unloading the driver or writing 0 | ||
203 | will disable this feature. | ||
204 | |||
205 | |||
206 | 2.5 Conservative | ||
207 | ---------------- | ||
208 | |||
209 | The CPUfreq governor "conservative", much like the "ondemand" | ||
210 | governor, sets the CPU frequency depending on the current usage. It | ||
211 | differs in behaviour in that it gracefully increases and decreases the | ||
212 | CPU speed rather than jumping to max speed the moment there is any load | ||
213 | on the CPU. This behaviour is more suitable in a battery powered | ||
214 | environment. The governor is tweaked in the same manner as the | ||
215 | "ondemand" governor through sysfs with the addition of: | ||
216 | |||
217 | * freq_step: | ||
218 | |||
219 | This describes what percentage steps the cpu freq should be increased | ||
220 | and decreased smoothly by. By default the cpu frequency will increase | ||
221 | in 5% chunks of your maximum cpu frequency. You can change this value | ||
222 | to anywhere between 0 and 100 where '0' will effectively lock your CPU | ||
223 | at a speed regardless of its load whilst '100' will, in theory, make | ||
224 | it behave identically to the "ondemand" governor. | ||
225 | |||
226 | * down_threshold: | ||
227 | |||
228 | Same as the 'up_threshold' found for the "ondemand" governor but for | ||
229 | the opposite direction. For example when set to its default value of | ||
230 | '20' it means that if the CPU usage needs to be below 20% between | ||
231 | samples to have the frequency decreased. | ||
232 | |||
233 | * sampling_down_factor: | ||
234 | |||
235 | Similar functionality as in "ondemand" governor. But in | ||
236 | "conservative", it controls the rate at which the kernel makes a | ||
237 | decision on when to decrease the frequency while running in any speed. | ||
238 | Load for frequency increase is still evaluated every sampling rate. | ||
239 | |||
240 | |||
241 | 2.6 Schedutil | ||
242 | ------------- | ||
243 | |||
244 | The "schedutil" governor aims at better integration with the Linux | ||
245 | kernel scheduler. Load estimation is achieved through the scheduler's | ||
246 | Per-Entity Load Tracking (PELT) mechanism, which also provides | ||
247 | information about the recent load [1]. This governor currently does | ||
248 | load based DVFS only for tasks managed by CFS. RT and DL scheduler tasks | ||
249 | are always run at the highest frequency. Unlike all the other | ||
250 | governors, the code is located under the kernel/sched/ directory. | ||
251 | |||
252 | Sysfs files: | ||
253 | |||
254 | * rate_limit_us: | ||
255 | |||
256 | This contains a value in microseconds. The governor waits for | ||
257 | rate_limit_us time before reevaluating the load again, after it has | ||
258 | evaluated the load once. | ||
259 | |||
260 | For an in-depth comparison with the other governors refer to [2]. | ||
261 | |||
262 | |||
263 | 3. The Governor Interface in the CPUfreq Core | ||
264 | ============================================= | ||
265 | |||
266 | A new governor must register itself with the CPUfreq core using | ||
267 | "cpufreq_register_governor". The struct cpufreq_governor, which has to | ||
268 | be passed to that function, must contain the following values: | ||
269 | |||
270 | governor->name - A unique name for this governor. | ||
271 | governor->owner - .THIS_MODULE for the governor module (if appropriate). | ||
272 | |||
273 | plus a set of hooks to the functions implementing the governor's logic. | ||
274 | |||
275 | The CPUfreq governor may call the CPU processor driver using one of | ||
276 | these two functions: | ||
277 | |||
278 | int cpufreq_driver_target(struct cpufreq_policy *policy, | ||
279 | unsigned int target_freq, | ||
280 | unsigned int relation); | ||
281 | |||
282 | int __cpufreq_driver_target(struct cpufreq_policy *policy, | ||
283 | unsigned int target_freq, | ||
284 | unsigned int relation); | ||
285 | |||
286 | target_freq must be within policy->min and policy->max, of course. | ||
287 | What's the difference between these two functions? When your governor is | ||
288 | in a direct code path of a call to governor callbacks, like | ||
289 | governor->start(), the policy->rwsem is still held in the cpufreq core, | ||
290 | and there's no need to lock it again (in fact, this would cause a | ||
291 | deadlock). So use __cpufreq_driver_target only in these cases. In all | ||
292 | other cases (for example, when there's a "daemonized" function that | ||
293 | wakes up every second), use cpufreq_driver_target to take policy->rwsem | ||
294 | before the command is passed to the cpufreq driver. | ||
295 | |||
296 | 4. References | ||
297 | ============= | ||
298 | |||
299 | [1] Per-entity load tracking: https://lwn.net/Articles/531853/ | ||
300 | [2] Improvements in CPU frequency management: https://lwn.net/Articles/682391/ | ||
301 | |||
diff --git a/Documentation/cpu-freq/index.txt b/Documentation/cpu-freq/index.txt index ef1d39247b05..03a7cee6ac73 100644 --- a/Documentation/cpu-freq/index.txt +++ b/Documentation/cpu-freq/index.txt | |||
@@ -21,8 +21,6 @@ Documents in this directory: | |||
21 | 21 | ||
22 | amd-powernow.txt - AMD powernow driver specific file. | 22 | amd-powernow.txt - AMD powernow driver specific file. |
23 | 23 | ||
24 | boost.txt - Frequency boosting support. | ||
25 | |||
26 | core.txt - General description of the CPUFreq core and | 24 | core.txt - General description of the CPUFreq core and |
27 | of CPUFreq notifiers. | 25 | of CPUFreq notifiers. |
28 | 26 | ||
@@ -32,17 +30,12 @@ cpufreq-nforce2.txt - nVidia nForce2 platform specific file. | |||
32 | 30 | ||
33 | cpufreq-stats.txt - General description of sysfs cpufreq stats. | 31 | cpufreq-stats.txt - General description of sysfs cpufreq stats. |
34 | 32 | ||
35 | governors.txt - What are cpufreq governors and how to | ||
36 | implement them? | ||
37 | |||
38 | index.txt - File index, Mailing list and Links (this document) | 33 | index.txt - File index, Mailing list and Links (this document) |
39 | 34 | ||
40 | intel-pstate.txt - Intel pstate cpufreq driver specific file. | 35 | intel-pstate.txt - Intel pstate cpufreq driver specific file. |
41 | 36 | ||
42 | pcc-cpufreq.txt - PCC cpufreq driver specific file. | 37 | pcc-cpufreq.txt - PCC cpufreq driver specific file. |
43 | 38 | ||
44 | user-guide.txt - User Guide to CPUFreq | ||
45 | |||
46 | 39 | ||
47 | Mailing List | 40 | Mailing List |
48 | ------------ | 41 | ------------ |
diff --git a/Documentation/cpu-freq/user-guide.txt b/Documentation/cpu-freq/user-guide.txt deleted file mode 100644 index 391da64e9492..000000000000 --- a/Documentation/cpu-freq/user-guide.txt +++ /dev/null | |||
@@ -1,228 +0,0 @@ | |||
1 | CPU frequency and voltage scaling code in the Linux(TM) kernel | ||
2 | |||
3 | |||
4 | L i n u x C P U F r e q | ||
5 | |||
6 | U S E R G U I D E | ||
7 | |||
8 | |||
9 | Dominik Brodowski <linux@brodo.de> | ||
10 | |||
11 | |||
12 | |||
13 | Clock scaling allows you to change the clock speed of the CPUs on the | ||
14 | fly. This is a nice method to save battery power, because the lower | ||
15 | the clock speed, the less power the CPU consumes. | ||
16 | |||
17 | |||
18 | Contents: | ||
19 | --------- | ||
20 | 1. Supported Architectures and Processors | ||
21 | 1.1 ARM and ARM64 | ||
22 | 1.2 x86 | ||
23 | 1.3 sparc64 | ||
24 | 1.4 ppc | ||
25 | 1.5 SuperH | ||
26 | 1.6 Blackfin | ||
27 | |||
28 | 2. "Policy" / "Governor"? | ||
29 | 2.1 Policy | ||
30 | 2.2 Governor | ||
31 | |||
32 | 3. How to change the CPU cpufreq policy and/or speed | ||
33 | 3.1 Preferred interface: sysfs | ||
34 | |||
35 | |||
36 | |||
37 | 1. Supported Architectures and Processors | ||
38 | ========================================= | ||
39 | |||
40 | 1.1 ARM and ARM64 | ||
41 | ----------------- | ||
42 | |||
43 | Almost all ARM and ARM64 platforms support CPU frequency scaling. | ||
44 | |||
45 | 1.2 x86 | ||
46 | ------- | ||
47 | |||
48 | The following processors for the x86 architecture are supported by cpufreq: | ||
49 | |||
50 | AMD Elan - SC400, SC410 | ||
51 | AMD mobile K6-2+ | ||
52 | AMD mobile K6-3+ | ||
53 | AMD mobile Duron | ||
54 | AMD mobile Athlon | ||
55 | AMD Opteron | ||
56 | AMD Athlon 64 | ||
57 | Cyrix Media GXm | ||
58 | Intel mobile PIII and Intel mobile PIII-M on certain chipsets | ||
59 | Intel Pentium 4, Intel Xeon | ||
60 | Intel Pentium M (Centrino) | ||
61 | National Semiconductors Geode GX | ||
62 | Transmeta Crusoe | ||
63 | Transmeta Efficeon | ||
64 | VIA Cyrix 3 / C3 | ||
65 | various processors on some ACPI 2.0-compatible systems [*] | ||
66 | And many more | ||
67 | |||
68 | [*] Only if "ACPI Processor Performance States" are available | ||
69 | to the ACPI<->BIOS interface. | ||
70 | |||
71 | |||
72 | 1.3 sparc64 | ||
73 | ----------- | ||
74 | |||
75 | The following processors for the sparc64 architecture are supported by | ||
76 | cpufreq: | ||
77 | |||
78 | UltraSPARC-III | ||
79 | |||
80 | |||
81 | 1.4 ppc | ||
82 | ------- | ||
83 | |||
84 | Several "PowerBook" and "iBook2" notebooks are supported. | ||
85 | The following POWER processors are supported in powernv mode: | ||
86 | POWER8 | ||
87 | POWER9 | ||
88 | |||
89 | 1.5 SuperH | ||
90 | ---------- | ||
91 | |||
92 | All SuperH processors supporting rate rounding through the clock | ||
93 | framework are supported by cpufreq. | ||
94 | |||
95 | 1.6 Blackfin | ||
96 | ------------ | ||
97 | |||
98 | The following Blackfin processors are supported by cpufreq: | ||
99 | |||
100 | BF522, BF523, BF524, BF525, BF526, BF527, Rev 0.1 or higher | ||
101 | BF531, BF532, BF533, Rev 0.3 or higher | ||
102 | BF534, BF536, BF537, Rev 0.2 or higher | ||
103 | BF561, Rev 0.3 or higher | ||
104 | BF542, BF544, BF547, BF548, BF549, Rev 0.1 or higher | ||
105 | |||
106 | |||
107 | 2. "Policy" / "Governor" ? | ||
108 | ========================== | ||
109 | |||
110 | Some CPU frequency scaling-capable processor switch between various | ||
111 | frequencies and operating voltages "on the fly" without any kernel or | ||
112 | user involvement. This guarantees very fast switching to a frequency | ||
113 | which is high enough to serve the user's needs, but low enough to save | ||
114 | power. | ||
115 | |||
116 | |||
117 | 2.1 Policy | ||
118 | ---------- | ||
119 | |||
120 | On these systems, all you can do is select the lower and upper | ||
121 | frequency limit as well as whether you want more aggressive | ||
122 | power-saving or more instantly available processing power. | ||
123 | |||
124 | |||
125 | 2.2 Governor | ||
126 | ------------ | ||
127 | |||
128 | On all other cpufreq implementations, these boundaries still need to | ||
129 | be set. Then, a "governor" must be selected. Such a "governor" decides | ||
130 | what speed the processor shall run within the boundaries. One such | ||
131 | "governor" is the "userspace" governor. This one allows the user - or | ||
132 | a yet-to-implement userspace program - to decide what specific speed | ||
133 | the processor shall run at. | ||
134 | |||
135 | |||
136 | 3. How to change the CPU cpufreq policy and/or speed | ||
137 | ==================================================== | ||
138 | |||
139 | 3.1 Preferred Interface: sysfs | ||
140 | ------------------------------ | ||
141 | |||
142 | The preferred interface is located in the sysfs filesystem. If you | ||
143 | mounted it at /sys, the cpufreq interface is located in a subdirectory | ||
144 | "cpufreq" within the cpu-device directory | ||
145 | (e.g. /sys/devices/system/cpu/cpu0/cpufreq/ for the first CPU). | ||
146 | |||
147 | affected_cpus : List of Online CPUs that require software | ||
148 | coordination of frequency. | ||
149 | |||
150 | cpuinfo_cur_freq : Current frequency of the CPU as obtained from | ||
151 | the hardware, in KHz. This is the frequency | ||
152 | the CPU actually runs at. | ||
153 | |||
154 | cpuinfo_min_freq : this file shows the minimum operating | ||
155 | frequency the processor can run at(in kHz) | ||
156 | |||
157 | cpuinfo_max_freq : this file shows the maximum operating | ||
158 | frequency the processor can run at(in kHz) | ||
159 | |||
160 | cpuinfo_transition_latency The time it takes on this CPU to | ||
161 | switch between two frequencies in nano | ||
162 | seconds. If unknown or known to be | ||
163 | that high that the driver does not | ||
164 | work with the ondemand governor, -1 | ||
165 | (CPUFREQ_ETERNAL) will be returned. | ||
166 | Using this information can be useful | ||
167 | to choose an appropriate polling | ||
168 | frequency for a kernel governor or | ||
169 | userspace daemon. Make sure to not | ||
170 | switch the frequency too often | ||
171 | resulting in performance loss. | ||
172 | |||
173 | related_cpus : List of Online + Offline CPUs that need software | ||
174 | coordination of frequency. | ||
175 | |||
176 | scaling_available_frequencies : List of available frequencies, in KHz. | ||
177 | |||
178 | scaling_available_governors : this file shows the CPUfreq governors | ||
179 | available in this kernel. You can see the | ||
180 | currently activated governor in | ||
181 | |||
182 | scaling_cur_freq : Current frequency of the CPU as determined by | ||
183 | the governor and cpufreq core, in KHz. This is | ||
184 | the frequency the kernel thinks the CPU runs | ||
185 | at. | ||
186 | |||
187 | scaling_driver : this file shows what cpufreq driver is | ||
188 | used to set the frequency on this CPU | ||
189 | |||
190 | scaling_governor, and by "echoing" the name of another | ||
191 | governor you can change it. Please note | ||
192 | that some governors won't load - they only | ||
193 | work on some specific architectures or | ||
194 | processors. | ||
195 | |||
196 | scaling_min_freq and | ||
197 | scaling_max_freq show the current "policy limits" (in | ||
198 | kHz). By echoing new values into these | ||
199 | files, you can change these limits. | ||
200 | NOTE: when setting a policy you need to | ||
201 | first set scaling_max_freq, then | ||
202 | scaling_min_freq. | ||
203 | |||
204 | scaling_setspeed This can be read to get the currently programmed | ||
205 | value by the governor. This can be written to | ||
206 | change the current frequency for a group of | ||
207 | CPUs, represented by a policy. This is supported | ||
208 | currently only by the userspace governor. | ||
209 | |||
210 | bios_limit : If the BIOS tells the OS to limit a CPU to | ||
211 | lower frequencies, the user can read out the | ||
212 | maximum available frequency from this file. | ||
213 | This typically can happen through (often not | ||
214 | intended) BIOS settings, restrictions | ||
215 | triggered through a service processor or other | ||
216 | BIOS/HW based implementations. | ||
217 | This does not cover thermal ACPI limitations | ||
218 | which can be detected through the generic | ||
219 | thermal driver. | ||
220 | |||
221 | If you have selected the "userspace" governor which allows you to | ||
222 | set the CPU operating frequency to a specific value, you can read out | ||
223 | the current frequency in | ||
224 | |||
225 | scaling_setspeed. By "echoing" a new frequency into this | ||
226 | you can change the speed of the CPU, | ||
227 | but only within the limits of | ||
228 | scaling_min_freq and scaling_max_freq. | ||