diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2011-10-25 09:18:39 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2011-10-25 09:18:39 -0400 |
commit | 7e0bb71e75020348bee523720a0c2f04cc72f540 (patch) | |
tree | 1a22d65bbce34e8cc0f82c543c9486ffb58332f7 /Documentation/power | |
parent | b9e2780d576a010d4aba1e69f247170bf3718d6b (diff) | |
parent | 0ab1e79b825a5cd8aeb3b34d89c9a89dea900056 (diff) |
Merge branch 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
* 'pm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (63 commits)
PM / Clocks: Remove redundant NULL checks before kfree()
PM / Documentation: Update docs about suspend and CPU hotplug
ACPI / PM: Add Sony VGN-FW21E to nonvs blacklist.
ARM: mach-shmobile: sh7372 A4R support (v4)
ARM: mach-shmobile: sh7372 A3SP support (v4)
PM / Sleep: Mark devices involved in wakeup signaling during suspend
PM / Hibernate: Improve performance of LZO/plain hibernation, checksum image
PM / Hibernate: Do not initialize static and extern variables to 0
PM / Freezer: Make fake_signal_wake_up() wake TASK_KILLABLE tasks too
PM / Hibernate: Add resumedelay kernel param in addition to resumewait
MAINTAINERS: Update linux-pm list address
PM / ACPI: Blacklist Vaio VGN-FW520F machine known to require acpi_sleep=nonvs
PM / ACPI: Blacklist Sony Vaio known to require acpi_sleep=nonvs
PM / Hibernate: Add resumewait param to support MMC-like devices as resume file
PM / Hibernate: Fix typo in a kerneldoc comment
PM / Hibernate: Freeze kernel threads after preallocating memory
PM: Update the policy on default wakeup settings
PM / VT: Cleanup #if defined uglyness and fix compile error
PM / Suspend: Off by one in pm_suspend()
PM / Hibernate: Include storage keys in hibernation image on s390
...
Diffstat (limited to 'Documentation/power')
-rw-r--r-- | Documentation/power/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/power/basic-pm-debugging.txt | 24 | ||||
-rw-r--r-- | Documentation/power/devices.txt | 8 | ||||
-rw-r--r-- | Documentation/power/pm_qos_interface.txt | 92 | ||||
-rw-r--r-- | Documentation/power/runtime_pm.txt | 21 | ||||
-rw-r--r-- | Documentation/power/suspend-and-cpuhotplug.txt | 275 |
6 files changed, 405 insertions, 17 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX index 45e9d4a91284..a4d682f54231 100644 --- a/Documentation/power/00-INDEX +++ b/Documentation/power/00-INDEX | |||
@@ -26,6 +26,8 @@ s2ram.txt | |||
26 | - How to get suspend to ram working (and debug it when it isn't) | 26 | - How to get suspend to ram working (and debug it when it isn't) |
27 | states.txt | 27 | states.txt |
28 | - System power management states | 28 | - System power management states |
29 | suspend-and-cpuhotplug.txt | ||
30 | - Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug | ||
29 | swsusp-and-swap-files.txt | 31 | swsusp-and-swap-files.txt |
30 | - Using swap files with software suspend (to disk) | 32 | - Using swap files with software suspend (to disk) |
31 | swsusp-dmcrypt.txt | 33 | swsusp-dmcrypt.txt |
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt index 05a7fe76232d..40a4c65f380a 100644 --- a/Documentation/power/basic-pm-debugging.txt +++ b/Documentation/power/basic-pm-debugging.txt | |||
@@ -201,3 +201,27 @@ case, you may be able to search for failing drivers by following the procedure | |||
201 | analogous to the one described in section 1. If you find some failing drivers, | 201 | analogous to the one described in section 1. If you find some failing drivers, |
202 | you will have to unload them every time before an STR transition (ie. before | 202 | you will have to unload them every time before an STR transition (ie. before |
203 | you run s2ram), and please report the problems with them. | 203 | you run s2ram), and please report the problems with them. |
204 | |||
205 | There is a debugfs entry which shows the suspend to RAM statistics. Here is an | ||
206 | example of its output. | ||
207 | # mount -t debugfs none /sys/kernel/debug | ||
208 | # cat /sys/kernel/debug/suspend_stats | ||
209 | success: 20 | ||
210 | fail: 5 | ||
211 | failed_freeze: 0 | ||
212 | failed_prepare: 0 | ||
213 | failed_suspend: 5 | ||
214 | failed_suspend_noirq: 0 | ||
215 | failed_resume: 0 | ||
216 | failed_resume_noirq: 0 | ||
217 | failures: | ||
218 | last_failed_dev: alarm | ||
219 | adc | ||
220 | last_failed_errno: -16 | ||
221 | -16 | ||
222 | last_failed_step: suspend | ||
223 | suspend | ||
224 | Field success means the success number of suspend to RAM, and field fail means | ||
225 | the failure number. Others are the failure number of different steps of suspend | ||
226 | to RAM. suspend_stats just lists the last 2 failed devices, error number and | ||
227 | failed step of suspend. | ||
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index 3384d5996be2..646a89e0c07d 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt | |||
@@ -152,7 +152,9 @@ try to use its wakeup mechanism. device_set_wakeup_enable() affects this flag; | |||
152 | for the most part drivers should not change its value. The initial value of | 152 | for the most part drivers should not change its value. The initial value of |
153 | should_wakeup is supposed to be false for the majority of devices; the major | 153 | should_wakeup is supposed to be false for the majority of devices; the major |
154 | exceptions are power buttons, keyboards, and Ethernet adapters whose WoL | 154 | exceptions are power buttons, keyboards, and Ethernet adapters whose WoL |
155 | (wake-on-LAN) feature has been set up with ethtool. | 155 | (wake-on-LAN) feature has been set up with ethtool. It should also default |
156 | to true for devices that don't generate wakeup requests on their own but merely | ||
157 | forward wakeup requests from one bus to another (like PCI bridges). | ||
156 | 158 | ||
157 | Whether or not a device is capable of issuing wakeup events is a hardware | 159 | Whether or not a device is capable of issuing wakeup events is a hardware |
158 | matter, and the kernel is responsible for keeping track of it. By contrast, | 160 | matter, and the kernel is responsible for keeping track of it. By contrast, |
@@ -279,10 +281,6 @@ When the system goes into the standby or memory sleep state, the phases are: | |||
279 | time.) Unlike the other suspend-related phases, during the prepare | 281 | time.) Unlike the other suspend-related phases, during the prepare |
280 | phase the device tree is traversed top-down. | 282 | phase the device tree is traversed top-down. |
281 | 283 | ||
282 | In addition to that, if device drivers need to allocate additional | ||
283 | memory to be able to hadle device suspend correctly, that should be | ||
284 | done in the prepare phase. | ||
285 | |||
286 | After the prepare callback method returns, no new children may be | 284 | After the prepare callback method returns, no new children may be |
287 | registered below the device. The method may also prepare the device or | 285 | registered below the device. The method may also prepare the device or |
288 | driver in some way for the upcoming system power transition (for | 286 | driver in some way for the upcoming system power transition (for |
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt index bfed898a03fc..17e130a80347 100644 --- a/Documentation/power/pm_qos_interface.txt +++ b/Documentation/power/pm_qos_interface.txt | |||
@@ -4,14 +4,19 @@ This interface provides a kernel and user mode interface for registering | |||
4 | performance expectations by drivers, subsystems and user space applications on | 4 | performance expectations by drivers, subsystems and user space applications on |
5 | one of the parameters. | 5 | one of the parameters. |
6 | 6 | ||
7 | Currently we have {cpu_dma_latency, network_latency, network_throughput} as the | 7 | Two different PM QoS frameworks are available: |
8 | initial set of pm_qos parameters. | 8 | 1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput. |
9 | 2. the per-device PM QoS framework provides the API to manage the per-device latency | ||
10 | constraints. | ||
9 | 11 | ||
10 | Each parameters have defined units: | 12 | Each parameters have defined units: |
11 | * latency: usec | 13 | * latency: usec |
12 | * timeout: usec | 14 | * timeout: usec |
13 | * throughput: kbs (kilo bit / sec) | 15 | * throughput: kbs (kilo bit / sec) |
14 | 16 | ||
17 | |||
18 | 1. PM QoS framework | ||
19 | |||
15 | The infrastructure exposes multiple misc device nodes one per implemented | 20 | The infrastructure exposes multiple misc device nodes one per implemented |
16 | parameter. The set of parameters implement is defined by pm_qos_power_init() | 21 | parameter. The set of parameters implement is defined by pm_qos_power_init() |
17 | and pm_qos_params.h. This is done because having the available parameters | 22 | and pm_qos_params.h. This is done because having the available parameters |
@@ -23,14 +28,18 @@ an aggregated target value. The aggregated target value is updated with | |||
23 | changes to the request list or elements of the list. Typically the | 28 | changes to the request list or elements of the list. Typically the |
24 | aggregated target value is simply the max or min of the request values held | 29 | aggregated target value is simply the max or min of the request values held |
25 | in the parameter list elements. | 30 | in the parameter list elements. |
31 | Note: the aggregated target value is implemented as an atomic variable so that | ||
32 | reading the aggregated value does not require any locking mechanism. | ||
33 | |||
26 | 34 | ||
27 | From kernel mode the use of this interface is simple: | 35 | From kernel mode the use of this interface is simple: |
28 | 36 | ||
29 | handle = pm_qos_add_request(param_class, target_value): | 37 | void pm_qos_add_request(handle, param_class, target_value): |
30 | Will insert an element into the list for that identified PM_QOS class with the | 38 | Will insert an element into the list for that identified PM QoS class with the |
31 | target value. Upon change to this list the new target is recomputed and any | 39 | target value. Upon change to this list the new target is recomputed and any |
32 | registered notifiers are called only if the target value is now different. | 40 | registered notifiers are called only if the target value is now different. |
33 | Clients of pm_qos need to save the returned handle. | 41 | Clients of pm_qos need to save the returned handle for future use in other |
42 | pm_qos API functions. | ||
34 | 43 | ||
35 | void pm_qos_update_request(handle, new_target_value): | 44 | void pm_qos_update_request(handle, new_target_value): |
36 | Will update the list element pointed to by the handle with the new target value | 45 | Will update the list element pointed to by the handle with the new target value |
@@ -42,6 +51,20 @@ Will remove the element. After removal it will update the aggregate target and | |||
42 | call the notification tree if the target was changed as a result of removing | 51 | call the notification tree if the target was changed as a result of removing |
43 | the request. | 52 | the request. |
44 | 53 | ||
54 | int pm_qos_request(param_class): | ||
55 | Returns the aggregated value for a given PM QoS class. | ||
56 | |||
57 | int pm_qos_request_active(handle): | ||
58 | Returns if the request is still active, i.e. it has not been removed from a | ||
59 | PM QoS class constraints list. | ||
60 | |||
61 | int pm_qos_add_notifier(param_class, notifier): | ||
62 | Adds a notification callback function to the PM QoS class. The callback is | ||
63 | called when the aggregated value for the PM QoS class is changed. | ||
64 | |||
65 | int pm_qos_remove_notifier(int param_class, notifier): | ||
66 | Removes the notification callback function for the PM QoS class. | ||
67 | |||
45 | 68 | ||
46 | From user mode: | 69 | From user mode: |
47 | Only processes can register a pm_qos request. To provide for automatic | 70 | Only processes can register a pm_qos request. To provide for automatic |
@@ -63,4 +86,63 @@ To remove the user mode request for a target value simply close the device | |||
63 | node. | 86 | node. |
64 | 87 | ||
65 | 88 | ||
89 | 2. PM QoS per-device latency framework | ||
90 | |||
91 | For each device a list of performance requests is maintained along with | ||
92 | an aggregated target value. The aggregated target value is updated with | ||
93 | changes to the request list or elements of the list. Typically the | ||
94 | aggregated target value is simply the max or min of the request values held | ||
95 | in the parameter list elements. | ||
96 | Note: the aggregated target value is implemented as an atomic variable so that | ||
97 | reading the aggregated value does not require any locking mechanism. | ||
98 | |||
99 | |||
100 | From kernel mode the use of this interface is the following: | ||
101 | |||
102 | int dev_pm_qos_add_request(device, handle, value): | ||
103 | Will insert an element into the list for that identified device with the | ||
104 | target value. Upon change to this list the new target is recomputed and any | ||
105 | registered notifiers are called only if the target value is now different. | ||
106 | Clients of dev_pm_qos need to save the handle for future use in other | ||
107 | dev_pm_qos API functions. | ||
108 | |||
109 | int dev_pm_qos_update_request(handle, new_value): | ||
110 | Will update the list element pointed to by the handle with the new target value | ||
111 | and recompute the new aggregated target, calling the notification trees if the | ||
112 | target is changed. | ||
113 | |||
114 | int dev_pm_qos_remove_request(handle): | ||
115 | Will remove the element. After removal it will update the aggregate target and | ||
116 | call the notification trees if the target was changed as a result of removing | ||
117 | the request. | ||
118 | |||
119 | s32 dev_pm_qos_read_value(device): | ||
120 | Returns the aggregated value for a given device's constraints list. | ||
121 | |||
122 | |||
123 | Notification mechanisms: | ||
124 | The per-device PM QoS framework has 2 different and distinct notification trees: | ||
125 | a per-device notification tree and a global notification tree. | ||
126 | |||
127 | int dev_pm_qos_add_notifier(device, notifier): | ||
128 | Adds a notification callback function for the device. | ||
129 | The callback is called when the aggregated value of the device constraints list | ||
130 | is changed. | ||
131 | |||
132 | int dev_pm_qos_remove_notifier(device, notifier): | ||
133 | Removes the notification callback function for the device. | ||
134 | |||
135 | int dev_pm_qos_add_global_notifier(notifier): | ||
136 | Adds a notification callback function in the global notification tree of the | ||
137 | framework. | ||
138 | The callback is called when the aggregated value for any device is changed. | ||
139 | |||
140 | int dev_pm_qos_remove_global_notifier(notifier): | ||
141 | Removes the notification callback function from the global notification tree | ||
142 | of the framework. | ||
143 | |||
144 | |||
145 | From user mode: | ||
146 | No API for user space access to the per-device latency constraints is provided | ||
147 | yet - still under discussion. | ||
66 | 148 | ||
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index 6066e3a6b9a9..0e856088db7c 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt | |||
@@ -43,13 +43,18 @@ struct dev_pm_ops { | |||
43 | ... | 43 | ... |
44 | }; | 44 | }; |
45 | 45 | ||
46 | The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are | 46 | The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks |
47 | executed by the PM core for either the device type, or the class (if the device | 47 | are executed by the PM core for either the power domain, or the device type |
48 | type's struct dev_pm_ops object does not exist), or the bus type (if the | 48 | (if the device power domain's struct dev_pm_ops does not exist), or the class |
49 | device type's and class' struct dev_pm_ops objects do not exist) of the given | 49 | (if the device power domain's and type's struct dev_pm_ops object does not |
50 | device (this allows device types to override callbacks provided by bus types or | 50 | exist), or the bus type (if the device power domain's, type's and class' |
51 | classes if necessary). The bus type, device type and class callbacks are | 51 | struct dev_pm_ops objects do not exist) of the given device, so the priority |
52 | referred to as subsystem-level callbacks in what follows. | 52 | order of callbacks from high to low is that power domain callbacks, device |
53 | type callbacks, class callbacks and bus type callbacks, and the high priority | ||
54 | one will take precedence over low priority one. The bus type, device type and | ||
55 | class callbacks are referred to as subsystem-level callbacks in what follows, | ||
56 | and generally speaking, the power domain callbacks are used for representing | ||
57 | power domains within a SoC. | ||
53 | 58 | ||
54 | By default, the callbacks are always invoked in process context with interrupts | 59 | By default, the callbacks are always invoked in process context with interrupts |
55 | enabled. However, subsystems can use the pm_runtime_irq_safe() helper function | 60 | enabled. However, subsystems can use the pm_runtime_irq_safe() helper function |
@@ -477,12 +482,14 @@ pm_runtime_autosuspend_expiration() | |||
477 | If pm_runtime_irq_safe() has been called for a device then the following helper | 482 | If pm_runtime_irq_safe() has been called for a device then the following helper |
478 | functions may also be used in interrupt context: | 483 | functions may also be used in interrupt context: |
479 | 484 | ||
485 | pm_runtime_idle() | ||
480 | pm_runtime_suspend() | 486 | pm_runtime_suspend() |
481 | pm_runtime_autosuspend() | 487 | pm_runtime_autosuspend() |
482 | pm_runtime_resume() | 488 | pm_runtime_resume() |
483 | pm_runtime_get_sync() | 489 | pm_runtime_get_sync() |
484 | pm_runtime_put_sync() | 490 | pm_runtime_put_sync() |
485 | pm_runtime_put_sync_suspend() | 491 | pm_runtime_put_sync_suspend() |
492 | pm_runtime_put_sync_autosuspend() | ||
486 | 493 | ||
487 | 5. Runtime PM Initialization, Device Probing and Removal | 494 | 5. Runtime PM Initialization, Device Probing and Removal |
488 | 495 | ||
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt new file mode 100644 index 000000000000..f28f9a6f0347 --- /dev/null +++ b/Documentation/power/suspend-and-cpuhotplug.txt | |||
@@ -0,0 +1,275 @@ | |||
1 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure | ||
2 | |||
3 | (C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> | ||
4 | |||
5 | |||
6 | I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM | ||
7 | infrastructure uses it internally? And where do they share common code? | ||
8 | |||
9 | Well, a picture is worth a thousand words... So ASCII art follows :-) | ||
10 | |||
11 | [This depicts the current design in the kernel, and focusses only on the | ||
12 | interactions involving the freezer and CPU hotplug and also tries to explain | ||
13 | the locking involved. It outlines the notifications involved as well. | ||
14 | But please note that here, only the call paths are illustrated, with the aim | ||
15 | of describing where they take different paths and where they share code. | ||
16 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other | ||
17 | is not depicted here.] | ||
18 | |||
19 | On a high level, the suspend-resume cycle goes like this: | ||
20 | |||
21 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | | ||
22 | |tasks | | cpus | | | | cpus | |tasks| | ||
23 | |||
24 | |||
25 | More details follow: | ||
26 | |||
27 | Suspend call path | ||
28 | ----------------- | ||
29 | |||
30 | Write 'mem' to | ||
31 | /sys/power/state | ||
32 | syfs file | ||
33 | | | ||
34 | v | ||
35 | Acquire pm_mutex lock | ||
36 | | | ||
37 | v | ||
38 | Send PM_SUSPEND_PREPARE | ||
39 | notifications | ||
40 | | | ||
41 | v | ||
42 | Freeze tasks | ||
43 | | | ||
44 | | | ||
45 | v | ||
46 | disable_nonboot_cpus() | ||
47 | /* start */ | ||
48 | | | ||
49 | v | ||
50 | Acquire cpu_add_remove_lock | ||
51 | | | ||
52 | v | ||
53 | Iterate over CURRENTLY | ||
54 | online CPUs | ||
55 | | | ||
56 | | | ||
57 | | ---------- | ||
58 | v | L | ||
59 | ======> _cpu_down() | | ||
60 | | [This takes cpuhotplug.lock | | ||
61 | Common | before taking down the CPU | | ||
62 | code | and releases it when done] | O | ||
63 | | While it is at it, notifications | | ||
64 | | are sent when notable events occur, | | ||
65 | ======> by running all registered callbacks. | | ||
66 | | | O | ||
67 | | | | ||
68 | | | | ||
69 | v | | ||
70 | Note down these cpus in | P | ||
71 | frozen_cpus mask ---------- | ||
72 | | | ||
73 | v | ||
74 | Disable regular cpu hotplug | ||
75 | by setting cpu_hotplug_disabled=1 | ||
76 | | | ||
77 | v | ||
78 | Release cpu_add_remove_lock | ||
79 | | | ||
80 | v | ||
81 | /* disable_nonboot_cpus() complete */ | ||
82 | | | ||
83 | v | ||
84 | Do suspend | ||
85 | |||
86 | |||
87 | |||
88 | Resuming back is likewise, with the counterparts being (in the order of | ||
89 | execution during resume): | ||
90 | * enable_nonboot_cpus() which involves: | ||
91 | | Acquire cpu_add_remove_lock | ||
92 | | Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug | ||
93 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] | ||
94 | | Release cpu_add_remove_lock | ||
95 | v | ||
96 | |||
97 | * thaw tasks | ||
98 | * send PM_POST_SUSPEND notifications | ||
99 | * Release pm_mutex lock. | ||
100 | |||
101 | |||
102 | It is to be noted here that the pm_mutex lock is acquired at the very | ||
103 | beginning, when we are just starting out to suspend, and then released only | ||
104 | after the entire cycle is complete (i.e., suspend + resume). | ||
105 | |||
106 | |||
107 | |||
108 | Regular CPU hotplug call path | ||
109 | ----------------------------- | ||
110 | |||
111 | Write 0 (or 1) to | ||
112 | /sys/devices/system/cpu/cpu*/online | ||
113 | sysfs file | ||
114 | | | ||
115 | | | ||
116 | v | ||
117 | cpu_down() | ||
118 | | | ||
119 | v | ||
120 | Acquire cpu_add_remove_lock | ||
121 | | | ||
122 | v | ||
123 | If cpu_hotplug_disabled is 1 | ||
124 | return gracefully | ||
125 | | | ||
126 | | | ||
127 | v | ||
128 | ======> _cpu_down() | ||
129 | | [This takes cpuhotplug.lock | ||
130 | Common | before taking down the CPU | ||
131 | code | and releases it when done] | ||
132 | | While it is at it, notifications | ||
133 | | are sent when notable events occur, | ||
134 | ======> by running all registered callbacks. | ||
135 | | | ||
136 | | | ||
137 | v | ||
138 | Release cpu_add_remove_lock | ||
139 | [That's it!, for | ||
140 | regular CPU hotplug] | ||
141 | |||
142 | |||
143 | |||
144 | So, as can be seen from the two diagrams (the parts marked as "Common code"), | ||
145 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and | ||
146 | _cpu_up() functions. They differ in the arguments passed to these functions, | ||
147 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' | ||
148 | argument. But during suspend, since the tasks are already frozen by the time | ||
149 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called | ||
150 | with the 'tasks_frozen' argument set to 1. | ||
151 | [See below for some known issues regarding this.] | ||
152 | |||
153 | |||
154 | Important files and functions/entry points: | ||
155 | ------------------------------------------ | ||
156 | |||
157 | kernel/power/process.c : freeze_processes(), thaw_processes() | ||
158 | kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() | ||
159 | kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() | ||
160 | |||
161 | |||
162 | |||
163 | II. What are the issues involved in CPU hotplug? | ||
164 | ------------------------------------------- | ||
165 | |||
166 | There are some interesting situations involving CPU hotplug and microcode | ||
167 | update on the CPUs, as discussed below: | ||
168 | |||
169 | [Please bear in mind that the kernel requests the microcode images from | ||
170 | userspace, using the request_firmware() function defined in | ||
171 | drivers/base/firmware_class.c] | ||
172 | |||
173 | |||
174 | a. When all the CPUs are identical: | ||
175 | |||
176 | This is the most common situation and it is quite straightforward: we want | ||
177 | to apply the same microcode revision to each of the CPUs. | ||
178 | To give an example of x86, the collect_cpu_info() function defined in | ||
179 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU | ||
180 | and thereby in applying the correct microcode revision to it. | ||
181 | But note that the kernel does not maintain a common microcode image for the | ||
182 | all CPUs, in order to handle case 'b' described below. | ||
183 | |||
184 | |||
185 | b. When some of the CPUs are different than the rest: | ||
186 | |||
187 | In this case since we probably need to apply different microcode revisions | ||
188 | to different CPUs, the kernel maintains a copy of the correct microcode | ||
189 | image for each CPU (after appropriate CPU type/model discovery using | ||
190 | functions such as collect_cpu_info()). | ||
191 | |||
192 | |||
193 | c. When a CPU is physically hot-unplugged and a new (and possibly different | ||
194 | type of) CPU is hot-plugged into the system: | ||
195 | |||
196 | In the current design of the kernel, whenever a CPU is taken offline during | ||
197 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification | ||
198 | (which is sent by the CPU hotplug code), the microcode update driver's | ||
199 | callback for that event reacts by freeing the kernel's copy of the | ||
200 | microcode image for that CPU. | ||
201 | |||
202 | Hence, when a new CPU is brought online, since the kernel finds that it | ||
203 | doesn't have the microcode image, it does the CPU type/model discovery | ||
204 | afresh and then requests the userspace for the appropriate microcode image | ||
205 | for that CPU, which is subsequently applied. | ||
206 | |||
207 | For example, in x86, the mc_cpu_callback() function (which is the microcode | ||
208 | update driver's callback registered for CPU hotplug events) calls | ||
209 | microcode_update_cpu() which would call microcode_init_cpu() in this case, | ||
210 | instead of microcode_resume_cpu() when it finds that the kernel doesn't | ||
211 | have a valid microcode image. This ensures that the CPU type/model | ||
212 | discovery is performed and the right microcode is applied to the CPU after | ||
213 | getting it from userspace. | ||
214 | |||
215 | |||
216 | d. Handling microcode update during suspend/hibernate: | ||
217 | |||
218 | Strictly speaking, during a CPU hotplug operation which does not involve | ||
219 | physically removing or inserting CPUs, the CPUs are not actually powered | ||
220 | off during a CPU offline. They are just put to the lowest C-states possible. | ||
221 | Hence, in such a case, it is not really necessary to re-apply microcode | ||
222 | when the CPUs are brought back online, since they wouldn't have lost the | ||
223 | image during the CPU offline operation. | ||
224 | |||
225 | This is the usual scenario encountered during a resume after a suspend. | ||
226 | However, in the case of hibernation, since all the CPUs are completely | ||
227 | powered off, during restore it becomes necessary to apply the microcode | ||
228 | images to all the CPUs. | ||
229 | |||
230 | [Note that we don't expect someone to physically pull out nodes and insert | ||
231 | nodes with a different type of CPUs in-between a suspend-resume or a | ||
232 | hibernate/restore cycle.] | ||
233 | |||
234 | In the current design of the kernel however, during a CPU offline operation | ||
235 | as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification), | ||
236 | the existing copy of microcode image in the kernel is not freed up. | ||
237 | And during the CPU online operations (during resume/restore), since the | ||
238 | kernel finds that it already has copies of the microcode images for all the | ||
239 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU | ||
240 | type/model and the need for validating whether the microcode revisions are | ||
241 | right for the CPUs or not (due to the above assumption that physical CPU | ||
242 | hotplug will not be done in-between suspend/resume or hibernate/restore | ||
243 | cycles). | ||
244 | |||
245 | |||
246 | III. Are there any known problems when regular CPU hotplug and suspend race | ||
247 | with each other? | ||
248 | |||
249 | Yes, they are listed below: | ||
250 | |||
251 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to | ||
252 | the _cpu_down() and _cpu_up() functions is *always* 0. | ||
253 | This might not reflect the true current state of the system, since the | ||
254 | tasks could have been frozen by an out-of-band event such as a suspend | ||
255 | operation in progress. Hence, it will lead to wrong notifications being | ||
256 | sent during the cpu online/offline events (eg, CPU_ONLINE notification | ||
257 | instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of | ||
258 | inappropriate code by the callbacks registered for such CPU hotplug events. | ||
259 | |||
260 | 2. If a regular CPU hotplug stress test happens to race with the freezer due | ||
261 | to a suspend operation in progress at the same time, then we could hit the | ||
262 | situation described below: | ||
263 | |||
264 | * A regular cpu online operation continues its journey from userspace | ||
265 | into the kernel, since the freezing has not yet begun. | ||
266 | * Then freezer gets to work and freezes userspace. | ||
267 | * If cpu online has not yet completed the microcode update stuff by now, | ||
268 | it will now start waiting on the frozen userspace in the | ||
269 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. | ||
270 | * Now the freezer continues and tries to freeze the remaining tasks. But | ||
271 | due to this wait mentioned above, the freezer won't be able to freeze | ||
272 | the cpu online hotplug task and hence freezing of tasks fails. | ||
273 | |||
274 | As a result of this task freezing failure, the suspend operation gets | ||
275 | aborted. | ||