diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/power/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/power/suspend-and-cpuhotplug.txt | 275 |
2 files changed, 277 insertions, 0 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX index 45e9d4a91284..a4d682f54231 100644 --- a/Documentation/power/00-INDEX +++ b/Documentation/power/00-INDEX | |||
@@ -26,6 +26,8 @@ s2ram.txt | |||
26 | - How to get suspend to ram working (and debug it when it isn't) | 26 | - How to get suspend to ram working (and debug it when it isn't) |
27 | states.txt | 27 | states.txt |
28 | - System power management states | 28 | - System power management states |
29 | suspend-and-cpuhotplug.txt | ||
30 | - Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug | ||
29 | swsusp-and-swap-files.txt | 31 | swsusp-and-swap-files.txt |
30 | - Using swap files with software suspend (to disk) | 32 | - Using swap files with software suspend (to disk) |
31 | swsusp-dmcrypt.txt | 33 | swsusp-dmcrypt.txt |
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt new file mode 100644 index 000000000000..f28f9a6f0347 --- /dev/null +++ b/Documentation/power/suspend-and-cpuhotplug.txt | |||
@@ -0,0 +1,275 @@ | |||
1 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure | ||
2 | |||
3 | (C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> | ||
4 | |||
5 | |||
6 | I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM | ||
7 | infrastructure uses it internally? And where do they share common code? | ||
8 | |||
9 | Well, a picture is worth a thousand words... So ASCII art follows :-) | ||
10 | |||
11 | [This depicts the current design in the kernel, and focusses only on the | ||
12 | interactions involving the freezer and CPU hotplug and also tries to explain | ||
13 | the locking involved. It outlines the notifications involved as well. | ||
14 | But please note that here, only the call paths are illustrated, with the aim | ||
15 | of describing where they take different paths and where they share code. | ||
16 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other | ||
17 | is not depicted here.] | ||
18 | |||
19 | On a high level, the suspend-resume cycle goes like this: | ||
20 | |||
21 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | | ||
22 | |tasks | | cpus | | | | cpus | |tasks| | ||
23 | |||
24 | |||
25 | More details follow: | ||
26 | |||
27 | Suspend call path | ||
28 | ----------------- | ||
29 | |||
30 | Write 'mem' to | ||
31 | /sys/power/state | ||
32 | syfs file | ||
33 | | | ||
34 | v | ||
35 | Acquire pm_mutex lock | ||
36 | | | ||
37 | v | ||
38 | Send PM_SUSPEND_PREPARE | ||
39 | notifications | ||
40 | | | ||
41 | v | ||
42 | Freeze tasks | ||
43 | | | ||
44 | | | ||
45 | v | ||
46 | disable_nonboot_cpus() | ||
47 | /* start */ | ||
48 | | | ||
49 | v | ||
50 | Acquire cpu_add_remove_lock | ||
51 | | | ||
52 | v | ||
53 | Iterate over CURRENTLY | ||
54 | online CPUs | ||
55 | | | ||
56 | | | ||
57 | | ---------- | ||
58 | v | L | ||
59 | ======> _cpu_down() | | ||
60 | | [This takes cpuhotplug.lock | | ||
61 | Common | before taking down the CPU | | ||
62 | code | and releases it when done] | O | ||
63 | | While it is at it, notifications | | ||
64 | | are sent when notable events occur, | | ||
65 | ======> by running all registered callbacks. | | ||
66 | | | O | ||
67 | | | | ||
68 | | | | ||
69 | v | | ||
70 | Note down these cpus in | P | ||
71 | frozen_cpus mask ---------- | ||
72 | | | ||
73 | v | ||
74 | Disable regular cpu hotplug | ||
75 | by setting cpu_hotplug_disabled=1 | ||
76 | | | ||
77 | v | ||
78 | Release cpu_add_remove_lock | ||
79 | | | ||
80 | v | ||
81 | /* disable_nonboot_cpus() complete */ | ||
82 | | | ||
83 | v | ||
84 | Do suspend | ||
85 | |||
86 | |||
87 | |||
88 | Resuming back is likewise, with the counterparts being (in the order of | ||
89 | execution during resume): | ||
90 | * enable_nonboot_cpus() which involves: | ||
91 | | Acquire cpu_add_remove_lock | ||
92 | | Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug | ||
93 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] | ||
94 | | Release cpu_add_remove_lock | ||
95 | v | ||
96 | |||
97 | * thaw tasks | ||
98 | * send PM_POST_SUSPEND notifications | ||
99 | * Release pm_mutex lock. | ||
100 | |||
101 | |||
102 | It is to be noted here that the pm_mutex lock is acquired at the very | ||
103 | beginning, when we are just starting out to suspend, and then released only | ||
104 | after the entire cycle is complete (i.e., suspend + resume). | ||
105 | |||
106 | |||
107 | |||
108 | Regular CPU hotplug call path | ||
109 | ----------------------------- | ||
110 | |||
111 | Write 0 (or 1) to | ||
112 | /sys/devices/system/cpu/cpu*/online | ||
113 | sysfs file | ||
114 | | | ||
115 | | | ||
116 | v | ||
117 | cpu_down() | ||
118 | | | ||
119 | v | ||
120 | Acquire cpu_add_remove_lock | ||
121 | | | ||
122 | v | ||
123 | If cpu_hotplug_disabled is 1 | ||
124 | return gracefully | ||
125 | | | ||
126 | | | ||
127 | v | ||
128 | ======> _cpu_down() | ||
129 | | [This takes cpuhotplug.lock | ||
130 | Common | before taking down the CPU | ||
131 | code | and releases it when done] | ||
132 | | While it is at it, notifications | ||
133 | | are sent when notable events occur, | ||
134 | ======> by running all registered callbacks. | ||
135 | | | ||
136 | | | ||
137 | v | ||
138 | Release cpu_add_remove_lock | ||
139 | [That's it!, for | ||
140 | regular CPU hotplug] | ||
141 | |||
142 | |||
143 | |||
144 | So, as can be seen from the two diagrams (the parts marked as "Common code"), | ||
145 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and | ||
146 | _cpu_up() functions. They differ in the arguments passed to these functions, | ||
147 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' | ||
148 | argument. But during suspend, since the tasks are already frozen by the time | ||
149 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called | ||
150 | with the 'tasks_frozen' argument set to 1. | ||
151 | [See below for some known issues regarding this.] | ||
152 | |||
153 | |||
154 | Important files and functions/entry points: | ||
155 | ------------------------------------------ | ||
156 | |||
157 | kernel/power/process.c : freeze_processes(), thaw_processes() | ||
158 | kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() | ||
159 | kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() | ||
160 | |||
161 | |||
162 | |||
163 | II. What are the issues involved in CPU hotplug? | ||
164 | ------------------------------------------- | ||
165 | |||
166 | There are some interesting situations involving CPU hotplug and microcode | ||
167 | update on the CPUs, as discussed below: | ||
168 | |||
169 | [Please bear in mind that the kernel requests the microcode images from | ||
170 | userspace, using the request_firmware() function defined in | ||
171 | drivers/base/firmware_class.c] | ||
172 | |||
173 | |||
174 | a. When all the CPUs are identical: | ||
175 | |||
176 | This is the most common situation and it is quite straightforward: we want | ||
177 | to apply the same microcode revision to each of the CPUs. | ||
178 | To give an example of x86, the collect_cpu_info() function defined in | ||
179 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU | ||
180 | and thereby in applying the correct microcode revision to it. | ||
181 | But note that the kernel does not maintain a common microcode image for the | ||
182 | all CPUs, in order to handle case 'b' described below. | ||
183 | |||
184 | |||
185 | b. When some of the CPUs are different than the rest: | ||
186 | |||
187 | In this case since we probably need to apply different microcode revisions | ||
188 | to different CPUs, the kernel maintains a copy of the correct microcode | ||
189 | image for each CPU (after appropriate CPU type/model discovery using | ||
190 | functions such as collect_cpu_info()). | ||
191 | |||
192 | |||
193 | c. When a CPU is physically hot-unplugged and a new (and possibly different | ||
194 | type of) CPU is hot-plugged into the system: | ||
195 | |||
196 | In the current design of the kernel, whenever a CPU is taken offline during | ||
197 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification | ||
198 | (which is sent by the CPU hotplug code), the microcode update driver's | ||
199 | callback for that event reacts by freeing the kernel's copy of the | ||
200 | microcode image for that CPU. | ||
201 | |||
202 | Hence, when a new CPU is brought online, since the kernel finds that it | ||
203 | doesn't have the microcode image, it does the CPU type/model discovery | ||
204 | afresh and then requests the userspace for the appropriate microcode image | ||
205 | for that CPU, which is subsequently applied. | ||
206 | |||
207 | For example, in x86, the mc_cpu_callback() function (which is the microcode | ||
208 | update driver's callback registered for CPU hotplug events) calls | ||
209 | microcode_update_cpu() which would call microcode_init_cpu() in this case, | ||
210 | instead of microcode_resume_cpu() when it finds that the kernel doesn't | ||
211 | have a valid microcode image. This ensures that the CPU type/model | ||
212 | discovery is performed and the right microcode is applied to the CPU after | ||
213 | getting it from userspace. | ||
214 | |||
215 | |||
216 | d. Handling microcode update during suspend/hibernate: | ||
217 | |||
218 | Strictly speaking, during a CPU hotplug operation which does not involve | ||
219 | physically removing or inserting CPUs, the CPUs are not actually powered | ||
220 | off during a CPU offline. They are just put to the lowest C-states possible. | ||
221 | Hence, in such a case, it is not really necessary to re-apply microcode | ||
222 | when the CPUs are brought back online, since they wouldn't have lost the | ||
223 | image during the CPU offline operation. | ||
224 | |||
225 | This is the usual scenario encountered during a resume after a suspend. | ||
226 | However, in the case of hibernation, since all the CPUs are completely | ||
227 | powered off, during restore it becomes necessary to apply the microcode | ||
228 | images to all the CPUs. | ||
229 | |||
230 | [Note that we don't expect someone to physically pull out nodes and insert | ||
231 | nodes with a different type of CPUs in-between a suspend-resume or a | ||
232 | hibernate/restore cycle.] | ||
233 | |||
234 | In the current design of the kernel however, during a CPU offline operation | ||
235 | as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification), | ||
236 | the existing copy of microcode image in the kernel is not freed up. | ||
237 | And during the CPU online operations (during resume/restore), since the | ||
238 | kernel finds that it already has copies of the microcode images for all the | ||
239 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU | ||
240 | type/model and the need for validating whether the microcode revisions are | ||
241 | right for the CPUs or not (due to the above assumption that physical CPU | ||
242 | hotplug will not be done in-between suspend/resume or hibernate/restore | ||
243 | cycles). | ||
244 | |||
245 | |||
246 | III. Are there any known problems when regular CPU hotplug and suspend race | ||
247 | with each other? | ||
248 | |||
249 | Yes, they are listed below: | ||
250 | |||
251 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to | ||
252 | the _cpu_down() and _cpu_up() functions is *always* 0. | ||
253 | This might not reflect the true current state of the system, since the | ||
254 | tasks could have been frozen by an out-of-band event such as a suspend | ||
255 | operation in progress. Hence, it will lead to wrong notifications being | ||
256 | sent during the cpu online/offline events (eg, CPU_ONLINE notification | ||
257 | instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of | ||
258 | inappropriate code by the callbacks registered for such CPU hotplug events. | ||
259 | |||
260 | 2. If a regular CPU hotplug stress test happens to race with the freezer due | ||
261 | to a suspend operation in progress at the same time, then we could hit the | ||
262 | situation described below: | ||
263 | |||
264 | * A regular cpu online operation continues its journey from userspace | ||
265 | into the kernel, since the freezing has not yet begun. | ||
266 | * Then freezer gets to work and freezes userspace. | ||
267 | * If cpu online has not yet completed the microcode update stuff by now, | ||
268 | it will now start waiting on the frozen userspace in the | ||
269 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. | ||
270 | * Now the freezer continues and tries to freeze the remaining tasks. But | ||
271 | due to this wait mentioned above, the freezer won't be able to freeze | ||
272 | the cpu online hotplug task and hence freezing of tasks fails. | ||
273 | |||
274 | As a result of this task freezing failure, the suspend operation gets | ||
275 | aborted. | ||