diff options
Diffstat (limited to 'Documentation/power')
-rw-r--r-- | Documentation/power/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/power/basic-pm-debugging.txt | 26 | ||||
-rw-r--r-- | Documentation/power/devices.txt | 8 | ||||
-rw-r--r-- | Documentation/power/freezing-of-tasks.txt | 8 | ||||
-rw-r--r-- | Documentation/power/pm_qos_interface.txt | 92 | ||||
-rw-r--r-- | Documentation/power/regulator/machine.txt | 19 | ||||
-rw-r--r-- | Documentation/power/runtime_pm.txt | 31 | ||||
-rw-r--r-- | Documentation/power/suspend-and-cpuhotplug.txt | 275 | ||||
-rw-r--r-- | Documentation/power/userland-swsusp.txt | 3 |
9 files changed, 435 insertions, 29 deletions
diff --git a/Documentation/power/00-INDEX b/Documentation/power/00-INDEX index 45e9d4a91284..a4d682f54231 100644 --- a/Documentation/power/00-INDEX +++ b/Documentation/power/00-INDEX | |||
@@ -26,6 +26,8 @@ s2ram.txt | |||
26 | - How to get suspend to ram working (and debug it when it isn't) | 26 | - How to get suspend to ram working (and debug it when it isn't) |
27 | states.txt | 27 | states.txt |
28 | - System power management states | 28 | - System power management states |
29 | suspend-and-cpuhotplug.txt | ||
30 | - Explains the interaction between Suspend-to-RAM (S3) and CPU hotplug | ||
29 | swsusp-and-swap-files.txt | 31 | swsusp-and-swap-files.txt |
30 | - Using swap files with software suspend (to disk) | 32 | - Using swap files with software suspend (to disk) |
31 | swsusp-dmcrypt.txt | 33 | swsusp-dmcrypt.txt |
diff --git a/Documentation/power/basic-pm-debugging.txt b/Documentation/power/basic-pm-debugging.txt index ddd78172ef73..40a4c65f380a 100644 --- a/Documentation/power/basic-pm-debugging.txt +++ b/Documentation/power/basic-pm-debugging.txt | |||
@@ -173,7 +173,7 @@ kernel messages using the serial console. This may provide you with some | |||
173 | information about the reasons of the suspend (resume) failure. Alternatively, | 173 | information about the reasons of the suspend (resume) failure. Alternatively, |
174 | it may be possible to use a FireWire port for debugging with firescope | 174 | it may be possible to use a FireWire port for debugging with firescope |
175 | (ftp://ftp.firstfloor.org/pub/ak/firescope/). On x86 it is also possible to | 175 | (ftp://ftp.firstfloor.org/pub/ak/firescope/). On x86 it is also possible to |
176 | use the PM_TRACE mechanism documented in Documentation/s2ram.txt . | 176 | use the PM_TRACE mechanism documented in Documentation/power/s2ram.txt . |
177 | 177 | ||
178 | 2. Testing suspend to RAM (STR) | 178 | 2. Testing suspend to RAM (STR) |
179 | 179 | ||
@@ -201,3 +201,27 @@ case, you may be able to search for failing drivers by following the procedure | |||
201 | analogous to the one described in section 1. If you find some failing drivers, | 201 | analogous to the one described in section 1. If you find some failing drivers, |
202 | you will have to unload them every time before an STR transition (ie. before | 202 | you will have to unload them every time before an STR transition (ie. before |
203 | you run s2ram), and please report the problems with them. | 203 | you run s2ram), and please report the problems with them. |
204 | |||
205 | There is a debugfs entry which shows the suspend to RAM statistics. Here is an | ||
206 | example of its output. | ||
207 | # mount -t debugfs none /sys/kernel/debug | ||
208 | # cat /sys/kernel/debug/suspend_stats | ||
209 | success: 20 | ||
210 | fail: 5 | ||
211 | failed_freeze: 0 | ||
212 | failed_prepare: 0 | ||
213 | failed_suspend: 5 | ||
214 | failed_suspend_noirq: 0 | ||
215 | failed_resume: 0 | ||
216 | failed_resume_noirq: 0 | ||
217 | failures: | ||
218 | last_failed_dev: alarm | ||
219 | adc | ||
220 | last_failed_errno: -16 | ||
221 | -16 | ||
222 | last_failed_step: suspend | ||
223 | suspend | ||
224 | Field success means the success number of suspend to RAM, and field fail means | ||
225 | the failure number. Others are the failure number of different steps of suspend | ||
226 | to RAM. suspend_stats just lists the last 2 failed devices, error number and | ||
227 | failed step of suspend. | ||
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index 3384d5996be2..646a89e0c07d 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt | |||
@@ -152,7 +152,9 @@ try to use its wakeup mechanism. device_set_wakeup_enable() affects this flag; | |||
152 | for the most part drivers should not change its value. The initial value of | 152 | for the most part drivers should not change its value. The initial value of |
153 | should_wakeup is supposed to be false for the majority of devices; the major | 153 | should_wakeup is supposed to be false for the majority of devices; the major |
154 | exceptions are power buttons, keyboards, and Ethernet adapters whose WoL | 154 | exceptions are power buttons, keyboards, and Ethernet adapters whose WoL |
155 | (wake-on-LAN) feature has been set up with ethtool. | 155 | (wake-on-LAN) feature has been set up with ethtool. It should also default |
156 | to true for devices that don't generate wakeup requests on their own but merely | ||
157 | forward wakeup requests from one bus to another (like PCI bridges). | ||
156 | 158 | ||
157 | Whether or not a device is capable of issuing wakeup events is a hardware | 159 | Whether or not a device is capable of issuing wakeup events is a hardware |
158 | matter, and the kernel is responsible for keeping track of it. By contrast, | 160 | matter, and the kernel is responsible for keeping track of it. By contrast, |
@@ -279,10 +281,6 @@ When the system goes into the standby or memory sleep state, the phases are: | |||
279 | time.) Unlike the other suspend-related phases, during the prepare | 281 | time.) Unlike the other suspend-related phases, during the prepare |
280 | phase the device tree is traversed top-down. | 282 | phase the device tree is traversed top-down. |
281 | 283 | ||
282 | In addition to that, if device drivers need to allocate additional | ||
283 | memory to be able to hadle device suspend correctly, that should be | ||
284 | done in the prepare phase. | ||
285 | |||
286 | After the prepare callback method returns, no new children may be | 284 | After the prepare callback method returns, no new children may be |
287 | registered below the device. The method may also prepare the device or | 285 | registered below the device. The method may also prepare the device or |
288 | driver in some way for the upcoming system power transition (for | 286 | driver in some way for the upcoming system power transition (for |
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt index 38b57248fd61..316c2ba187f4 100644 --- a/Documentation/power/freezing-of-tasks.txt +++ b/Documentation/power/freezing-of-tasks.txt | |||
@@ -22,12 +22,12 @@ try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and | |||
22 | either wakes them up, if they are kernel threads, or sends fake signals to them, | 22 | either wakes them up, if they are kernel threads, or sends fake signals to them, |
23 | if they are user space processes. A task that has TIF_FREEZE set, should react | 23 | if they are user space processes. A task that has TIF_FREEZE set, should react |
24 | to it by calling the function called refrigerator() (defined in | 24 | to it by calling the function called refrigerator() (defined in |
25 | kernel/power/process.c), which sets the task's PF_FROZEN flag, changes its state | 25 | kernel/freezer.c), which sets the task's PF_FROZEN flag, changes its state |
26 | to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it. | 26 | to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it. |
27 | Then, we say that the task is 'frozen' and therefore the set of functions | 27 | Then, we say that the task is 'frozen' and therefore the set of functions |
28 | handling this mechanism is referred to as 'the freezer' (these functions are | 28 | handling this mechanism is referred to as 'the freezer' (these functions are |
29 | defined in kernel/power/process.c and include/linux/freezer.h). User space | 29 | defined in kernel/power/process.c, kernel/freezer.c & include/linux/freezer.h). |
30 | processes are generally frozen before kernel threads. | 30 | User space processes are generally frozen before kernel threads. |
31 | 31 | ||
32 | It is not recommended to call refrigerator() directly. Instead, it is | 32 | It is not recommended to call refrigerator() directly. Instead, it is |
33 | recommended to use the try_to_freeze() function (defined in | 33 | recommended to use the try_to_freeze() function (defined in |
@@ -95,7 +95,7 @@ after the memory for the image has been freed, we don't want tasks to allocate | |||
95 | additional memory and we prevent them from doing that by freezing them earlier. | 95 | additional memory and we prevent them from doing that by freezing them earlier. |
96 | [Of course, this also means that device drivers should not allocate substantial | 96 | [Of course, this also means that device drivers should not allocate substantial |
97 | amounts of memory from their .suspend() callbacks before hibernation, but this | 97 | amounts of memory from their .suspend() callbacks before hibernation, but this |
98 | is e separate issue.] | 98 | is a separate issue.] |
99 | 99 | ||
100 | 3. The third reason is to prevent user space processes and some kernel threads | 100 | 3. The third reason is to prevent user space processes and some kernel threads |
101 | from interfering with the suspending and resuming of devices. A user space | 101 | from interfering with the suspending and resuming of devices. A user space |
diff --git a/Documentation/power/pm_qos_interface.txt b/Documentation/power/pm_qos_interface.txt index bfed898a03fc..17e130a80347 100644 --- a/Documentation/power/pm_qos_interface.txt +++ b/Documentation/power/pm_qos_interface.txt | |||
@@ -4,14 +4,19 @@ This interface provides a kernel and user mode interface for registering | |||
4 | performance expectations by drivers, subsystems and user space applications on | 4 | performance expectations by drivers, subsystems and user space applications on |
5 | one of the parameters. | 5 | one of the parameters. |
6 | 6 | ||
7 | Currently we have {cpu_dma_latency, network_latency, network_throughput} as the | 7 | Two different PM QoS frameworks are available: |
8 | initial set of pm_qos parameters. | 8 | 1. PM QoS classes for cpu_dma_latency, network_latency, network_throughput. |
9 | 2. the per-device PM QoS framework provides the API to manage the per-device latency | ||
10 | constraints. | ||
9 | 11 | ||
10 | Each parameters have defined units: | 12 | Each parameters have defined units: |
11 | * latency: usec | 13 | * latency: usec |
12 | * timeout: usec | 14 | * timeout: usec |
13 | * throughput: kbs (kilo bit / sec) | 15 | * throughput: kbs (kilo bit / sec) |
14 | 16 | ||
17 | |||
18 | 1. PM QoS framework | ||
19 | |||
15 | The infrastructure exposes multiple misc device nodes one per implemented | 20 | The infrastructure exposes multiple misc device nodes one per implemented |
16 | parameter. The set of parameters implement is defined by pm_qos_power_init() | 21 | parameter. The set of parameters implement is defined by pm_qos_power_init() |
17 | and pm_qos_params.h. This is done because having the available parameters | 22 | and pm_qos_params.h. This is done because having the available parameters |
@@ -23,14 +28,18 @@ an aggregated target value. The aggregated target value is updated with | |||
23 | changes to the request list or elements of the list. Typically the | 28 | changes to the request list or elements of the list. Typically the |
24 | aggregated target value is simply the max or min of the request values held | 29 | aggregated target value is simply the max or min of the request values held |
25 | in the parameter list elements. | 30 | in the parameter list elements. |
31 | Note: the aggregated target value is implemented as an atomic variable so that | ||
32 | reading the aggregated value does not require any locking mechanism. | ||
33 | |||
26 | 34 | ||
27 | From kernel mode the use of this interface is simple: | 35 | From kernel mode the use of this interface is simple: |
28 | 36 | ||
29 | handle = pm_qos_add_request(param_class, target_value): | 37 | void pm_qos_add_request(handle, param_class, target_value): |
30 | Will insert an element into the list for that identified PM_QOS class with the | 38 | Will insert an element into the list for that identified PM QoS class with the |
31 | target value. Upon change to this list the new target is recomputed and any | 39 | target value. Upon change to this list the new target is recomputed and any |
32 | registered notifiers are called only if the target value is now different. | 40 | registered notifiers are called only if the target value is now different. |
33 | Clients of pm_qos need to save the returned handle. | 41 | Clients of pm_qos need to save the returned handle for future use in other |
42 | pm_qos API functions. | ||
34 | 43 | ||
35 | void pm_qos_update_request(handle, new_target_value): | 44 | void pm_qos_update_request(handle, new_target_value): |
36 | Will update the list element pointed to by the handle with the new target value | 45 | Will update the list element pointed to by the handle with the new target value |
@@ -42,6 +51,20 @@ Will remove the element. After removal it will update the aggregate target and | |||
42 | call the notification tree if the target was changed as a result of removing | 51 | call the notification tree if the target was changed as a result of removing |
43 | the request. | 52 | the request. |
44 | 53 | ||
54 | int pm_qos_request(param_class): | ||
55 | Returns the aggregated value for a given PM QoS class. | ||
56 | |||
57 | int pm_qos_request_active(handle): | ||
58 | Returns if the request is still active, i.e. it has not been removed from a | ||
59 | PM QoS class constraints list. | ||
60 | |||
61 | int pm_qos_add_notifier(param_class, notifier): | ||
62 | Adds a notification callback function to the PM QoS class. The callback is | ||
63 | called when the aggregated value for the PM QoS class is changed. | ||
64 | |||
65 | int pm_qos_remove_notifier(int param_class, notifier): | ||
66 | Removes the notification callback function for the PM QoS class. | ||
67 | |||
45 | 68 | ||
46 | From user mode: | 69 | From user mode: |
47 | Only processes can register a pm_qos request. To provide for automatic | 70 | Only processes can register a pm_qos request. To provide for automatic |
@@ -63,4 +86,63 @@ To remove the user mode request for a target value simply close the device | |||
63 | node. | 86 | node. |
64 | 87 | ||
65 | 88 | ||
89 | 2. PM QoS per-device latency framework | ||
90 | |||
91 | For each device a list of performance requests is maintained along with | ||
92 | an aggregated target value. The aggregated target value is updated with | ||
93 | changes to the request list or elements of the list. Typically the | ||
94 | aggregated target value is simply the max or min of the request values held | ||
95 | in the parameter list elements. | ||
96 | Note: the aggregated target value is implemented as an atomic variable so that | ||
97 | reading the aggregated value does not require any locking mechanism. | ||
98 | |||
99 | |||
100 | From kernel mode the use of this interface is the following: | ||
101 | |||
102 | int dev_pm_qos_add_request(device, handle, value): | ||
103 | Will insert an element into the list for that identified device with the | ||
104 | target value. Upon change to this list the new target is recomputed and any | ||
105 | registered notifiers are called only if the target value is now different. | ||
106 | Clients of dev_pm_qos need to save the handle for future use in other | ||
107 | dev_pm_qos API functions. | ||
108 | |||
109 | int dev_pm_qos_update_request(handle, new_value): | ||
110 | Will update the list element pointed to by the handle with the new target value | ||
111 | and recompute the new aggregated target, calling the notification trees if the | ||
112 | target is changed. | ||
113 | |||
114 | int dev_pm_qos_remove_request(handle): | ||
115 | Will remove the element. After removal it will update the aggregate target and | ||
116 | call the notification trees if the target was changed as a result of removing | ||
117 | the request. | ||
118 | |||
119 | s32 dev_pm_qos_read_value(device): | ||
120 | Returns the aggregated value for a given device's constraints list. | ||
121 | |||
122 | |||
123 | Notification mechanisms: | ||
124 | The per-device PM QoS framework has 2 different and distinct notification trees: | ||
125 | a per-device notification tree and a global notification tree. | ||
126 | |||
127 | int dev_pm_qos_add_notifier(device, notifier): | ||
128 | Adds a notification callback function for the device. | ||
129 | The callback is called when the aggregated value of the device constraints list | ||
130 | is changed. | ||
131 | |||
132 | int dev_pm_qos_remove_notifier(device, notifier): | ||
133 | Removes the notification callback function for the device. | ||
134 | |||
135 | int dev_pm_qos_add_global_notifier(notifier): | ||
136 | Adds a notification callback function in the global notification tree of the | ||
137 | framework. | ||
138 | The callback is called when the aggregated value for any device is changed. | ||
139 | |||
140 | int dev_pm_qos_remove_global_notifier(notifier): | ||
141 | Removes the notification callback function from the global notification tree | ||
142 | of the framework. | ||
143 | |||
144 | |||
145 | From user mode: | ||
146 | No API for user space access to the per-device latency constraints is provided | ||
147 | yet - still under discussion. | ||
66 | 148 | ||
diff --git a/Documentation/power/regulator/machine.txt b/Documentation/power/regulator/machine.txt index b42419b52e44..ce63af0a8e35 100644 --- a/Documentation/power/regulator/machine.txt +++ b/Documentation/power/regulator/machine.txt | |||
@@ -16,7 +16,7 @@ initialisation code by creating a struct regulator_consumer_supply for | |||
16 | each regulator. | 16 | each regulator. |
17 | 17 | ||
18 | struct regulator_consumer_supply { | 18 | struct regulator_consumer_supply { |
19 | struct device *dev; /* consumer */ | 19 | const char *dev_name; /* consumer dev_name() */ |
20 | const char *supply; /* consumer supply - e.g. "vcc" */ | 20 | const char *supply; /* consumer supply - e.g. "vcc" */ |
21 | }; | 21 | }; |
22 | 22 | ||
@@ -24,13 +24,13 @@ e.g. for the machine above | |||
24 | 24 | ||
25 | static struct regulator_consumer_supply regulator1_consumers[] = { | 25 | static struct regulator_consumer_supply regulator1_consumers[] = { |
26 | { | 26 | { |
27 | .dev = &platform_consumerB_device.dev, | 27 | .dev_name = "dev_name(consumer B)", |
28 | .supply = "Vcc", | 28 | .supply = "Vcc", |
29 | },}; | 29 | },}; |
30 | 30 | ||
31 | static struct regulator_consumer_supply regulator2_consumers[] = { | 31 | static struct regulator_consumer_supply regulator2_consumers[] = { |
32 | { | 32 | { |
33 | .dev = &platform_consumerA_device.dev, | 33 | .dev = "dev_name(consumer A"), |
34 | .supply = "Vcc", | 34 | .supply = "Vcc", |
35 | },}; | 35 | },}; |
36 | 36 | ||
@@ -43,6 +43,7 @@ to their supply regulator :- | |||
43 | 43 | ||
44 | static struct regulator_init_data regulator1_data = { | 44 | static struct regulator_init_data regulator1_data = { |
45 | .constraints = { | 45 | .constraints = { |
46 | .name = "Regulator-1", | ||
46 | .min_uV = 3300000, | 47 | .min_uV = 3300000, |
47 | .max_uV = 3300000, | 48 | .max_uV = 3300000, |
48 | .valid_modes_mask = REGULATOR_MODE_NORMAL, | 49 | .valid_modes_mask = REGULATOR_MODE_NORMAL, |
@@ -51,13 +52,19 @@ static struct regulator_init_data regulator1_data = { | |||
51 | .consumer_supplies = regulator1_consumers, | 52 | .consumer_supplies = regulator1_consumers, |
52 | }; | 53 | }; |
53 | 54 | ||
55 | The name field should be set to something that is usefully descriptive | ||
56 | for the board for configuration of supplies for other regulators and | ||
57 | for use in logging and other diagnostic output. Normally the name | ||
58 | used for the supply rail in the schematic is a good choice. If no | ||
59 | name is provided then the subsystem will choose one. | ||
60 | |||
54 | Regulator-1 supplies power to Regulator-2. This relationship must be registered | 61 | Regulator-1 supplies power to Regulator-2. This relationship must be registered |
55 | with the core so that Regulator-1 is also enabled when Consumer A enables its | 62 | with the core so that Regulator-1 is also enabled when Consumer A enables its |
56 | supply (Regulator-2). The supply regulator is set by the supply_regulator | 63 | supply (Regulator-2). The supply regulator is set by the supply_regulator |
57 | field below:- | 64 | field below and co:- |
58 | 65 | ||
59 | static struct regulator_init_data regulator2_data = { | 66 | static struct regulator_init_data regulator2_data = { |
60 | .supply_regulator = "regulator_name", | 67 | .supply_regulator = "Regulator-1", |
61 | .constraints = { | 68 | .constraints = { |
62 | .min_uV = 1800000, | 69 | .min_uV = 1800000, |
63 | .max_uV = 2000000, | 70 | .max_uV = 2000000, |
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt index 6066e3a6b9a9..5336149f831b 100644 --- a/Documentation/power/runtime_pm.txt +++ b/Documentation/power/runtime_pm.txt | |||
@@ -43,13 +43,18 @@ struct dev_pm_ops { | |||
43 | ... | 43 | ... |
44 | }; | 44 | }; |
45 | 45 | ||
46 | The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks are | 46 | The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks |
47 | executed by the PM core for either the device type, or the class (if the device | 47 | are executed by the PM core for either the power domain, or the device type |
48 | type's struct dev_pm_ops object does not exist), or the bus type (if the | 48 | (if the device power domain's struct dev_pm_ops does not exist), or the class |
49 | device type's and class' struct dev_pm_ops objects do not exist) of the given | 49 | (if the device power domain's and type's struct dev_pm_ops object does not |
50 | device (this allows device types to override callbacks provided by bus types or | 50 | exist), or the bus type (if the device power domain's, type's and class' |
51 | classes if necessary). The bus type, device type and class callbacks are | 51 | struct dev_pm_ops objects do not exist) of the given device, so the priority |
52 | referred to as subsystem-level callbacks in what follows. | 52 | order of callbacks from high to low is that power domain callbacks, device |
53 | type callbacks, class callbacks and bus type callbacks, and the high priority | ||
54 | one will take precedence over low priority one. The bus type, device type and | ||
55 | class callbacks are referred to as subsystem-level callbacks in what follows, | ||
56 | and generally speaking, the power domain callbacks are used for representing | ||
57 | power domains within a SoC. | ||
53 | 58 | ||
54 | By default, the callbacks are always invoked in process context with interrupts | 59 | By default, the callbacks are always invoked in process context with interrupts |
55 | enabled. However, subsystems can use the pm_runtime_irq_safe() helper function | 60 | enabled. However, subsystems can use the pm_runtime_irq_safe() helper function |
@@ -477,12 +482,14 @@ pm_runtime_autosuspend_expiration() | |||
477 | If pm_runtime_irq_safe() has been called for a device then the following helper | 482 | If pm_runtime_irq_safe() has been called for a device then the following helper |
478 | functions may also be used in interrupt context: | 483 | functions may also be used in interrupt context: |
479 | 484 | ||
485 | pm_runtime_idle() | ||
480 | pm_runtime_suspend() | 486 | pm_runtime_suspend() |
481 | pm_runtime_autosuspend() | 487 | pm_runtime_autosuspend() |
482 | pm_runtime_resume() | 488 | pm_runtime_resume() |
483 | pm_runtime_get_sync() | 489 | pm_runtime_get_sync() |
484 | pm_runtime_put_sync() | 490 | pm_runtime_put_sync() |
485 | pm_runtime_put_sync_suspend() | 491 | pm_runtime_put_sync_suspend() |
492 | pm_runtime_put_sync_autosuspend() | ||
486 | 493 | ||
487 | 5. Runtime PM Initialization, Device Probing and Removal | 494 | 5. Runtime PM Initialization, Device Probing and Removal |
488 | 495 | ||
@@ -782,6 +789,16 @@ will behave normally, not taking the autosuspend delay into account. | |||
782 | Similarly, if the power.use_autosuspend field isn't set then the autosuspend | 789 | Similarly, if the power.use_autosuspend field isn't set then the autosuspend |
783 | helper functions will behave just like the non-autosuspend counterparts. | 790 | helper functions will behave just like the non-autosuspend counterparts. |
784 | 791 | ||
792 | Under some circumstances a driver or subsystem may want to prevent a device | ||
793 | from autosuspending immediately, even though the usage counter is zero and the | ||
794 | autosuspend delay time has expired. If the ->runtime_suspend() callback | ||
795 | returns -EAGAIN or -EBUSY, and if the next autosuspend delay expiration time is | ||
796 | in the future (as it normally would be if the callback invoked | ||
797 | pm_runtime_mark_last_busy()), the PM core will automatically reschedule the | ||
798 | autosuspend. The ->runtime_suspend() callback can't do this rescheduling | ||
799 | itself because no suspend requests of any kind are accepted while the device is | ||
800 | suspending (i.e., while the callback is running). | ||
801 | |||
785 | The implementation is well suited for asynchronous use in interrupt contexts. | 802 | The implementation is well suited for asynchronous use in interrupt contexts. |
786 | However such use inevitably involves races, because the PM core can't | 803 | However such use inevitably involves races, because the PM core can't |
787 | synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. | 804 | synchronize ->runtime_suspend() callbacks with the arrival of I/O requests. |
diff --git a/Documentation/power/suspend-and-cpuhotplug.txt b/Documentation/power/suspend-and-cpuhotplug.txt new file mode 100644 index 000000000000..f28f9a6f0347 --- /dev/null +++ b/Documentation/power/suspend-and-cpuhotplug.txt | |||
@@ -0,0 +1,275 @@ | |||
1 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure | ||
2 | |||
3 | (C) 2011 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> | ||
4 | |||
5 | |||
6 | I. How does the regular CPU hotplug code differ from how the Suspend-to-RAM | ||
7 | infrastructure uses it internally? And where do they share common code? | ||
8 | |||
9 | Well, a picture is worth a thousand words... So ASCII art follows :-) | ||
10 | |||
11 | [This depicts the current design in the kernel, and focusses only on the | ||
12 | interactions involving the freezer and CPU hotplug and also tries to explain | ||
13 | the locking involved. It outlines the notifications involved as well. | ||
14 | But please note that here, only the call paths are illustrated, with the aim | ||
15 | of describing where they take different paths and where they share code. | ||
16 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other | ||
17 | is not depicted here.] | ||
18 | |||
19 | On a high level, the suspend-resume cycle goes like this: | ||
20 | |||
21 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | | ||
22 | |tasks | | cpus | | | | cpus | |tasks| | ||
23 | |||
24 | |||
25 | More details follow: | ||
26 | |||
27 | Suspend call path | ||
28 | ----------------- | ||
29 | |||
30 | Write 'mem' to | ||
31 | /sys/power/state | ||
32 | syfs file | ||
33 | | | ||
34 | v | ||
35 | Acquire pm_mutex lock | ||
36 | | | ||
37 | v | ||
38 | Send PM_SUSPEND_PREPARE | ||
39 | notifications | ||
40 | | | ||
41 | v | ||
42 | Freeze tasks | ||
43 | | | ||
44 | | | ||
45 | v | ||
46 | disable_nonboot_cpus() | ||
47 | /* start */ | ||
48 | | | ||
49 | v | ||
50 | Acquire cpu_add_remove_lock | ||
51 | | | ||
52 | v | ||
53 | Iterate over CURRENTLY | ||
54 | online CPUs | ||
55 | | | ||
56 | | | ||
57 | | ---------- | ||
58 | v | L | ||
59 | ======> _cpu_down() | | ||
60 | | [This takes cpuhotplug.lock | | ||
61 | Common | before taking down the CPU | | ||
62 | code | and releases it when done] | O | ||
63 | | While it is at it, notifications | | ||
64 | | are sent when notable events occur, | | ||
65 | ======> by running all registered callbacks. | | ||
66 | | | O | ||
67 | | | | ||
68 | | | | ||
69 | v | | ||
70 | Note down these cpus in | P | ||
71 | frozen_cpus mask ---------- | ||
72 | | | ||
73 | v | ||
74 | Disable regular cpu hotplug | ||
75 | by setting cpu_hotplug_disabled=1 | ||
76 | | | ||
77 | v | ||
78 | Release cpu_add_remove_lock | ||
79 | | | ||
80 | v | ||
81 | /* disable_nonboot_cpus() complete */ | ||
82 | | | ||
83 | v | ||
84 | Do suspend | ||
85 | |||
86 | |||
87 | |||
88 | Resuming back is likewise, with the counterparts being (in the order of | ||
89 | execution during resume): | ||
90 | * enable_nonboot_cpus() which involves: | ||
91 | | Acquire cpu_add_remove_lock | ||
92 | | Reset cpu_hotplug_disabled to 0, thereby enabling regular cpu hotplug | ||
93 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] | ||
94 | | Release cpu_add_remove_lock | ||
95 | v | ||
96 | |||
97 | * thaw tasks | ||
98 | * send PM_POST_SUSPEND notifications | ||
99 | * Release pm_mutex lock. | ||
100 | |||
101 | |||
102 | It is to be noted here that the pm_mutex lock is acquired at the very | ||
103 | beginning, when we are just starting out to suspend, and then released only | ||
104 | after the entire cycle is complete (i.e., suspend + resume). | ||
105 | |||
106 | |||
107 | |||
108 | Regular CPU hotplug call path | ||
109 | ----------------------------- | ||
110 | |||
111 | Write 0 (or 1) to | ||
112 | /sys/devices/system/cpu/cpu*/online | ||
113 | sysfs file | ||
114 | | | ||
115 | | | ||
116 | v | ||
117 | cpu_down() | ||
118 | | | ||
119 | v | ||
120 | Acquire cpu_add_remove_lock | ||
121 | | | ||
122 | v | ||
123 | If cpu_hotplug_disabled is 1 | ||
124 | return gracefully | ||
125 | | | ||
126 | | | ||
127 | v | ||
128 | ======> _cpu_down() | ||
129 | | [This takes cpuhotplug.lock | ||
130 | Common | before taking down the CPU | ||
131 | code | and releases it when done] | ||
132 | | While it is at it, notifications | ||
133 | | are sent when notable events occur, | ||
134 | ======> by running all registered callbacks. | ||
135 | | | ||
136 | | | ||
137 | v | ||
138 | Release cpu_add_remove_lock | ||
139 | [That's it!, for | ||
140 | regular CPU hotplug] | ||
141 | |||
142 | |||
143 | |||
144 | So, as can be seen from the two diagrams (the parts marked as "Common code"), | ||
145 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and | ||
146 | _cpu_up() functions. They differ in the arguments passed to these functions, | ||
147 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' | ||
148 | argument. But during suspend, since the tasks are already frozen by the time | ||
149 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called | ||
150 | with the 'tasks_frozen' argument set to 1. | ||
151 | [See below for some known issues regarding this.] | ||
152 | |||
153 | |||
154 | Important files and functions/entry points: | ||
155 | ------------------------------------------ | ||
156 | |||
157 | kernel/power/process.c : freeze_processes(), thaw_processes() | ||
158 | kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() | ||
159 | kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), [disable|enable]_nonboot_cpus() | ||
160 | |||
161 | |||
162 | |||
163 | II. What are the issues involved in CPU hotplug? | ||
164 | ------------------------------------------- | ||
165 | |||
166 | There are some interesting situations involving CPU hotplug and microcode | ||
167 | update on the CPUs, as discussed below: | ||
168 | |||
169 | [Please bear in mind that the kernel requests the microcode images from | ||
170 | userspace, using the request_firmware() function defined in | ||
171 | drivers/base/firmware_class.c] | ||
172 | |||
173 | |||
174 | a. When all the CPUs are identical: | ||
175 | |||
176 | This is the most common situation and it is quite straightforward: we want | ||
177 | to apply the same microcode revision to each of the CPUs. | ||
178 | To give an example of x86, the collect_cpu_info() function defined in | ||
179 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU | ||
180 | and thereby in applying the correct microcode revision to it. | ||
181 | But note that the kernel does not maintain a common microcode image for the | ||
182 | all CPUs, in order to handle case 'b' described below. | ||
183 | |||
184 | |||
185 | b. When some of the CPUs are different than the rest: | ||
186 | |||
187 | In this case since we probably need to apply different microcode revisions | ||
188 | to different CPUs, the kernel maintains a copy of the correct microcode | ||
189 | image for each CPU (after appropriate CPU type/model discovery using | ||
190 | functions such as collect_cpu_info()). | ||
191 | |||
192 | |||
193 | c. When a CPU is physically hot-unplugged and a new (and possibly different | ||
194 | type of) CPU is hot-plugged into the system: | ||
195 | |||
196 | In the current design of the kernel, whenever a CPU is taken offline during | ||
197 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification | ||
198 | (which is sent by the CPU hotplug code), the microcode update driver's | ||
199 | callback for that event reacts by freeing the kernel's copy of the | ||
200 | microcode image for that CPU. | ||
201 | |||
202 | Hence, when a new CPU is brought online, since the kernel finds that it | ||
203 | doesn't have the microcode image, it does the CPU type/model discovery | ||
204 | afresh and then requests the userspace for the appropriate microcode image | ||
205 | for that CPU, which is subsequently applied. | ||
206 | |||
207 | For example, in x86, the mc_cpu_callback() function (which is the microcode | ||
208 | update driver's callback registered for CPU hotplug events) calls | ||
209 | microcode_update_cpu() which would call microcode_init_cpu() in this case, | ||
210 | instead of microcode_resume_cpu() when it finds that the kernel doesn't | ||
211 | have a valid microcode image. This ensures that the CPU type/model | ||
212 | discovery is performed and the right microcode is applied to the CPU after | ||
213 | getting it from userspace. | ||
214 | |||
215 | |||
216 | d. Handling microcode update during suspend/hibernate: | ||
217 | |||
218 | Strictly speaking, during a CPU hotplug operation which does not involve | ||
219 | physically removing or inserting CPUs, the CPUs are not actually powered | ||
220 | off during a CPU offline. They are just put to the lowest C-states possible. | ||
221 | Hence, in such a case, it is not really necessary to re-apply microcode | ||
222 | when the CPUs are brought back online, since they wouldn't have lost the | ||
223 | image during the CPU offline operation. | ||
224 | |||
225 | This is the usual scenario encountered during a resume after a suspend. | ||
226 | However, in the case of hibernation, since all the CPUs are completely | ||
227 | powered off, during restore it becomes necessary to apply the microcode | ||
228 | images to all the CPUs. | ||
229 | |||
230 | [Note that we don't expect someone to physically pull out nodes and insert | ||
231 | nodes with a different type of CPUs in-between a suspend-resume or a | ||
232 | hibernate/restore cycle.] | ||
233 | |||
234 | In the current design of the kernel however, during a CPU offline operation | ||
235 | as part of the suspend/hibernate cycle (the CPU_DEAD_FROZEN notification), | ||
236 | the existing copy of microcode image in the kernel is not freed up. | ||
237 | And during the CPU online operations (during resume/restore), since the | ||
238 | kernel finds that it already has copies of the microcode images for all the | ||
239 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU | ||
240 | type/model and the need for validating whether the microcode revisions are | ||
241 | right for the CPUs or not (due to the above assumption that physical CPU | ||
242 | hotplug will not be done in-between suspend/resume or hibernate/restore | ||
243 | cycles). | ||
244 | |||
245 | |||
246 | III. Are there any known problems when regular CPU hotplug and suspend race | ||
247 | with each other? | ||
248 | |||
249 | Yes, they are listed below: | ||
250 | |||
251 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to | ||
252 | the _cpu_down() and _cpu_up() functions is *always* 0. | ||
253 | This might not reflect the true current state of the system, since the | ||
254 | tasks could have been frozen by an out-of-band event such as a suspend | ||
255 | operation in progress. Hence, it will lead to wrong notifications being | ||
256 | sent during the cpu online/offline events (eg, CPU_ONLINE notification | ||
257 | instead of CPU_ONLINE_FROZEN) which in turn will lead to execution of | ||
258 | inappropriate code by the callbacks registered for such CPU hotplug events. | ||
259 | |||
260 | 2. If a regular CPU hotplug stress test happens to race with the freezer due | ||
261 | to a suspend operation in progress at the same time, then we could hit the | ||
262 | situation described below: | ||
263 | |||
264 | * A regular cpu online operation continues its journey from userspace | ||
265 | into the kernel, since the freezing has not yet begun. | ||
266 | * Then freezer gets to work and freezes userspace. | ||
267 | * If cpu online has not yet completed the microcode update stuff by now, | ||
268 | it will now start waiting on the frozen userspace in the | ||
269 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. | ||
270 | * Now the freezer continues and tries to freeze the remaining tasks. But | ||
271 | due to this wait mentioned above, the freezer won't be able to freeze | ||
272 | the cpu online hotplug task and hence freezing of tasks fails. | ||
273 | |||
274 | As a result of this task freezing failure, the suspend operation gets | ||
275 | aborted. | ||
diff --git a/Documentation/power/userland-swsusp.txt b/Documentation/power/userland-swsusp.txt index 1101bee4e822..0e870825c1b9 100644 --- a/Documentation/power/userland-swsusp.txt +++ b/Documentation/power/userland-swsusp.txt | |||
@@ -77,7 +77,8 @@ SNAPSHOT_SET_SWAP_AREA - set the resume partition and the offset (in <PAGE_SIZE> | |||
77 | resume_swap_area, as defined in kernel/power/suspend_ioctls.h, | 77 | resume_swap_area, as defined in kernel/power/suspend_ioctls.h, |
78 | containing the resume device specification and the offset); for swap | 78 | containing the resume device specification and the offset); for swap |
79 | partitions the offset is always 0, but it is different from zero for | 79 | partitions the offset is always 0, but it is different from zero for |
80 | swap files (see Documentation/swsusp-and-swap-files.txt for details). | 80 | swap files (see Documentation/power/swsusp-and-swap-files.txt for |
81 | details). | ||
81 | 82 | ||
82 | SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, | 83 | SNAPSHOT_PLATFORM_SUPPORT - enable/disable the hibernation platform support, |
83 | depending on the argument value (enable, if the argument is nonzero) | 84 | depending on the argument value (enable, if the argument is nonzero) |