aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/power/devices.txt
diff options
context:
space:
mode:
authorRafael J. Wysocki <rjw@sisk.pl>2010-03-26 18:53:42 -0400
committerRafael J. Wysocki <rjw@sisk.pl>2010-05-10 17:08:16 -0400
commit624f6ec871886525ca19cf7841f918da91d4315e (patch)
tree9728d0ab5f3715cb4567069553690916e567d985 /Documentation/power/devices.txt
parent240c7337a4cd3d91b196c5ef97ad461b3a22fa09 (diff)
PM: Update device power management document
The device PM document, Documentation/power/devices.txt, is badly outdated and requires total rework to fit the current design of the PM framework. Make it more up to date. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Reviewed-by: Randy Dunlap <randy.dunlap@oracle.com>
Diffstat (limited to 'Documentation/power/devices.txt')
-rw-r--r--Documentation/power/devices.txt698
1 files changed, 431 insertions, 267 deletions
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index c9abbd86bc18..10018d19e0bf 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -1,3 +1,7 @@
1Device Power Management
2
3(C) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc.
4
1Most of the code in Linux is device drivers, so most of the Linux power 5Most of the code in Linux is device drivers, so most of the Linux power
2management code is also driver-specific. Most drivers will do very little; 6management code is also driver-specific. Most drivers will do very little;
3others, especially for platforms with small batteries (like cell phones), 7others, especially for platforms with small batteries (like cell phones),
@@ -25,31 +29,39 @@ states:
25 them without loss of data. 29 them without loss of data.
26 30
27 Some drivers can manage hardware wakeup events, which make the system 31 Some drivers can manage hardware wakeup events, which make the system
28 leave that low-power state. This feature may be disabled using the 32 leave that low-power state. This feature may be enabled or disabled
29 relevant /sys/devices/.../power/wakeup file; enabling it may cost some 33 using the relevant /sys/devices/.../power/wakeup file (for Ethernet
30 power usage, but let the whole system enter low power states more often. 34 drivers the ioctl interface used by ethtool may also be used for this
35 purpose); enabling it may cost some power usage, but let the whole
36 system enter low power states more often.
31 37
32 Runtime Power Management model: 38 Runtime Power Management model:
33 Drivers may also enter low power states while the system is running, 39 Devices may also be put into low power states while the system is
34 independently of other power management activity. Upstream drivers 40 running, independently of other power management activity in principle.
35 will normally not know (or care) if the device is in some low power 41 However, devices are not generally independent of each other (for
36 state when issuing requests; the driver will auto-resume anything 42 example, parent device cannot be suspended unless all of its child
37 that's needed when it gets a request. 43 devices have been suspended). Moreover, depending on the bus type the
38 44 device is on, it may be necessary to carry out some bus-specific
39 This doesn't have, or need much infrastructure; it's just something you 45 operations on the device for this purpose. Also, devices put into low
40 should do when writing your drivers. For example, clk_disable() unused 46 power states at run time may require special handling during system-wide
41 clocks as part of minimizing power drain for currently-unused hardware. 47 power transitions, like suspend to RAM.
42 Of course, sometimes clusters of drivers will collaborate with each 48
43 other, which could involve task-specific power management. 49 For these reasons not only the device driver itself, but also the
50 appropriate subsystem (bus type, device type or device class) driver
51 and the PM core are involved in the runtime power management of devices.
52 Like in the system sleep power management case, they need to collaborate
53 by implementing various role-specific suspend and resume methods, so
54 that the hardware is cleanly powered down and reactivated without data
55 or service loss.
44 56
45There's not a lot to be said about those low power states except that they 57There's not a lot to be said about those low power states except that they
46are very system-specific, and often device-specific. Also, that if enough 58are very system-specific, and often device-specific. Also, that if enough
47drivers put themselves into low power states (at "runtime"), the effect may be 59devices have been put into low power states (at "run time"), the effect may be
48the same as entering some system-wide low-power state (system sleep) ... and 60very similar to entering some system-wide low-power state (system sleep) ... and
49that synergies exist, so that several drivers using runtime pm might put the 61that synergies exist, so that several drivers using runtime PM might put the
50system into a state where even deeper power saving options are available. 62system into a state where even deeper power saving options are available.
51 63
52Most suspended devices will have quiesced all I/O: no more DMA or irqs, no 64Most suspended devices will have quiesced all I/O: no more DMA or IRQs, no
53more data read or written, and requests from upstream drivers are no longer 65more data read or written, and requests from upstream drivers are no longer
54accepted. A given bus or platform may have different requirements though. 66accepted. A given bus or platform may have different requirements though.
55 67
@@ -60,34 +72,67 @@ or removal (for PCMCIA, MMC/SD, USB, and so on).
60 72
61Interfaces for Entering System Sleep States 73Interfaces for Entering System Sleep States
62=========================================== 74===========================================
63Most of the programming interfaces a device driver needs to know about 75There are programming interfaces provided for subsystem (bus type, device type,
64relate to that first model: entering a system-wide low power state, 76device class) and device drivers in order to allow them to participate in the
65rather than just minimizing power consumption by one device. 77power management of devices they are concerned with. They cover the system
78sleep power management as well as the runtime power management of devices.
79
80
81Device Power Management Operations
82----------------------------------
83Device power management operations, at the subsystem level as well as at the
84device driver level, are implemented by defining and populating objects of type
85struct dev_pm_ops:
86
87struct dev_pm_ops {
88 int (*prepare)(struct device *dev);
89 void (*complete)(struct device *dev);
90 int (*suspend)(struct device *dev);
91 int (*resume)(struct device *dev);
92 int (*freeze)(struct device *dev);
93 int (*thaw)(struct device *dev);
94 int (*poweroff)(struct device *dev);
95 int (*restore)(struct device *dev);
96 int (*suspend_noirq)(struct device *dev);
97 int (*resume_noirq)(struct device *dev);
98 int (*freeze_noirq)(struct device *dev);
99 int (*thaw_noirq)(struct device *dev);
100 int (*poweroff_noirq)(struct device *dev);
101 int (*restore_noirq)(struct device *dev);
102 int (*runtime_suspend)(struct device *dev);
103 int (*runtime_resume)(struct device *dev);
104 int (*runtime_idle)(struct device *dev);
105};
66 106
107This structure is defined in include/linux/pm.h and the methods included in it
108are also described in that file. Their roles will be explained in what follows.
109For now, it should be sufficient to remember that the last three of them are
110specific to runtime power management, while the remaining ones are used during
111system-wide power transitions.
67 112
68Bus Driver Methods 113There also is an "old" or "legacy", deprecated way of implementing power
69------------------ 114management operations available at least for some subsystems. This approach
70The core methods to suspend and resume devices reside in struct bus_type. 115does not use struct dev_pm_ops objects and it only is suitable for implementing
71These are mostly of interest to people writing infrastructure for busses 116system sleep power management methods. Therefore it is not described in this
72like PCI or USB, or because they define the primitives that device drivers 117document, so please refer directly to the source code for more information about
73may need to apply in domain-specific ways to their devices: 118it.
74 119
75struct bus_type { 120
76 ... 121Subsystem-Level Methods
77 int (*suspend)(struct device *dev, pm_message_t state); 122-----------------------
78 int (*resume)(struct device *dev); 123The core methods to suspend and resume devices reside in struct dev_pm_ops
79}; 124pointed to by the pm member of struct bus_type, struct device_type and
125struct class. They are mostly of interest to the people writing infrastructure
126for buses, like PCI or USB, or device type and device class drivers.
80 127
81Bus drivers implement those methods as appropriate for the hardware and 128Bus drivers implement these methods as appropriate for the hardware and
82the drivers using it; PCI works differently from USB, and so on. Not many 129the drivers using it; PCI works differently from USB, and so on. Not many
83people write bus drivers; most driver code is a "device driver" that 130people write subsystem-level drivers; most driver code is a "device driver" that
84builds on top of bus-specific framework code. 131builds on top of bus-specific framework code.
85 132
86For more information on these driver calls, see the description later; 133For more information on these driver calls, see the description later;
87they are called in phases for every device, respecting the parent-child 134they are called in phases for every device, respecting the parent-child
88sequencing in the driver model tree. Note that as this is being written, 135sequencing in the driver model tree.
89only the suspend() and resume() are widely available; not many bus drivers
90leverage all of those phases, or pass them down to lower driver levels.
91 136
92 137
93/sys/devices/.../power/wakeup files 138/sys/devices/.../power/wakeup files
@@ -95,7 +140,7 @@ leverage all of those phases, or pass them down to lower driver levels.
95All devices in the driver model have two flags to control handling of 140All devices in the driver model have two flags to control handling of
96wakeup events, which are hardware signals that can force the device and/or 141wakeup events, which are hardware signals that can force the device and/or
97system out of a low power state. These are initialized by bus or device 142system out of a low power state. These are initialized by bus or device
98driver code using device_init_wakeup(dev,can_wakeup). 143driver code using device_init_wakeup().
99 144
100The "can_wakeup" flag just records whether the device (and its driver) can 145The "can_wakeup" flag just records whether the device (and its driver) can
101physically support wakeup events. When that flag is clear, the sysfs 146physically support wakeup events. When that flag is clear, the sysfs
@@ -103,64 +148,44 @@ physically support wakeup events. When that flag is clear, the sysfs
103 148
104For devices that can issue wakeup events, a separate flag controls whether 149For devices that can issue wakeup events, a separate flag controls whether
105that device should try to use its wakeup mechanism. The initial value of 150that device should try to use its wakeup mechanism. The initial value of
106device_may_wakeup() will be true, so that the device's "wakeup" file holds 151device_may_wakeup() will be false for the majority of devices, except for
107the value "enabled". Userspace can change that to "disabled" so that 152power buttons, keyboards, and Ethernet adapters whose WoL (wake-on-LAN) feature
108device_may_wakeup() returns false; or change it back to "enabled" (so that 153has been set up with ethtool. Thus in the majority of cases the device's
109it returns true again). 154"wakeup" file will initially hold the value "disabled". Userspace can change
110 155that to "enabled", so that device_may_wakeup() returns true, or change it back
111 156to "disabled", so that it returns false again.
112EXAMPLE: PCI Device Driver Methods 157
113----------------------------------- 158
114PCI framework software calls these methods when the PCI device driver bound 159/sys/devices/.../power/control files
115to a device device has provided them: 160------------------------------------
116 161All devices in the driver model have a flag to control the desired behavior of
117struct pci_driver { 162its driver with respect to runtime power management. This flag, called
118 ... 163runtime_auto, is initialized by the bus type (or generally subsystem) code using
119 int (*suspend)(struct pci_device *pdev, pm_message_t state); 164pm_runtime_allow() or pm_runtime_forbid(), depending on whether or not the
120 int (*suspend_late)(struct pci_device *pdev, pm_message_t state); 165driver is supposed to power manage the device at run time by default,
121 166respectively.
122 int (*resume_early)(struct pci_device *pdev); 167
123 int (*resume)(struct pci_device *pdev); 168This setting may be adjusted by user space by writing either "on" or "auto" to
124}; 169the device's "control" file. If "auto" is written, the device's runtime_auto
125 170flag will be set and the driver will be allowed to power manage the device if
126Drivers will implement those methods, and call PCI-specific procedures 171capable of doing that. If "on" is written, the driver is not allowed to power
127like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and 172manage the device which in turn is supposed to remain in the full power state at
128pci_restore_state() to manage PCI-specific mechanisms. (PCI config space 173run time. User space can check the current value of the runtime_auto flag by
129could be saved during driver probe, if it weren't for the fact that some 174reading from the device's "control" file.
130systems rely on userspace tweaking using setpci.) Devices are suspended 175
131before their bridges enter low power states, and likewise bridges resume 176The device's runtime_auto flag has no effect on the handling of system-wide
132before their devices. 177power transitions by its driver. In particular, the device can (and in the
133 178majority of cases should and will) be put into a low power state during a
134 179system-wide transition to a sleep state (like "suspend-to-RAM") even though its
135Upper Layers of Driver Stacks 180runtime_auto flag is unset (in which case its "control" file contains "on").
136----------------------------- 181
137Device drivers generally have at least two interfaces, and the methods 182For more information about the runtime power management framework for devices
138sketched above are the ones which apply to the lower level (nearer PCI, USB, 183refer to Documentation/power/runtime_pm.txt.
139or other bus hardware). The network and block layers are examples of upper
140level interfaces, as is a character device talking to userspace.
141
142Power management requests normally need to flow through those upper levels,
143which often use domain-oriented requests like "blank that screen". In
144some cases those upper levels will have power management intelligence that
145relates to end-user activity, or other devices that work in cooperation.
146
147When those interfaces are structured using class interfaces, there is a
148standard way to have the upper layer stop issuing requests to a given
149class device (and restart later):
150
151struct class {
152 ...
153 int (*suspend)(struct device *dev, pm_message_t state);
154 int (*resume)(struct device *dev);
155};
156
157Those calls are issued in specific phases of the process by which the
158system enters a low power "suspend" state, or resumes from it.
159 184
160 185
161Calling Drivers to Enter System Sleep States 186Calling Drivers to Enter System Sleep States
162============================================ 187============================================
163When the system enters a low power state, each device's driver is asked 188When the system goes into a sleep state, each device's driver is asked
164to suspend the device by putting it into state compatible with the target 189to suspend the device by putting it into state compatible with the target
165system state. That's usually some version of "off", but the details are 190system state. That's usually some version of "off", but the details are
166system-specific. Also, wakeup-enabled devices will usually stay partly 191system-specific. Also, wakeup-enabled devices will usually stay partly
@@ -175,14 +200,13 @@ and then turn its hardware as "off" as possible with late_suspend. The
175matching resume calls would then completely reinitialize the hardware 200matching resume calls would then completely reinitialize the hardware
176before reactivating its class I/O queues. 201before reactivating its class I/O queues.
177 202
178More power-aware drivers drivers will use more than one device low power 203More power-aware drivers might prepare the devices for triggering system wakeup
179state, either at runtime or during system sleep states, and might trigger 204events.
180system wakeup events.
181 205
182 206
183Call Sequence Guarantees 207Call Sequence Guarantees
184------------------------ 208------------------------
185To ensure that bridges and similar links needed to talk to a device are 209To ensure that bridges and similar links needing to talk to a device are
186available when the device is suspended or resumed, the device tree is 210available when the device is suspended or resumed, the device tree is
187walked in a bottom-up order to suspend devices. A top-down order is 211walked in a bottom-up order to suspend devices. A top-down order is
188used to resume those devices. 212used to resume those devices.
@@ -194,7 +218,7 @@ its parent; and can't be removed or suspended after that parent.
194The policy is that the device tree should match hardware bus topology. 218The policy is that the device tree should match hardware bus topology.
195(Or at least the control bus, for devices which use multiple busses.) 219(Or at least the control bus, for devices which use multiple busses.)
196In particular, this means that a device registration may fail if the parent of 220In particular, this means that a device registration may fail if the parent of
197the device is suspending (ie. has been chosen by the PM core as the next 221the device is suspending (i.e. has been chosen by the PM core as the next
198device to suspend) or has already suspended, as well as after all of the other 222device to suspend) or has already suspended, as well as after all of the other
199devices have been suspended. Device drivers must be prepared to cope with such 223devices have been suspended. Device drivers must be prepared to cope with such
200situations. 224situations.
@@ -207,54 +231,166 @@ system always includes every phase, executing calls for every device
207before the next phase begins. Not all busses or classes support all 231before the next phase begins. Not all busses or classes support all
208these callbacks; and not all drivers use all the callbacks. 232these callbacks; and not all drivers use all the callbacks.
209 233
210The phases are seen by driver notifications issued in this order: 234Generally, different callbacks are used depending on whether the system is
235going to the standby or memory sleep state ("suspend-to-RAM") or it is going to
236be hibernated ("suspend-to-disk").
237
238If the system goes to the standby or memory sleep state the phases are seen by
239driver notifications issued in this order:
240
241 1 bus->pm.prepare(dev) is called after tasks are frozen and it is supposed
242 to call the device driver's ->pm.prepare() method.
243
244 The purpose of this method is mainly to prevent new children of the
245 device from being registered after it has returned. It also may be used
246 to generally prepare the device for the upcoming system transition, but
247 it should not put the device into a low power state.
211 248
212 1 class.suspend(dev, message) is called after tasks are frozen, for 249 2 class->pm.suspend(dev) is called if dev is associated with a class that
213 devices associated with a class that has such a method. This 250 has such a method. It may invoke the device driver's ->pm.suspend()
214 method may sleep. 251 method, unless type->pm.suspend(dev) or bus->pm.suspend() does that.
215 252
216 Since I/O activity usually comes from such higher layers, this is 253 3 type->pm.suspend(dev) is called if dev is associated with a device type
217 a good place to quiesce all drivers of a given type (and keep such 254 that has such a method. It may invoke the device driver's
218 code out of those drivers). 255 ->pm.suspend() method, unless class->pm.suspend(dev) or
256 bus->pm.suspend() does that.
219 257
220 2 bus.suspend(dev, message) is called next. This method may sleep, 258 4 bus->pm.suspend(dev) is called, if implemented. It usually calls the
221 and is often morphed into a device driver call with bus-specific 259 device driver's ->pm.suspend() method.
222 parameters and/or rules.
223 260
224 This call should handle parts of device suspend logic that require 261 This call should generally quiesce the device so that it doesn't do any
225 sleeping. It probably does work to quiesce the device which hasn't 262 I/O after the call has returned. It also may save the device registers
226 been abstracted into class.suspend(). 263 and put it into the appropriate low power state, depending on the bus
264 type the device is on.
227 265
228The pm_message_t parameter is currently used to refine those semantics 266 5 bus->pm.suspend_noirq(dev) is called, if implemented. It may call the
229(described later). 267 device driver's ->pm.suspend_noirq() method, depending on the bus type
268 in question.
269
270 This method is invoked after device interrupts have been suspended,
271 which means that the driver's interrupt handler will not be called
272 while it is running. It should save the values of the device's
273 registers that weren't saved previously and finally put the device into
274 the appropriate low power state.
275
276 The majority of subsystems and device drivers need not implement this
277 method. However, bus types allowing devices to share interrupt vectors,
278 like PCI, generally need to use it to prevent interrupt handling issues
279 from happening during suspend.
230 280
231At the end of those phases, drivers should normally have stopped all I/O 281At the end of those phases, drivers should normally have stopped all I/O
232transactions (DMA, IRQs), saved enough state that they can re-initialize 282transactions (DMA, IRQs), saved enough state that they can re-initialize
233or restore previous state (as needed by the hardware), and placed the 283or restore previous state (as needed by the hardware), and placed the
234device into a low-power state. On many platforms they will also use 284device into a low-power state. On many platforms they will also use
235clk_disable() to gate off one or more clock sources; sometimes they will 285gate off one or more clock sources; sometimes they will also switch off power
236also switch off power supplies, or reduce voltages. Drivers which have 286supplies, or reduce voltages. [Drivers supporting runtime PM may already have
237runtime PM support may already have performed some or all of the steps 287performed some or all of the steps needed to prepare for the upcoming system
238needed to prepare for the upcoming system sleep state. 288state transition.]
289
290If device_may_wakeup(dev) returns true, the device should be prepared for
291generating hardware wakeup signals when the system is in the sleep state to
292trigger a system wakeup event. For example, enable_irq_wake() might identify
293GPIO signals hooked up to a switch or other external hardware, and
294pci_enable_wake() does something similar for the PCI PME signal.
295
296If a driver (or subsystem) fails it suspend method, the system won't enter the
297desired low power state; it will resume all the devices it's suspended so far.
298
299
300Hibernation Phases
301------------------
302Hibernating the system is more complicated than putting it into the standby or
303memory sleep state, because it involves creating a system image and saving it.
304Therefore there are more phases of hibernation and special device PM methods are
305used in this case.
306
307First, it is necessary to prepare the system for creating a hibernation image.
308This is similar to putting the system into the standby or memory sleep state,
309although it generally doesn't require that devices be put into low power states
310(that is even not desirable at this point). Driver notifications are then
311issued in the following order:
312
313 1 bus->pm.prepare(dev) is called after tasks have been frozen and enough
314 memory has been freed.
315
316 2 class->pm.freeze(dev) is called if implemented. It may invoke the
317 device driver's ->pm.freeze() method, unless type->pm.freeze(dev) or
318 bus->pm.freeze() does that.
319
320 3 type->pm.freeze(dev) is called if implemented. It may invoke the device
321 driver's ->pm.suspend() method, unless class->pm.freeze(dev) or
322 bus->pm.freeze() does that.
239 323
240When any driver sees that its device_can_wakeup(dev), it should make sure 324 4 bus->pm.freeze(dev) is called, if implemented. It usually calls the
241to use the relevant hardware signals to trigger a system wakeup event. 325 device driver's ->pm.freeze() method.
242For example, enable_irq_wake() might identify GPIO signals hooked up to
243a switch or other external hardware, and pci_enable_wake() does something
244similar for PCI's PME# signal.
245 326
246If a driver (or bus, or class) fails it suspend method, the system won't 327 5 bus->pm.freeze_noirq(dev) is called, if implemented. It may call the
247enter the desired low power state; it will resume all the devices it's 328 device driver's ->pm.freeze_noirq() method, depending on the bus type
248suspended so far. 329 in question.
249 330
250Note that drivers may need to perform different actions based on the target 331The difference between ->pm.freeze() and the corresponding ->pm.suspend() (and
251system lowpower/sleep state. At this writing, there are only platform 332similarly for the "noirq" variants) is that the former should avoid preparing
252specific APIs through which drivers could determine those target states. 333devices to trigger system wakeup events and putting devices into low power
334states, although they generally have to save the values of device registers
335so that it's possible to restore them during system resume.
336
337Second, after the system image has been created, the functionality of devices
338has to be restored so that the image can be saved. That is similar to resuming
339devices after the system has been woken up from the standby or memory sleep
340state, which is described below, and causes the following device notifications
341to be issued:
342
343 1 bus->pm.thaw_noirq(dev), if implemented; may call the device driver's
344 ->pm.thaw_noirq() method, depending on the bus type in question.
345
346 2 bus->pm.thaw(dev), if implemented; usually calls the device driver's
347 ->pm.thaw() method.
348
349 3 type->pm.thaw(dev), if implemented; may call the device driver's
350 ->pm.thaw() method if not called by the bus type or class.
351
352 4 class->pm.thaw(dev), if implemented; may call the device driver's
353 ->pm.thaw() method if not called by the bus type or device type.
354
355 5 bus->pm.complete(dev), if implemented; may call the device driver's
356 ->pm.complete() method.
357
358Generally, the role of the ->pm.thaw() methods (including the "noirq" variants)
359is to bring the device back to the fully functional state, so that it may be
360used for saving the image, if necessary. The role of bus->pm.complete() is to
361reverse whatever bus->pm.prepare() did (likewise for the analogous device driver
362callbacks).
363
364After the image has been saved, the devices need to be prepared for putting the
365system into the low power state. That is analogous to suspending them before
366putting the system into the standby or memory sleep state and involves the
367following device notifications:
368
369 1 bus->pm.prepare(dev).
370
371 2 class->pm.poweroff(dev), if implemented; may invoke the device driver's
372 ->pm.poweroff() method if not called by the bus type or device type.
373
374 3 type->pm.poweroff(dev), if implemented; may invoke the device driver's
375 ->pm.poweroff() method if not called by the bus type or device class.
376
377 4 bus->pm.poweroff(dev), if implemented; usually calls the device driver's
378 ->pm.poweroff() method (if not called by the device class or type).
379
380 5 bus->pm.poweroff_noirq(dev), if implemented; may call the device
381 driver's ->pm.poweroff_noirq() method, depending on the bus type
382 in question.
383
384The difference between ->pm.poweroff() and the corresponding ->pm.suspend() (and
385analogously for the "noirq" variants) is that the former need not save the
386device's registers. Still, they should prepare the device for triggering
387system wakeup events if necessary and finally put it into the appropriate low
388power state.
253 389
254 390
255Device Low Power (suspend) States 391Device Low Power (suspend) States
256--------------------------------- 392---------------------------------
257Device low-power states aren't very standard. One device might only handle 393Device low-power states aren't standard. One device might only handle
258"on" and "off, while another might support a dozen different versions of 394"on" and "off, while another might support a dozen different versions of
259"on" (how many engines are active?), plus a state that gets back to "on" 395"on" (how many engines are active?), plus a state that gets back to "on"
260faster than from a full "off". 396faster than from a full "off".
@@ -265,7 +401,7 @@ PCI device may not perform DMA or issue IRQs, and any wakeup events it
265issues would be issued through the PME# bus signal. Plus, there are 401issues would be issued through the PME# bus signal. Plus, there are
266several PCI-standard device states, some of which are optional. 402several PCI-standard device states, some of which are optional.
267 403
268In contrast, integrated system-on-chip processors often use irqs as the 404In contrast, integrated system-on-chip processors often use IRQs as the
269wakeup event sources (so drivers would call enable_irq_wake) and might 405wakeup event sources (so drivers would call enable_irq_wake) and might
270be able to treat DMA completion as a wakeup event (sometimes DMA can stay 406be able to treat DMA completion as a wakeup event (sometimes DMA can stay
271active too, it'd only be the CPU and some peripherals that sleep). 407active too, it'd only be the CPU and some peripherals that sleep).
@@ -284,84 +420,86 @@ ways; the aforementioned LCD might be active in one product's "standby",
284but a different product using the same SOC might work differently. 420but a different product using the same SOC might work differently.
285 421
286 422
287Meaning of pm_message_t.event 423Resuming Devices
288----------------------------- 424----------------
289Parameters to suspend calls include the device affected and a message of 425Resuming is done in multiple phases, much like suspending, with all
290type pm_message_t, which has one field: the event. If driver does not 426devices processing each phase's calls before the next phase begins.
291recognize the event code, suspend calls may abort the request and return
292a negative errno. However, most drivers will be fine if they implement
293PM_EVENT_SUSPEND semantics for all messages.
294 427
295The event codes are used to refine the goal of suspending the device, and 428Again, however, different callbacks are used depending on whether the system is
296mostly matter when creating or resuming system memory image snapshots, as 429waking up from the standby or memory sleep state ("suspend-to-RAM") or from
297used with suspend-to-disk: 430hibernation ("suspend-to-disk").
298 431
299 PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power 432If the system is waking up from the standby or memory sleep state, the phases
300 state. When used with system sleep states like "suspend-to-RAM" or 433are seen by driver notifications issued in this order:
301 "standby", the upcoming resume() call will often be able to rely on
302 state kept in hardware, or issue system wakeup events.
303 434
304 PM_EVENT_HIBERNATE -- Put hardware into a low-power state and enable wakeup 435 1 bus->pm.resume_noirq(dev) is called, if implemented. It may call the
305 events as appropriate. It is only used with hibernation 436 device driver's ->pm.resume_noirq() method, depending on the bus type in
306 (suspend-to-disk) and few devices are able to wake up the system from 437 question.
307 this state; most are completely powered off.
308 438
309 PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into 439 The role of this method is to perform actions that need to be performed
310 any low power mode. A system snapshot is about to be taken, often 440 before device drivers' interrupt handlers are allowed to be invoked. If
311 followed by a call to the driver's resume() method. Neither wakeup 441 the given bus type permits devices to share interrupt vectors, like PCI,
312 events nor DMA are allowed. 442 this method should bring the device and its driver into a state in which
443 the driver can recognize if the device is the source of incoming
444 interrupts, if any, and handle them correctly.
313 445
314 PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume() 446 For example, the PCI bus type's ->pm.resume_noirq() puts the device into
315 will restore a suspend-to-disk snapshot from a different kernel image. 447 the full power state (D0 in the PCI terminology) and restores the
316 Drivers that are smart enough to look at their hardware state during 448 standard configuration registers of the device. Then, it calls the
317 resume() processing need that state to be correct ... a PRETHAW could 449 device driver's ->pm.resume_noirq() method to perform device-specific
318 be used to invalidate that state (by resetting the device), like a 450 actions needed at this stage of resume.
319 shutdown() invocation would before a kexec() or system halt. Other
320 drivers might handle this the same way as PM_EVENT_FREEZE. Neither
321 wakeup events nor DMA are allowed.
322 451
323To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or 452 2 bus->pm.resume(dev) is called, if implemented. It usually calls the
324the similarly named APM states, only PM_EVENT_SUSPEND is used; the other event 453 device driver's ->pm.resume() method.
325codes are used for hibernation ("Suspend to Disk", STD, ACPI S4).
326 454
327There's also PM_EVENT_ON, a value which never appears as a suspend event 455 This call should generally bring the the device back to the working
328but is sometimes used to record the "not suspended" device state. 456 state, so that it can do I/O as requested after the call has returned.
457 However, it may be more convenient to use the device class or device
458 type ->pm.resume() for this purpose, in which case the bus type's
459 ->pm.resume() method need not be implemented at all.
329 460
461 3 type->pm.resume(dev) is called, if implemented. It may invoke the
462 device driver's ->pm.resume() method, unless class->pm.resume(dev) or
463 bus->pm.resume() does that.
330 464
331Resuming Devices 465 For devices that are not associated with any bus type or device class
332---------------- 466 this method plays the role of bus->pm.resume().
333Resuming is done in multiple phases, much like suspending, with all
334devices processing each phase's calls before the next phase begins.
335 467
336The phases are seen by driver notifications issued in this order: 468 4 class->pm.resume(dev) is called, if implemented. It may invoke the
469 device driver's ->pm.resume() method, unless bus->pm.resume(dev) or
470 type->pm.resume() does that.
337 471
338 1 bus.resume(dev) reverses the effects of bus.suspend(). This may 472 For devices that are not associated with any bus type or device type
339 be morphed into a device driver call with bus-specific parameters; 473 this method plays the role of bus->pm.resume().
340 implementations may sleep.
341 474
342 2 class.resume(dev) is called for devices associated with a class 475 5 bus->pm.complete(dev) is called, if implemented. It is supposed to
343 that has such a method. Implementations may sleep. 476 invoke the device driver's ->pm.complete() method.
344 477
345 This reverses the effects of class.suspend(), and would usually 478 The role of this method is to reverse whatever bus->pm.prepare(dev)
346 reactivate the device's I/O queue. 479 (or the driver's ->pm.prepare()) did during suspend, if necessary.
347 480
348At the end of those phases, drivers should normally be as functional as 481At the end of those phases, drivers should normally be as functional as
349they were before suspending: I/O can be performed using DMA and IRQs, and 482they were before suspending: I/O can be performed using DMA and IRQs, and
350the relevant clocks are gated on. The device need not be "fully on"; it 483the relevant clocks are gated on. In principle the device need not be
351might be in a runtime lowpower/suspend state that acts as if it were. 484"fully on"; it might be in a runtime lowpower/suspend state during suspend and
485the resume callbacks may try to restore that state, but that need not be
486desirable from the user's point of view. In fact, there are multiple reasons
487why it's better to always put devices into the "fully working" state in the
488system sleep resume callbacks and they are discussed in more detail in
489Documentation/power/runtime_pm.txt.
352 490
353However, the details here may again be platform-specific. For example, 491However, the details here may again be platform-specific. For example,
354some systems support multiple "run" states, and the mode in effect at 492some systems support multiple "run" states, and the mode in effect at
355the end of resume() might not be the one which preceded suspension. 493the end of resume might not be the one which preceded suspension.
356That means availability of certain clocks or power supplies changed, 494That means availability of certain clocks or power supplies changed,
357which could easily affect how a driver works. 495which could easily affect how a driver works.
358 496
359
360Drivers need to be able to handle hardware which has been reset since the 497Drivers need to be able to handle hardware which has been reset since the
361suspend methods were called, for example by complete reinitialization. 498suspend methods were called, for example by complete reinitialization.
362This may be the hardest part, and the one most protected by NDA'd documents 499This may be the hardest part, and the one most protected by NDA'd documents
363and chip errata. It's simplest if the hardware state hasn't changed since 500and chip errata. It's simplest if the hardware state hasn't changed since
364the suspend() was called, but that can't always be guaranteed. 501the suspend was carried out, but that can't be guaranteed (in fact, it ususally
502is not the case).
365 503
366Drivers must also be prepared to notice that the device has been removed 504Drivers must also be prepared to notice that the device has been removed
367while the system was powered off, whenever that's physically possible. 505while the system was powered off, whenever that's physically possible.
@@ -371,11 +509,76 @@ will notice and handle such removals are currently bus-specific, and often
371involve a separate thread. 509involve a separate thread.
372 510
373 511
374Note that the bus-specific runtime PM wakeup mechanism can exist, and might 512Resume From Hibernation
375be defined to share some of the same driver code as for system wakeup. For 513-----------------------
376example, a bus-specific device driver's resume() method might be used there, 514Resuming from hibernation is, again, more complicated than resuming from a sleep
377so it wouldn't only be called from bus.resume() during system-wide wakeup. 515state in which the contents of main memory are preserved, because it requires
378See bus-specific information about how runtime wakeup events are handled. 516a system image to be loaded into memory and the pre-hibernation memory contents
517to be restored before control can be passed back to the image kernel.
518
519In principle, the image might be loaded into memory and the pre-hibernation
520memory contents might be restored by the boot loader. For this purpose,
521however, the boot loader would need to know the image kernel's entry point and
522there's no protocol defined for passing that information to boot loaders. As
523a workaround, the boot loader loads a fresh instance of the kernel, called the
524boot kernel, into memory and passes control to it in a usual way. Then, the
525boot kernel reads the hibernation image, restores the pre-hibernation memory
526contents and passes control to the image kernel. Thus, in fact, two different
527kernels are involved in resuming from hibernation and in general they are not
528only different because they play different roles in this operation. Actually,
529the boot kernel may be completely different from the image kernel. Not only
530the configuration of it, but also the version of it may be different.
531The consequences of this are important to device drivers and their subsystems
532(bus types, device classes and device types) too.
533
534Namely, to be able to load the hibernation image into memory, the boot kernel
535needs to include at least the subset of device drivers allowing it to access the
536storage medium containing the image, although it generally doesn't need to
537include all of the drivers included into the image kernel. After the image has
538been loaded the devices handled by those drivers need to be prepared for passing
539control back to the image kernel. This is very similar to the preparation of
540devices for creating a hibernation image described above. In fact, it is done
541in the same way, with the help of the ->pm.prepare(), ->pm.freeze() and
542->pm.freeze_noirq() callbacks, but only for device drivers included in the boot
543kernel (whose versions may generally be different from the versions of the
544analogous drivers from the image kernel).
545
546Should the restoration of the pre-hibernation memory contents fail, the boot
547kernel would carry out the procedure of "thawing" devices described above, using
548the ->pm.thaw_noirq(), ->pm.thaw(), and ->pm.complete() callbacks provided by
549subsystems and device drivers. This, however, is a very rare condition. Most
550often the pre-hibernation memory contents are restored successfully and control
551is passed to the image kernel that is now responsible for bringing the system
552back to the working state.
553
554To achieve this goal, among other things, the image kernel restores the
555pre-hibernation functionality of devices. This operation is analogous to the
556resuming of devices after waking up from the memory sleep state, although it
557involves different device notifications which are the following:
558
559 1 bus->pm.restore_noirq(dev), if implemented; may call the device driver's
560 ->pm.restore_noirq() method, depending on the bus type in question.
561
562 2 bus->pm.restore(dev), if implemented; usually calls the device driver's
563 ->pm.restore() method.
564
565 3 type->pm.restore(dev), if implemented; may call the device driver's
566 ->pm.restore() method if not called by the bus type or class.
567
568 4 class->pm.restore(dev), if implemented; may call the device driver's
569 ->pm.restore() method if not called by the bus type or device type.
570
571 5 bus->pm.complete(dev), if implemented; may call the device driver's
572 ->pm.complete() method.
573
574The roles of the ->pm.restore_noirq() and ->pm.restore() callbacks are analogous
575to the roles of the corresponding resume callbacks, but they must assume that
576the device may have been accessed before by the boot kernel. Consequently, the
577state of the device before they are called may be different from the state of it
578right prior to calling the resume callbacks. That difference usually doesn't
579matter, so the majority of device drivers can set their resume and restore
580callback pointers to the same routine. Nevertheless, different callback
581pointers are used in case there is a situation where it actually matters.
379 582
380 583
381System Devices 584System Devices
@@ -389,10 +592,13 @@ System devices will only be suspended with interrupts disabled, and after
389all other devices have been suspended. On resume, they will be resumed 592all other devices have been suspended. On resume, they will be resumed
390before any other devices, and also with interrupts disabled. 593before any other devices, and also with interrupts disabled.
391 594
392That is, IRQs are disabled, the suspend_late() phase begins, then the 595That is, when the non-boot CPUs are all offline and IRQs are disabled on the
393sysdev_driver.suspend() phase, and the system enters a sleep state. Then 596remaining online CPU, then the sysdev_driver.suspend() phase is carried out, and
394the sysdev_driver.resume() phase begins, followed by the resume_early() 597the system enters a sleep state (or hibernation image is created). During
395phase, after which IRQs are enabled. 598resume (or after the image has been created) the sysdev_driver.resume() phase
599is carried out, IRQs are enabled on the only online CPU, the non-boot CPUs are
600enabled and that is followed by the "early resume" phase (in which the "noirq"
601callbacks provided by subsystems and device drivers are invoked).
396 602
397Code to actually enter and exit the system-wide low power state sometimes 603Code to actually enter and exit the system-wide low power state sometimes
398involves hardware details that are only known to the boot firmware, and 604involves hardware details that are only known to the boot firmware, and
@@ -400,6 +606,22 @@ may leave a CPU running software (from SRAM or flash memory) that monitors
400the system and manages its wakeup sequence. 606the system and manages its wakeup sequence.
401 607
402 608
609Power Management Notifiers
610--------------------------
611As stated in Documentation/power/notifiers.txt, there are some operations that
612cannot be carried out by the power management callbacks discussed above, because
613carrying them out at these points would be too late or too early. To handle
614these cases subsystems and device drivers may register power management
615notifiers that are called before tasks are frozen and after they have been
616thawed.
617
618Generally speaking, the PM notifiers are suitable for performing actions that
619either require user space to be available, or at least won't interfere with user
620space in a wrong way.
621
622For details refer to Documentation/power/notifiers.txt.
623
624
403Runtime Power Management 625Runtime Power Management
404======================== 626========================
405Many devices are able to dynamically power down while the system is still 627Many devices are able to dynamically power down while the system is still
@@ -410,79 +632,21 @@ as "off", "sleep", "idle", "active", and so on. Those states will in some
410cases (like PCI) be partially constrained by a bus the device uses, and will 632cases (like PCI) be partially constrained by a bus the device uses, and will
411usually include hardware states that are also used in system sleep states. 633usually include hardware states that are also used in system sleep states.
412 634
413However, note that if a driver puts a device into a runtime low power state 635Note, however, that a system-wide power transition can be started while some
414and the system then goes into a system-wide sleep state, it normally ought 636devices are in low power states due to the runtime power management. The system
415to resume into that runtime low power state rather than "full on". Such 637sleep PM callbacks should generally recognize such situations and react to them
416distinctions would be part of the driver-internal state machine for that 638appropriately, but the recommended actions to be taken in that cases are
417hardware; the whole point of runtime power management is to be sure that 639subsystem-specific.
418drivers are decoupled in that way from the state machine governing phases 640
419of the system-wide power/sleep state transitions. 641In some cases the decision may be made at the subsystem level while in some
420 642other cases the device driver may be left to decide. In some cases it may be
421 643desirable to leave a suspended device in that state during system-wide power
422Power Saving Techniques 644transition, but in some other cases the device ought to be put back into the
423----------------------- 645full power state, for example to be configured for system wakeup or so that its
424Normally runtime power management is handled by the drivers without specific 646system wakeup capability can be disabled. That all depends on the hardware
425userspace or kernel intervention, by device-aware use of techniques like: 647and the design of the subsystem and device driver in question.
426 648
427 Using information provided by other system layers 649During system-wide resume from a sleep state it's better to put devices into
428 - stay deeply "off" except between open() and close() 650the full power state, as explained in Documentation/power/runtime_pm.txt. Refer
429 - if transceiver/PHY indicates "nobody connected", stay "off" 651to that document for more information regarding this particular issue as well as
430 - application protocols may include power commands or hints 652for information on the device runtime power management framework in general.
431
432 Using fewer CPU cycles
433 - using DMA instead of PIO
434 - removing timers, or making them lower frequency
435 - shortening "hot" code paths
436 - eliminating cache misses
437 - (sometimes) offloading work to device firmware
438
439 Reducing other resource costs
440 - gating off unused clocks in software (or hardware)
441 - switching off unused power supplies
442 - eliminating (or delaying/merging) IRQs
443 - tuning DMA to use word and/or burst modes
444
445 Using device-specific low power states
446 - using lower voltages
447 - avoiding needless DMA transfers
448
449Read your hardware documentation carefully to see the opportunities that
450may be available. If you can, measure the actual power usage and check
451it against the budget established for your project.
452
453
454Examples: USB hosts, system timer, system CPU
455----------------------------------------------
456USB host controllers make interesting, if complex, examples. In many cases
457these have no work to do: no USB devices are connected, or all of them are
458in the USB "suspend" state. Linux host controller drivers can then disable
459periodic DMA transfers that would otherwise be a constant power drain on the
460memory subsystem, and enter a suspend state. In power-aware controllers,
461entering that suspend state may disable the clock used with USB signaling,
462saving a certain amount of power.
463
464The controller will be woken from that state (with an IRQ) by changes to the
465signal state on the data lines of a given port, for example by an existing
466peripheral requesting "remote wakeup" or by plugging a new peripheral. The
467same wakeup mechanism usually works from "standby" sleep states, and on some
468systems also from "suspend to RAM" (or even "suspend to disk") states.
469(Except that ACPI may be involved instead of normal IRQs, on some hardware.)
470
471System devices like timers and CPUs may have special roles in the platform
472power management scheme. For example, system timers using a "dynamic tick"
473approach don't just save CPU cycles (by eliminating needless timer IRQs),
474but they may also open the door to using lower power CPU "idle" states that
475cost more than a jiffie to enter and exit. On x86 systems these are states
476like "C3"; note that periodic DMA transfers from a USB host controller will
477also prevent entry to a C3 state, much like a periodic timer IRQ.
478
479That kind of runtime mechanism interaction is common. "System On Chip" (SOC)
480processors often have low power idle modes that can't be entered unless
481certain medium-speed clocks (often 12 or 48 MHz) are gated off. When the
482drivers gate those clocks effectively, then the system idle task may be able
483to use the lower power idle modes and thereby increase battery life.
484
485If the CPU can have a "cpufreq" driver, there also may be opportunities
486to shift to lower voltage settings and reduce the power cost of executing
487a given number of instructions. (Without voltage adjustment, it's rare
488for cpufreq to save much power; the cost-per-instruction must go down.)