diff options
-rw-r--r-- | Documentation/power/pci.txt | 1258 |
1 files changed, 992 insertions, 266 deletions
diff --git a/Documentation/power/pci.txt b/Documentation/power/pci.txt index dd8fe43888d3..62328d76b55b 100644 --- a/Documentation/power/pci.txt +++ b/Documentation/power/pci.txt | |||
@@ -1,299 +1,1025 @@ | |||
1 | |||
2 | PCI Power Management | 1 | PCI Power Management |
3 | ~~~~~~~~~~~~~~~~~~~~ | ||
4 | 2 | ||
5 | An overview of the concepts and the related functions in the Linux kernel | 3 | Copyright (c) 2010 Rafael J. Wysocki <rjw@sisk.pl>, Novell Inc. |
4 | |||
5 | An overview of concepts and the Linux kernel's interfaces related to PCI power | ||
6 | management. Based on previous work by Patrick Mochel <mochel@transmeta.com> | ||
7 | (and others). | ||
6 | 8 | ||
7 | Patrick Mochel <mochel@transmeta.com> | 9 | This document only covers the aspects of power management specific to PCI |
8 | (and others) | 10 | devices. For general description of the kernel's interfaces related to device |
11 | power management refer to Documentation/power/devices.txt and | ||
12 | Documentation/power/runtime_pm.txt. | ||
9 | 13 | ||
10 | --------------------------------------------------------------------------- | 14 | --------------------------------------------------------------------------- |
11 | 15 | ||
12 | 1. Overview | 16 | 1. Hardware and Platform Support for PCI Power Management |
13 | 2. How the PCI Subsystem Does Power Management | 17 | 2. PCI Subsystem and Device Power Management |
14 | 3. PCI Utility Functions | 18 | 3. PCI Device Drivers and Power Management |
15 | 4. PCI Device Drivers | 19 | 4. Resources |
16 | 5. Resources | 20 | |
17 | 21 | ||
18 | 1. Overview | 22 | 1. Hardware and Platform Support for PCI Power Management |
19 | ~~~~~~~~~~~ | 23 | ========================================================= |
20 | 24 | ||
21 | The PCI Power Management Specification was introduced between the PCI 2.1 and | 25 | 1.1. Native and Platform-Based Power Management |
22 | PCI 2.2 Specifications. It a standard interface for controlling various | 26 | ----------------------------------------------- |
23 | power management operations. | 27 | In general, power management is a feature allowing one to save energy by putting |
24 | 28 | devices into states in which they draw less power (low-power states) at the | |
25 | Implementation of the PCI PM Spec is optional, as are several sub-components of | 29 | price of reduced functionality or performance. |
26 | it. If a device supports the PCI PM Spec, the device will have an 8 byte | 30 | |
27 | capability field in its PCI configuration space. This field is used to describe | 31 | Usually, a device is put into a low-power state when it is underutilized or |
28 | and control the standard PCI power management features. | 32 | completely inactive. However, when it is necessary to use the device once |
29 | 33 | again, it has to be put back into the "fully functional" state (full-power | |
30 | The PCI PM spec defines 4 operating states for devices (D0 - D3) and for buses | 34 | state). This may happen when there are some data for the device to handle or |
31 | (B0 - B3). The higher the number, the less power the device consumes. However, | 35 | as a result of an external event requiring the device to be active, which may |
32 | the higher the number, the longer the latency is for the device to return to | 36 | be signaled by the device itself. |
33 | an operational state (D0). | 37 | |
34 | 38 | PCI devices may be put into low-power states in two ways, by using the device | |
35 | There are actually two D3 states. When someone talks about D3, they usually | 39 | capabilities introduced by the PCI Bus Power Management Interface Specification, |
36 | mean D3hot, which corresponds to an ACPI D2 state (power is reduced, the | 40 | or with the help of platform firmware, such as an ACPI BIOS. In the first |
37 | device may lose some context). But they may also mean D3cold, which is an | 41 | approach, that is referred to as the native PCI power management (native PCI PM) |
38 | ACPI D3 state (power is fully off, all state was discarded); or both. | 42 | in what follows, the device power state is changed as a result of writing a |
39 | 43 | specific value into one of its standard configuration registers. The second | |
40 | Bus power management is not covered in this version of this document. | 44 | approach requires the platform firmware to provide special methods that may be |
41 | 45 | used by the kernel to change the device's power state. | |
42 | Note that all PCI devices support D0 and D3cold by default, regardless of | 46 | |
43 | whether or not they implement any of the PCI PM spec. | 47 | Devices supporting the native PCI PM usually can generate wakeup signals called |
44 | 48 | Power Management Events (PMEs) to let the kernel know about external events | |
45 | The possible state transitions that a device can undergo are: | 49 | requiring the device to be active. After receiving a PME the kernel is supposed |
46 | 50 | to put the device that sent it into the full-power state. However, the PCI Bus | |
47 | +---------------------------+ | 51 | Power Management Interface Specification doesn't define any standard method of |
48 | | Current State | New State | | 52 | delivering the PME from the device to the CPU and the operating system kernel. |
49 | +---------------------------+ | 53 | It is assumed that the platform firmware will perform this task and therefore, |
50 | | D0 | D1, D2, D3| | 54 | even though a PCI device is set up to generate PMEs, it also may be necessary to |
51 | +---------------------------+ | 55 | prepare the platform firmware for notifying the CPU of the PMEs coming from the |
52 | | D1 | D2, D3 | | 56 | device (e.g. by generating interrupts). |
53 | +---------------------------+ | 57 | |
54 | | D2 | D3 | | 58 | In turn, if the methods provided by the platform firmware are used for changing |
55 | +---------------------------+ | 59 | the power state of a device, usually the platform also provides a method for |
56 | | D1, D2, D3 | D0 | | 60 | preparing the device to generate wakeup signals. In that case, however, it |
57 | +---------------------------+ | 61 | often also is necessary to prepare the device for generating PMEs using the |
58 | 62 | native PCI PM mechanism, because the method provided by the platform depends on | |
59 | Note that when the system is entering a global suspend state, all devices will | 63 | that. |
60 | be placed into D3 and when resuming, all devices will be placed into D0. | 64 | |
61 | However, when the system is running, other state transitions are possible. | 65 | Thus in many situations both the native and the platform-based power management |
62 | 66 | mechanisms have to be used simultaneously to obtain the desired result. | |
63 | 2. How The PCI Subsystem Handles Power Management | 67 | |
64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 68 | 1.2. Native PCI Power Management |
65 | 69 | -------------------------------- | |
66 | The PCI suspend/resume functionality is accessed indirectly via the Power | 70 | The PCI Bus Power Management Interface Specification (PCI PM Spec) was |
67 | Management subsystem. At boot, the PCI driver registers a power management | 71 | introduced between the PCI 2.1 and PCI 2.2 Specifications. It defined a |
68 | callback with that layer. Upon entering a suspend state, the PM layer iterates | 72 | standard interface for performing various operations related to power |
69 | through all of its registered callbacks. This currently takes place only during | 73 | management. |
70 | APM state transitions. | 74 | |
71 | 75 | The implementation of the PCI PM Spec is optional for conventional PCI devices, | |
72 | Upon going to sleep, the PCI subsystem walks its device tree twice. Both times, | 76 | but it is mandatory for PCI Express devices. If a device supports the PCI PM |
73 | it does a depth first walk of the device tree. The first walk saves each of the | 77 | Spec, it has an 8 byte power management capability field in its PCI |
74 | device's state and checks for devices that will prevent the system from entering | 78 | configuration space. This field is used to describe and control the standard |
75 | a global power state. The next walk then places the devices in a low power | 79 | features related to the native PCI power management. |
80 | |||
81 | The PCI PM Spec defines 4 operating states for devices (D0-D3) and for buses | ||
82 | (B0-B3). The higher the number, the less power is drawn by the device or bus | ||
83 | in that state. However, the higher the number, the longer the latency for | ||
84 | the device or bus to return to the full-power state (D0 or B0, respectively). | ||
85 | |||
86 | There are two variants of the D3 state defined by the specification. The first | ||
87 | one is D3hot, referred to as the software accessible D3, because devices can be | ||
88 | programmed to go into it. The second one, D3cold, is the state that PCI devices | ||
89 | are in when the supply voltage (Vcc) is removed from them. It is not possible | ||
90 | to program a PCI device to go into D3cold, although there may be a programmable | ||
91 | interface for putting the bus the device is on into a state in which Vcc is | ||
92 | removed from all devices on the bus. | ||
93 | |||
94 | PCI bus power management, however, is not supported by the Linux kernel at the | ||
95 | time of this writing and therefore it is not covered by this document. | ||
96 | |||
97 | Note that every PCI device can be in the full-power state (D0) or in D3cold, | ||
98 | regardless of whether or not it implements the PCI PM Spec. In addition to | ||
99 | that, if the PCI PM Spec is implemented by the device, it must support D3hot | ||
100 | as well as D0. The support for the D1 and D2 power states is optional. | ||
101 | |||
102 | PCI devices supporting the PCI PM Spec can be programmed to go to any of the | ||
103 | supported low-power states (except for D3cold). While in D1-D3hot the | ||
104 | standard configuration registers of the device must be accessible to software | ||
105 | (i.e. the device is required to respond to PCI configuration accesses), although | ||
106 | its I/O and memory spaces are then disabled. This allows the device to be | ||
107 | programmatically put into D0. Thus the kernel can switch the device back and | ||
108 | forth between D0 and the supported low-power states (except for D3cold) and the | ||
109 | possible power state transitions the device can undergo are the following: | ||
110 | |||
111 | +----------------------------+ | ||
112 | | Current State | New State | | ||
113 | +----------------------------+ | ||
114 | | D0 | D1, D2, D3 | | ||
115 | +----------------------------+ | ||
116 | | D1 | D2, D3 | | ||
117 | +----------------------------+ | ||
118 | | D2 | D3 | | ||
119 | +----------------------------+ | ||
120 | | D1, D2, D3 | D0 | | ||
121 | +----------------------------+ | ||
122 | |||
123 | The transition from D3cold to D0 occurs when the supply voltage is provided to | ||
124 | the device (i.e. power is restored). In that case the device returns to D0 with | ||
125 | a full power-on reset sequence and the power-on defaults are restored to the | ||
126 | device by hardware just as at initial power up. | ||
127 | |||
128 | PCI devices supporting the PCI PM Spec can be programmed to generate PMEs | ||
129 | while in a low-power state (D1-D3), but they are not required to be capable | ||
130 | of generating PMEs from all supported low-power states. In particular, the | ||
131 | capability of generating PMEs from D3cold is optional and depends on the | ||
132 | presence of additional voltage (3.3Vaux) allowing the device to remain | ||
133 | sufficiently active to generate a wakeup signal. | ||
134 | |||
135 | 1.3. ACPI Device Power Management | ||
136 | --------------------------------- | ||
137 | The platform firmware support for the power management of PCI devices is | ||
138 | system-specific. However, if the system in question is compliant with the | ||
139 | Advanced Configuration and Power Interface (ACPI) Specification, like the | ||
140 | majority of x86-based systems, it is supposed to implement device power | ||
141 | management interfaces defined by the ACPI standard. | ||
142 | |||
143 | For this purpose the ACPI BIOS provides special functions called "control | ||
144 | methods" that may be executed by the kernel to perform specific tasks, such as | ||
145 | putting a device into a low-power state. These control methods are encoded | ||
146 | using special byte-code language called the ACPI Machine Language (AML) and | ||
147 | stored in the machine's BIOS. The kernel loads them from the BIOS and executes | ||
148 | them as needed using an AML interpreter that translates the AML byte code into | ||
149 | computations and memory or I/O space accesses. This way, in theory, a BIOS | ||
150 | writer can provide the kernel with a means to perform actions depending | ||
151 | on the system design in a system-specific fashion. | ||
152 | |||
153 | ACPI control methods may be divided into global control methods, that are not | ||
154 | associated with any particular devices, and device control methods, that have | ||
155 | to be defined separately for each device supposed to be handled with the help of | ||
156 | the platform. This means, in particular, that ACPI device control methods can | ||
157 | only be used to handle devices that the BIOS writer knew about in advance. The | ||
158 | ACPI methods used for device power management fall into that category. | ||
159 | |||
160 | The ACPI specification assumes that devices can be in one of four power states | ||
161 | labeled as D0, D1, D2, and D3 that roughly correspond to the native PCI PM | ||
162 | D0-D3 states (although the difference between D3hot and D3cold is not taken | ||
163 | into account by ACPI). Moreover, for each power state of a device there is a | ||
164 | set of power resources that have to be enabled for the device to be put into | ||
165 | that state. These power resources are controlled (i.e. enabled or disabled) | ||
166 | with the help of their own control methods, _ON and _OFF, that have to be | ||
167 | defined individually for each of them. | ||
168 | |||
169 | To put a device into the ACPI power state Dx (where x is a number between 0 and | ||
170 | 3 inclusive) the kernel is supposed to (1) enable the power resources required | ||
171 | by the device in this state using their _ON control methods and (2) execute the | ||
172 | _PSx control method defined for the device. In addition to that, if the device | ||
173 | is going to be put into a low-power state (D1-D3) and is supposed to generate | ||
174 | wakeup signals from that state, the _DSW (or _PSW, replaced with _DSW by ACPI | ||
175 | 3.0) control method defined for it has to be executed before _PSx. Power | ||
176 | resources that are not required by the device in the target power state and are | ||
177 | not required any more by any other device should be disabled (by executing their | ||
178 | _OFF control methods). If the current power state of the device is D3, it can | ||
179 | only be put into D0 this way. | ||
180 | |||
181 | However, quite often the power states of devices are changed during a | ||
182 | system-wide transition into a sleep state or back into the working state. ACPI | ||
183 | defines four system sleep states, S1, S2, S3, and S4, and denotes the system | ||
184 | working state as S0. In general, the target system sleep (or working) state | ||
185 | determines the highest power (lowest number) state the device can be put | ||
186 | into and the kernel is supposed to obtain this information by executing the | ||
187 | device's _SxD control method (where x is a number between 0 and 4 inclusive). | ||
188 | If the device is required to wake up the system from the target sleep state, the | ||
189 | lowest power (highest number) state it can be put into is also determined by the | ||
190 | target state of the system. The kernel is then supposed to use the device's | ||
191 | _SxW control method to obtain the number of that state. It also is supposed to | ||
192 | use the device's _PRW control method to learn which power resources need to be | ||
193 | enabled for the device to be able to generate wakeup signals. | ||
194 | |||
195 | 1.4. Wakeup Signaling | ||
196 | --------------------- | ||
197 | Wakeup signals generated by PCI devices, either as native PCI PMEs, or as | ||
198 | a result of the execution of the _DSW (or _PSW) ACPI control method before | ||
199 | putting the device into a low-power state, have to be caught and handled as | ||
200 | appropriate. If they are sent while the system is in the working state | ||
201 | (ACPI S0), they should be translated into interrupts so that the kernel can | ||
202 | put the devices generating them into the full-power state and take care of the | ||
203 | events that triggered them. In turn, if they are sent while the system is | ||
204 | sleeping, they should cause the system's core logic to trigger wakeup. | ||
205 | |||
206 | On ACPI-based systems wakeup signals sent by conventional PCI devices are | ||
207 | converted into ACPI General-Purpose Events (GPEs) which are hardware signals | ||
208 | from the system core logic generated in response to various events that need to | ||
209 | be acted upon. Every GPE is associated with one or more sources of potentially | ||
210 | interesting events. In particular, a GPE may be associated with a PCI device | ||
211 | capable of signaling wakeup. The information on the connections between GPEs | ||
212 | and event sources is recorded in the system's ACPI BIOS from where it can be | ||
213 | read by the kernel. | ||
214 | |||
215 | If a PCI device known to the system's ACPI BIOS signals wakeup, the GPE | ||
216 | associated with it (if there is one) is triggered. The GPEs associated with PCI | ||
217 | bridges may also be triggered in response to a wakeup signal from one of the | ||
218 | devices below the bridge (this also is the case for root bridges) and, for | ||
219 | example, native PCI PMEs from devices unknown to the system's ACPI BIOS may be | ||
220 | handled this way. | ||
221 | |||
222 | A GPE may be triggered when the system is sleeping (i.e. when it is in one of | ||
223 | the ACPI S1-S4 states), in which case system wakeup is started by its core logic | ||
224 | (the device that was the source of the signal causing the system wakeup to occur | ||
225 | may be identified later). The GPEs used in such situations are referred to as | ||
226 | wakeup GPEs. | ||
227 | |||
228 | Usually, however, GPEs are also triggered when the system is in the working | ||
229 | state (ACPI S0) and in that case the system's core logic generates a System | ||
230 | Control Interrupt (SCI) to notify the kernel of the event. Then, the SCI | ||
231 | handler identifies the GPE that caused the interrupt to be generated which, | ||
232 | in turn, allows the kernel to identify the source of the event (that may be | ||
233 | a PCI device signaling wakeup). The GPEs used for notifying the kernel of | ||
234 | events occurring while the system is in the working state are referred to as | ||
235 | runtime GPEs. | ||
236 | |||
237 | Unfortunately, there is no standard way of handling wakeup signals sent by | ||
238 | conventional PCI devices on systems that are not ACPI-based, but there is one | ||
239 | for PCI Express devices. Namely, the PCI Express Base Specification introduced | ||
240 | a native mechanism for converting native PCI PMEs into interrupts generated by | ||
241 | root ports. For conventional PCI devices native PMEs are out-of-band, so they | ||
242 | are routed separately and they need not pass through bridges (in principle they | ||
243 | may be routed directly to the system's core logic), but for PCI Express devices | ||
244 | they are in-band messages that have to pass through the PCI Express hierarchy, | ||
245 | including the root port on the path from the device to the Root Complex. Thus | ||
246 | it was possible to introduce a mechanism by which a root port generates an | ||
247 | interrupt whenever it receives a PME message from one of the devices below it. | ||
248 | The PCI Express Requester ID of the device that sent the PME message is then | ||
249 | recorded in one of the root port's configuration registers from where it may be | ||
250 | read by the interrupt handler allowing the device to be identified. [PME | ||
251 | messages sent by PCI Express endpoints integrated with the Root Complex don't | ||
252 | pass through root ports, but instead they cause a Root Complex Event Collector | ||
253 | (if there is one) to generate interrupts.] | ||
254 | |||
255 | In principle the native PCI Express PME signaling may also be used on ACPI-based | ||
256 | systems along with the GPEs, but to use it the kernel has to ask the system's | ||
257 | ACPI BIOS to release control of root port configuration registers. The ACPI | ||
258 | BIOS, however, is not required to allow the kernel to control these registers | ||
259 | and if it doesn't do that, the kernel must not modify their contents. Of course | ||
260 | the native PCI Express PME signaling cannot be used by the kernel in that case. | ||
261 | |||
262 | |||
263 | 2. PCI Subsystem and Device Power Management | ||
264 | ============================================ | ||
265 | |||
266 | 2.1. Device Power Management Callbacks | ||
267 | -------------------------------------- | ||
268 | The PCI Subsystem participates in the power management of PCI devices in a | ||
269 | number of ways. First of all, it provides an intermediate code layer between | ||
270 | the device power management core (PM core) and PCI device drivers. | ||
271 | Specifically, the pm field of the PCI subsystem's struct bus_type object, | ||
272 | pci_bus_type, points to a struct dev_pm_ops object, pci_dev_pm_ops, containing | ||
273 | pointers to several device power management callbacks: | ||
274 | |||
275 | const struct dev_pm_ops pci_dev_pm_ops = { | ||
276 | .prepare = pci_pm_prepare, | ||
277 | .complete = pci_pm_complete, | ||
278 | .suspend = pci_pm_suspend, | ||
279 | .resume = pci_pm_resume, | ||
280 | .freeze = pci_pm_freeze, | ||
281 | .thaw = pci_pm_thaw, | ||
282 | .poweroff = pci_pm_poweroff, | ||
283 | .restore = pci_pm_restore, | ||
284 | .suspend_noirq = pci_pm_suspend_noirq, | ||
285 | .resume_noirq = pci_pm_resume_noirq, | ||
286 | .freeze_noirq = pci_pm_freeze_noirq, | ||
287 | .thaw_noirq = pci_pm_thaw_noirq, | ||
288 | .poweroff_noirq = pci_pm_poweroff_noirq, | ||
289 | .restore_noirq = pci_pm_restore_noirq, | ||
290 | .runtime_suspend = pci_pm_runtime_suspend, | ||
291 | .runtime_resume = pci_pm_runtime_resume, | ||
292 | .runtime_idle = pci_pm_runtime_idle, | ||
293 | }; | ||
294 | |||
295 | These callbacks are executed by the PM core in various situations related to | ||
296 | device power management and they, in turn, execute power management callbacks | ||
297 | provided by PCI device drivers. They also perform power management operations | ||
298 | involving some standard configuration registers of PCI devices that device | ||
299 | drivers need not know or care about. | ||
300 | |||
301 | The structure representing a PCI device, struct pci_dev, contains several fields | ||
302 | that these callbacks operate on: | ||
303 | |||
304 | struct pci_dev { | ||
305 | ... | ||
306 | pci_power_t current_state; /* Current operating state. */ | ||
307 | int pm_cap; /* PM capability offset in the | ||
308 | configuration space */ | ||
309 | unsigned int pme_support:5; /* Bitmask of states from which PME# | ||
310 | can be generated */ | ||
311 | unsigned int pme_interrupt:1;/* Is native PCIe PME signaling used? */ | ||
312 | unsigned int d1_support:1; /* Low power state D1 is supported */ | ||
313 | unsigned int d2_support:1; /* Low power state D2 is supported */ | ||
314 | unsigned int no_d1d2:1; /* D1 and D2 are forbidden */ | ||
315 | unsigned int wakeup_prepared:1; /* Device prepared for wake up */ | ||
316 | unsigned int d3_delay; /* D3->D0 transition time in ms */ | ||
317 | ... | ||
318 | }; | ||
319 | |||
320 | They also indirectly use some fields of the struct device that is embedded in | ||
321 | struct pci_dev. | ||
322 | |||
323 | 2.2. Device Initialization | ||
324 | -------------------------- | ||
325 | The PCI subsystem's first task related to device power management is to | ||
326 | prepare the device for power management and initialize the fields of struct | ||
327 | pci_dev used for this purpose. This happens in two functions defined in | ||
328 | drivers/pci/pci.c, pci_pm_init() and platform_pci_wakeup_init(). | ||
329 | |||
330 | The first of these functions checks if the device supports native PCI PM | ||
331 | and if that's the case the offset of its power management capability structure | ||
332 | in the configuration space is stored in the pm_cap field of the device's struct | ||
333 | pci_dev object. Next, the function checks which PCI low-power states are | ||
334 | supported by the device and from which low-power states the device can generate | ||
335 | native PCI PMEs. The power management fields of the device's struct pci_dev and | ||
336 | the struct device embedded in it are updated accordingly and the generation of | ||
337 | PMEs by the device is disabled. | ||
338 | |||
339 | The second function checks if the device can be prepared to signal wakeup with | ||
340 | the help of the platform firmware, such as the ACPI BIOS. If that is the case, | ||
341 | the function updates the wakeup fields in struct device embedded in the | ||
342 | device's struct pci_dev and uses the firmware-provided method to prevent the | ||
343 | device from signaling wakeup. | ||
344 | |||
345 | At this point the device is ready for power management. For driverless devices, | ||
346 | however, this functionality is limited to a few basic operations carried out | ||
347 | during system-wide transitions to a sleep state and back to the working state. | ||
348 | |||
349 | 2.3. Runtime Device Power Management | ||
350 | ------------------------------------ | ||
351 | The PCI subsystem plays a vital role in the runtime power management of PCI | ||
352 | devices. For this purpose it uses the general runtime power management | ||
353 | (runtime PM) framework described in Documentation/power/runtime_pm.txt. | ||
354 | Namely, it provides subsystem-level callbacks: | ||
355 | |||
356 | pci_pm_runtime_suspend() | ||
357 | pci_pm_runtime_resume() | ||
358 | pci_pm_runtime_idle() | ||
359 | |||
360 | that are executed by the core runtime PM routines. It also implements the | ||
361 | entire mechanics necessary for handling runtime wakeup signals from PCI devices | ||
362 | in low-power states, which at the time of this writing works for both the native | ||
363 | PCI Express PME signaling and the ACPI GPE-based wakeup signaling described in | ||
364 | Section 1. | ||
365 | |||
366 | First, a PCI device is put into a low-power state, or suspended, with the help | ||
367 | of pm_schedule_suspend() or pm_runtime_suspend() which for PCI devices call | ||
368 | pci_pm_runtime_suspend() to do the actual job. For this to work, the device's | ||
369 | driver has to provide a pm->runtime_suspend() callback (see below), which is | ||
370 | run by pci_pm_runtime_suspend() as the first action. If the driver's callback | ||
371 | returns successfully, the device's standard configuration registers are saved, | ||
372 | the device is prepared to generate wakeup signals and, finally, it is put into | ||
373 | the target low-power state. | ||
374 | |||
375 | The low-power state to put the device into is the lowest-power (highest number) | ||
376 | state from which it can signal wakeup. The exact method of signaling wakeup is | ||
377 | system-dependent and is determined by the PCI subsystem on the basis of the | ||
378 | reported capabilities of the device and the platform firmware. To prepare the | ||
379 | device for signaling wakeup and put it into the selected low-power state, the | ||
380 | PCI subsystem can use the platform firmware as well as the device's native PCI | ||
381 | PM capabilities, if supported. | ||
382 | |||
383 | It is expected that the device driver's pm->runtime_suspend() callback will | ||
384 | not attempt to prepare the device for signaling wakeup or to put it into a | ||
385 | low-power state. The driver ought to leave these tasks to the PCI subsystem | ||
386 | that has all of the information necessary to perform them. | ||
387 | |||
388 | A suspended device is brought back into the "active" state, or resumed, | ||
389 | with the help of pm_request_resume() or pm_runtime_resume() which both call | ||
390 | pci_pm_runtime_resume() for PCI devices. Again, this only works if the device's | ||
391 | driver provides a pm->runtime_resume() callback (see below). However, before | ||
392 | the driver's callback is executed, pci_pm_runtime_resume() brings the device | ||
393 | back into the full-power state, prevents it from signaling wakeup while in that | ||
394 | state and restores its standard configuration registers. Thus the driver's | ||
395 | callback need not worry about the PCI-specific aspects of the device resume. | ||
396 | |||
397 | Note that generally pci_pm_runtime_resume() may be called in two different | ||
398 | situations. First, it may be called at the request of the device's driver, for | ||
399 | example if there are some data for it to process. Second, it may be called | ||
400 | as a result of a wakeup signal from the device itself (this sometimes is | ||
401 | referred to as "remote wakeup"). Of course, for this purpose the wakeup signal | ||
402 | is handled in one of the ways described in Section 1 and finally converted into | ||
403 | a notification for the PCI subsystem after the source device has been | ||
404 | identified. | ||
405 | |||
406 | The pci_pm_runtime_idle() function, called for PCI devices by pm_runtime_idle() | ||
407 | and pm_request_idle(), executes the device driver's pm->runtime_idle() | ||
408 | callback, if defined, and if that callback doesn't return error code (or is not | ||
409 | present at all), suspends the device with the help of pm_runtime_suspend(). | ||
410 | Sometimes pci_pm_runtime_idle() is called automatically by the PM core (for | ||
411 | example, it is called right after the device has just been resumed), in which | ||
412 | cases it is expected to suspend the device if that makes sense. Usually, | ||
413 | however, the PCI subsystem doesn't really know if the device really can be | ||
414 | suspended, so it lets the device's driver decide by running its | ||
415 | pm->runtime_idle() callback. | ||
416 | |||
417 | 2.4. System-Wide Power Transitions | ||
418 | ---------------------------------- | ||
419 | There are a few different types of system-wide power transitions, described in | ||
420 | Documentation/power/devices.txt. Each of them requires devices to be handled | ||
421 | in a specific way and the PM core executes subsystem-level power management | ||
422 | callbacks for this purpose. They are executed in phases such that each phase | ||
423 | involves executing the same subsystem-level callback for every device belonging | ||
424 | to the given subsystem before the next phase begins. These phases always run | ||
425 | after tasks have been frozen. | ||
426 | |||
427 | 2.4.1. System Suspend | ||
428 | |||
429 | When the system is going into a sleep state in which the contents of memory will | ||
430 | be preserved, such as one of the ACPI sleep states S1-S3, the phases are: | ||
431 | |||
432 | prepare, suspend, suspend_noirq. | ||
433 | |||
434 | The following PCI bus type's callbacks, respectively, are used in these phases: | ||
435 | |||
436 | pci_pm_prepare() | ||
437 | pci_pm_suspend() | ||
438 | pci_pm_suspend_noirq() | ||
439 | |||
440 | The pci_pm_prepare() routine first puts the device into the "fully functional" | ||
441 | state with the help of pm_runtime_resume(). Then, it executes the device | ||
442 | driver's pm->prepare() callback if defined (i.e. if the driver's struct | ||
443 | dev_pm_ops object is present and the prepare pointer in that object is valid). | ||
444 | |||
445 | The pci_pm_suspend() routine first checks if the device's driver implements | ||
446 | legacy PCI suspend routines (see Section 3), in which case the driver's legacy | ||
447 | suspend callback is executed, if present, and its result is returned. Next, if | ||
448 | the device's driver doesn't provide a struct dev_pm_ops object (containing | ||
449 | pointers to the driver's callbacks), pci_pm_default_suspend() is called, which | ||
450 | simply turns off the device's bus master capability and runs | ||
451 | pcibios_disable_device() to disable it, unless the device is a bridge (PCI | ||
452 | bridges are ignored by this routine). Next, the device driver's pm->suspend() | ||
453 | callback is executed, if defined, and its result is returned if it fails. | ||
454 | Finally, pci_fixup_device() is called to apply hardware suspend quirks related | ||
455 | to the device if necessary. | ||
456 | |||
457 | Note that the suspend phase is carried out asynchronously for PCI devices, so | ||
458 | the pci_pm_suspend() callback may be executed in parallel for any pair of PCI | ||
459 | devices that don't depend on each other in a known way (i.e. none of the paths | ||
460 | in the device tree from the root bridge to a leaf device contains both of them). | ||
461 | |||
462 | The pci_pm_suspend_noirq() routine is executed after suspend_device_irqs() has | ||
463 | been called, which means that the device driver's interrupt handler won't be | ||
464 | invoked while this routine is running. It first checks if the device's driver | ||
465 | implements legacy PCI suspends routines (Section 3), in which case the legacy | ||
466 | late suspend routine is called and its result is returned (the standard | ||
467 | configuration registers of the device are saved if the driver's callback hasn't | ||
468 | done that). Second, if the device driver's struct dev_pm_ops object is not | ||
469 | present, the device's standard configuration registers are saved and the routine | ||
470 | returns success. Otherwise the device driver's pm->suspend_noirq() callback is | ||
471 | executed, if present, and its result is returned if it fails. Next, if the | ||
472 | device's standard configuration registers haven't been saved yet (one of the | ||
473 | device driver's callbacks executed before might do that), pci_pm_suspend_noirq() | ||
474 | saves them, prepares the device to signal wakeup (if necessary) and puts it into | ||
475 | a low-power state. | ||
476 | |||
477 | The low-power state to put the device into is the lowest-power (highest number) | ||
478 | state from which it can signal wakeup while the system is in the target sleep | ||
479 | state. Just like in the runtime PM case described above, the mechanism of | ||
480 | signaling wakeup is system-dependent and determined by the PCI subsystem, which | ||
481 | is also responsible for preparing the device to signal wakeup from the system's | ||
482 | target sleep state as appropriate. | ||
483 | |||
484 | PCI device drivers (that don't implement legacy power management callbacks) are | ||
485 | generally not expected to prepare devices for signaling wakeup or to put them | ||
486 | into low-power states. However, if one of the driver's suspend callbacks | ||
487 | (pm->suspend() or pm->suspend_noirq()) saves the device's standard configuration | ||
488 | registers, pci_pm_suspend_noirq() will assume that the device has been prepared | ||
489 | to signal wakeup and put into a low-power state by the driver (the driver is | ||
490 | then assumed to have used the helper functions provided by the PCI subsystem for | ||
491 | this purpose). PCI device drivers are not encouraged to do that, but in some | ||
492 | rare cases doing that in the driver may be the optimum approach. | ||
493 | |||
494 | 2.4.2. System Resume | ||
495 | |||
496 | When the system is undergoing a transition from a sleep state in which the | ||
497 | contents of memory have been preserved, such as one of the ACPI sleep states | ||
498 | S1-S3, into the working state (ACPI S0), the phases are: | ||
499 | |||
500 | resume_noirq, resume, complete. | ||
501 | |||
502 | The following PCI bus type's callbacks, respectively, are executed in these | ||
503 | phases: | ||
504 | |||
505 | pci_pm_resume_noirq() | ||
506 | pci_pm_resume() | ||
507 | pci_pm_complete() | ||
508 | |||
509 | The pci_pm_resume_noirq() routine first puts the device into the full-power | ||
510 | state, restores its standard configuration registers and applies early resume | ||
511 | hardware quirks related to the device, if necessary. This is done | ||
512 | unconditionally, regardless of whether or not the device's driver implements | ||
513 | legacy PCI power management callbacks (this way all PCI devices are in the | ||
514 | full-power state and their standard configuration registers have been restored | ||
515 | when their interrupt handlers are invoked for the first time during resume, | ||
516 | which allows the kernel to avoid problems with the handling of shared interrupts | ||
517 | by drivers whose devices are still suspended). If legacy PCI power management | ||
518 | callbacks (see Section 3) are implemented by the device's driver, the legacy | ||
519 | early resume callback is executed and its result is returned. Otherwise, the | ||
520 | device driver's pm->resume_noirq() callback is executed, if defined, and its | ||
521 | result is returned. | ||
522 | |||
523 | The pci_pm_resume() routine first checks if the device's standard configuration | ||
524 | registers have been restored and restores them if that's not the case (this | ||
525 | only is necessary in the error path during a failing suspend). Next, resume | ||
526 | hardware quirks related to the device are applied, if necessary, and if the | ||
527 | device's driver implements legacy PCI power management callbacks (see | ||
528 | Section 3), the driver's legacy resume callback is executed and its result is | ||
529 | returned. Otherwise, the device's wakeup signaling mechanisms are blocked and | ||
530 | its driver's pm->resume() callback is executed, if defined (the callback's | ||
531 | result is then returned). | ||
532 | |||
533 | The resume phase is carried out asynchronously for PCI devices, like the | ||
534 | suspend phase described above, which means that if two PCI devices don't depend | ||
535 | on each other in a known way, the pci_pm_resume() routine may be executed for | ||
536 | the both of them in parallel. | ||
537 | |||
538 | The pci_pm_complete() routine only executes the device driver's pm->complete() | ||
539 | callback, if defined. | ||
540 | |||
541 | 2.4.3. System Hibernation | ||
542 | |||
543 | System hibernation is more complicated than system suspend, because it requires | ||
544 | a system image to be created and written into a persistent storage medium. The | ||
545 | image is created atomically and all devices are quiesced, or frozen, before that | ||
546 | happens. | ||
547 | |||
548 | The freezing of devices is carried out after enough memory has been freed (at | ||
549 | the time of this writing the image creation requires at least 50% of system RAM | ||
550 | to be free) in the following three phases: | ||
551 | |||
552 | prepare, freeze, freeze_noirq | ||
553 | |||
554 | that correspond to the PCI bus type's callbacks: | ||
555 | |||
556 | pci_pm_prepare() | ||
557 | pci_pm_freeze() | ||
558 | pci_pm_freeze_noirq() | ||
559 | |||
560 | This means that the prepare phase is exactly the same as for system suspend. | ||
561 | The other two phases, however, are different. | ||
562 | |||
563 | The pci_pm_freeze() routine is quite similar to pci_pm_suspend(), but it runs | ||
564 | the device driver's pm->freeze() callback, if defined, instead of pm->suspend(), | ||
565 | and it doesn't apply the suspend-related hardware quirks. It is executed | ||
566 | asynchronously for different PCI devices that don't depend on each other in a | ||
567 | known way. | ||
568 | |||
569 | The pci_pm_freeze_noirq() routine, in turn, is similar to | ||
570 | pci_pm_suspend_noirq(), but it calls the device driver's pm->freeze_noirq() | ||
571 | routine instead of pm->suspend_noirq(). It also doesn't attempt to prepare the | ||
572 | device for signaling wakeup and put it into a low-power state. Still, it saves | ||
573 | the device's standard configuration registers if they haven't been saved by one | ||
574 | of the driver's callbacks. | ||
575 | |||
576 | Once the image has been created, it has to be saved. However, at this point all | ||
577 | devices are frozen and they cannot handle I/O, while their ability to handle | ||
578 | I/O is obviously necessary for the image saving. Thus they have to be brought | ||
579 | back to the fully functional state and this is done in the following phases: | ||
580 | |||
581 | thaw_noirq, thaw, complete | ||
582 | |||
583 | using the following PCI bus type's callbacks: | ||
584 | |||
585 | pci_pm_thaw_noirq() | ||
586 | pci_pm_thaw() | ||
587 | pci_pm_complete() | ||
588 | |||
589 | respectively. | ||
590 | |||
591 | The first of them, pci_pm_thaw_noirq(), is analogous to pci_pm_resume_noirq(), | ||
592 | but it doesn't put the device into the full power state and doesn't attempt to | ||
593 | restore its standard configuration registers. It also executes the device | ||
594 | driver's pm->thaw_noirq() callback, if defined, instead of pm->resume_noirq(). | ||
595 | |||
596 | The pci_pm_thaw() routine is similar to pci_pm_resume(), but it runs the device | ||
597 | driver's pm->thaw() callback instead of pm->resume(). It is executed | ||
598 | asynchronously for different PCI devices that don't depend on each other in a | ||
599 | known way. | ||
600 | |||
601 | The complete phase it the same as for system resume. | ||
602 | |||
603 | After saving the image, devices need to be powered down before the system can | ||
604 | enter the target sleep state (ACPI S4 for ACPI-based systems). This is done in | ||
605 | three phases: | ||
606 | |||
607 | prepare, poweroff, poweroff_noirq | ||
608 | |||
609 | where the prepare phase is exactly the same as for system suspend. The other | ||
610 | two phases are analogous to the suspend and suspend_noirq phases, respectively. | ||
611 | The PCI subsystem-level callbacks they correspond to | ||
612 | |||
613 | pci_pm_poweroff() | ||
614 | pci_pm_poweroff_noirq() | ||
615 | |||
616 | work in analogy with pci_pm_suspend() and pci_pm_poweroff_noirq(), respectively, | ||
617 | although they don't attempt to save the device's standard configuration | ||
618 | registers. | ||
619 | |||
620 | 2.4.4. System Restore | ||
621 | |||
622 | System restore requires a hibernation image to be loaded into memory and the | ||
623 | pre-hibernation memory contents to be restored before the pre-hibernation system | ||
624 | activity can be resumed. | ||
625 | |||
626 | As described in Documentation/power/devices.txt, the hibernation image is loaded | ||
627 | into memory by a fresh instance of the kernel, called the boot kernel, which in | ||
628 | turn is loaded and run by a boot loader in the usual way. After the boot kernel | ||
629 | has loaded the image, it needs to replace its own code and data with the code | ||
630 | and data of the "hibernated" kernel stored within the image, called the image | ||
631 | kernel. For this purpose all devices are frozen just like before creating | ||
632 | the image during hibernation, in the | ||
633 | |||
634 | prepare, freeze, freeze_noirq | ||
635 | |||
636 | phases described above. However, the devices affected by these phases are only | ||
637 | those having drivers in the boot kernel; other devices will still be in whatever | ||
638 | state the boot loader left them. | ||
639 | |||
640 | Should the restoration of the pre-hibernation memory contents fail, the boot | ||
641 | kernel would go through the "thawing" procedure described above, using the | ||
642 | thaw_noirq, thaw, and complete phases (that will only affect the devices having | ||
643 | drivers in the boot kernel), and then continue running normally. | ||
644 | |||
645 | If the pre-hibernation memory contents are restored successfully, which is the | ||
646 | usual situation, control is passed to the image kernel, which then becomes | ||
647 | responsible for bringing the system back to the working state. To achieve this, | ||
648 | it must restore the devices' pre-hibernation functionality, which is done much | ||
649 | like waking up from the memory sleep state, although it involves different | ||
650 | phases: | ||
651 | |||
652 | restore_noirq, restore, complete | ||
653 | |||
654 | The first two of these are analogous to the resume_noirq and resume phases | ||
655 | described above, respectively, and correspond to the following PCI subsystem | ||
656 | callbacks: | ||
657 | |||
658 | pci_pm_restore_noirq() | ||
659 | pci_pm_restore() | ||
660 | |||
661 | These callbacks work in analogy with pci_pm_resume_noirq() and pci_pm_resume(), | ||
662 | respectively, but they execute the device driver's pm->restore_noirq() and | ||
663 | pm->restore() callbacks, if available. | ||
664 | |||
665 | The complete phase is carried out in exactly the same way as during system | ||
666 | resume. | ||
667 | |||
668 | |||
669 | 3. PCI Device Drivers and Power Management | ||
670 | ========================================== | ||
671 | |||
672 | 3.1. Power Management Callbacks | ||
673 | ------------------------------- | ||
674 | PCI device drivers participate in power management by providing callbacks to be | ||
675 | executed by the PCI subsystem's power management routines described above and by | ||
676 | controlling the runtime power management of their devices. | ||
677 | |||
678 | At the time of this writing there are two ways to define power management | ||
679 | callbacks for a PCI device driver, the recommended one, based on using a | ||
680 | dev_pm_ops structure described in Documentation/power/devices.txt, and the | ||
681 | "legacy" one, in which the .suspend(), .suspend_late(), .resume_early(), and | ||
682 | .resume() callbacks from struct pci_driver are used. The legacy approach, | ||
683 | however, doesn't allow one to define runtime power management callbacks and is | ||
684 | not really suitable for any new drivers. Therefore it is not covered by this | ||
685 | document (refer to the source code to learn more about it). | ||
686 | |||
687 | It is recommended that all PCI device drivers define a struct dev_pm_ops object | ||
688 | containing pointers to power management (PM) callbacks that will be executed by | ||
689 | the PCI subsystem's PM routines in various circumstances. A pointer to the | ||
690 | driver's struct dev_pm_ops object has to be assigned to the driver.pm field in | ||
691 | its struct pci_driver object. Once that has happened, the "legacy" PM callbacks | ||
692 | in struct pci_driver are ignored (even if they are not NULL). | ||
693 | |||
694 | The PM callbacks in struct dev_pm_ops are not mandatory and if they are not | ||
695 | defined (i.e. the respective fields of struct dev_pm_ops are unset) the PCI | ||
696 | subsystem will handle the device in a simplified default manner. If they are | ||
697 | defined, though, they are expected to behave as described in the following | ||
698 | subsections. | ||
699 | |||
700 | 3.1.1. prepare() | ||
701 | |||
702 | The prepare() callback is executed during system suspend, during hibernation | ||
703 | (when a hibernation image is about to be created), during power-off after | ||
704 | saving a hibernation image and during system restore, when a hibernation image | ||
705 | has just been loaded into memory. | ||
706 | |||
707 | This callback is only necessary if the driver's device has children that in | ||
708 | general may be registered at any time. In that case the role of the prepare() | ||
709 | callback is to prevent new children of the device from being registered until | ||
710 | one of the resume_noirq(), thaw_noirq(), or restore_noirq() callbacks is run. | ||
711 | |||
712 | In addition to that the prepare() callback may carry out some operations | ||
713 | preparing the device to be suspended, although it should not allocate memory | ||
714 | (if additional memory is required to suspend the device, it has to be | ||
715 | preallocated earlier, for example in a suspend/hibernate notifier as described | ||
716 | in Documentation/power/notifiers.txt). | ||
717 | |||
718 | 3.1.2. suspend() | ||
719 | |||
720 | The suspend() callback is only executed during system suspend, after prepare() | ||
721 | callbacks have been executed for all devices in the system. | ||
722 | |||
723 | This callback is expected to quiesce the device and prepare it to be put into a | ||
724 | low-power state by the PCI subsystem. It is not required (in fact it even is | ||
725 | not recommended) that a PCI driver's suspend() callback save the standard | ||
726 | configuration registers of the device, prepare it for waking up the system, or | ||
727 | put it into a low-power state. All of these operations can very well be taken | ||
728 | care of by the PCI subsystem, without the driver's participation. | ||
729 | |||
730 | However, in some rare case it is convenient to carry out these operations in | ||
731 | a PCI driver. Then, pci_save_state(), pci_prepare_to_sleep(), and | ||
732 | pci_set_power_state() should be used to save the device's standard configuration | ||
733 | registers, to prepare it for system wakeup (if necessary), and to put it into a | ||
734 | low-power state, respectively. Moreover, if the driver calls pci_save_state(), | ||
735 | the PCI subsystem will not execute either pci_prepare_to_sleep(), or | ||
736 | pci_set_power_state() for its device, so the driver is then responsible for | ||
737 | handling the device as appropriate. | ||
738 | |||
739 | While the suspend() callback is being executed, the driver's interrupt handler | ||
740 | can be invoked to handle an interrupt from the device, so all suspend-related | ||
741 | operations relying on the driver's ability to handle interrupts should be | ||
742 | carried out in this callback. | ||
743 | |||
744 | 3.1.3. suspend_noirq() | ||
745 | |||
746 | The suspend_noirq() callback is only executed during system suspend, after | ||
747 | suspend() callbacks have been executed for all devices in the system and | ||
748 | after device interrupts have been disabled by the PM core. | ||
749 | |||
750 | The difference between suspend_noirq() and suspend() is that the driver's | ||
751 | interrupt handler will not be invoked while suspend_noirq() is running. Thus | ||
752 | suspend_noirq() can carry out operations that would cause race conditions to | ||
753 | arise if they were performed in suspend(). | ||
754 | |||
755 | 3.1.4. freeze() | ||
756 | |||
757 | The freeze() callback is hibernation-specific and is executed in two situations, | ||
758 | during hibernation, after prepare() callbacks have been executed for all devices | ||
759 | in preparation for the creation of a system image, and during restore, | ||
760 | after a system image has been loaded into memory from persistent storage and the | ||
761 | prepare() callbacks have been executed for all devices. | ||
762 | |||
763 | The role of this callback is analogous to the role of the suspend() callback | ||
764 | described above. In fact, they only need to be different in the rare cases when | ||
765 | the driver takes the responsibility for putting the device into a low-power | ||
76 | state. | 766 | state. |
77 | 767 | ||
78 | The first walk allows a graceful recovery in the event of a failure, since none | 768 | In that cases the freeze() callback should not prepare the device system wakeup |
79 | of the devices have actually been powered down. | 769 | or put it into a low-power state. Still, either it or freeze_noirq() should |
80 | 770 | save the device's standard configuration registers using pci_save_state(). | |
81 | In both walks, in particular the second, all children of a bridge are touched | ||
82 | before the actual bridge itself. This allows the bridge to retain power while | ||
83 | its children are being accessed. | ||
84 | |||
85 | Upon resuming from sleep, just the opposite must be true: all bridges must be | ||
86 | powered on and restored before their children are powered on. This is easily | ||
87 | accomplished with a breadth-first walk of the PCI device tree. | ||
88 | |||
89 | |||
90 | 3. PCI Utility Functions | ||
91 | ~~~~~~~~~~~~~~~~~~~~~~~~ | ||
92 | |||
93 | These are helper functions designed to be called by individual device drivers. | ||
94 | Assuming that a device behaves as advertised, these should be applicable in most | ||
95 | cases. However, results may vary. | ||
96 | |||
97 | Note that these functions are never implicitly called for the driver. The driver | ||
98 | is always responsible for deciding when and if to call these. | ||
99 | |||
100 | |||
101 | pci_save_state | ||
102 | -------------- | ||
103 | |||
104 | Usage: | ||
105 | pci_save_state(struct pci_dev *dev); | ||
106 | |||
107 | Description: | ||
108 | Save first 64 bytes of PCI config space, along with any additional | ||
109 | PCI-Express or PCI-X information. | ||
110 | |||
111 | |||
112 | pci_restore_state | ||
113 | ----------------- | ||
114 | |||
115 | Usage: | ||
116 | pci_restore_state(struct pci_dev *dev); | ||
117 | |||
118 | Description: | ||
119 | Restore previously saved config space. | ||
120 | |||
121 | |||
122 | pci_set_power_state | ||
123 | ------------------- | ||
124 | |||
125 | Usage: | ||
126 | pci_set_power_state(struct pci_dev *dev, pci_power_t state); | ||
127 | |||
128 | Description: | ||
129 | Transition device to low power state using PCI PM Capabilities | ||
130 | registers. | ||
131 | |||
132 | Will fail under one of the following conditions: | ||
133 | - If state is less than current state, but not D0 (illegal transition) | ||
134 | - Device doesn't support PM Capabilities | ||
135 | - Device does not support requested state | ||
136 | |||
137 | |||
138 | pci_enable_wake | ||
139 | --------------- | ||
140 | |||
141 | Usage: | ||
142 | pci_enable_wake(struct pci_dev *dev, pci_power_t state, int enable); | ||
143 | |||
144 | Description: | ||
145 | Enable device to generate PME# during low power state using PCI PM | ||
146 | Capabilities. | ||
147 | |||
148 | Checks whether if device supports generating PME# from requested state | ||
149 | and fail if it does not, unless enable == 0 (request is to disable wake | ||
150 | events, which is implicit if it doesn't even support it in the first | ||
151 | place). | ||
152 | |||
153 | Note that the PMC Register in the device's PM Capabilities has a bitmask | ||
154 | of the states it supports generating PME# from. D3hot is bit 3 and | ||
155 | D3cold is bit 4. So, while a value of 4 as the state may not seem | ||
156 | semantically correct, it is. | ||
157 | |||
158 | |||
159 | 4. PCI Device Drivers | ||
160 | ~~~~~~~~~~~~~~~~~~~~~ | ||
161 | |||
162 | These functions are intended for use by individual drivers, and are defined in | ||
163 | struct pci_driver: | ||
164 | |||
165 | int (*suspend) (struct pci_dev *dev, pm_message_t state); | ||
166 | int (*resume) (struct pci_dev *dev); | ||
167 | |||
168 | |||
169 | suspend | ||
170 | ------- | ||
171 | |||
172 | Usage: | ||
173 | |||
174 | if (dev->driver && dev->driver->suspend) | ||
175 | dev->driver->suspend(dev,state); | ||
176 | |||
177 | A driver uses this function to actually transition the device into a low power | ||
178 | state. This should include disabling I/O, IRQs, and bus-mastering, as well as | ||
179 | physically transitioning the device to a lower power state; it may also include | ||
180 | calls to pci_enable_wake(). | ||
181 | |||
182 | Bus mastering may be disabled by doing: | ||
183 | |||
184 | pci_disable_device(dev); | ||
185 | |||
186 | For devices that support the PCI PM Spec, this may be used to set the device's | ||
187 | power state to match the suspend() parameter: | ||
188 | |||
189 | pci_set_power_state(dev,state); | ||
190 | |||
191 | The driver is also responsible for disabling any other device-specific features | ||
192 | (e.g blanking screen, turning off on-card memory, etc). | ||
193 | |||
194 | The driver should be sure to track the current state of the device, as it may | ||
195 | obviate the need for some operations. | ||
196 | |||
197 | The driver should update the current_state field in its pci_dev structure in | ||
198 | this function, except for PM-capable devices when pci_set_power_state is used. | ||
199 | |||
200 | resume | ||
201 | ------ | ||
202 | |||
203 | Usage: | ||
204 | |||
205 | if (dev->driver && dev->driver->resume) | ||
206 | dev->driver->resume(dev) | ||
207 | 771 | ||
208 | The resume callback may be called from any power state, and is always meant to | 772 | 3.1.5. freeze_noirq() |
209 | transition the device to the D0 state. | ||
210 | 773 | ||
211 | The driver is responsible for reenabling any features of the device that had | 774 | The freeze_noirq() callback is hibernation-specific. It is executed during |
212 | been disabled during previous suspend calls, such as IRQs and bus mastering, | 775 | hibernation, after prepare() and freeze() callbacks have been executed for all |
213 | as well as calling pci_restore_state(). | 776 | devices in preparation for the creation of a system image, and during restore, |
777 | after a system image has been loaded into memory and after prepare() and | ||
778 | freeze() callbacks have been executed for all devices. It is always executed | ||
779 | after device interrupts have been disabled by the PM core. | ||
214 | 780 | ||
215 | If the device is currently in D3, it may need to be reinitialized in resume(). | 781 | The role of this callback is analogous to the role of the suspend_noirq() |
782 | callback described above and it very rarely is necessary to define | ||
783 | freeze_noirq(). | ||
216 | 784 | ||
217 | * Some types of devices, like bus controllers, will preserve context in D3hot | 785 | The difference between freeze_noirq() and freeze() is analogous to the |
218 | (using Vcc power). Their drivers will often want to avoid re-initializing | 786 | difference between suspend_noirq() and suspend(). |
219 | them after re-entering D0 (perhaps to avoid resetting downstream devices). | ||
220 | 787 | ||
221 | * Other kinds of devices in D3hot will discard device context as part of a | 788 | 3.1.6. poweroff() |
222 | soft reset when re-entering the D0 state. | ||
223 | |||
224 | * Devices resuming from D3cold always go through a power-on reset. Some | ||
225 | device context can also be preserved using Vaux power. | ||
226 | 789 | ||
227 | * Some systems hide D3cold resume paths from drivers. For example, on PCs | 790 | The poweroff() callback is hibernation-specific. It is executed when the system |
228 | the resume path for suspend-to-disk often runs BIOS powerup code, which | 791 | is about to be powered off after saving a hibernation image to a persistent |
229 | will sometimes re-initialize the device. | 792 | storage. prepare() callbacks are executed for all devices before poweroff() is |
793 | called. | ||
230 | 794 | ||
231 | To handle resets during D3 to D0 transitions, it may be convenient to share | 795 | The role of this callback is analogous to the role of the suspend() and freeze() |
232 | device initialization code between probe() and resume(). Device parameters | 796 | callbacks described above, although it does not need to save the contents of |
233 | can also be saved before the driver suspends into D3, avoiding re-probe. | 797 | the device's registers. In particular, if the driver wants to put the device |
798 | into a low-power state itself instead of allowing the PCI subsystem to do that, | ||
799 | the poweroff() callback should use pci_prepare_to_sleep() and | ||
800 | pci_set_power_state() to prepare the device for system wakeup and to put it | ||
801 | into a low-power state, respectively, but it need not save the device's standard | ||
802 | configuration registers. | ||
234 | 803 | ||
235 | If the device supports the PCI PM Spec, it can use this to physically transition | 804 | 3.1.7. poweroff_noirq() |
236 | the device to D0: | ||
237 | 805 | ||
238 | pci_set_power_state(dev,0); | 806 | The poweroff_noirq() callback is hibernation-specific. It is executed after |
807 | poweroff() callbacks have been executed for all devices in the system. | ||
239 | 808 | ||
240 | Note that if the entire system is transitioning out of a global sleep state, all | 809 | The role of this callback is analogous to the role of the suspend_noirq() and |
241 | devices will be placed in the D0 state, so this is not necessary. However, in | 810 | freeze_noirq() callbacks described above, but it does not need to save the |
242 | the event that the device is placed in the D3 state during normal operation, | 811 | contents of the device's registers. |
243 | this call is necessary. It is impossible to determine which of the two events is | ||
244 | taking place in the driver, so it is always a good idea to make that call. | ||
245 | 812 | ||
246 | The driver should take note of the state that it is resuming from in order to | 813 | The difference between poweroff_noirq() and poweroff() is analogous to the |
247 | ensure correct (and speedy) operation. | 814 | difference between suspend_noirq() and suspend(). |
248 | 815 | ||
249 | The driver should update the current_state field in its pci_dev structure in | 816 | 3.1.8. resume_noirq() |
250 | this function, except for PM-capable devices when pci_set_power_state is used. | ||
251 | 817 | ||
818 | The resume_noirq() callback is only executed during system resume, after the | ||
819 | PM core has enabled the non-boot CPUs. The driver's interrupt handler will not | ||
820 | be invoked while resume_noirq() is running, so this callback can carry out | ||
821 | operations that might race with the interrupt handler. | ||
252 | 822 | ||
823 | Since the PCI subsystem unconditionally puts all devices into the full power | ||
824 | state in the resume_noirq phase of system resume and restores their standard | ||
825 | configuration registers, resume_noirq() is usually not necessary. In general | ||
826 | it should only be used for performing operations that would lead to race | ||
827 | conditions if carried out by resume(). | ||
253 | 828 | ||
254 | A reference implementation | 829 | 3.1.9. resume() |
255 | ------------------------- | ||
256 | .suspend() | ||
257 | { | ||
258 | /* driver specific operations */ | ||
259 | 830 | ||
260 | /* Disable IRQ */ | 831 | The resume() callback is only executed during system resume, after |
261 | free_irq(); | 832 | resume_noirq() callbacks have been executed for all devices in the system and |
262 | /* If using MSI */ | 833 | device interrupts have been enabled by the PM core. |
263 | pci_disable_msi(); | ||
264 | 834 | ||
265 | pci_save_state(); | 835 | This callback is responsible for restoring the pre-suspend configuration of the |
266 | pci_enable_wake(); | 836 | device and bringing it back to the fully functional state. The device should be |
267 | /* Disable IO/bus master/irq router */ | 837 | able to process I/O in a usual way after resume() has returned. |
268 | pci_disable_device(); | ||
269 | pci_set_power_state(pci_choose_state()); | ||
270 | } | ||
271 | 838 | ||
272 | .resume() | 839 | 3.1.10. thaw_noirq() |
273 | { | ||
274 | pci_set_power_state(PCI_D0); | ||
275 | pci_restore_state(); | ||
276 | /* device's irq possibly is changed, driver should take care */ | ||
277 | pci_enable_device(); | ||
278 | pci_set_master(); | ||
279 | 840 | ||
280 | /* if using MSI, device's vector possibly is changed */ | 841 | The thaw_noirq() callback is hibernation-specific. It is executed after a |
281 | pci_enable_msi(); | 842 | system image has been created and the non-boot CPUs have been enabled by the PM |
843 | core, in the thaw_noirq phase of hibernation. It also may be executed if the | ||
844 | loading of a hibernation image fails during system restore (it is then executed | ||
845 | after enabling the non-boot CPUs). The driver's interrupt handler will not be | ||
846 | invoked while thaw_noirq() is running. | ||
282 | 847 | ||
283 | request_irq(); | 848 | The role of this callback is analogous to the role of resume_noirq(). The |
284 | /* driver specific operations; */ | 849 | difference between these two callbacks is that thaw_noirq() is executed after |
285 | } | 850 | freeze() and freeze_noirq(), so in general it does not need to modify the |
851 | contents of the device's registers. | ||
286 | 852 | ||
287 | This is a typical implementation. Drivers can slightly change the order | 853 | 3.1.11. thaw() |
288 | of the operations in the implementation, ignore some operations or add | ||
289 | more driver specific operations in it, but drivers should do something like | ||
290 | this on the whole. | ||
291 | 854 | ||
292 | 5. Resources | 855 | The thaw() callback is hibernation-specific. It is executed after thaw_noirq() |
293 | ~~~~~~~~~~~~ | 856 | callbacks have been executed for all devices in the system and after device |
857 | interrupts have been enabled by the PM core. | ||
294 | 858 | ||
295 | PCI Local Bus Specification | 859 | This callback is responsible for restoring the pre-freeze configuration of |
296 | PCI Bus Power Management Interface Specification | 860 | the device, so that it will work in a usual way after thaw() has returned. |
297 | 861 | ||
298 | http://www.pcisig.com | 862 | 3.1.12. restore_noirq() |
299 | 863 | ||
864 | The restore_noirq() callback is hibernation-specific. It is executed in the | ||
865 | restore_noirq phase of hibernation, when the boot kernel has passed control to | ||
866 | the image kernel and the non-boot CPUs have been enabled by the image kernel's | ||
867 | PM core. | ||
868 | |||
869 | This callback is analogous to resume_noirq() with the exception that it cannot | ||
870 | make any assumption on the previous state of the device, even if the BIOS (or | ||
871 | generally the platform firmware) is known to preserve that state over a | ||
872 | suspend-resume cycle. | ||
873 | |||
874 | For the vast majority of PCI device drivers there is no difference between | ||
875 | resume_noirq() and restore_noirq(). | ||
876 | |||
877 | 3.1.13. restore() | ||
878 | |||
879 | The restore() callback is hibernation-specific. It is executed after | ||
880 | restore_noirq() callbacks have been executed for all devices in the system and | ||
881 | after the PM core has enabled device drivers' interrupt handlers to be invoked. | ||
882 | |||
883 | This callback is analogous to resume(), just like restore_noirq() is analogous | ||
884 | to resume_noirq(). Consequently, the difference between restore_noirq() and | ||
885 | restore() is analogous to the difference between resume_noirq() and resume(). | ||
886 | |||
887 | For the vast majority of PCI device drivers there is no difference between | ||
888 | resume() and restore(). | ||
889 | |||
890 | 3.1.14. complete() | ||
891 | |||
892 | The complete() callback is executed in the following situations: | ||
893 | - during system resume, after resume() callbacks have been executed for all | ||
894 | devices, | ||
895 | - during hibernation, before saving the system image, after thaw() callbacks | ||
896 | have been executed for all devices, | ||
897 | - during system restore, when the system is going back to its pre-hibernation | ||
898 | state, after restore() callbacks have been executed for all devices. | ||
899 | It also may be executed if the loading of a hibernation image into memory fails | ||
900 | (in that case it is run after thaw() callbacks have been executed for all | ||
901 | devices that have drivers in the boot kernel). | ||
902 | |||
903 | This callback is entirely optional, although it may be necessary if the | ||
904 | prepare() callback performs operations that need to be reversed. | ||
905 | |||
906 | 3.1.15. runtime_suspend() | ||
907 | |||
908 | The runtime_suspend() callback is specific to device runtime power management | ||
909 | (runtime PM). It is executed by the PM core's runtime PM framework when the | ||
910 | device is about to be suspended (i.e. quiesced and put into a low-power state) | ||
911 | at run time. | ||
912 | |||
913 | This callback is responsible for freezing the device and preparing it to be | ||
914 | put into a low-power state, but it must allow the PCI subsystem to perform all | ||
915 | of the PCI-specific actions necessary for suspending the device. | ||
916 | |||
917 | 3.1.16. runtime_resume() | ||
918 | |||
919 | The runtime_resume() callback is specific to device runtime PM. It is executed | ||
920 | by the PM core's runtime PM framework when the device is about to be resumed | ||
921 | (i.e. put into the full-power state and programmed to process I/O normally) at | ||
922 | run time. | ||
923 | |||
924 | This callback is responsible for restoring the normal functionality of the | ||
925 | device after it has been put into the full-power state by the PCI subsystem. | ||
926 | The device is expected to be able to process I/O in the usual way after | ||
927 | runtime_resume() has returned. | ||
928 | |||
929 | 3.1.17. runtime_idle() | ||
930 | |||
931 | The runtime_idle() callback is specific to device runtime PM. It is executed | ||
932 | by the PM core's runtime PM framework whenever it may be desirable to suspend | ||
933 | the device according to the PM core's information. In particular, it is | ||
934 | automatically executed right after runtime_resume() has returned in case the | ||
935 | resume of the device has happened as a result of a spurious event. | ||
936 | |||
937 | This callback is optional, but if it is not implemented or if it returns 0, the | ||
938 | PCI subsystem will call pm_runtime_suspend() for the device, which in turn will | ||
939 | cause the driver's runtime_suspend() callback to be executed. | ||
940 | |||
941 | 3.1.18. Pointing Multiple Callback Pointers to One Routine | ||
942 | |||
943 | Although in principle each of the callbacks described in the previous | ||
944 | subsections can be defined as a separate function, it often is convenient to | ||
945 | point two or more members of struct dev_pm_ops to the same routine. There are | ||
946 | a few convenience macros that can be used for this purpose. | ||
947 | |||
948 | The SIMPLE_DEV_PM_OPS macro declares a struct dev_pm_ops object with one | ||
949 | suspend routine pointed to by the .suspend(), .freeze(), and .poweroff() | ||
950 | members and one resume routine pointed to by the .resume(), .thaw(), and | ||
951 | .restore() members. The other function pointers in this struct dev_pm_ops are | ||
952 | unset. | ||
953 | |||
954 | The UNIVERSAL_DEV_PM_OPS macro is similar to SIMPLE_DEV_PM_OPS, but it | ||
955 | additionally sets the .runtime_resume() pointer to the same value as | ||
956 | .resume() (and .thaw(), and .restore()) and the .runtime_suspend() pointer to | ||
957 | the same value as .suspend() (and .freeze() and .poweroff()). | ||
958 | |||
959 | The SET_SYSTEM_SLEEP_PM_OPS can be used inside of a declaration of struct | ||
960 | dev_pm_ops to indicate that one suspend routine is to be pointed to by the | ||
961 | .suspend(), .freeze(), and .poweroff() members and one resume routine is to | ||
962 | be pointed to by the .resume(), .thaw(), and .restore() members. | ||
963 | |||
964 | 3.2. Device Runtime Power Management | ||
965 | ------------------------------------ | ||
966 | In addition to providing device power management callbacks PCI device drivers | ||
967 | are responsible for controlling the runtime power management (runtime PM) of | ||
968 | their devices. | ||
969 | |||
970 | The PCI device runtime PM is optional, but it is recommended that PCI device | ||
971 | drivers implement it at least in the cases where there is a reliable way of | ||
972 | verifying that the device is not used (like when the network cable is detached | ||
973 | from an Ethernet adapter or there are no devices attached to a USB controller). | ||
974 | |||
975 | To support the PCI runtime PM the driver first needs to implement the | ||
976 | runtime_suspend() and runtime_resume() callbacks. It also may need to implement | ||
977 | the runtime_idle() callback to prevent the device from being suspended again | ||
978 | every time right after the runtime_resume() callback has returned | ||
979 | (alternatively, the runtime_suspend() callback will have to check if the | ||
980 | device should really be suspended and return -EAGAIN if that is not the case). | ||
981 | |||
982 | The runtime PM of PCI devices is disabled by default. It is also blocked by | ||
983 | pci_pm_init() that runs the pm_runtime_forbid() helper function. If a PCI | ||
984 | driver implements the runtime PM callbacks and intends to use the runtime PM | ||
985 | framework provided by the PM core and the PCI subsystem, it should enable this | ||
986 | feature by executing the pm_runtime_enable() helper function. However, the | ||
987 | driver should not call the pm_runtime_allow() helper function unblocking | ||
988 | the runtime PM of the device. Instead, it should allow user space or some | ||
989 | platform-specific code to do that (user space can do it via sysfs), although | ||
990 | once it has called pm_runtime_enable(), it must be prepared to handle the | ||
991 | runtime PM of the device correctly as soon as pm_runtime_allow() is called | ||
992 | (which may happen at any time). [It also is possible that user space causes | ||
993 | pm_runtime_allow() to be called via sysfs before the driver is loaded, so in | ||
994 | fact the driver has to be prepared to handle the runtime PM of the device as | ||
995 | soon as it calls pm_runtime_enable().] | ||
996 | |||
997 | The runtime PM framework works by processing requests to suspend or resume | ||
998 | devices, or to check if they are idle (in which cases it is reasonable to | ||
999 | subsequently request that they be suspended). These requests are represented | ||
1000 | by work items put into the power management workqueue, pm_wq. Although there | ||
1001 | are a few situations in which power management requests are automatically | ||
1002 | queued by the PM core (for example, after processing a request to resume a | ||
1003 | device the PM core automatically queues a request to check if the device is | ||
1004 | idle), device drivers are generally responsible for queuing power management | ||
1005 | requests for their devices. For this purpose they should use the runtime PM | ||
1006 | helper functions provided by the PM core, discussed in | ||
1007 | Documentation/power/runtime_pm.txt. | ||
1008 | |||
1009 | Devices can also be suspended and resumed synchronously, without placing a | ||
1010 | request into pm_wq. In the majority of cases this also is done by their | ||
1011 | drivers that use helper functions provided by the PM core for this purpose. | ||
1012 | |||
1013 | For more information on the runtime PM of devices refer to | ||
1014 | Documentation/power/runtime_pm.txt. | ||
1015 | |||
1016 | |||
1017 | 4. Resources | ||
1018 | ============ | ||
1019 | |||
1020 | PCI Local Bus Specification, Rev. 3.0 | ||
1021 | PCI Bus Power Management Interface Specification, Rev. 1.2 | ||
1022 | Advanced Configuration and Power Interface (ACPI) Specification, Rev. 3.0b | ||
1023 | PCI Express Base Specification, Rev. 2.0 | ||
1024 | Documentation/power/devices.txt | ||
1025 | Documentation/power/runtime_pm.txt | ||