3 files changed, 193 insertions, 122 deletions
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt
index 646a89e0c07d..20af7def23c8 100644
--- a/Documentation/power/devices.txt
+++ b/Documentation/power/devices.txt
@@ -123,9 +123,12 @@ please refer directly to the source code for more information about it.
 Subsystem-Level Methods
 -----------------------
 The core methods to suspend and resume devices reside in struct dev_pm_ops
-pointed to by the pm member of struct bus_type, struct device_type and
+pointed to by the ops member of struct dev_pm_domain, or by the pm member of
-struct class.  They are mostly of interest to the people writing infrastructure
+struct bus_type, struct device_type and struct class.  They are mostly of
-for buses, like PCI or USB, or device type and device class drivers.
+interest to the people writing infrastructure for platforms and buses, like PCI
+or USB, or device type and device class drivers.  They also are relevant to the
+writers of device drivers whose subsystems (PM domains, device types, device
+classes and bus types) don't provide all power management methods.
 Bus drivers implement these methods as appropriate for the hardware and the
 drivers using it; PCI works differently from USB, and so on.  Not many people
@@ -139,41 +142,57 @@ sequencing in the driver model tree.
 /sys/devices/.../power/wakeup files
 -----------------------------------
-All devices in the driver model have two flags to control handling of wakeup
+All device objects in the driver model contain fields that control the handling
-events (hardware signals that can force the device and/or system out of a low
+of system wakeup events (hardware signals that can force the system out of a
-power state).  These flags are initialized by bus or device driver code using
+sleep state).  These fields are initialized by bus or device driver code using
 device_set_wakeup_capable() and device_set_wakeup_enable(), defined in
 include/linux/pm_wakeup.h.
-The "can_wakeup" flag just records whether the device (and its driver) can
+The "power.can_wakeup" flag just records whether the device (and its driver) can
 physically support wakeup events.  The device_set_wakeup_capable() routine
-affects this flag.  The "should_wakeup" flag controls whether the device should
+affects this flag.  The "power.wakeup" field is a pointer to an object of type
-try to use its wakeup mechanism.  device_set_wakeup_enable() affects this flag;
+struct wakeup_source used for controlling whether or not the device should use
-for the most part drivers should not change its value.  The initial value of
+its system wakeup mechanism and for notifying the PM core of system wakeup
-should_wakeup is supposed to be false for the majority of devices; the major
+events signaled by the device.  This object is only present for wakeup-capable
-exceptions are power buttons, keyboards, and Ethernet adapters whose WoL
+devices (i.e. devices whose "can_wakeup" flags are set) and is created (or
-(wake-on-LAN) feature has been set up with ethtool.  It should also default
+removed) by device_set_wakeup_capable().
-to true for devices that don't generate wakeup requests on their own but merely
-forward wakeup requests from one bus to another (like PCI bridges).
 Whether or not a device is capable of issuing wakeup events is a hardware
 matter, and the kernel is responsible for keeping track of it.  By contrast,
 whether or not a wakeup-capable device should issue wakeup events is a policy
 decision, and it is managed by user space through a sysfs attribute: the
-power/wakeup file.  User space can write the strings "enabled" or "disabled" to
+"power/wakeup" file.  User space can write the strings "enabled" or "disabled"
-set or clear the "should_wakeup" flag, respectively.  This file is only present
+to it to indicate whether or not, respectively, the device is supposed to signal
-for wakeup-capable devices (i.e. devices whose "can_wakeup" flags are set)
+system wakeup.  This file is only present if the "power.wakeup" object exists
-and is created (or removed) by device_set_wakeup_capable().  Reads from the
+for the given device and is created (or removed) along with that object, by
-file will return the corresponding string.
+device_set_wakeup_capable().  Reads from the file will return the corresponding
+string.
-The device_may_wakeup() routine returns true only if both flags are set.
+The "power/wakeup" file is supposed to contain the "disabled" string initially
+for the majority of devices; the major exceptions are power buttons, keyboards,
+and Ethernet adapters whose WoL (wake-on-LAN) feature has been set up with
+ethtool.  It should also default to "enabled" for devices that don't generate
+wakeup requests on their own but merely forward wakeup requests from one bus to
+another (like PCI Express ports).
+The device_may_wakeup() routine returns true only if the "power.wakeup" object
+exists and the corresponding "power/wakeup" file contains the string "enabled".
 This information is used by subsystems, like the PCI bus type code, to see
 whether or not to enable the devices' wakeup mechanisms.  If device wakeup
 mechanisms are enabled or disabled directly by drivers, they also should use
 device_may_wakeup() to decide what to do during a system sleep transition.
-However for runtime power management, wakeup events should be enabled whenever
+Device drivers, however, are not supposed to call device_set_wakeup_enable()
-the device and driver both support them, regardless of the should_wakeup flag.
+directly in any case.
+It ought to be noted that system wakeup is conceptually different from "remote
+wakeup" used by runtime power management, although it may be supported by the
+same physical mechanism.  Remote wakeup is a feature allowing devices in
+low-power states to trigger specific interrupts to signal conditions in which
+they should be put into the full-power state.  Those interrupts may or may not
+be used to signal system wakeup events, depending on the hardware design.  On
+some systems it is impossible to trigger them from system sleep states.  In any
+case, remote wakeup should always be enabled for runtime power management for
+all devices and drivers that support it.
 /sys/devices/.../power/control files
 ------------------------------------
@@ -249,23 +268,37 @@ for every device before the next phase begins.  Not all busses or classes
 support all these callbacks and not all drivers use all the callbacks.  The
 various phases always run after tasks have been frozen and before they are
 unfrozen.  Furthermore, the *_noirq phases run at a time when IRQ handlers have
-been disabled (except for those marked with the IRQ_WAKEUP flag).
+been disabled (except for those marked with the IRQF_NO_SUSPEND flag).
+All phases use PM domain, bus, type, class or driver callbacks (that is, methods
+defined in dev->pm_domain->ops, dev->bus->pm, dev->type->pm, dev->class->pm or
+dev->driver->pm).  These callbacks are regarded by the PM core as mutually
+exclusive.  Moreover, PM domain callbacks always take precedence over all of the
+other callbacks and, for example, type callbacks take precedence over bus, class
+and driver callbacks.  To be precise, the following rules are used to determine
+which callback to execute in the given phase:
+    1.  If dev->pm_domain is present, the PM core will choose the callback
+        included in dev->pm_domain->ops for execution
+    2.  Otherwise, if both dev->type and dev->type->pm are present, the callback
+        included in dev->type->pm will be chosen for execution.
+    3.  Otherwise, if both dev->class and dev->class->pm are present, the
+        callback included in dev->class->pm will be chosen for execution.
+    4.  Otherwise, if both dev->bus and dev->bus->pm are present, the callback
+        included in dev->bus->pm will be chosen for execution.
+This allows PM domains and device types to override callbacks provided by bus
+types or device classes if necessary.
-All phases use bus, type, or class callbacks (that is, methods defined in
+The PM domain, type, class and bus callbacks may in turn invoke device- or
-dev->bus->pm, dev->type->pm, or dev->class->pm).  These callbacks are mutually
+driver-specific methods stored in dev->driver->pm, but they don't have to do
-exclusive, so if the device type provides a struct dev_pm_ops object pointed to
+that.
-by its pm field (i.e. both dev->type and dev->type->pm are defined), the
-callbacks included in that object (i.e. dev->type->pm) will be used.  Otherwise,
-if the class provides a struct dev_pm_ops object pointed to by its pm field
-(i.e. both dev->class and dev->class->pm are defined), the PM core will use the
-callbacks from that object (i.e. dev->class->pm).  Finally, if the pm fields of
-both the device type and class objects are NULL (or those objects do not exist),
-the callbacks provided by the bus (that is, the callbacks from dev->bus->pm)
-will be used (this allows device types to override callbacks provided by bus
-types or classes if necessary).
-These callbacks may in turn invoke device- or driver-specific methods stored in
+If the subsystem callback chosen for execution is not present, the PM core will
-dev->driver->pm, but they don't have to.
+execute the corresponding method from dev->driver->pm instead if there is one.
 Entering System Suspend
@@ -283,9 +316,8 @@ When the system goes into the standby or memory sleep state, the phases are:
        After the prepare callback method returns, no new children may be
        registered below the device.  The method may also prepare the device or
-        driver in some way for the upcoming system power transition (for
+        driver in some way for the upcoming system power transition, but it
-        example, by allocating additional memory required for this purpose), but
+        should not put the device into a low-power state.
-        it should not put the device into a low-power state.
    2.  The suspend methods should quiesce the device to stop it from performing
        I/O.  They also may save the device registers and put it into the
diff --git a/Documentation/power/freezing-of-tasks.txt b/Documentation/power/freezing-of-tasks.txt
index 316c2ba187f4..6ccb68f68da6 100644
--- a/Documentation/power/freezing-of-tasks.txt
+++ b/Documentation/power/freezing-of-tasks.txt
@@ -21,7 +21,7 @@ freeze_processes() (defined in kernel/power/process.c) is called.  It executes
 try_to_freeze_tasks() that sets TIF_FREEZE for all of the freezable tasks and
 either wakes them up, if they are kernel threads, or sends fake signals to them,
 if they are user space processes.  A task that has TIF_FREEZE set, should react
-to it by calling the function called refrigerator() (defined in
+to it by calling the function called __refrigerator() (defined in
 kernel/freezer.c), which sets the task's PF_FROZEN flag, changes its state
 to TASK_UNINTERRUPTIBLE and makes it loop until PF_FROZEN is cleared for it.
 Then, we say that the task is 'frozen' and therefore the set of functions
@@ -29,10 +29,10 @@ handling this mechanism is referred to as 'the freezer' (these functions are
 defined in kernel/power/process.c, kernel/freezer.c & include/linux/freezer.h).
 User space processes are generally frozen before kernel threads.
-It is not recommended to call refrigerator() directly.  Instead, it is
+__refrigerator() must not be called directly.  Instead, use the
-recommended to use the try_to_freeze() function (defined in
+try_to_freeze() function (defined in include/linux/freezer.h), that checks
-include/linux/freezer.h), that checks the task's TIF_FREEZE flag and makes the
+the task's TIF_FREEZE flag and makes the task enter __refrigerator() if the
-task enter refrigerator() if the flag is set.
+flag is set.
 For user space processes try_to_freeze() is called automatically from the
 signal-handling code, but the freezable kernel threads need to call it
@@ -61,13 +61,13 @@ wait_event_freezable() and wait_event_freezable_timeout() macros.
 After the system memory state has been restored from a hibernation image and
 devices have been reinitialized, the function thaw_processes() is called in
 order to clear the PF_FROZEN flag for each frozen task.  Then, the tasks that
-have been frozen leave refrigerator() and continue running.
+have been frozen leave __refrigerator() and continue running.
 III. Which kernel threads are freezable?
 Kernel threads are not freezable by default.  However, a kernel thread may clear
 PF_NOFREEZE for itself by calling set_freezable() (the resetting of PF_NOFREEZE
-directly is strongly discouraged).  From this point it is regarded as freezable
+directly is not allowed).  From this point it is regarded as freezable
 and must call try_to_freeze() in a suitable place.
 IV. Why do we do that?
@@ -176,3 +176,28 @@ tasks, since it generally exists anyway.
 A driver must have all firmwares it may need in RAM before suspend() is called.
 If keeping them is not practical, for example due to their size, they must be
 requested early enough using the suspend notifier API described in notifiers.txt.
+VI. Are there any precautions to be taken to prevent freezing failures?
+Yes, there are.
+First of all, grabbing the 'pm_mutex' lock to mutually exclude a piece of code
+from system-wide sleep such as suspend/hibernation is not encouraged.
+If possible, that piece of code must instead hook onto the suspend/hibernation
+notifiers to achieve mutual exclusion. Look at the CPU-Hotplug code
+(kernel/cpu.c) for an example.
+However, if that is not feasible, and grabbing 'pm_mutex' is deemed necessary,
+it is strongly discouraged to directly call mutex_[un]lock(&pm_mutex) since
+that could lead to freezing failures, because if the suspend/hibernate code
+successfully acquired the 'pm_mutex' lock, and hence that other entity failed
+to acquire the lock, then that task would get blocked in TASK_UNINTERRUPTIBLE
+state. As a consequence, the freezer would not be able to freeze that task,
+leading to freezing failure.
+However, the [un]lock_system_sleep() APIs are safe to use in this scenario,
+since they ask the freezer to skip freezing this task, since it is anyway
+"frozen enough" as it is blocked on 'pm_mutex', which will be released
+only after the entire suspend/hibernation sequence is complete.
+So, to summarize, use [un]lock_system_sleep() instead of directly using
+mutex_[un]lock(&pm_mutex). That would prevent freezing failures.
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt
index 5336149f831b..4abe83e1045a 100644
--- a/Documentation/power/runtime_pm.txt
+++ b/Documentation/power/runtime_pm.txt
@@ -44,98 +44,112 @@ struct dev_pm_ops {
 };
 The ->runtime_suspend(), ->runtime_resume() and ->runtime_idle() callbacks
-are executed by the PM core for either the power domain, or the device type
+are executed by the PM core for the device's subsystem that may be either of
-(if the device power domain's struct dev_pm_ops does not exist), or the class
+the following:
-(if the device power domain's and type's struct dev_pm_ops object does not
-exist), or the bus type (if the device power domain's, type's and class'
+  1. PM domain of the device, if the device's PM domain object, dev->pm_domain,
-struct dev_pm_ops objects do not exist) of the given device, so the priority
+     is present.
-order of callbacks from high to low is that power domain callbacks, device
-type callbacks, class callbacks and bus type callbacks, and the high priority
+  2. Device type of the device, if both dev->type and dev->type->pm are present.
-one will take precedence over low priority one. The bus type, device type and
-class callbacks are referred to as subsystem-level callbacks in what follows,
+  3. Device class of the device, if both dev->class and dev->class->pm are
-and generally speaking, the power domain callbacks are used for representing
+     present.
-power domains within a SoC.
+  4. Bus type of the device, if both dev->bus and dev->bus->pm are present.
+If the subsystem chosen by applying the above rules doesn't provide the relevant
+callback, the PM core will invoke the corresponding driver callback stored in
+dev->driver->pm directly (if present).
+The PM core always checks which callback to use in the order given above, so the
+priority order of callbacks from high to low is: PM domain, device type, class
+and bus type.  Moreover, the high-priority one will always take precedence over
+a low-priority one.  The PM domain, bus type, device type and class callbacks
+are referred to as subsystem-level callbacks in what follows.
 By default, the callbacks are always invoked in process context with interrupts
-enabled.  However, subsystems can use the pm_runtime_irq_safe() helper function
+enabled.  However, the pm_runtime_irq_safe() helper function can be used to tell
-to tell the PM core that a device's ->runtime_suspend() and ->runtime_resume()
+the PM core that it is safe to run the ->runtime_suspend(), ->runtime_resume()
-callbacks should be invoked in atomic context with interrupts disabled.
+and ->runtime_idle() callbacks for the given device in atomic context with
-This implies that these callback routines must not block or sleep, but it also
+interrupts disabled.  This implies that the callback routines in question must
-means that the synchronous helper functions listed at the end of Section 4 can
+not block or sleep, but it also means that the synchronous helper functions
-be used within an interrupt handler or in an atomic context.
+listed at the end of Section 4 may be used for that device within an interrupt
+handler or generally in an atomic context.
-The subsystem-level suspend callback is _entirely_ _responsible_ for handling
-the suspend of the device as appropriate, which may, but need not include
+The subsystem-level suspend callback, if present, is _entirely_ _responsible_
-executing the device driver's own ->runtime_suspend() callback (from the
+for handling the suspend of the device as appropriate, which may, but need not
+include executing the device driver's own ->runtime_suspend() callback (from the
 PM core's point of view it is not necessary to implement a ->runtime_suspend()
 callback in a device driver as long as the subsystem-level suspend callback
 knows what to do to handle the device).
-  * Once the subsystem-level suspend callback has completed successfully
+  * Once the subsystem-level suspend callback (or the driver suspend callback,
-    for given device, the PM core regards the device as suspended, which need
+    if invoked directly) has completed successfully for the given device, the PM
-    not mean that the device has been put into a low power state.  It is
+    core regards the device as suspended, which need not mean that it has been
-    supposed to mean, however, that the device will not process data and will
+    put into a low power state.  It is supposed to mean, however, that the
-    not communicate with the CPU(s) and RAM until the subsystem-level resume
+    device will not process data and will not communicate with the CPU(s) and
-    callback is executed for it.  The runtime PM status of a device after
+    RAM until the appropriate resume callback is executed for it.  The runtime
-    successful execution of the subsystem-level suspend callback is 'suspended'.
+    PM status of a device after successful execution of the suspend callback is
+    'suspended'.
-  * If the subsystem-level suspend callback returns -EBUSY or -EAGAIN,
-    the device's runtime PM status is 'active', which means that the device
+  * If the suspend callback returns -EBUSY or -EAGAIN, the device's runtime PM
-    _must_ be fully operational afterwards.
+    status remains 'active', which means that the device _must_ be fully
+    operational afterwards.
-  * If the subsystem-level suspend callback returns an error code different
-    from -EBUSY or -EAGAIN, the PM core regards this as a fatal error and will
+  * If the suspend callback returns an error code different from -EBUSY and
-    refuse to run the helper functions described in Section 4 for the device,
+    -EAGAIN, the PM core regards this as a fatal error and will refuse to run
-    until the status of it is directly set either to 'active', or to 'suspended'
+    the helper functions described in Section 4 for the device until its status
-    (the PM core provides special helper functions for this purpose).
+    is directly set to  either'active', or 'suspended' (the PM core provides
+    special helper functions for this purpose).
-In particular, if the driver requires remote wake-up capability (i.e. hardware
+In particular, if the driver requires remote wakeup capability (i.e. hardware
 mechanism allowing the device to request a change of its power state, such as
 PCI PME) for proper functioning and device_run_wake() returns 'false' for the
 device, then ->runtime_suspend() should return -EBUSY.  On the other hand, if
-device_run_wake() returns 'true' for the device and the device is put into a low
+device_run_wake() returns 'true' for the device and the device is put into a
-power state during the execution of the subsystem-level suspend callback, it is
+low-power state during the execution of the suspend callback, it is expected
-expected that remote wake-up will be enabled for the device.  Generally, remote
+that remote wakeup will be enabled for the device.  Generally, remote wakeup
-wake-up should be enabled for all input devices put into a low power state at
+should be enabled for all input devices put into low-power states at run time.
-run time.
+The subsystem-level resume callback, if present, is _entirely_ _responsible_ for
-The subsystem-level resume callback is _entirely_ _responsible_ for handling the
+handling the resume of the device as appropriate, which may, but need not
-resume of the device as appropriate, which may, but need not include executing
+include executing the device driver's own ->runtime_resume() callback (from the
-the device driver's own ->runtime_resume() callback (from the PM core's point of
+PM core's point of view it is not necessary to implement a ->runtime_resume()
-view it is not necessary to implement a ->runtime_resume() callback in a device
+callback in a device driver as long as the subsystem-level resume callback knows
-driver as long as the subsystem-level resume callback knows what to do to handle
+what to do to handle the device).
-the device).
+  * Once the subsystem-level resume callback (or the driver resume callback, if
-  * Once the subsystem-level resume callback has completed successfully, the PM
+    invoked directly) has completed successfully, the PM core regards the device
-    core regards the device as fully operational, which means that the device
+    as fully operational, which means that the device _must_ be able to complete
-    _must_ be able to complete I/O operations as needed.  The runtime PM status
+    I/O operations as needed.  The runtime PM status of the device is then
-    of the device is then 'active'.
+    'active'.
-  * If the subsystem-level resume callback returns an error code, the PM core
+  * If the resume callback returns an error code, the PM core regards this as a
-    regards this as a fatal error and will refuse to run the helper functions
+    fatal error and will refuse to run the helper functions described in Section
-    described in Section 4 for the device, until its status is directly set
+    4 for the device, until its status is directly set to either 'active', or
-    either to 'active' or to 'suspended' (the PM core provides special helper
+    'suspended' (by means of special helper functions provided by the PM core
-    functions for this purpose).
+    for this purpose).
-The subsystem-level idle callback is executed by the PM core whenever the device
+The idle callback (a subsystem-level one, if present, or the driver one) is
-appears to be idle, which is indicated to the PM core by two counters, the
+executed by the PM core whenever the device appears to be idle, which is
-device's usage counter and the counter of 'active' children of the device.
+indicated to the PM core by two counters, the device's usage counter and the
+counter of 'active' children of the device.
  * If any of these counters is decreased using a helper function provided by
    the PM core and it turns out to be equal to zero, the other counter is
    checked.  If that counter also is equal to zero, the PM core executes the
-    subsystem-level idle callback with the device as an argument.
+    idle callback with the device as its argument.
-The action performed by a subsystem-level idle callback is totally dependent on
+The action performed by the idle callback is totally dependent on the subsystem
-the subsystem in question, but the expected and recommended action is to check
+(or driver) in question, but the expected and recommended action is to check
 if the device can be suspended (i.e. if all of the conditions necessary for
 suspending the device are satisfied) and to queue up a suspend request for the
 device in that case.  The value returned by this callback is ignored by the PM
 core.
 The helper functions provided by the PM core, described in Section 4, guarantee
-that the following constraints are met with respect to the bus type's runtime
+that the following constraints are met with respect to runtime PM callbacks for
-PM callbacks:
+one device:
 (1) The callbacks are mutually exclusive (e.g. it is forbidden to execute
    ->runtime_suspend() in parallel with ->runtime_resume() or with another