diff options
| author | Steven Whitehouse <swhiteho@redhat.com> | 2006-09-28 08:29:59 -0400 |
|---|---|---|
| committer | Steven Whitehouse <swhiteho@redhat.com> | 2006-09-28 08:29:59 -0400 |
| commit | 185a257f2f73bcd89050ad02da5bedbc28fc43fa (patch) | |
| tree | 5e32586114534ed3f2165614cba3d578f5d87307 /Documentation | |
| parent | 3f1a9aaeffd8d1cbc5ab9776c45cbd66af1c9699 (diff) | |
| parent | a77c64c1a641950626181b4857abb701d8f38ccc (diff) | |
Merge branch 'master' into gfs2
Diffstat (limited to 'Documentation')
24 files changed, 1383 insertions, 368 deletions
diff --git a/Documentation/ABI/obsolete/devfs b/Documentation/ABI/removed/devfs index b8b87399bc8f..8195c4e0d0a1 100644 --- a/Documentation/ABI/obsolete/devfs +++ b/Documentation/ABI/removed/devfs | |||
| @@ -1,13 +1,12 @@ | |||
| 1 | What: devfs | 1 | What: devfs |
| 2 | Date: July 2005 | 2 | Date: July 2005 (scheduled), finally removed in kernel v2.6.18 |
| 3 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | 3 | Contact: Greg Kroah-Hartman <gregkh@suse.de> |
| 4 | Description: | 4 | Description: |
| 5 | devfs has been unmaintained for a number of years, has unfixable | 5 | devfs has been unmaintained for a number of years, has unfixable |
| 6 | races, contains a naming policy within the kernel that is | 6 | races, contains a naming policy within the kernel that is |
| 7 | against the LSB, and can be replaced by using udev. | 7 | against the LSB, and can be replaced by using udev. |
| 8 | The files fs/devfs/*, include/linux/devfs_fs*.h will be removed, | 8 | The files fs/devfs/*, include/linux/devfs_fs*.h were removed, |
| 9 | along with the the assorted devfs function calls throughout the | 9 | along with the the assorted devfs function calls throughout the |
| 10 | kernel tree. | 10 | kernel tree. |
| 11 | 11 | ||
| 12 | Users: | 12 | Users: |
| 13 | |||
diff --git a/Documentation/ABI/testing/sysfs-power b/Documentation/ABI/testing/sysfs-power new file mode 100644 index 000000000000..d882f8093871 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-power | |||
| @@ -0,0 +1,88 @@ | |||
| 1 | What: /sys/power/ | ||
| 2 | Date: August 2006 | ||
| 3 | Contact: Rafael J. Wysocki <rjw@sisk.pl> | ||
| 4 | Description: | ||
| 5 | The /sys/power directory will contain files that will | ||
| 6 | provide a unified interface to the power management | ||
| 7 | subsystem. | ||
| 8 | |||
| 9 | What: /sys/power/state | ||
| 10 | Date: August 2006 | ||
| 11 | Contact: Rafael J. Wysocki <rjw@sisk.pl> | ||
| 12 | Description: | ||
| 13 | The /sys/power/state file controls the system power state. | ||
| 14 | Reading from this file returns what states are supported, | ||
| 15 | which is hard-coded to 'standby' (Power-On Suspend), 'mem' | ||
| 16 | (Suspend-to-RAM), and 'disk' (Suspend-to-Disk). | ||
| 17 | |||
| 18 | Writing to this file one of these strings causes the system to | ||
| 19 | transition into that state. Please see the file | ||
| 20 | Documentation/power/states.txt for a description of each of | ||
| 21 | these states. | ||
| 22 | |||
| 23 | What: /sys/power/disk | ||
| 24 | Date: August 2006 | ||
| 25 | Contact: Rafael J. Wysocki <rjw@sisk.pl> | ||
| 26 | Description: | ||
| 27 | The /sys/power/disk file controls the operating mode of the | ||
| 28 | suspend-to-disk mechanism. Reading from this file returns | ||
| 29 | the name of the method by which the system will be put to | ||
| 30 | sleep on the next suspend. There are four methods supported: | ||
| 31 | 'firmware' - means that the memory image will be saved to disk | ||
| 32 | by some firmware, in which case we also assume that the | ||
| 33 | firmware will handle the system suspend. | ||
| 34 | 'platform' - the memory image will be saved by the kernel and | ||
| 35 | the system will be put to sleep by the platform driver (e.g. | ||
| 36 | ACPI or other PM registers). | ||
| 37 | 'shutdown' - the memory image will be saved by the kernel and | ||
| 38 | the system will be powered off. | ||
| 39 | 'reboot' - the memory image will be saved by the kernel and | ||
| 40 | the system will be rebooted. | ||
| 41 | |||
| 42 | The suspend-to-disk method may be chosen by writing to this | ||
| 43 | file one of the accepted strings: | ||
| 44 | |||
| 45 | 'firmware' | ||
| 46 | 'platform' | ||
| 47 | 'shutdown' | ||
| 48 | 'reboot' | ||
| 49 | |||
| 50 | It will only change to 'firmware' or 'platform' if the system | ||
| 51 | supports that. | ||
| 52 | |||
| 53 | What: /sys/power/image_size | ||
| 54 | Date: August 2006 | ||
| 55 | Contact: Rafael J. Wysocki <rjw@sisk.pl> | ||
| 56 | Description: | ||
| 57 | The /sys/power/image_size file controls the size of the image | ||
| 58 | created by the suspend-to-disk mechanism. It can be written a | ||
| 59 | string representing a non-negative integer that will be used | ||
| 60 | as an upper limit of the image size, in bytes. The kernel's | ||
| 61 | suspend-to-disk code will do its best to ensure the image size | ||
| 62 | will not exceed this number. However, if it turns out to be | ||
| 63 | impossible, the kernel will try to suspend anyway using the | ||
| 64 | smallest image possible. In particular, if "0" is written to | ||
| 65 | this file, the suspend image will be as small as possible. | ||
| 66 | |||
| 67 | Reading from this file will display the current image size | ||
| 68 | limit, which is set to 500 MB by default. | ||
| 69 | |||
| 70 | What: /sys/power/pm_trace | ||
| 71 | Date: August 2006 | ||
| 72 | Contact: Rafael J. Wysocki <rjw@sisk.pl> | ||
| 73 | Description: | ||
| 74 | The /sys/power/pm_trace file controls the code which saves the | ||
| 75 | last PM event point in the RTC across reboots, so that you can | ||
| 76 | debug a machine that just hangs during suspend (or more | ||
| 77 | commonly, during resume). Namely, the RTC is only used to save | ||
| 78 | the last PM event point if this file contains '1'. Initially | ||
| 79 | it contains '0' which may be changed to '1' by writing a | ||
| 80 | string representing a nonzero integer into it. | ||
| 81 | |||
| 82 | To use this debugging feature you should attempt to suspend | ||
| 83 | the machine, then reboot it and run | ||
| 84 | |||
| 85 | dmesg -s 1000000 | grep 'hash matches' | ||
| 86 | |||
| 87 | CAUTION: Using it will cause your machine's real-time (CMOS) | ||
| 88 | clock to be set to a random invalid time after a resume. | ||
diff --git a/Documentation/DocBook/usb.tmpl b/Documentation/DocBook/usb.tmpl index 320af25de3a2..3608472d7b74 100644 --- a/Documentation/DocBook/usb.tmpl +++ b/Documentation/DocBook/usb.tmpl | |||
| @@ -43,59 +43,52 @@ | |||
| 43 | 43 | ||
| 44 | <para>A Universal Serial Bus (USB) is used to connect a host, | 44 | <para>A Universal Serial Bus (USB) is used to connect a host, |
| 45 | such as a PC or workstation, to a number of peripheral | 45 | such as a PC or workstation, to a number of peripheral |
| 46 | devices. USB uses a tree structure, with the host at the | 46 | devices. USB uses a tree structure, with the host as the |
| 47 | root (the system's master), hubs as interior nodes, and | 47 | root (the system's master), hubs as interior nodes, and |
| 48 | peripheral devices as leaves (and slaves). | 48 | peripherals as leaves (and slaves). |
| 49 | Modern PCs support several such trees of USB devices, usually | 49 | Modern PCs support several such trees of USB devices, usually |
| 50 | one USB 2.0 tree (480 Mbit/sec each) with | 50 | one USB 2.0 tree (480 Mbit/sec each) with |
| 51 | a few USB 1.1 trees (12 Mbit/sec each) that are used when you | 51 | a few USB 1.1 trees (12 Mbit/sec each) that are used when you |
| 52 | connect a USB 1.1 device directly to the machine's "root hub". | 52 | connect a USB 1.1 device directly to the machine's "root hub". |
| 53 | </para> | 53 | </para> |
| 54 | 54 | ||
| 55 | <para>That master/slave asymmetry was designed in part for | 55 | <para>That master/slave asymmetry was designed-in for a number of |
| 56 | ease of use. It is not physically possible to assemble | 56 | reasons, one being ease of use. It is not physically possible to |
| 57 | (legal) USB cables incorrectly: all upstream "to-the-host" | 57 | assemble (legal) USB cables incorrectly: all upstream "to the host" |
| 58 | connectors are the rectangular type, matching the sockets on | 58 | connectors are the rectangular type (matching the sockets on |
| 59 | root hubs, and the downstream type are the squarish type | 59 | root hubs), and all downstream connectors are the squarish type |
| 60 | (or they are built in to the peripheral). | 60 | (or they are built into the peripheral). |
| 61 | Software doesn't need to deal with distributed autoconfiguration | 61 | Also, the host software doesn't need to deal with distributed |
| 62 | since the pre-designated master node manages all that. | 62 | auto-configuration since the pre-designated master node manages all that. |
| 63 | At the electrical level, bus protocol overhead is reduced by | 63 | And finally, at the electrical level, bus protocol overhead is reduced by |
| 64 | eliminating arbitration and moving scheduling into host software. | 64 | eliminating arbitration and moving scheduling into the host software. |
| 65 | </para> | 65 | </para> |
| 66 | 66 | ||
| 67 | <para>USB 1.0 was announced in January 1996, and was revised | 67 | <para>USB 1.0 was announced in January 1996 and was revised |
| 68 | as USB 1.1 (with improvements in hub specification and | 68 | as USB 1.1 (with improvements in hub specification and |
| 69 | support for interrupt-out transfers) in September 1998. | 69 | support for interrupt-out transfers) in September 1998. |
| 70 | USB 2.0 was released in April 2000, including high speed | 70 | USB 2.0 was released in April 2000, adding high-speed |
| 71 | transfers and transaction translating hubs (used for USB 1.1 | 71 | transfers and transaction-translating hubs (used for USB 1.1 |
| 72 | and 1.0 backward compatibility). | 72 | and 1.0 backward compatibility). |
| 73 | </para> | 73 | </para> |
| 74 | 74 | ||
| 75 | <para>USB support was added to Linux early in the 2.2 kernel series | 75 | <para>Kernel developers added USB support to Linux early in the 2.2 kernel |
| 76 | shortly before the 2.3 development forked off. Updates | 76 | series, shortly before 2.3 development forked. Updates from 2.3 were |
| 77 | from 2.3 were regularly folded back into 2.2 releases, bringing | 77 | regularly folded back into 2.2 releases, which improved reliability and |
| 78 | new features such as <filename>/sbin/hotplug</filename> support, | 78 | brought <filename>/sbin/hotplug</filename> support as well more drivers. |
| 79 | more drivers, and more robustness. | 79 | Such improvements were continued in the 2.5 kernel series, where they added |
| 80 | The 2.5 kernel series continued such improvements, and also | 80 | USB 2.0 support, improved performance, and made the host controller drivers |
| 81 | worked on USB 2.0 support, | 81 | (HCDs) more consistent. They also simplified the API (to make bugs less |
| 82 | higher performance, | 82 | likely) and added internal "kerneldoc" documentation. |
| 83 | better consistency between host controller drivers, | ||
| 84 | API simplification (to make bugs less likely), | ||
| 85 | and providing internal "kerneldoc" documentation. | ||
| 86 | </para> | 83 | </para> |
| 87 | 84 | ||
| 88 | <para>Linux can run inside USB devices as well as on | 85 | <para>Linux can run inside USB devices as well as on |
| 89 | the hosts that control the devices. | 86 | the hosts that control the devices. |
| 90 | Because the Linux 2.x USB support evolved to support mass market | 87 | But USB device drivers running inside those peripherals |
| 91 | platforms such as Apple Macintosh or PC-compatible systems, | ||
| 92 | it didn't address design concerns for those types of USB systems. | ||
| 93 | So it can't be used inside mass-market PDAs, or other peripherals. | ||
| 94 | USB device drivers running inside those Linux peripherals | ||
| 95 | don't do the same things as the ones running inside hosts, | 88 | don't do the same things as the ones running inside hosts, |
| 96 | and so they've been given a different name: | 89 | so they've been given a different name: |
| 97 | they're called <emphasis>gadget drivers</emphasis>. | 90 | <emphasis>gadget drivers</emphasis>. |
| 98 | This document does not present gadget drivers. | 91 | This document does not cover gadget drivers. |
| 99 | </para> | 92 | </para> |
| 100 | 93 | ||
| 101 | </chapter> | 94 | </chapter> |
| @@ -103,17 +96,14 @@ | |||
| 103 | <chapter id="host"> | 96 | <chapter id="host"> |
| 104 | <title>USB Host-Side API Model</title> | 97 | <title>USB Host-Side API Model</title> |
| 105 | 98 | ||
| 106 | <para>Within the kernel, | 99 | <para>Host-side drivers for USB devices talk to the "usbcore" APIs. |
| 107 | host-side drivers for USB devices talk to the "usbcore" APIs. | 100 | There are two. One is intended for |
| 108 | There are two types of public "usbcore" APIs, targetted at two different | 101 | <emphasis>general-purpose</emphasis> drivers (exposed through |
| 109 | layers of USB driver. Those are | 102 | driver frameworks), and the other is for drivers that are |
| 110 | <emphasis>general purpose</emphasis> drivers, exposed through | 103 | <emphasis>part of the core</emphasis>. |
| 111 | driver frameworks such as block, character, or network devices; | 104 | Such core drivers include the <emphasis>hub</emphasis> driver |
| 112 | and drivers that are <emphasis>part of the core</emphasis>, | 105 | (which manages trees of USB devices) and several different kinds |
| 113 | which are involved in managing a USB bus. | 106 | of <emphasis>host controller drivers</emphasis>, |
| 114 | Such core drivers include the <emphasis>hub</emphasis> driver, | ||
| 115 | which manages trees of USB devices, and several different kinds | ||
| 116 | of <emphasis>host controller driver (HCD)</emphasis>, | ||
| 117 | which control individual busses. | 107 | which control individual busses. |
| 118 | </para> | 108 | </para> |
| 119 | 109 | ||
| @@ -122,21 +112,21 @@ | |||
| 122 | 112 | ||
| 123 | <itemizedlist> | 113 | <itemizedlist> |
| 124 | 114 | ||
| 125 | <listitem><para>USB supports four kinds of data transfer | 115 | <listitem><para>USB supports four kinds of data transfers |
| 126 | (control, bulk, interrupt, and isochronous). Two transfer | 116 | (control, bulk, interrupt, and isochronous). Two of them (control |
| 127 | types use bandwidth as it's available (control and bulk), | 117 | and bulk) use bandwidth as it's available, |
| 128 | while the other two types of transfer (interrupt and isochronous) | 118 | while the other two (interrupt and isochronous) |
| 129 | are scheduled to provide guaranteed bandwidth. | 119 | are scheduled to provide guaranteed bandwidth. |
| 130 | </para></listitem> | 120 | </para></listitem> |
| 131 | 121 | ||
| 132 | <listitem><para>The device description model includes one or more | 122 | <listitem><para>The device description model includes one or more |
| 133 | "configurations" per device, only one of which is active at a time. | 123 | "configurations" per device, only one of which is active at a time. |
| 134 | Devices that are capable of high speed operation must also support | 124 | Devices that are capable of high-speed operation must also support |
| 135 | full speed configurations, along with a way to ask about the | 125 | full-speed configurations, along with a way to ask about the |
| 136 | "other speed" configurations that might be used. | 126 | "other speed" configurations which might be used. |
| 137 | </para></listitem> | 127 | </para></listitem> |
| 138 | 128 | ||
| 139 | <listitem><para>Configurations have one or more "interface", each | 129 | <listitem><para>Configurations have one or more "interfaces", each |
| 140 | of which may have "alternate settings". Interfaces may be | 130 | of which may have "alternate settings". Interfaces may be |
| 141 | standardized by USB "Class" specifications, or may be specific to | 131 | standardized by USB "Class" specifications, or may be specific to |
| 142 | a vendor or device.</para> | 132 | a vendor or device.</para> |
| @@ -162,7 +152,7 @@ | |||
| 162 | </para></listitem> | 152 | </para></listitem> |
| 163 | 153 | ||
| 164 | <listitem><para>The Linux USB API supports synchronous calls for | 154 | <listitem><para>The Linux USB API supports synchronous calls for |
| 165 | control and bulk messaging. | 155 | control and bulk messages. |
| 166 | It also supports asynchnous calls for all kinds of data transfer, | 156 | It also supports asynchnous calls for all kinds of data transfer, |
| 167 | using request structures called "URBs" (USB Request Blocks). | 157 | using request structures called "URBs" (USB Request Blocks). |
| 168 | </para></listitem> | 158 | </para></listitem> |
| @@ -463,14 +453,25 @@ | |||
| 463 | file in your Linux kernel sources. | 453 | file in your Linux kernel sources. |
| 464 | </para> | 454 | </para> |
| 465 | 455 | ||
| 466 | <para>Otherwise the main use for this file from programs | 456 | <para>This file, in combination with the poll() system call, can |
| 467 | is to poll() it to get notifications of usb devices | 457 | also be used to detect when devices are added or removed: |
| 468 | as they're plugged or unplugged. | 458 | <programlisting>int fd; |
| 469 | To see what changed, you'd need to read the file and | 459 | struct pollfd pfd; |
| 470 | compare "before" and "after" contents, scan the filesystem, | 460 | |
| 471 | or see its hotplug event. | 461 | fd = open("/proc/bus/usb/devices", O_RDONLY); |
| 462 | pfd = { fd, POLLIN, 0 }; | ||
| 463 | for (;;) { | ||
| 464 | /* The first time through, this call will return immediately. */ | ||
| 465 | poll(&pfd, 1, -1); | ||
| 466 | |||
| 467 | /* To see what's changed, compare the file's previous and current | ||
| 468 | contents or scan the filesystem. (Scanning is more precise.) */ | ||
| 469 | }</programlisting> | ||
| 470 | Note that this behavior is intended to be used for informational | ||
| 471 | and debug purposes. It would be more appropriate to use programs | ||
| 472 | such as udev or HAL to initialize a device or start a user-mode | ||
| 473 | helper program, for instance. | ||
| 472 | </para> | 474 | </para> |
| 473 | |||
| 474 | </sect1> | 475 | </sect1> |
| 475 | 476 | ||
| 476 | <sect1> | 477 | <sect1> |
diff --git a/Documentation/HOWTO b/Documentation/HOWTO index 915ae8c986c6..1d6560413cc5 100644 --- a/Documentation/HOWTO +++ b/Documentation/HOWTO | |||
| @@ -358,7 +358,8 @@ Here is a list of some of the different kernel trees available: | |||
| 358 | quilt trees: | 358 | quilt trees: |
| 359 | - USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de> | 359 | - USB, PCI, Driver Core, and I2C, Greg Kroah-Hartman <gregkh@suse.de> |
| 360 | kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ | 360 | kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/ |
| 361 | 361 | - x86-64, partly i386, Andi Kleen <ak@suse.de> | |
| 362 | ftp.firstfloor.org:/pub/ak/x86_64/quilt/ | ||
| 362 | 363 | ||
| 363 | Bug Reporting | 364 | Bug Reporting |
| 364 | ------------- | 365 | ------------- |
diff --git a/Documentation/devices.txt b/Documentation/devices.txt index 66c725f530f3..addc67b1d770 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt | |||
| @@ -2543,6 +2543,9 @@ Your cooperation is appreciated. | |||
| 2543 | 64 = /dev/usb/rio500 Diamond Rio 500 | 2543 | 64 = /dev/usb/rio500 Diamond Rio 500 |
| 2544 | 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) | 2544 | 65 = /dev/usb/usblcd USBLCD Interface (info@usblcd.de) |
| 2545 | 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) | 2545 | 66 = /dev/usb/cpad0 Synaptics cPad (mouse/LCD) |
| 2546 | 67 = /dev/usb/adutux0 1st Ontrak ADU device | ||
| 2547 | ... | ||
| 2548 | 76 = /dev/usb/adutux10 10th Ontrak ADU device | ||
| 2546 | 96 = /dev/usb/hiddev0 1st USB HID device | 2549 | 96 = /dev/usb/hiddev0 1st USB HID device |
| 2547 | ... | 2550 | ... |
| 2548 | 111 = /dev/usb/hiddev15 16th USB HID device | 2551 | 111 = /dev/usb/hiddev15 16th USB HID device |
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 552507fe9a7e..436697cb9388 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt | |||
| @@ -6,6 +6,21 @@ be removed from this file. | |||
| 6 | 6 | ||
| 7 | --------------------------- | 7 | --------------------------- |
| 8 | 8 | ||
| 9 | What: /sys/devices/.../power/state | ||
| 10 | dev->power.power_state | ||
| 11 | dpm_runtime_{suspend,resume)() | ||
| 12 | When: July 2007 | ||
| 13 | Why: Broken design for runtime control over driver power states, confusing | ||
| 14 | driver-internal runtime power management with: mechanisms to support | ||
| 15 | system-wide sleep state transitions; event codes that distinguish | ||
| 16 | different phases of swsusp "sleep" transitions; and userspace policy | ||
| 17 | inputs. This framework was never widely used, and most attempts to | ||
| 18 | use it were broken. Drivers should instead be exposing domain-specific | ||
| 19 | interfaces either to kernel or to userspace. | ||
| 20 | Who: Pavel Machek <pavel@suse.cz> | ||
| 21 | |||
| 22 | --------------------------- | ||
| 23 | |||
| 9 | What: RAW driver (CONFIG_RAW_DRIVER) | 24 | What: RAW driver (CONFIG_RAW_DRIVER) |
| 10 | When: December 2005 | 25 | When: December 2005 |
| 11 | Why: declared obsolete since kernel 2.6.3 | 26 | Why: declared obsolete since kernel 2.6.3 |
| @@ -55,6 +70,18 @@ Who: Mauro Carvalho Chehab <mchehab@brturbo.com.br> | |||
| 55 | 70 | ||
| 56 | --------------------------- | 71 | --------------------------- |
| 57 | 72 | ||
| 73 | What: sys_sysctl | ||
| 74 | When: January 2007 | ||
| 75 | Why: The same information is available through /proc/sys and that is the | ||
| 76 | interface user space prefers to use. And there do not appear to be | ||
| 77 | any existing user in user space of sys_sysctl. The additional | ||
| 78 | maintenance overhead of keeping a set of binary names gets | ||
| 79 | in the way of doing a good job of maintaining this interface. | ||
| 80 | |||
| 81 | Who: Eric Biederman <ebiederm@xmission.com> | ||
| 82 | |||
| 83 | --------------------------- | ||
| 84 | |||
| 58 | What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) | 85 | What: PCMCIA control ioctl (needed for pcmcia-cs [cardmgr, cardctl]) |
| 59 | When: November 2005 | 86 | When: November 2005 |
| 60 | Files: drivers/pcmcia/: pcmcia_ioctl.c | 87 | Files: drivers/pcmcia/: pcmcia_ioctl.c |
| @@ -202,14 +229,6 @@ Who: Nick Piggin <npiggin@suse.de> | |||
| 202 | 229 | ||
| 203 | --------------------------- | 230 | --------------------------- |
| 204 | 231 | ||
| 205 | What: Support for the MIPS EV96100 evaluation board | ||
| 206 | When: September 2006 | ||
| 207 | Why: Does no longer build since at least November 15, 2003, apparently | ||
| 208 | no userbase left. | ||
| 209 | Who: Ralf Baechle <ralf@linux-mips.org> | ||
| 210 | |||
| 211 | --------------------------- | ||
| 212 | |||
| 213 | What: Support for the Momentum / PMC-Sierra Jaguar ATX evaluation board | 232 | What: Support for the Momentum / PMC-Sierra Jaguar ATX evaluation board |
| 214 | When: September 2006 | 233 | When: September 2006 |
| 215 | Why: Does no longer build since quite some time, and was never popular, | 234 | Why: Does no longer build since quite some time, and was never popular, |
| @@ -294,3 +313,24 @@ Why: The frame diverter is included in most distribution kernels, but is | |||
| 294 | It is not clear if anyone is still using it. | 313 | It is not clear if anyone is still using it. |
| 295 | Who: Stephen Hemminger <shemminger@osdl.org> | 314 | Who: Stephen Hemminger <shemminger@osdl.org> |
| 296 | 315 | ||
| 316 | --------------------------- | ||
| 317 | |||
| 318 | |||
| 319 | What: PHYSDEVPATH, PHYSDEVBUS, PHYSDEVDRIVER in the uevent environment | ||
| 320 | When: Oktober 2008 | ||
| 321 | Why: The stacking of class devices makes these values misleading and | ||
| 322 | inconsistent. | ||
| 323 | Class devices should not carry any of these properties, and bus | ||
| 324 | devices have SUBSYTEM and DRIVER as a replacement. | ||
| 325 | Who: Kay Sievers <kay.sievers@suse.de> | ||
| 326 | |||
| 327 | --------------------------- | ||
| 328 | |||
| 329 | What: i2c-isa | ||
| 330 | When: December 2006 | ||
| 331 | Why: i2c-isa is a non-sense and doesn't fit in the device driver | ||
| 332 | model. Drivers relying on it are better implemented as platform | ||
| 333 | drivers. | ||
| 334 | Who: Jean Delvare <khali@linux-fr.org> | ||
| 335 | |||
| 336 | --------------------------- | ||
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 99902ae6804e..7db71d6fba82 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
| @@ -1124,11 +1124,15 @@ debugging information is displayed on console. | |||
| 1124 | NMI switch that most IA32 servers have fires unknown NMI up, for example. | 1124 | NMI switch that most IA32 servers have fires unknown NMI up, for example. |
| 1125 | If a system hangs up, try pressing the NMI switch. | 1125 | If a system hangs up, try pressing the NMI switch. |
| 1126 | 1126 | ||
| 1127 | [NOTE] | 1127 | nmi_watchdog |
| 1128 | This function and oprofile share a NMI callback. Therefore this function | 1128 | ------------ |
| 1129 | cannot be enabled when oprofile is activated. | 1129 | |
| 1130 | And NMI watchdog will be disabled when the value in this file is set to | 1130 | Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero |
| 1131 | non-zero. | 1131 | the NMI watchdog is enabled and will continuously test all online cpus to |
| 1132 | determine whether or not they are still functioning properly. | ||
| 1133 | |||
| 1134 | Because the NMI watchdog shares registers with oprofile, by disabling the NMI | ||
| 1135 | watchdog, oprofile may have more registers to utilize. | ||
| 1132 | 1136 | ||
| 1133 | 1137 | ||
| 1134 | 2.4 /proc/sys/vm - The virtual memory subsystem | 1138 | 2.4 /proc/sys/vm - The virtual memory subsystem |
diff --git a/Documentation/i2c/busses/i2c-viapro b/Documentation/i2c/busses/i2c-viapro index 16775663b9f5..25680346e0ac 100644 --- a/Documentation/i2c/busses/i2c-viapro +++ b/Documentation/i2c/busses/i2c-viapro | |||
| @@ -7,9 +7,12 @@ Supported adapters: | |||
| 7 | * VIA Technologies, Inc. VT82C686A/B | 7 | * VIA Technologies, Inc. VT82C686A/B |
| 8 | Datasheet: Sometimes available at the VIA website | 8 | Datasheet: Sometimes available at the VIA website |
| 9 | 9 | ||
| 10 | * VIA Technologies, Inc. VT8231, VT8233, VT8233A, VT8235, VT8237R | 10 | * VIA Technologies, Inc. VT8231, VT8233, VT8233A |
| 11 | Datasheet: available on request from VIA | 11 | Datasheet: available on request from VIA |
| 12 | 12 | ||
| 13 | * VIA Technologies, Inc. VT8235, VT8237R, VT8237A, VT8251 | ||
| 14 | Datasheet: available on request and under NDA from VIA | ||
| 15 | |||
| 13 | Authors: | 16 | Authors: |
| 14 | Kyösti Mälkki <kmalkki@cc.hut.fi>, | 17 | Kyösti Mälkki <kmalkki@cc.hut.fi>, |
| 15 | Mark D. Studebaker <mdsxyz123@yahoo.com>, | 18 | Mark D. Studebaker <mdsxyz123@yahoo.com>, |
| @@ -39,6 +42,8 @@ Your lspci -n listing must show one of these : | |||
| 39 | device 1106:8235 (VT8231 function 4) | 42 | device 1106:8235 (VT8231 function 4) |
| 40 | device 1106:3177 (VT8235) | 43 | device 1106:3177 (VT8235) |
| 41 | device 1106:3227 (VT8237R) | 44 | device 1106:3227 (VT8237R) |
| 45 | device 1106:3337 (VT8237A) | ||
| 46 | device 1106:3287 (VT8251) | ||
| 42 | 47 | ||
| 43 | If none of these show up, you should look in the BIOS for settings like | 48 | If none of these show up, you should look in the BIOS for settings like |
| 44 | enable ACPI / SMBus or even USB. | 49 | enable ACPI / SMBus or even USB. |
diff --git a/Documentation/i2c/i2c-stub b/Documentation/i2c/i2c-stub index d6dcb138abf5..9cc081e69764 100644 --- a/Documentation/i2c/i2c-stub +++ b/Documentation/i2c/i2c-stub | |||
| @@ -6,9 +6,12 @@ This module is a very simple fake I2C/SMBus driver. It implements four | |||
| 6 | types of SMBus commands: write quick, (r/w) byte, (r/w) byte data, and | 6 | types of SMBus commands: write quick, (r/w) byte, (r/w) byte data, and |
| 7 | (r/w) word data. | 7 | (r/w) word data. |
| 8 | 8 | ||
| 9 | You need to provide a chip address as a module parameter when loading | ||
| 10 | this driver, which will then only react to SMBus commands to this address. | ||
| 11 | |||
| 9 | No hardware is needed nor associated with this module. It will accept write | 12 | No hardware is needed nor associated with this module. It will accept write |
| 10 | quick commands to all addresses; it will respond to the other commands (also | 13 | quick commands to one address; it will respond to the other commands (also |
| 11 | to all addresses) by reading from or writing to an array in memory. It will | 14 | to one address) by reading from or writing to an array in memory. It will |
| 12 | also spam the kernel logs for every command it handles. | 15 | also spam the kernel logs for every command it handles. |
| 13 | 16 | ||
| 14 | A pointer register with auto-increment is implemented for all byte | 17 | A pointer register with auto-increment is implemented for all byte |
| @@ -21,6 +24,11 @@ The typical use-case is like this: | |||
| 21 | 3. load the target sensors chip driver module | 24 | 3. load the target sensors chip driver module |
| 22 | 4. observe its behavior in the kernel log | 25 | 4. observe its behavior in the kernel log |
| 23 | 26 | ||
| 27 | PARAMETERS: | ||
| 28 | |||
| 29 | int chip_addr: | ||
| 30 | The SMBus address to emulate a chip at. | ||
| 31 | |||
| 24 | CAVEATS: | 32 | CAVEATS: |
| 25 | 33 | ||
| 26 | There are independent arrays for byte/data and word/data commands. Depending | 34 | There are independent arrays for byte/data and word/data commands. Depending |
| @@ -33,6 +41,9 @@ If the hardware for your driver has banked registers (e.g. Winbond sensors | |||
| 33 | chips) this module will not work well - although it could be extended to | 41 | chips) this module will not work well - although it could be extended to |
| 34 | support that pretty easily. | 42 | support that pretty easily. |
| 35 | 43 | ||
| 44 | Only one chip address is supported - although this module could be | ||
| 45 | extended to support more. | ||
| 46 | |||
| 36 | If you spam it hard enough, printk can be lossy. This module really wants | 47 | If you spam it hard enough, printk can be lossy. This module really wants |
| 37 | something like relayfs. | 48 | something like relayfs. |
| 38 | 49 | ||
diff --git a/Documentation/kbuild/makefiles.txt b/Documentation/kbuild/makefiles.txt index b7d6abb501a6..e2cbd59cf2d0 100644 --- a/Documentation/kbuild/makefiles.txt +++ b/Documentation/kbuild/makefiles.txt | |||
| @@ -421,6 +421,11 @@ more details, with real examples. | |||
| 421 | The second argument is optional, and if supplied will be used | 421 | The second argument is optional, and if supplied will be used |
| 422 | if first argument is not supported. | 422 | if first argument is not supported. |
| 423 | 423 | ||
| 424 | as-instr | ||
| 425 | as-instr checks if the assembler reports a specific instruction | ||
| 426 | and then outputs either option1 or option2 | ||
| 427 | C escapes are supported in the test instruction | ||
| 428 | |||
| 424 | cc-option | 429 | cc-option |
| 425 | cc-option is used to check if $(CC) supports a given option, and not | 430 | cc-option is used to check if $(CC) supports a given option, and not |
| 426 | supported to use an optional second option. | 431 | supported to use an optional second option. |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 71d05f481727..54983246930d 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
| @@ -573,8 +573,6 @@ running once the system is up. | |||
| 573 | gscd= [HW,CD] | 573 | gscd= [HW,CD] |
| 574 | Format: <io> | 574 | Format: <io> |
| 575 | 575 | ||
| 576 | gt96100eth= [NET] MIPS GT96100 Advanced Communication Controller | ||
| 577 | |||
| 578 | gus= [HW,OSS] | 576 | gus= [HW,OSS] |
| 579 | Format: <io>,<irq>,<dma>,<dma16> | 577 | Format: <io>,<irq>,<dma>,<dma16> |
| 580 | 578 | ||
| @@ -1240,7 +1238,11 @@ running once the system is up. | |||
| 1240 | bootloader. This is currently used on | 1238 | bootloader. This is currently used on |
| 1241 | IXP2000 systems where the bus has to be | 1239 | IXP2000 systems where the bus has to be |
| 1242 | configured a certain way for adjunct CPUs. | 1240 | configured a certain way for adjunct CPUs. |
| 1243 | 1241 | noearly [X86] Don't do any early type 1 scanning. | |
| 1242 | This might help on some broken boards which | ||
| 1243 | machine check when some devices' config space | ||
| 1244 | is read. But various workarounds are disabled | ||
| 1245 | and some IOMMU drivers will not work. | ||
| 1244 | pcmv= [HW,PCMCIA] BadgePAD 4 | 1246 | pcmv= [HW,PCMCIA] BadgePAD 4 |
| 1245 | 1247 | ||
| 1246 | pd. [PARIDE] | 1248 | pd. [PARIDE] |
| @@ -1363,6 +1365,14 @@ running once the system is up. | |||
| 1363 | 1365 | ||
| 1364 | reserve= [KNL,BUGS] Force the kernel to ignore some iomem area | 1366 | reserve= [KNL,BUGS] Force the kernel to ignore some iomem area |
| 1365 | 1367 | ||
| 1368 | reservetop= [IA-32] | ||
| 1369 | Format: nn[KMG] | ||
| 1370 | Reserves a hole at the top of the kernel virtual | ||
| 1371 | address space. | ||
| 1372 | |||
| 1373 | reset_devices [KNL] Force drivers to reset the underlying device | ||
| 1374 | during initialization. | ||
| 1375 | |||
| 1366 | resume= [SWSUSP] | 1376 | resume= [SWSUSP] |
| 1367 | Specify the partition device for software suspend | 1377 | Specify the partition device for software suspend |
| 1368 | 1378 | ||
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index afac780445cd..dc942eaf490f 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt | |||
| @@ -192,6 +192,17 @@ or, for backwards compatibility, the option value. E.g., | |||
| 192 | arp_interval | 192 | arp_interval |
| 193 | 193 | ||
| 194 | Specifies the ARP link monitoring frequency in milliseconds. | 194 | Specifies the ARP link monitoring frequency in milliseconds. |
| 195 | |||
| 196 | The ARP monitor works by periodically checking the slave | ||
| 197 | devices to determine whether they have sent or received | ||
| 198 | traffic recently (the precise criteria depends upon the | ||
| 199 | bonding mode, and the state of the slave). Regular traffic is | ||
| 200 | generated via ARP probes issued for the addresses specified by | ||
| 201 | the arp_ip_target option. | ||
| 202 | |||
| 203 | This behavior can be modified by the arp_validate option, | ||
| 204 | below. | ||
| 205 | |||
| 195 | If ARP monitoring is used in an etherchannel compatible mode | 206 | If ARP monitoring is used in an etherchannel compatible mode |
| 196 | (modes 0 and 2), the switch should be configured in a mode | 207 | (modes 0 and 2), the switch should be configured in a mode |
| 197 | that evenly distributes packets across all links. If the | 208 | that evenly distributes packets across all links. If the |
| @@ -213,6 +224,54 @@ arp_ip_target | |||
| 213 | maximum number of targets that can be specified is 16. The | 224 | maximum number of targets that can be specified is 16. The |
| 214 | default value is no IP addresses. | 225 | default value is no IP addresses. |
| 215 | 226 | ||
| 227 | arp_validate | ||
| 228 | |||
| 229 | Specifies whether or not ARP probes and replies should be | ||
| 230 | validated in the active-backup mode. This causes the ARP | ||
| 231 | monitor to examine the incoming ARP requests and replies, and | ||
| 232 | only consider a slave to be up if it is receiving the | ||
| 233 | appropriate ARP traffic. | ||
| 234 | |||
| 235 | Possible values are: | ||
| 236 | |||
| 237 | none or 0 | ||
| 238 | |||
| 239 | No validation is performed. This is the default. | ||
| 240 | |||
| 241 | active or 1 | ||
| 242 | |||
| 243 | Validation is performed only for the active slave. | ||
| 244 | |||
| 245 | backup or 2 | ||
| 246 | |||
| 247 | Validation is performed only for backup slaves. | ||
| 248 | |||
| 249 | all or 3 | ||
| 250 | |||
| 251 | Validation is performed for all slaves. | ||
| 252 | |||
| 253 | For the active slave, the validation checks ARP replies to | ||
| 254 | confirm that they were generated by an arp_ip_target. Since | ||
| 255 | backup slaves do not typically receive these replies, the | ||
| 256 | validation performed for backup slaves is on the ARP request | ||
| 257 | sent out via the active slave. It is possible that some | ||
| 258 | switch or network configurations may result in situations | ||
| 259 | wherein the backup slaves do not receive the ARP requests; in | ||
| 260 | such a situation, validation of backup slaves must be | ||
| 261 | disabled. | ||
| 262 | |||
| 263 | This option is useful in network configurations in which | ||
| 264 | multiple bonding hosts are concurrently issuing ARPs to one or | ||
| 265 | more targets beyond a common switch. Should the link between | ||
| 266 | the switch and target fail (but not the switch itself), the | ||
| 267 | probe traffic generated by the multiple bonding instances will | ||
| 268 | fool the standard ARP monitor into considering the links as | ||
| 269 | still up. Use of the arp_validate option can resolve this, as | ||
| 270 | the ARP monitor will only consider ARP requests and replies | ||
| 271 | associated with its own instance of bonding. | ||
| 272 | |||
| 273 | This option was added in bonding version 3.1.0. | ||
| 274 | |||
| 216 | downdelay | 275 | downdelay |
| 217 | 276 | ||
| 218 | Specifies the time, in milliseconds, to wait before disabling | 277 | Specifies the time, in milliseconds, to wait before disabling |
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index c45daabd3bfe..74563b38ffd9 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt | |||
| @@ -1,7 +1,6 @@ | |||
| 1 | DCCP protocol | 1 | DCCP protocol |
| 2 | ============ | 2 | ============ |
| 3 | 3 | ||
| 4 | Last updated: 10 November 2005 | ||
| 5 | 4 | ||
| 6 | Contents | 5 | Contents |
| 7 | ======== | 6 | ======== |
| @@ -42,8 +41,11 @@ Socket options | |||
| 42 | DCCP_SOCKOPT_PACKET_SIZE is used for CCID3 to set default packet size for | 41 | DCCP_SOCKOPT_PACKET_SIZE is used for CCID3 to set default packet size for |
| 43 | calculations. | 42 | calculations. |
| 44 | 43 | ||
| 45 | DCCP_SOCKOPT_SERVICE sets the service. This is compulsory as per the | 44 | DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of |
| 46 | specification. If you don't set it you will get EPROTO. | 45 | service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, |
| 46 | the socket will fall back to 0 (which means that no meaningful service code | ||
| 47 | is present). Connecting sockets set at most one service option; for | ||
| 48 | listening sockets, multiple service codes can be specified. | ||
| 47 | 49 | ||
| 48 | Notes | 50 | Notes |
| 49 | ===== | 51 | ===== |
diff --git a/Documentation/nommu-mmap.txt b/Documentation/nommu-mmap.txt index b88ebe4d808c..7714f57caad5 100644 --- a/Documentation/nommu-mmap.txt +++ b/Documentation/nommu-mmap.txt | |||
| @@ -116,6 +116,9 @@ FURTHER NOTES ON NO-MMU MMAP | |||
| 116 | (*) A list of all the mappings on the system is visible through /proc/maps in | 116 | (*) A list of all the mappings on the system is visible through /proc/maps in |
| 117 | no-MMU mode. | 117 | no-MMU mode. |
| 118 | 118 | ||
| 119 | (*) A list of all the mappings in use by a process is visible through | ||
| 120 | /proc/<pid>/maps in no-MMU mode. | ||
| 121 | |||
| 119 | (*) Supplying MAP_FIXED or a requesting a particular mapping address will | 122 | (*) Supplying MAP_FIXED or a requesting a particular mapping address will |
| 120 | result in an error. | 123 | result in an error. |
| 121 | 124 | ||
| @@ -125,6 +128,49 @@ FURTHER NOTES ON NO-MMU MMAP | |||
| 125 | error will result if they don't. This is most likely to be encountered | 128 | error will result if they don't. This is most likely to be encountered |
| 126 | with character device files, pipes, fifos and sockets. | 129 | with character device files, pipes, fifos and sockets. |
| 127 | 130 | ||
| 131 | |||
| 132 | ========================== | ||
| 133 | INTERPROCESS SHARED MEMORY | ||
| 134 | ========================== | ||
| 135 | |||
| 136 | Both SYSV IPC SHM shared memory and POSIX shared memory is supported in NOMMU | ||
| 137 | mode. The former through the usual mechanism, the latter through files created | ||
| 138 | on ramfs or tmpfs mounts. | ||
| 139 | |||
| 140 | |||
| 141 | ======= | ||
| 142 | FUTEXES | ||
| 143 | ======= | ||
| 144 | |||
| 145 | Futexes are supported in NOMMU mode if the arch supports them. An error will | ||
| 146 | be given if an address passed to the futex system call lies outside the | ||
| 147 | mappings made by a process or if the mapping in which the address lies does not | ||
| 148 | support futexes (such as an I/O chardev mapping). | ||
| 149 | |||
| 150 | |||
| 151 | ============= | ||
| 152 | NO-MMU MREMAP | ||
| 153 | ============= | ||
| 154 | |||
| 155 | The mremap() function is partially supported. It may change the size of a | ||
| 156 | mapping, and may move it[*] if MREMAP_MAYMOVE is specified and if the new size | ||
| 157 | of the mapping exceeds the size of the slab object currently occupied by the | ||
| 158 | memory to which the mapping refers, or if a smaller slab object could be used. | ||
| 159 | |||
| 160 | MREMAP_FIXED is not supported, though it is ignored if there's no change of | ||
| 161 | address and the object does not need to be moved. | ||
| 162 | |||
| 163 | Shared mappings may not be moved. Shareable mappings may not be moved either, | ||
| 164 | even if they are not currently shared. | ||
| 165 | |||
| 166 | The mremap() function must be given an exact match for base address and size of | ||
| 167 | a previously mapped object. It may not be used to create holes in existing | ||
| 168 | mappings, move parts of existing mappings or resize parts of mappings. It must | ||
| 169 | act on a complete mapping. | ||
| 170 | |||
| 171 | [*] Not currently supported. | ||
| 172 | |||
| 173 | |||
| 128 | ============================================ | 174 | ============================================ |
| 129 | PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT | 175 | PROVIDING SHAREABLE CHARACTER DEVICE SUPPORT |
| 130 | ============================================ | 176 | ============================================ |
diff --git a/Documentation/pcieaer-howto.txt b/Documentation/pcieaer-howto.txt new file mode 100644 index 000000000000..16c251230c82 --- /dev/null +++ b/Documentation/pcieaer-howto.txt | |||
| @@ -0,0 +1,253 @@ | |||
| 1 | The PCI Express Advanced Error Reporting Driver Guide HOWTO | ||
| 2 | T. Long Nguyen <tom.l.nguyen@intel.com> | ||
| 3 | Yanmin Zhang <yanmin.zhang@intel.com> | ||
| 4 | 07/29/2006 | ||
| 5 | |||
| 6 | |||
| 7 | 1. Overview | ||
| 8 | |||
| 9 | 1.1 About this guide | ||
| 10 | |||
| 11 | This guide describes the basics of the PCI Express Advanced Error | ||
| 12 | Reporting (AER) driver and provides information on how to use it, as | ||
| 13 | well as how to enable the drivers of endpoint devices to conform with | ||
| 14 | PCI Express AER driver. | ||
| 15 | |||
| 16 | 1.2 Copyright © Intel Corporation 2006. | ||
| 17 | |||
| 18 | 1.3 What is the PCI Express AER Driver? | ||
| 19 | |||
| 20 | PCI Express error signaling can occur on the PCI Express link itself | ||
| 21 | or on behalf of transactions initiated on the link. PCI Express | ||
| 22 | defines two error reporting paradigms: the baseline capability and | ||
| 23 | the Advanced Error Reporting capability. The baseline capability is | ||
| 24 | required of all PCI Express components providing a minimum defined | ||
| 25 | set of error reporting requirements. Advanced Error Reporting | ||
| 26 | capability is implemented with a PCI Express advanced error reporting | ||
| 27 | extended capability structure providing more robust error reporting. | ||
| 28 | |||
| 29 | The PCI Express AER driver provides the infrastructure to support PCI | ||
| 30 | Express Advanced Error Reporting capability. The PCI Express AER | ||
| 31 | driver provides three basic functions: | ||
| 32 | |||
| 33 | - Gathers the comprehensive error information if errors occurred. | ||
| 34 | - Reports error to the users. | ||
| 35 | - Performs error recovery actions. | ||
| 36 | |||
| 37 | AER driver only attaches root ports which support PCI-Express AER | ||
| 38 | capability. | ||
| 39 | |||
| 40 | |||
| 41 | 2. User Guide | ||
| 42 | |||
| 43 | 2.1 Include the PCI Express AER Root Driver into the Linux Kernel | ||
| 44 | |||
| 45 | The PCI Express AER Root driver is a Root Port service driver attached | ||
| 46 | to the PCI Express Port Bus driver. If a user wants to use it, the driver | ||
| 47 | has to be compiled. Option CONFIG_PCIEAER supports this capability. It | ||
| 48 | depends on CONFIG_PCIEPORTBUS, so pls. set CONFIG_PCIEPORTBUS=y and | ||
| 49 | CONFIG_PCIEAER = y. | ||
| 50 | |||
| 51 | 2.2 Load PCI Express AER Root Driver | ||
| 52 | There is a case where a system has AER support in BIOS. Enabling the AER | ||
| 53 | Root driver and having AER support in BIOS may result unpredictable | ||
| 54 | behavior. To avoid this conflict, a successful load of the AER Root driver | ||
| 55 | requires ACPI _OSC support in the BIOS to allow the AER Root driver to | ||
| 56 | request for native control of AER. See the PCI FW 3.0 Specification for | ||
| 57 | details regarding OSC usage. Currently, lots of firmwares don't provide | ||
| 58 | _OSC support while they use PCI Express. To support such firmwares, | ||
| 59 | forceload, a parameter of type bool, could enable AER to continue to | ||
| 60 | be initiated although firmwares have no _OSC support. To enable the | ||
| 61 | walkaround, pls. add aerdriver.forceload=y to kernel boot parameter line | ||
| 62 | when booting kernel. Note that forceload=n by default. | ||
| 63 | |||
| 64 | 2.3 AER error output | ||
| 65 | When a PCI-E AER error is captured, an error message will be outputed to | ||
| 66 | console. If it's a correctable error, it is outputed as a warning. | ||
| 67 | Otherwise, it is printed as an error. So users could choose different | ||
| 68 | log level to filter out correctable error messages. | ||
| 69 | |||
| 70 | Below shows an example. | ||
| 71 | +------ PCI-Express Device Error -----+ | ||
| 72 | Error Severity : Uncorrected (Fatal) | ||
| 73 | PCIE Bus Error type : Transaction Layer | ||
| 74 | Unsupported Request : First | ||
| 75 | Requester ID : 0500 | ||
| 76 | VendorID=8086h, DeviceID=0329h, Bus=05h, Device=00h, Function=00h | ||
| 77 | TLB Header: | ||
| 78 | 04000001 00200a03 05010000 00050100 | ||
| 79 | |||
| 80 | In the example, 'Requester ID' means the ID of the device who sends | ||
| 81 | the error message to root port. Pls. refer to pci express specs for | ||
| 82 | other fields. | ||
| 83 | |||
| 84 | |||
| 85 | 3. Developer Guide | ||
| 86 | |||
| 87 | To enable AER aware support requires a software driver to configure | ||
| 88 | the AER capability structure within its device and to provide callbacks. | ||
| 89 | |||
| 90 | To support AER better, developers need understand how AER does work | ||
| 91 | firstly. | ||
| 92 | |||
| 93 | PCI Express errors are classified into two types: correctable errors | ||
| 94 | and uncorrectable errors. This classification is based on the impacts | ||
| 95 | of those errors, which may result in degraded performance or function | ||
| 96 | failure. | ||
| 97 | |||
| 98 | Correctable errors pose no impacts on the functionality of the | ||
| 99 | interface. The PCI Express protocol can recover without any software | ||
| 100 | intervention or any loss of data. These errors are detected and | ||
| 101 | corrected by hardware. Unlike correctable errors, uncorrectable | ||
| 102 | errors impact functionality of the interface. Uncorrectable errors | ||
| 103 | can cause a particular transaction or a particular PCI Express link | ||
| 104 | to be unreliable. Depending on those error conditions, uncorrectable | ||
| 105 | errors are further classified into non-fatal errors and fatal errors. | ||
| 106 | Non-fatal errors cause the particular transaction to be unreliable, | ||
| 107 | but the PCI Express link itself is fully functional. Fatal errors, on | ||
| 108 | the other hand, cause the link to be unreliable. | ||
| 109 | |||
| 110 | When AER is enabled, a PCI Express device will automatically send an | ||
| 111 | error message to the PCIE root port above it when the device captures | ||
| 112 | an error. The Root Port, upon receiving an error reporting message, | ||
| 113 | internally processes and logs the error message in its PCI Express | ||
| 114 | capability structure. Error information being logged includes storing | ||
| 115 | the error reporting agent's requestor ID into the Error Source | ||
| 116 | Identification Registers and setting the error bits of the Root Error | ||
| 117 | Status Register accordingly. If AER error reporting is enabled in Root | ||
| 118 | Error Command Register, the Root Port generates an interrupt if an | ||
| 119 | error is detected. | ||
| 120 | |||
| 121 | Note that the errors as described above are related to the PCI Express | ||
| 122 | hierarchy and links. These errors do not include any device specific | ||
| 123 | errors because device specific errors will still get sent directly to | ||
| 124 | the device driver. | ||
| 125 | |||
| 126 | 3.1 Configure the AER capability structure | ||
| 127 | |||
| 128 | AER aware drivers of PCI Express component need change the device | ||
| 129 | control registers to enable AER. They also could change AER registers, | ||
| 130 | including mask and severity registers. Helper function | ||
| 131 | pci_enable_pcie_error_reporting could be used to enable AER. See | ||
| 132 | section 3.3. | ||
| 133 | |||
| 134 | 3.2. Provide callbacks | ||
| 135 | |||
| 136 | 3.2.1 callback reset_link to reset pci express link | ||
| 137 | |||
| 138 | This callback is used to reset the pci express physical link when a | ||
| 139 | fatal error happens. The root port aer service driver provides a | ||
| 140 | default reset_link function, but different upstream ports might | ||
| 141 | have different specifications to reset pci express link, so all | ||
| 142 | upstream ports should provide their own reset_link functions. | ||
| 143 | |||
| 144 | In struct pcie_port_service_driver, a new pointer, reset_link, is | ||
| 145 | added. | ||
| 146 | |||
| 147 | pci_ers_result_t (*reset_link) (struct pci_dev *dev); | ||
| 148 | |||
| 149 | Section 3.2.2.2 provides more detailed info on when to call | ||
| 150 | reset_link. | ||
| 151 | |||
| 152 | 3.2.2 PCI error-recovery callbacks | ||
| 153 | |||
| 154 | The PCI Express AER Root driver uses error callbacks to coordinate | ||
| 155 | with downstream device drivers associated with a hierarchy in question | ||
| 156 | when performing error recovery actions. | ||
| 157 | |||
| 158 | Data struct pci_driver has a pointer, err_handler, to point to | ||
| 159 | pci_error_handlers who consists of a couple of callback function | ||
| 160 | pointers. AER driver follows the rules defined in | ||
| 161 | pci-error-recovery.txt except pci express specific parts (e.g. | ||
| 162 | reset_link). Pls. refer to pci-error-recovery.txt for detailed | ||
| 163 | definitions of the callbacks. | ||
| 164 | |||
| 165 | Below sections specify when to call the error callback functions. | ||
| 166 | |||
| 167 | 3.2.2.1 Correctable errors | ||
| 168 | |||
| 169 | Correctable errors pose no impacts on the functionality of | ||
| 170 | the interface. The PCI Express protocol can recover without any | ||
| 171 | software intervention or any loss of data. These errors do not | ||
| 172 | require any recovery actions. The AER driver clears the device's | ||
| 173 | correctable error status register accordingly and logs these errors. | ||
| 174 | |||
| 175 | 3.2.2.2 Non-correctable (non-fatal and fatal) errors | ||
| 176 | |||
| 177 | If an error message indicates a non-fatal error, performing link reset | ||
| 178 | at upstream is not required. The AER driver calls error_detected(dev, | ||
| 179 | pci_channel_io_normal) to all drivers associated within a hierarchy in | ||
| 180 | question. for example, | ||
| 181 | EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort. | ||
| 182 | If Upstream port A captures an AER error, the hierarchy consists of | ||
| 183 | Downstream port B and EndPoint. | ||
| 184 | |||
| 185 | A driver may return PCI_ERS_RESULT_CAN_RECOVER, | ||
| 186 | PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on | ||
| 187 | whether it can recover or the AER driver calls mmio_enabled as next. | ||
| 188 | |||
| 189 | If an error message indicates a fatal error, kernel will broadcast | ||
| 190 | error_detected(dev, pci_channel_io_frozen) to all drivers within | ||
| 191 | a hierarchy in question. Then, performing link reset at upstream is | ||
| 192 | necessary. As different kinds of devices might use different approaches | ||
| 193 | to reset link, AER port service driver is required to provide the | ||
| 194 | function to reset link. Firstly, kernel looks for if the upstream | ||
| 195 | component has an aer driver. If it has, kernel uses the reset_link | ||
| 196 | callback of the aer driver. If the upstream component has no aer driver | ||
| 197 | and the port is downstream port, we will use the aer driver of the | ||
| 198 | root port who reports the AER error. As for upstream ports, | ||
| 199 | they should provide their own aer service drivers with reset_link | ||
| 200 | function. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER and | ||
| 201 | reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes | ||
| 202 | to mmio_enabled. | ||
| 203 | |||
| 204 | 3.3 helper functions | ||
| 205 | |||
| 206 | 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); | ||
| 207 | pci_find_aer_capability locates the PCI Express AER capability | ||
| 208 | in the device configuration space. If the device doesn't support | ||
| 209 | PCI-Express AER, the function returns 0. | ||
| 210 | |||
| 211 | 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | ||
| 212 | pci_enable_pcie_error_reporting enables the device to send error | ||
| 213 | messages to root port when an error is detected. Note that devices | ||
| 214 | don't enable the error reporting by default, so device drivers need | ||
| 215 | call this function to enable it. | ||
| 216 | |||
| 217 | 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | ||
| 218 | pci_disable_pcie_error_reporting disables the device to send error | ||
| 219 | messages to root port when an error is detected. | ||
| 220 | |||
| 221 | 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | ||
| 222 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | ||
| 223 | error status register. | ||
| 224 | |||
| 225 | 3.4 Frequent Asked Questions | ||
| 226 | |||
| 227 | Q: What happens if a PCI Express device driver does not provide an | ||
| 228 | error recovery handler (pci_driver->err_handler is equal to NULL)? | ||
| 229 | |||
| 230 | A: The devices attached with the driver won't be recovered. If the | ||
| 231 | error is fatal, kernel will print out warning messages. Please refer | ||
| 232 | to section 3 for more information. | ||
| 233 | |||
| 234 | Q: What happens if an upstream port service driver does not provide | ||
| 235 | callback reset_link? | ||
| 236 | |||
| 237 | A: Fatal error recovery will fail if the errors are reported by the | ||
| 238 | upstream ports who are attached by the service driver. | ||
| 239 | |||
| 240 | Q: How does this infrastructure deal with driver that is not PCI | ||
| 241 | Express aware? | ||
| 242 | |||
| 243 | A: This infrastructure calls the error callback functions of the | ||
| 244 | driver when an error happens. But if the driver is not aware of | ||
| 245 | PCI Express, the device might not report its own errors to root | ||
| 246 | port. | ||
| 247 | |||
| 248 | Q: What modifications will that driver need to make it compatible | ||
| 249 | with the PCI Express AER Root driver? | ||
| 250 | |||
| 251 | A: It could call the helper functions to enable AER in devices and | ||
| 252 | cleanup uncorrectable status register. Pls. refer to section 3.3. | ||
| 253 | |||
diff --git a/Documentation/power/devices.txt b/Documentation/power/devices.txt index fba1e05c47c7..d0e79d5820a5 100644 --- a/Documentation/power/devices.txt +++ b/Documentation/power/devices.txt | |||
| @@ -1,208 +1,553 @@ | |||
| 1 | Most of the code in Linux is device drivers, so most of the Linux power | ||
| 2 | management code is also driver-specific. Most drivers will do very little; | ||
| 3 | others, especially for platforms with small batteries (like cell phones), | ||
| 4 | will do a lot. | ||
| 5 | |||
| 6 | This writeup gives an overview of how drivers interact with system-wide | ||
| 7 | power management goals, emphasizing the models and interfaces that are | ||
| 8 | shared by everything that hooks up to the driver model core. Read it as | ||
| 9 | background for the domain-specific work you'd do with any specific driver. | ||
| 10 | |||
| 11 | |||
| 12 | Two Models for Device Power Management | ||
| 13 | ====================================== | ||
| 14 | Drivers will use one or both of these models to put devices into low-power | ||
| 15 | states: | ||
| 16 | |||
| 17 | System Sleep model: | ||
| 18 | Drivers can enter low power states as part of entering system-wide | ||
| 19 | low-power states like "suspend-to-ram", or (mostly for systems with | ||
| 20 | disks) "hibernate" (suspend-to-disk). | ||
| 21 | |||
| 22 | This is something that device, bus, and class drivers collaborate on | ||
| 23 | by implementing various role-specific suspend and resume methods to | ||
| 24 | cleanly power down hardware and software subsystems, then reactivate | ||
| 25 | them without loss of data. | ||
| 26 | |||
| 27 | Some drivers can manage hardware wakeup events, which make the system | ||
| 28 | leave that low-power state. This feature may be disabled using the | ||
| 29 | relevant /sys/devices/.../power/wakeup file; enabling it may cost some | ||
| 30 | power usage, but let the whole system enter low power states more often. | ||
| 31 | |||
| 32 | Runtime Power Management model: | ||
| 33 | Drivers may also enter low power states while the system is running, | ||
| 34 | independently of other power management activity. Upstream drivers | ||
| 35 | will normally not know (or care) if the device is in some low power | ||
| 36 | state when issuing requests; the driver will auto-resume anything | ||
| 37 | that's needed when it gets a request. | ||
| 38 | |||
| 39 | This doesn't have, or need much infrastructure; it's just something you | ||
| 40 | should do when writing your drivers. For example, clk_disable() unused | ||
| 41 | clocks as part of minimizing power drain for currently-unused hardware. | ||
| 42 | Of course, sometimes clusters of drivers will collaborate with each | ||
| 43 | other, which could involve task-specific power management. | ||
| 44 | |||
| 45 | There's not a lot to be said about those low power states except that they | ||
| 46 | are very system-specific, and often device-specific. Also, that if enough | ||
| 47 | drivers put themselves into low power states (at "runtime"), the effect may be | ||
| 48 | the same as entering some system-wide low-power state (system sleep) ... and | ||
| 49 | that synergies exist, so that several drivers using runtime pm might put the | ||
| 50 | system into a state where even deeper power saving options are available. | ||
| 51 | |||
| 52 | Most suspended devices will have quiesced all I/O: no more DMA or irqs, no | ||
| 53 | more data read or written, and requests from upstream drivers are no longer | ||
| 54 | accepted. A given bus or platform may have different requirements though. | ||
| 55 | |||
| 56 | Examples of hardware wakeup events include an alarm from a real time clock, | ||
| 57 | network wake-on-LAN packets, keyboard or mouse activity, and media insertion | ||
| 58 | or removal (for PCMCIA, MMC/SD, USB, and so on). | ||
| 59 | |||
| 60 | |||
| 61 | Interfaces for Entering System Sleep States | ||
| 62 | =========================================== | ||
| 63 | Most of the programming interfaces a device driver needs to know about | ||
| 64 | relate to that first model: entering a system-wide low power state, | ||
| 65 | rather than just minimizing power consumption by one device. | ||
| 66 | |||
| 67 | |||
| 68 | Bus Driver Methods | ||
| 69 | ------------------ | ||
| 70 | The core methods to suspend and resume devices reside in struct bus_type. | ||
| 71 | These are mostly of interest to people writing infrastructure for busses | ||
| 72 | like PCI or USB, or because they define the primitives that device drivers | ||
| 73 | may need to apply in domain-specific ways to their devices: | ||
| 1 | 74 | ||
| 2 | Device Power Management | 75 | struct bus_type { |
| 76 | ... | ||
| 77 | int (*suspend)(struct device *dev, pm_message_t state); | ||
| 78 | int (*suspend_late)(struct device *dev, pm_message_t state); | ||
| 3 | 79 | ||
| 80 | int (*resume_early)(struct device *dev); | ||
| 81 | int (*resume)(struct device *dev); | ||
| 82 | }; | ||
| 4 | 83 | ||
| 5 | Device power management encompasses two areas - the ability to save | 84 | Bus drivers implement those methods as appropriate for the hardware and |
| 6 | state and transition a device to a low-power state when the system is | 85 | the drivers using it; PCI works differently from USB, and so on. Not many |
| 7 | entering a low-power state; and the ability to transition a device to | 86 | people write bus drivers; most driver code is a "device driver" that |
| 8 | a low-power state while the system is running (and independently of | 87 | builds on top of bus-specific framework code. |
| 9 | any other power management activity). | 88 | |
| 89 | For more information on these driver calls, see the description later; | ||
| 90 | they are called in phases for every device, respecting the parent-child | ||
| 91 | sequencing in the driver model tree. Note that as this is being written, | ||
| 92 | only the suspend() and resume() are widely available; not many bus drivers | ||
| 93 | leverage all of those phases, or pass them down to lower driver levels. | ||
| 94 | |||
| 95 | |||
| 96 | /sys/devices/.../power/wakeup files | ||
| 97 | ----------------------------------- | ||
| 98 | All devices in the driver model have two flags to control handling of | ||
| 99 | wakeup events, which are hardware signals that can force the device and/or | ||
| 100 | system out of a low power state. These are initialized by bus or device | ||
| 101 | driver code using device_init_wakeup(dev,can_wakeup). | ||
| 102 | |||
| 103 | The "can_wakeup" flag just records whether the device (and its driver) can | ||
| 104 | physically support wakeup events. When that flag is clear, the sysfs | ||
| 105 | "wakeup" file is empty, and device_may_wakeup() returns false. | ||
| 106 | |||
| 107 | For devices that can issue wakeup events, a separate flag controls whether | ||
| 108 | that device should try to use its wakeup mechanism. The initial value of | ||
| 109 | device_may_wakeup() will be true, so that the device's "wakeup" file holds | ||
| 110 | the value "enabled". Userspace can change that to "disabled" so that | ||
| 111 | device_may_wakeup() returns false; or change it back to "enabled" (so that | ||
| 112 | it returns true again). | ||
| 113 | |||
| 114 | |||
| 115 | EXAMPLE: PCI Device Driver Methods | ||
| 116 | ----------------------------------- | ||
| 117 | PCI framework software calls these methods when the PCI device driver bound | ||
| 118 | to a device device has provided them: | ||
| 119 | |||
| 120 | struct pci_driver { | ||
| 121 | ... | ||
| 122 | int (*suspend)(struct pci_device *pdev, pm_message_t state); | ||
| 123 | int (*suspend_late)(struct pci_device *pdev, pm_message_t state); | ||
| 124 | |||
| 125 | int (*resume_early)(struct pci_device *pdev); | ||
| 126 | int (*resume)(struct pci_device *pdev); | ||
| 127 | }; | ||
| 10 | 128 | ||
| 129 | Drivers will implement those methods, and call PCI-specific procedures | ||
| 130 | like pci_set_power_state(), pci_enable_wake(), pci_save_state(), and | ||
| 131 | pci_restore_state() to manage PCI-specific mechanisms. (PCI config space | ||
| 132 | could be saved during driver probe, if it weren't for the fact that some | ||
| 133 | systems rely on userspace tweaking using setpci.) Devices are suspended | ||
| 134 | before their bridges enter low power states, and likewise bridges resume | ||
| 135 | before their devices. | ||
| 136 | |||
| 137 | |||
| 138 | Upper Layers of Driver Stacks | ||
| 139 | ----------------------------- | ||
| 140 | Device drivers generally have at least two interfaces, and the methods | ||
| 141 | sketched above are the ones which apply to the lower level (nearer PCI, USB, | ||
| 142 | or other bus hardware). The network and block layers are examples of upper | ||
| 143 | level interfaces, as is a character device talking to userspace. | ||
| 144 | |||
| 145 | Power management requests normally need to flow through those upper levels, | ||
| 146 | which often use domain-oriented requests like "blank that screen". In | ||
| 147 | some cases those upper levels will have power management intelligence that | ||
| 148 | relates to end-user activity, or other devices that work in cooperation. | ||
| 149 | |||
| 150 | When those interfaces are structured using class interfaces, there is a | ||
| 151 | standard way to have the upper layer stop issuing requests to a given | ||
| 152 | class device (and restart later): | ||
| 153 | |||
| 154 | struct class { | ||
| 155 | ... | ||
| 156 | int (*suspend)(struct device *dev, pm_message_t state); | ||
| 157 | int (*resume)(struct device *dev); | ||
| 158 | }; | ||
| 11 | 159 | ||
| 12 | Methods | 160 | Those calls are issued in specific phases of the process by which the |
| 161 | system enters a low power "suspend" state, or resumes from it. | ||
| 162 | |||
| 163 | |||
| 164 | Calling Drivers to Enter System Sleep States | ||
| 165 | ============================================ | ||
| 166 | When the system enters a low power state, each device's driver is asked | ||
| 167 | to suspend the device by putting it into state compatible with the target | ||
| 168 | system state. That's usually some version of "off", but the details are | ||
| 169 | system-specific. Also, wakeup-enabled devices will usually stay partly | ||
| 170 | functional in order to wake the system. | ||
| 171 | |||
| 172 | When the system leaves that low power state, the device's driver is asked | ||
| 173 | to resume it. The suspend and resume operations always go together, and | ||
| 174 | both are multi-phase operations. | ||
| 175 | |||
| 176 | For simple drivers, suspend might quiesce the device using the class code | ||
| 177 | and then turn its hardware as "off" as possible with late_suspend. The | ||
| 178 | matching resume calls would then completely reinitialize the hardware | ||
| 179 | before reactivating its class I/O queues. | ||
| 180 | |||
| 181 | More power-aware drivers drivers will use more than one device low power | ||
| 182 | state, either at runtime or during system sleep states, and might trigger | ||
| 183 | system wakeup events. | ||
| 184 | |||
| 185 | |||
| 186 | Call Sequence Guarantees | ||
| 187 | ------------------------ | ||
| 188 | To ensure that bridges and similar links needed to talk to a device are | ||
| 189 | available when the device is suspended or resumed, the device tree is | ||
| 190 | walked in a bottom-up order to suspend devices. A top-down order is | ||
| 191 | used to resume those devices. | ||
| 192 | |||
| 193 | The ordering of the device tree is defined by the order in which devices | ||
| 194 | get registered: a child can never be registered, probed or resumed before | ||
| 195 | its parent; and can't be removed or suspended after that parent. | ||
| 196 | |||
| 197 | The policy is that the device tree should match hardware bus topology. | ||
| 198 | (Or at least the control bus, for devices which use multiple busses.) | ||
| 199 | |||
| 200 | |||
| 201 | Suspending Devices | ||
| 202 | ------------------ | ||
| 203 | Suspending a given device is done in several phases. Suspending the | ||
| 204 | system always includes every phase, executing calls for every device | ||
| 205 | before the next phase begins. Not all busses or classes support all | ||
| 206 | these callbacks; and not all drivers use all the callbacks. | ||
| 207 | |||
| 208 | The phases are seen by driver notifications issued in this order: | ||
| 209 | |||
| 210 | 1 class.suspend(dev, message) is called after tasks are frozen, for | ||
| 211 | devices associated with a class that has such a method. This | ||
| 212 | method may sleep. | ||
| 213 | |||
| 214 | Since I/O activity usually comes from such higher layers, this is | ||
| 215 | a good place to quiesce all drivers of a given type (and keep such | ||
| 216 | code out of those drivers). | ||
| 217 | |||
| 218 | 2 bus.suspend(dev, message) is called next. This method may sleep, | ||
| 219 | and is often morphed into a device driver call with bus-specific | ||
| 220 | parameters and/or rules. | ||
| 221 | |||
| 222 | This call should handle parts of device suspend logic that require | ||
| 223 | sleeping. It probably does work to quiesce the device which hasn't | ||
| 224 | been abstracted into class.suspend() or bus.suspend_late(). | ||
| 225 | |||
| 226 | 3 bus.suspend_late(dev, message) is called with IRQs disabled, and | ||
| 227 | with only one CPU active. Until the bus.resume_early() phase | ||
| 228 | completes (see later), IRQs are not enabled again. This method | ||
| 229 | won't be exposed by all busses; for message based busses like USB, | ||
| 230 | I2C, or SPI, device interactions normally require IRQs. This bus | ||
| 231 | call may be morphed into a driver call with bus-specific parameters. | ||
| 232 | |||
| 233 | This call might save low level hardware state that might otherwise | ||
| 234 | be lost in the upcoming low power state, and actually put the | ||
| 235 | device into a low power state ... so that in some cases the device | ||
| 236 | may stay partly usable until this late. This "late" call may also | ||
| 237 | help when coping with hardware that behaves badly. | ||
| 238 | |||
| 239 | The pm_message_t parameter is currently used to refine those semantics | ||
| 240 | (described later). | ||
| 241 | |||
| 242 | At the end of those phases, drivers should normally have stopped all I/O | ||
| 243 | transactions (DMA, IRQs), saved enough state that they can re-initialize | ||
| 244 | or restore previous state (as needed by the hardware), and placed the | ||
| 245 | device into a low-power state. On many platforms they will also use | ||
| 246 | clk_disable() to gate off one or more clock sources; sometimes they will | ||
| 247 | also switch off power supplies, or reduce voltages. Drivers which have | ||
| 248 | runtime PM support may already have performed some or all of the steps | ||
| 249 | needed to prepare for the upcoming system sleep state. | ||
| 250 | |||
| 251 | When any driver sees that its device_can_wakeup(dev), it should make sure | ||
| 252 | to use the relevant hardware signals to trigger a system wakeup event. | ||
| 253 | For example, enable_irq_wake() might identify GPIO signals hooked up to | ||
| 254 | a switch or other external hardware, and pci_enable_wake() does something | ||
| 255 | similar for PCI's PME# signal. | ||
| 256 | |||
| 257 | If a driver (or bus, or class) fails it suspend method, the system won't | ||
| 258 | enter the desired low power state; it will resume all the devices it's | ||
| 259 | suspended so far. | ||
| 260 | |||
| 261 | Note that drivers may need to perform different actions based on the target | ||
| 262 | system lowpower/sleep state. At this writing, there are only platform | ||
| 263 | specific APIs through which drivers could determine those target states. | ||
| 264 | |||
| 265 | |||
| 266 | Device Low Power (suspend) States | ||
| 267 | --------------------------------- | ||
| 268 | Device low-power states aren't very standard. One device might only handle | ||
| 269 | "on" and "off, while another might support a dozen different versions of | ||
| 270 | "on" (how many engines are active?), plus a state that gets back to "on" | ||
| 271 | faster than from a full "off". | ||
| 272 | |||
| 273 | Some busses define rules about what different suspend states mean. PCI | ||
| 274 | gives one example: after the suspend sequence completes, a non-legacy | ||
| 275 | PCI device may not perform DMA or issue IRQs, and any wakeup events it | ||
| 276 | issues would be issued through the PME# bus signal. Plus, there are | ||
| 277 | several PCI-standard device states, some of which are optional. | ||
| 278 | |||
| 279 | In contrast, integrated system-on-chip processors often use irqs as the | ||
| 280 | wakeup event sources (so drivers would call enable_irq_wake) and might | ||
| 281 | be able to treat DMA completion as a wakeup event (sometimes DMA can stay | ||
| 282 | active too, it'd only be the CPU and some peripherals that sleep). | ||
| 283 | |||
| 284 | Some details here may be platform-specific. Systems may have devices that | ||
| 285 | can be fully active in certain sleep states, such as an LCD display that's | ||
| 286 | refreshed using DMA while most of the system is sleeping lightly ... and | ||
| 287 | its frame buffer might even be updated by a DSP or other non-Linux CPU while | ||
| 288 | the Linux control processor stays idle. | ||
| 289 | |||
| 290 | Moreover, the specific actions taken may depend on the target system state. | ||
| 291 | One target system state might allow a given device to be very operational; | ||
| 292 | another might require a hard shut down with re-initialization on resume. | ||
| 293 | And two different target systems might use the same device in different | ||
| 294 | ways; the aforementioned LCD might be active in one product's "standby", | ||
| 295 | but a different product using the same SOC might work differently. | ||
| 296 | |||
| 297 | |||
| 298 | Meaning of pm_message_t.event | ||
| 299 | ----------------------------- | ||
| 300 | Parameters to suspend calls include the device affected and a message of | ||
| 301 | type pm_message_t, which has one field: the event. If driver does not | ||
| 302 | recognize the event code, suspend calls may abort the request and return | ||
| 303 | a negative errno. However, most drivers will be fine if they implement | ||
| 304 | PM_EVENT_SUSPEND semantics for all messages. | ||
| 305 | |||
| 306 | The event codes are used to refine the goal of suspending the device, and | ||
| 307 | mostly matter when creating or resuming system memory image snapshots, as | ||
| 308 | used with suspend-to-disk: | ||
| 309 | |||
| 310 | PM_EVENT_SUSPEND -- quiesce the driver and put hardware into a low-power | ||
| 311 | state. When used with system sleep states like "suspend-to-RAM" or | ||
| 312 | "standby", the upcoming resume() call will often be able to rely on | ||
| 313 | state kept in hardware, or issue system wakeup events. When used | ||
| 314 | instead with suspend-to-disk, few devices support this capability; | ||
| 315 | most are completely powered off. | ||
| 316 | |||
| 317 | PM_EVENT_FREEZE -- quiesce the driver, but don't necessarily change into | ||
| 318 | any low power mode. A system snapshot is about to be taken, often | ||
| 319 | followed by a call to the driver's resume() method. Neither wakeup | ||
| 320 | events nor DMA are allowed. | ||
| 321 | |||
| 322 | PM_EVENT_PRETHAW -- quiesce the driver, knowing that the upcoming resume() | ||
| 323 | will restore a suspend-to-disk snapshot from a different kernel image. | ||
| 324 | Drivers that are smart enough to look at their hardware state during | ||
| 325 | resume() processing need that state to be correct ... a PRETHAW could | ||
| 326 | be used to invalidate that state (by resetting the device), like a | ||
| 327 | shutdown() invocation would before a kexec() or system halt. Other | ||
| 328 | drivers might handle this the same way as PM_EVENT_FREEZE. Neither | ||
| 329 | wakeup events nor DMA are allowed. | ||
| 330 | |||
| 331 | To enter "standby" (ACPI S1) or "Suspend to RAM" (STR, ACPI S3) states, or | ||
| 332 | the similarly named APM states, only PM_EVENT_SUSPEND is used; for "Suspend | ||
| 333 | to Disk" (STD, hibernate, ACPI S4), all of those event codes are used. | ||
| 334 | |||
| 335 | There's also PM_EVENT_ON, a value which never appears as a suspend event | ||
| 336 | but is sometimes used to record the "not suspended" device state. | ||
| 337 | |||
| 338 | |||
| 339 | Resuming Devices | ||
| 340 | ---------------- | ||
| 341 | Resuming is done in multiple phases, much like suspending, with all | ||
| 342 | devices processing each phase's calls before the next phase begins. | ||
| 343 | |||
| 344 | The phases are seen by driver notifications issued in this order: | ||
| 345 | |||
| 346 | 1 bus.resume_early(dev) is called with IRQs disabled, and with | ||
| 347 | only one CPU active. As with bus.suspend_late(), this method | ||
| 348 | won't be supported on busses that require IRQs in order to | ||
| 349 | interact with devices. | ||
| 350 | |||
| 351 | This reverses the effects of bus.suspend_late(). | ||
| 352 | |||
| 353 | 2 bus.resume(dev) is called next. This may be morphed into a device | ||
| 354 | driver call with bus-specific parameters; implementations may sleep. | ||
| 355 | |||
| 356 | This reverses the effects of bus.suspend(). | ||
| 357 | |||
| 358 | 3 class.resume(dev) is called for devices associated with a class | ||
| 359 | that has such a method. Implementations may sleep. | ||
| 360 | |||
| 361 | This reverses the effects of class.suspend(), and would usually | ||
| 362 | reactivate the device's I/O queue. | ||
| 363 | |||
| 364 | At the end of those phases, drivers should normally be as functional as | ||
| 365 | they were before suspending: I/O can be performed using DMA and IRQs, and | ||
| 366 | the relevant clocks are gated on. The device need not be "fully on"; it | ||
| 367 | might be in a runtime lowpower/suspend state that acts as if it were. | ||
| 368 | |||
| 369 | However, the details here may again be platform-specific. For example, | ||
| 370 | some systems support multiple "run" states, and the mode in effect at | ||
| 371 | the end of resume() might not be the one which preceded suspension. | ||
| 372 | That means availability of certain clocks or power supplies changed, | ||
| 373 | which could easily affect how a driver works. | ||
| 374 | |||
| 375 | |||
| 376 | Drivers need to be able to handle hardware which has been reset since the | ||
| 377 | suspend methods were called, for example by complete reinitialization. | ||
| 378 | This may be the hardest part, and the one most protected by NDA'd documents | ||
| 379 | and chip errata. It's simplest if the hardware state hasn't changed since | ||
| 380 | the suspend() was called, but that can't always be guaranteed. | ||
| 381 | |||
| 382 | Drivers must also be prepared to notice that the device has been removed | ||
| 383 | while the system was powered off, whenever that's physically possible. | ||
| 384 | PCMCIA, MMC, USB, Firewire, SCSI, and even IDE are common examples of busses | ||
| 385 | where common Linux platforms will see such removal. Details of how drivers | ||
| 386 | will notice and handle such removals are currently bus-specific, and often | ||
| 387 | involve a separate thread. | ||
| 13 | 388 | ||
| 14 | The methods to suspend and resume devices reside in struct bus_type: | ||
| 15 | 389 | ||
| 16 | struct bus_type { | 390 | Note that the bus-specific runtime PM wakeup mechanism can exist, and might |
| 17 | ... | 391 | be defined to share some of the same driver code as for system wakeup. For |
| 18 | int (*suspend)(struct device * dev, pm_message_t state); | 392 | example, a bus-specific device driver's resume() method might be used there, |
| 19 | int (*resume)(struct device * dev); | 393 | so it wouldn't only be called from bus.resume() during system-wide wakeup. |
| 20 | }; | 394 | See bus-specific information about how runtime wakeup events are handled. |
| 21 | 395 | ||
| 22 | Each bus driver is responsible implementing these methods, translating | ||
| 23 | the call into a bus-specific request and forwarding the call to the | ||
| 24 | bus-specific drivers. For example, PCI drivers implement suspend() and | ||
| 25 | resume() methods in struct pci_driver. The PCI core is simply | ||
| 26 | responsible for translating the pointers to PCI-specific ones and | ||
| 27 | calling the low-level driver. | ||
| 28 | |||
| 29 | This is done to a) ease transition to the new power management methods | ||
| 30 | and leverage the existing PM code in various bus drivers; b) allow | ||
| 31 | buses to implement generic and default PM routines for devices, and c) | ||
| 32 | make the flow of execution obvious to the reader. | ||
| 33 | |||
| 34 | |||
| 35 | System Power Management | ||
| 36 | |||
| 37 | When the system enters a low-power state, the device tree is walked in | ||
| 38 | a depth-first fashion to transition each device into a low-power | ||
| 39 | state. The ordering of the device tree is guaranteed by the order in | ||
| 40 | which devices get registered - children are never registered before | ||
| 41 | their ancestors, and devices are placed at the back of the list when | ||
| 42 | registered. By walking the list in reverse order, we are guaranteed to | ||
| 43 | suspend devices in the proper order. | ||
| 44 | |||
| 45 | Devices are suspended once with interrupts enabled. Drivers are | ||
| 46 | expected to stop I/O transactions, save device state, and place the | ||
| 47 | device into a low-power state. Drivers may sleep, allocate memory, | ||
| 48 | etc. at will. | ||
| 49 | |||
| 50 | Some devices are broken and will inevitably have problems powering | ||
| 51 | down or disabling themselves with interrupts enabled. For these | ||
| 52 | special cases, they may return -EAGAIN. This will put the device on a | ||
| 53 | list to be taken care of later. When interrupts are disabled, before | ||
| 54 | we enter the low-power state, their drivers are called again to put | ||
| 55 | their device to sleep. | ||
| 56 | |||
| 57 | On resume, the devices that returned -EAGAIN will be called to power | ||
| 58 | themselves back on with interrupts disabled. Once interrupts have been | ||
| 59 | re-enabled, the rest of the drivers will be called to resume their | ||
| 60 | devices. On resume, a driver is responsible for powering back on each | ||
| 61 | device, restoring state, and re-enabling I/O transactions for that | ||
| 62 | device. | ||
| 63 | 396 | ||
| 397 | System Devices | ||
| 398 | -------------- | ||
| 64 | System devices follow a slightly different API, which can be found in | 399 | System devices follow a slightly different API, which can be found in |
| 65 | 400 | ||
| 66 | include/linux/sysdev.h | 401 | include/linux/sysdev.h |
| 67 | drivers/base/sys.c | 402 | drivers/base/sys.c |
| 68 | 403 | ||
| 69 | System devices will only be suspended with interrupts disabled, and | 404 | System devices will only be suspended with interrupts disabled, and after |
| 70 | after all other devices have been suspended. On resume, they will be | 405 | all other devices have been suspended. On resume, they will be resumed |
| 71 | resumed before any other devices, and also with interrupts disabled. | 406 | before any other devices, and also with interrupts disabled. |
| 72 | 407 | ||
| 408 | That is, IRQs are disabled, the suspend_late() phase begins, then the | ||
| 409 | sysdev_driver.suspend() phase, and the system enters a sleep state. Then | ||
| 410 | the sysdev_driver.resume() phase begins, followed by the resume_early() | ||
| 411 | phase, after which IRQs are enabled. | ||
| 73 | 412 | ||
| 74 | Runtime Power Management | 413 | Code to actually enter and exit the system-wide low power state sometimes |
| 75 | 414 | involves hardware details that are only known to the boot firmware, and | |
| 76 | Many devices are able to dynamically power down while the system is | 415 | may leave a CPU running software (from SRAM or flash memory) that monitors |
| 77 | still running. This feature is useful for devices that are not being | 416 | the system and manages its wakeup sequence. |
| 78 | used, and can offer significant power savings on a running system. | ||
| 79 | |||
| 80 | In each device's directory, there is a 'power' directory, which | ||
| 81 | contains at least a 'state' file. Reading from this file displays what | ||
| 82 | power state the device is currently in. Writing to this file initiates | ||
| 83 | a transition to the specified power state, which must be a decimal in | ||
| 84 | the range 1-3, inclusive; or 0 for 'On'. | ||
| 85 | 417 | ||
| 86 | The PM core will call the ->suspend() method in the bus_type object | ||
| 87 | that the device belongs to if the specified state is not 0, or | ||
| 88 | ->resume() if it is. | ||
| 89 | 418 | ||
| 90 | Nothing will happen if the specified state is the same state the | 419 | Runtime Power Management |
| 91 | device is currently in. | 420 | ======================== |
| 92 | 421 | Many devices are able to dynamically power down while the system is still | |
| 93 | If the device is already in a low-power state, and the specified state | 422 | running. This feature is useful for devices that are not being used, and |
| 94 | is another, but different, low-power state, the ->resume() method will | 423 | can offer significant power savings on a running system. These devices |
| 95 | first be called to power the device back on, then ->suspend() will be | 424 | often support a range of runtime power states, which might use names such |
| 96 | called again with the new state. | 425 | as "off", "sleep", "idle", "active", and so on. Those states will in some |
| 97 | 426 | cases (like PCI) be partially constrained by a bus the device uses, and will | |
| 98 | The driver is responsible for saving the working state of the device | 427 | usually include hardware states that are also used in system sleep states. |
| 99 | and putting it into the low-power state specified. If this was | 428 | |
| 100 | successful, it returns 0, and the device's power_state field is | 429 | However, note that if a driver puts a device into a runtime low power state |
| 101 | updated. | 430 | and the system then goes into a system-wide sleep state, it normally ought |
| 102 | 431 | to resume into that runtime low power state rather than "full on". Such | |
| 103 | The driver must take care to know whether or not it is able to | 432 | distinctions would be part of the driver-internal state machine for that |
| 104 | properly resume the device, including all step of reinitialization | 433 | hardware; the whole point of runtime power management is to be sure that |
| 105 | necessary. (This is the hardest part, and the one most protected by | 434 | drivers are decoupled in that way from the state machine governing phases |
| 106 | NDA'd documents). | 435 | of the system-wide power/sleep state transitions. |
| 107 | 436 | ||
| 108 | The driver must also take care not to suspend a device that is | 437 | |
| 109 | currently in use. It is their responsibility to provide their own | 438 | Power Saving Techniques |
| 110 | exclusion mechanisms. | 439 | ----------------------- |
| 111 | 440 | Normally runtime power management is handled by the drivers without specific | |
| 112 | The runtime power transition happens with interrupts enabled. If a | 441 | userspace or kernel intervention, by device-aware use of techniques like: |
| 113 | device cannot support being powered down with interrupts, it may | 442 | |
| 114 | return -EAGAIN (as it would during a system power management | 443 | Using information provided by other system layers |
| 115 | transition), but it will _not_ be called again, and the transaction | 444 | - stay deeply "off" except between open() and close() |
| 116 | will fail. | 445 | - if transceiver/PHY indicates "nobody connected", stay "off" |
| 117 | 446 | - application protocols may include power commands or hints | |
| 118 | There is currently no way to know what states a device or driver | 447 | |
| 119 | supports a priori. This will change in the future. | 448 | Using fewer CPU cycles |
| 120 | 449 | - using DMA instead of PIO | |
| 121 | pm_message_t meaning | 450 | - removing timers, or making them lower frequency |
| 122 | 451 | - shortening "hot" code paths | |
| 123 | pm_message_t has two fields. event ("major"), and flags. If driver | 452 | - eliminating cache misses |
| 124 | does not know event code, it aborts the request, returning error. Some | 453 | - (sometimes) offloading work to device firmware |
| 125 | drivers may need to deal with special cases based on the actual type | 454 | |
| 126 | of suspend operation being done at the system level. This is why | 455 | Reducing other resource costs |
| 127 | there are flags. | 456 | - gating off unused clocks in software (or hardware) |
| 128 | 457 | - switching off unused power supplies | |
| 129 | Event codes are: | 458 | - eliminating (or delaying/merging) IRQs |
| 130 | 459 | - tuning DMA to use word and/or burst modes | |
| 131 | ON -- no need to do anything except special cases like broken | 460 | |
| 132 | HW. | 461 | Using device-specific low power states |
| 133 | 462 | - using lower voltages | |
| 134 | # NOTIFICATION -- pretty much same as ON? | 463 | - avoiding needless DMA transfers |
| 135 | 464 | ||
| 136 | FREEZE -- stop DMA and interrupts, and be prepared to reinit HW from | 465 | Read your hardware documentation carefully to see the opportunities that |
| 137 | scratch. That probably means stop accepting upstream requests, the | 466 | may be available. If you can, measure the actual power usage and check |
| 138 | actual policy of what to do with them being specific to a given | 467 | it against the budget established for your project. |
| 139 | driver. It's acceptable for a network driver to just drop packets | 468 | |
| 140 | while a block driver is expected to block the queue so no request is | 469 | |
| 141 | lost. (Use IDE as an example on how to do that). FREEZE requires no | 470 | Examples: USB hosts, system timer, system CPU |
| 142 | power state change, and it's expected for drivers to be able to | 471 | ---------------------------------------------- |
| 143 | quickly transition back to operating state. | 472 | USB host controllers make interesting, if complex, examples. In many cases |
| 144 | 473 | these have no work to do: no USB devices are connected, or all of them are | |
| 145 | SUSPEND -- like FREEZE, but also put hardware into low-power state. If | 474 | in the USB "suspend" state. Linux host controller drivers can then disable |
| 146 | there's need to distinguish several levels of sleep, additional flag | 475 | periodic DMA transfers that would otherwise be a constant power drain on the |
| 147 | is probably best way to do that. | 476 | memory subsystem, and enter a suspend state. In power-aware controllers, |
| 148 | 477 | entering that suspend state may disable the clock used with USB signaling, | |
| 149 | Transitions are only from a resumed state to a suspended state, never | 478 | saving a certain amount of power. |
| 150 | between 2 suspended states. (ON -> FREEZE or ON -> SUSPEND can happen, | 479 | |
| 151 | FREEZE -> SUSPEND or SUSPEND -> FREEZE can not). | 480 | The controller will be woken from that state (with an IRQ) by changes to the |
| 152 | 481 | signal state on the data lines of a given port, for example by an existing | |
| 153 | All events are: | 482 | peripheral requesting "remote wakeup" or by plugging a new peripheral. The |
| 154 | 483 | same wakeup mechanism usually works from "standby" sleep states, and on some | |
| 155 | [NOTE NOTE NOTE: If you are driver author, you should not care; you | 484 | systems also from "suspend to RAM" (or even "suspend to disk") states. |
| 156 | should only look at event, and ignore flags.] | 485 | (Except that ACPI may be involved instead of normal IRQs, on some hardware.) |
| 157 | 486 | ||
| 158 | #Prepare for suspend -- userland is still running but we are going to | 487 | System devices like timers and CPUs may have special roles in the platform |
| 159 | #enter suspend state. This gives drivers chance to load firmware from | 488 | power management scheme. For example, system timers using a "dynamic tick" |
| 160 | #disk and store it in memory, or do other activities taht require | 489 | approach don't just save CPU cycles (by eliminating needless timer IRQs), |
| 161 | #operating userland, ability to kmalloc GFP_KERNEL, etc... All of these | 490 | but they may also open the door to using lower power CPU "idle" states that |
| 162 | #are forbiden once the suspend dance is started.. event = ON, flags = | 491 | cost more than a jiffie to enter and exit. On x86 systems these are states |
| 163 | #PREPARE_TO_SUSPEND | 492 | like "C3"; note that periodic DMA transfers from a USB host controller will |
| 164 | 493 | also prevent entry to a C3 state, much like a periodic timer IRQ. | |
| 165 | Apm standby -- prepare for APM event. Quiesce devices to make life | 494 | |
| 166 | easier for APM BIOS. event = FREEZE, flags = APM_STANDBY | 495 | That kind of runtime mechanism interaction is common. "System On Chip" (SOC) |
| 167 | 496 | processors often have low power idle modes that can't be entered unless | |
| 168 | Apm suspend -- same as APM_STANDBY, but it we should probably avoid | 497 | certain medium-speed clocks (often 12 or 48 MHz) are gated off. When the |
| 169 | spinning down disks. event = FREEZE, flags = APM_SUSPEND | 498 | drivers gate those clocks effectively, then the system idle task may be able |
| 170 | 499 | to use the lower power idle modes and thereby increase battery life. | |
| 171 | System halt, reboot -- quiesce devices to make life easier for BIOS. event | 500 | |
| 172 | = FREEZE, flags = SYSTEM_HALT or SYSTEM_REBOOT | 501 | If the CPU can have a "cpufreq" driver, there also may be opportunities |
| 173 | 502 | to shift to lower voltage settings and reduce the power cost of executing | |
| 174 | System shutdown -- at least disks need to be spun down, or data may be | 503 | a given number of instructions. (Without voltage adjustment, it's rare |
| 175 | lost. Quiesce devices, just to make life easier for BIOS. event = | 504 | for cpufreq to save much power; the cost-per-instruction must go down.) |
| 176 | FREEZE, flags = SYSTEM_SHUTDOWN | 505 | |
| 177 | 506 | ||
| 178 | Kexec -- turn off DMAs and put hardware into some state where new | 507 | /sys/devices/.../power/state files |
| 179 | kernel can take over. event = FREEZE, flags = KEXEC | 508 | ================================== |
| 180 | 509 | For now you can also test some of this functionality using sysfs. | |
| 181 | Powerdown at end of swsusp -- very similar to SYSTEM_SHUTDOWN, except wake | 510 | |
| 182 | may need to be enabled on some devices. This actually has at least 3 | 511 | DEPRECATED: USE "power/state" ONLY FOR DRIVER TESTING, AND |
| 183 | subtypes, system can reboot, enter S4 and enter S5 at the end of | 512 | AVOID USING dev->power.power_state IN DRIVERS. |
| 184 | swsusp. event = FREEZE, flags = SWSUSP and one of SYSTEM_REBOOT, | 513 | |
| 185 | SYSTEM_SHUTDOWN, SYSTEM_S4 | 514 | THESE WILL BE REMOVED. IF THE "power/state" FILE GETS REPLACED, |
| 186 | 515 | IT WILL BECOME SOMETHING COUPLED TO THE BUS OR DRIVER. | |
| 187 | Suspend to ram -- put devices into low power state. event = SUSPEND, | 516 | |
| 188 | flags = SUSPEND_TO_RAM | 517 | In each device's directory, there is a 'power' directory, which contains |
| 189 | 518 | at least a 'state' file. The value of this field is effectively boolean, | |
| 190 | Freeze for swsusp snapshot -- stop DMA and interrupts. No need to put | 519 | PM_EVENT_ON or PM_EVENT_SUSPEND. |
| 191 | devices into low power mode, but you must be able to reinitialize | 520 | |
| 192 | device from scratch in resume method. This has two flavors, its done | 521 | * Reading from this file displays a value corresponding to |
| 193 | once on suspending kernel, once on resuming kernel. event = FREEZE, | 522 | the power.power_state.event field. All nonzero values are |
| 194 | flags = DURING_SUSPEND or DURING_RESUME | 523 | displayed as "2", corresponding to a low power state; zero |
| 195 | 524 | is displayed as "0", corresponding to normal operation. | |
| 196 | Device detach requested from /sys -- deinitialize device; proably same as | 525 | |
| 197 | SYSTEM_SHUTDOWN, I do not understand this one too much. probably event | 526 | * Writing to this file initiates a transition using the |
| 198 | = FREEZE, flags = DEV_DETACH. | 527 | specified event code number; only '0', '2', and '3' are |
| 199 | 528 | accepted (without a newline); '2' and '3' are both | |
| 200 | #These are not really events sent: | 529 | mapped to PM_EVENT_SUSPEND. |
| 201 | # | 530 | |
| 202 | #System fully on -- device is working normally; this is probably never | 531 | On writes, the PM core relies on that recorded event code and the device/bus |
| 203 | #passed to suspend() method... event = ON, flags = 0 | 532 | capabilities to determine whether it uses a partial suspend() or resume() |
| 204 | # | 533 | sequence to change things so that the recorded event corresponds to the |
| 205 | #Ready after resume -- userland is now running, again. Time to free any | 534 | numeric parameter. |
| 206 | #memory you ate during prepare to suspend... event = ON, flags = | 535 | |
| 207 | #READY_AFTER_RESUME | 536 | - If the bus requires the irqs-disabled suspend_late()/resume_early() |
| 208 | # | 537 | phases, writes fail because those operations are not supported here. |
| 538 | |||
| 539 | - If the recorded value is the expected value, nothing is done. | ||
| 540 | |||
| 541 | - If the recorded value is nonzero, the device is partially resumed, | ||
| 542 | using the bus.resume() and/or class.resume() methods. | ||
| 543 | |||
| 544 | - If the target value is nonzero, the device is partially suspended, | ||
| 545 | using the class.suspend() and/or bus.suspend() methods and the | ||
| 546 | PM_EVENT_SUSPEND message. | ||
| 547 | |||
| 548 | Drivers have no way to tell whether their suspend() and resume() calls | ||
| 549 | have come through the sysfs power/state file or as part of entering a | ||
| 550 | system sleep state, except that when accessed through sysfs the normal | ||
| 551 | parent/child sequencing rules are ignored. Drivers (such as bus, bridge, | ||
| 552 | or hub drivers) which expose child devices may need to enforce those rules | ||
| 553 | on their own. | ||
diff --git a/Documentation/power/interface.txt b/Documentation/power/interface.txt index 4117802af0f8..a66bec222b16 100644 --- a/Documentation/power/interface.txt +++ b/Documentation/power/interface.txt | |||
| @@ -52,3 +52,18 @@ suspend image will be as small as possible. | |||
| 52 | 52 | ||
| 53 | Reading from this file will display the current image size limit, which | 53 | Reading from this file will display the current image size limit, which |
| 54 | is set to 500 MB by default. | 54 | is set to 500 MB by default. |
| 55 | |||
| 56 | /sys/power/pm_trace controls the code which saves the last PM event point in | ||
| 57 | the RTC across reboots, so that you can debug a machine that just hangs | ||
| 58 | during suspend (or more commonly, during resume). Namely, the RTC is only | ||
| 59 | used to save the last PM event point if this file contains '1'. Initially it | ||
| 60 | contains '0' which may be changed to '1' by writing a string representing a | ||
| 61 | nonzero integer into it. | ||
| 62 | |||
| 63 | To use this debugging feature you should attempt to suspend the machine, then | ||
| 64 | reboot it and run | ||
| 65 | |||
| 66 | dmesg -s 1000000 | grep 'hash matches' | ||
| 67 | |||
| 68 | CAUTION: Using it will cause your machine's real-time (CMOS) clock to be | ||
| 69 | set to a random invalid time after a resume. | ||
diff --git a/Documentation/sh/new-machine.txt b/Documentation/sh/new-machine.txt index eb2dd2e6993b..73988e0d112b 100644 --- a/Documentation/sh/new-machine.txt +++ b/Documentation/sh/new-machine.txt | |||
| @@ -41,11 +41,6 @@ Board-specific code: | |||
| 41 | | | 41 | | |
| 42 | .. more boards here ... | 42 | .. more boards here ... |
| 43 | 43 | ||
| 44 | It should also be noted that each board is required to have some certain | ||
| 45 | headers. At the time of this writing, io.h is the only thing that needs | ||
| 46 | to be provided for each board, and can generally just reference generic | ||
| 47 | functions (with the exception of isa_port2addr). | ||
| 48 | |||
| 49 | Next, for companion chips: | 44 | Next, for companion chips: |
| 50 | . | 45 | . |
| 51 | `-- arch | 46 | `-- arch |
| @@ -104,12 +99,13 @@ and then populate that with sub-directories for each member of the family. | |||
| 104 | Both the Solution Engine and the hp6xx boards are an example of this. | 99 | Both the Solution Engine and the hp6xx boards are an example of this. |
| 105 | 100 | ||
| 106 | After you have setup your new arch/sh/boards/ directory, remember that you | 101 | After you have setup your new arch/sh/boards/ directory, remember that you |
| 107 | also must add a directory in include/asm-sh for headers localized to this | 102 | should also add a directory in include/asm-sh for headers localized to this |
| 108 | board. In order to interoperate seamlessly with the build system, it's best | 103 | board (if there are going to be more than one). In order to interoperate |
| 109 | to have this directory the same as the arch/sh/boards/ directory name, | 104 | seamlessly with the build system, it's best to have this directory the same |
| 110 | though if your board is again part of a family, the build system has ways | 105 | as the arch/sh/boards/ directory name, though if your board is again part of |
| 111 | of dealing with this, and you can feel free to name the directory after | 106 | a family, the build system has ways of dealing with this (via incdir-y |
| 112 | the family member itself. | 107 | overloading), and you can feel free to name the directory after the family |
| 108 | member itself. | ||
| 113 | 109 | ||
| 114 | There are a few things that each board is required to have, both in the | 110 | There are a few things that each board is required to have, both in the |
| 115 | arch/sh/boards and the include/asm-sh/ heirarchy. In order to better | 111 | arch/sh/boards and the include/asm-sh/ heirarchy. In order to better |
| @@ -122,6 +118,7 @@ might look something like: | |||
| 122 | * arch/sh/boards/vapor/setup.c - Setup code for imaginary board | 118 | * arch/sh/boards/vapor/setup.c - Setup code for imaginary board |
| 123 | */ | 119 | */ |
| 124 | #include <linux/init.h> | 120 | #include <linux/init.h> |
| 121 | #include <asm/rtc.h> /* for board_time_init() */ | ||
| 125 | 122 | ||
| 126 | const char *get_system_type(void) | 123 | const char *get_system_type(void) |
| 127 | { | 124 | { |
| @@ -152,79 +149,57 @@ int __init platform_setup(void) | |||
| 152 | } | 149 | } |
| 153 | 150 | ||
| 154 | Our new imaginary board will also have to tie into the machvec in order for it | 151 | Our new imaginary board will also have to tie into the machvec in order for it |
| 155 | to be of any use. Currently the machvec is slowly on its way out, but is still | 152 | to be of any use. |
| 156 | required for the time being. As such, let us take a look at what needs to be | ||
| 157 | done for the machvec assignment. | ||
| 158 | 153 | ||
| 159 | machvec functions fall into a number of categories: | 154 | machvec functions fall into a number of categories: |
| 160 | 155 | ||
| 161 | - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc). | 156 | - I/O functions to IO memory (inb etc) and PCI/main memory (readb etc). |
| 162 | - I/O remapping functions (ioremap etc) | 157 | - I/O mapping functions (ioport_map, ioport_unmap, etc). |
| 163 | - some initialisation functions | 158 | - a 'heartbeat' function. |
| 164 | - a 'heartbeat' function | 159 | - PCI and IRQ initialization routines. |
| 165 | - some miscellaneous flags | 160 | - Consistent allocators (for boards that need special allocators, |
| 166 | 161 | particularly for allocating out of some board-specific SRAM for DMA | |
| 167 | The tree can be built in two ways: | 162 | handles). |
| 168 | - as a fully generic build. All drivers are linked in, and all functions | 163 | |
| 169 | go through the machvec | 164 | There are machvec functions added and removed over time, so always be sure to |
| 170 | - as a machine specific build. In this case only the required drivers | 165 | consult include/asm-sh/machvec.h for the current state of the machvec. |
| 171 | will be linked in, and some macros may be redefined to not go through | 166 | |
| 172 | the machvec where performance is important (in particular IO functions). | 167 | The kernel will automatically wrap in generic routines for undefined function |
| 173 | 168 | pointers in the machvec at boot time, as machvec functions are referenced | |
| 174 | There are three ways in which IO can be performed: | 169 | unconditionally throughout most of the tree. Some boards have incredibly |
| 175 | - none at all. This is really only useful for the 'unknown' machine type, | 170 | sparse machvecs (such as the dreamcast and sh03), whereas others must define |
| 176 | which us designed to run on a machine about which we know nothing, and | 171 | virtually everything (rts7751r2d). |
| 177 | so all all IO instructions do nothing. | 172 | |
| 178 | - fully custom. In this case all IO functions go to a machine specific | 173 | Adding a new machine is relatively trivial (using vapor as an example): |
| 179 | set of functions which can do what they like | 174 | |
| 180 | - a generic set of functions. These will cope with most situations, | 175 | If the board-specific definitions are quite minimalistic, as is the case for |
| 181 | and rely on a single function, mv_port2addr, which is called through the | 176 | the vast majority of boards, simply having a single board-specific header is |
| 182 | machine vector, and converts an IO address into a memory address, which | 177 | sufficient. |
| 183 | can be read from/written to directly. | 178 | |
| 184 | 179 | - add a new file include/asm-sh/vapor.h which contains prototypes for | |
| 185 | Thus adding a new machine involves the following steps (I will assume I am | ||
| 186 | adding a machine called vapor): | ||
| 187 | |||
| 188 | - add a new file include/asm-sh/vapor/io.h which contains prototypes for | ||
| 189 | any machine specific IO functions prefixed with the machine name, for | 180 | any machine specific IO functions prefixed with the machine name, for |
| 190 | example vapor_inb. These will be needed when filling out the machine | 181 | example vapor_inb. These will be needed when filling out the machine |
| 191 | vector. | 182 | vector. |
| 192 | 183 | ||
| 193 | This is the minimum that is required, however there are ample | 184 | Note that these prototypes are generated automatically by setting |
| 194 | opportunities to optimise this. In particular, by making the prototypes | 185 | __IO_PREFIX to something sensible. A typical example would be: |
| 195 | inline function definitions, it is possible to inline the function when | 186 | |
| 196 | building machine specific versions. Note that the machine vector | 187 | #define __IO_PREFIX vapor |
| 197 | functions will still be needed, so that a module built for a generic | 188 | #include <asm/io_generic.h> |
| 198 | setup can be loaded. | 189 | |
| 199 | 190 | somewhere in the board-specific header. Any boards being ported that still | |
| 200 | - add a new file arch/sh/boards/vapor/mach.c. This contains the definition | 191 | have a legacy io.h should remove it entirely and switch to the new model. |
| 201 | of the machine vector. When building the machine specific version, this | 192 | |
| 202 | will be the real machine vector (via an alias), while in the generic | 193 | - Add machine vector definitions to the board's setup.c. At a bare minimum, |
| 203 | version is used to initialise the machine vector, and then freed, by | 194 | this must be defined as something like: |
| 204 | making it initdata. This should be defined as: | 195 | |
| 205 | 196 | struct sh_machine_vector mv_vapor __initmv = { | |
| 206 | struct sh_machine_vector mv_vapor __initmv = { | 197 | .mv_name = "vapor", |
| 207 | .mv_name = "vapor", | 198 | }; |
| 208 | } | 199 | ALIAS_MV(vapor) |
| 209 | ALIAS_MV(vapor) | 200 | |
| 210 | 201 | - finally add a file arch/sh/boards/vapor/io.c, which contains definitions of | |
| 211 | - finally add a file arch/sh/boards/vapor/io.c, which contains | 202 | the machine specific io functions (if there are enough to warrant it). |
| 212 | definitions of the machine specific io functions. | ||
| 213 | |||
| 214 | A note about initialisation functions. Three initialisation functions are | ||
| 215 | provided in the machine vector: | ||
| 216 | - mv_arch_init - called very early on from setup_arch | ||
| 217 | - mv_init_irq - called from init_IRQ, after the generic SH interrupt | ||
| 218 | initialisation | ||
| 219 | - mv_init_pci - currently not used | ||
| 220 | |||
| 221 | Any other remaining functions which need to be called at start up can be | ||
| 222 | added to the list using the __initcalls macro (or module_init if the code | ||
| 223 | can be built as a module). Many generic drivers probe to see if the device | ||
| 224 | they are targeting is present, however this may not always be appropriate, | ||
| 225 | so a flag can be added to the machine vector which will be set on those | ||
| 226 | machines which have the hardware in question, reducing the probe to a | ||
| 227 | single conditional. | ||
| 228 | 203 | ||
| 229 | 3. Hooking into the Build System | 204 | 3. Hooking into the Build System |
| 230 | ================================ | 205 | ================================ |
| @@ -303,4 +278,3 @@ which will in turn copy the defconfig for this board, run it through | |||
| 303 | oldconfig (prompting you for any new options since the time of creation), | 278 | oldconfig (prompting you for any new options since the time of creation), |
| 304 | and start you on your way to having a functional kernel for your new | 279 | and start you on your way to having a functional kernel for your new |
| 305 | board. | 280 | board. |
| 306 | |||
diff --git a/Documentation/sh/register-banks.txt b/Documentation/sh/register-banks.txt new file mode 100644 index 000000000000..a6719f2f6594 --- /dev/null +++ b/Documentation/sh/register-banks.txt | |||
| @@ -0,0 +1,33 @@ | |||
| 1 | Notes on register bank usage in the kernel | ||
| 2 | ========================================== | ||
| 3 | |||
| 4 | Introduction | ||
| 5 | ------------ | ||
| 6 | |||
| 7 | The SH-3 and SH-4 CPU families traditionally include a single partial register | ||
| 8 | bank (selected by SR.RB, only r0 ... r7 are banked), whereas other families | ||
| 9 | may have more full-featured banking or simply no such capabilities at all. | ||
| 10 | |||
| 11 | SR.RB banking | ||
| 12 | ------------- | ||
| 13 | |||
| 14 | In the case of this type of banking, banked registers are mapped directly to | ||
| 15 | r0 ... r7 if SR.RB is set to the bank we are interested in, otherwise ldc/stc | ||
| 16 | can still be used to reference the banked registers (as r0_bank ... r7_bank) | ||
| 17 | when in the context of another bank. The developer must keep the SR.RB value | ||
| 18 | in mind when writing code that utilizes these banked registers, for obvious | ||
| 19 | reasons. Userspace is also not able to poke at the bank1 values, so these can | ||
| 20 | be used rather effectively as scratch registers by the kernel. | ||
| 21 | |||
| 22 | Presently the kernel uses several of these registers. | ||
| 23 | |||
| 24 | - r0_bank, r1_bank (referenced as k0 and k1, used for scratch | ||
| 25 | registers when doing exception handling). | ||
| 26 | - r2_bank (used to track the EXPEVT/INTEVT code) | ||
| 27 | - Used by do_IRQ() and friends for doing irq mapping based off | ||
| 28 | of the interrupt exception vector jump table offset | ||
| 29 | - r6_bank (global interrupt mask) | ||
| 30 | - The SR.IMASK interrupt handler makes use of this to set the | ||
| 31 | interrupt priority level (used by local_irq_enable()) | ||
| 32 | - r7_bank (current) | ||
| 33 | |||
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7cee90223d3a..20d0d797f539 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt | |||
| @@ -29,6 +29,7 @@ Currently, these files are in /proc/sys/vm: | |||
| 29 | - drop-caches | 29 | - drop-caches |
| 30 | - zone_reclaim_mode | 30 | - zone_reclaim_mode |
| 31 | - min_unmapped_ratio | 31 | - min_unmapped_ratio |
| 32 | - min_slab_ratio | ||
| 32 | - panic_on_oom | 33 | - panic_on_oom |
| 33 | 34 | ||
| 34 | ============================================================== | 35 | ============================================================== |
| @@ -138,7 +139,6 @@ This is value ORed together of | |||
| 138 | 1 = Zone reclaim on | 139 | 1 = Zone reclaim on |
| 139 | 2 = Zone reclaim writes dirty pages out | 140 | 2 = Zone reclaim writes dirty pages out |
| 140 | 4 = Zone reclaim swaps pages | 141 | 4 = Zone reclaim swaps pages |
| 141 | 8 = Also do a global slab reclaim pass | ||
| 142 | 142 | ||
| 143 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | 143 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages |
| 144 | from remote zones will cause a measurable performance reduction. The | 144 | from remote zones will cause a measurable performance reduction. The |
| @@ -162,18 +162,13 @@ Allowing regular swap effectively restricts allocations to the local | |||
| 162 | node unless explicitly overridden by memory policies or cpuset | 162 | node unless explicitly overridden by memory policies or cpuset |
| 163 | configurations. | 163 | configurations. |
| 164 | 164 | ||
| 165 | It may be advisable to allow slab reclaim if the system makes heavy | ||
| 166 | use of files and builds up large slab caches. However, the slab | ||
| 167 | shrink operation is global, may take a long time and free slabs | ||
| 168 | in all nodes of the system. | ||
| 169 | |||
| 170 | ============================================================= | 165 | ============================================================= |
| 171 | 166 | ||
| 172 | min_unmapped_ratio: | 167 | min_unmapped_ratio: |
| 173 | 168 | ||
| 174 | This is available only on NUMA kernels. | 169 | This is available only on NUMA kernels. |
| 175 | 170 | ||
| 176 | A percentage of the file backed pages in each zone. Zone reclaim will only | 171 | A percentage of the total pages in each zone. Zone reclaim will only |
| 177 | occur if more than this percentage of pages are file backed and unmapped. | 172 | occur if more than this percentage of pages are file backed and unmapped. |
| 178 | This is to insure that a minimal amount of local pages is still available for | 173 | This is to insure that a minimal amount of local pages is still available for |
| 179 | file I/O even if the node is overallocated. | 174 | file I/O even if the node is overallocated. |
| @@ -182,6 +177,24 @@ The default is 1 percent. | |||
| 182 | 177 | ||
| 183 | ============================================================= | 178 | ============================================================= |
| 184 | 179 | ||
| 180 | min_slab_ratio: | ||
| 181 | |||
| 182 | This is available only on NUMA kernels. | ||
| 183 | |||
| 184 | A percentage of the total pages in each zone. On Zone reclaim | ||
| 185 | (fallback from the local zone occurs) slabs will be reclaimed if more | ||
| 186 | than this percentage of pages in a zone are reclaimable slab pages. | ||
| 187 | This insures that the slab growth stays under control even in NUMA | ||
| 188 | systems that rarely perform global reclaim. | ||
| 189 | |||
| 190 | The default is 5 percent. | ||
| 191 | |||
| 192 | Note that slab reclaim is triggered in a per zone / node fashion. | ||
| 193 | The process of reclaiming slab memory is currently not node specific | ||
| 194 | and may not be fast. | ||
| 195 | |||
| 196 | ============================================================= | ||
| 197 | |||
| 185 | panic_on_oom | 198 | panic_on_oom |
| 186 | 199 | ||
| 187 | This enables or disables panic on out-of-memory feature. If this is set to 1, | 200 | This enables or disables panic on out-of-memory feature. If this is set to 1, |
diff --git a/Documentation/usb/error-codes.txt b/Documentation/usb/error-codes.txt index 867f4c38f356..39c68f8c4e6c 100644 --- a/Documentation/usb/error-codes.txt +++ b/Documentation/usb/error-codes.txt | |||
| @@ -98,13 +98,13 @@ one or more packets could finish before an error stops further endpoint I/O. | |||
| 98 | error, a failure to respond (often caused by | 98 | error, a failure to respond (often caused by |
| 99 | device disconnect), or some other fault. | 99 | device disconnect), or some other fault. |
| 100 | 100 | ||
| 101 | -ETIMEDOUT (**) No response packet received within the prescribed | 101 | -ETIME (**) No response packet received within the prescribed |
| 102 | bus turn-around time. This error may instead be | 102 | bus turn-around time. This error may instead be |
| 103 | reported as -EPROTO or -EILSEQ. | 103 | reported as -EPROTO or -EILSEQ. |
| 104 | 104 | ||
| 105 | Note that the synchronous USB message functions | 105 | -ETIMEDOUT Synchronous USB message functions use this code |
| 106 | also use this code to indicate timeout expired | 106 | to indicate timeout expired before the transfer |
| 107 | before the transfer completed. | 107 | completed, and no other error was reported by HC. |
| 108 | 108 | ||
| 109 | -EPIPE (**) Endpoint stalled. For non-control endpoints, | 109 | -EPIPE (**) Endpoint stalled. For non-control endpoints, |
| 110 | reset this status with usb_clear_halt(). | 110 | reset this status with usb_clear_halt(). |
| @@ -163,6 +163,3 @@ usb_get_*/usb_set_*(): | |||
| 163 | usb_control_msg(): | 163 | usb_control_msg(): |
| 164 | usb_bulk_msg(): | 164 | usb_bulk_msg(): |
| 165 | -ETIMEDOUT Timeout expired before the transfer completed. | 165 | -ETIMEDOUT Timeout expired before the transfer completed. |
| 166 | In the future this code may change to -ETIME, | ||
| 167 | whose definition is a closer match to this sort | ||
| 168 | of error. | ||
diff --git a/Documentation/usb/usb-serial.txt b/Documentation/usb/usb-serial.txt index 02b0f7beb6d1..a2dee6e6190d 100644 --- a/Documentation/usb/usb-serial.txt +++ b/Documentation/usb/usb-serial.txt | |||
| @@ -433,6 +433,11 @@ Options supported: | |||
| 433 | See http://www.uuhaus.de/linux/palmconnect.html for up-to-date | 433 | See http://www.uuhaus.de/linux/palmconnect.html for up-to-date |
| 434 | information on this driver. | 434 | information on this driver. |
| 435 | 435 | ||
| 436 | AIRcable USB Dongle Bluetooth driver | ||
| 437 | If there is the cdc_acm driver loaded in the system, you will find that the | ||
| 438 | cdc_acm claims the device before AIRcable can. This is simply corrected | ||
| 439 | by unloading both modules and then loading the aircable module before | ||
| 440 | cdc_acm module | ||
| 436 | 441 | ||
| 437 | Generic Serial driver | 442 | Generic Serial driver |
| 438 | 443 | ||
diff --git a/Documentation/x86_64/boot-options.txt b/Documentation/x86_64/boot-options.txt index 6da24e7a56cb..4303e0c12476 100644 --- a/Documentation/x86_64/boot-options.txt +++ b/Documentation/x86_64/boot-options.txt | |||
| @@ -245,6 +245,13 @@ Debugging | |||
| 245 | newfallback: use new unwinder but fall back to old if it gets | 245 | newfallback: use new unwinder but fall back to old if it gets |
| 246 | stuck (default) | 246 | stuck (default) |
| 247 | 247 | ||
| 248 | call_trace=[old|both|newfallback|new] | ||
| 249 | old: use old inexact backtracer | ||
| 250 | new: use new exact dwarf2 unwinder | ||
| 251 | both: print entries from both | ||
| 252 | newfallback: use new unwinder but fall back to old if it gets | ||
| 253 | stuck (default) | ||
| 254 | |||
| 248 | Misc | 255 | Misc |
| 249 | 256 | ||
| 250 | noreplacement Don't replace instructions with more appropriate ones | 257 | noreplacement Don't replace instructions with more appropriate ones |
diff --git a/Documentation/x86_64/kernel-stacks b/Documentation/x86_64/kernel-stacks new file mode 100644 index 000000000000..bddfddd466ab --- /dev/null +++ b/Documentation/x86_64/kernel-stacks | |||
| @@ -0,0 +1,99 @@ | |||
| 1 | Most of the text from Keith Owens, hacked by AK | ||
| 2 | |||
| 3 | x86_64 page size (PAGE_SIZE) is 4K. | ||
| 4 | |||
| 5 | Like all other architectures, x86_64 has a kernel stack for every | ||
| 6 | active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. | ||
| 7 | These stacks contain useful data as long as a thread is alive or a | ||
| 8 | zombie. While the thread is in user space the kernel stack is empty | ||
| 9 | except for the thread_info structure at the bottom. | ||
| 10 | |||
| 11 | In addition to the per thread stacks, there are specialized stacks | ||
| 12 | associated with each cpu. These stacks are only used while the kernel | ||
| 13 | is in control on that cpu, when a cpu returns to user space the | ||
| 14 | specialized stacks contain no useful data. The main cpu stacks is | ||
| 15 | |||
| 16 | * Interrupt stack. IRQSTACKSIZE | ||
| 17 | |||
| 18 | Used for external hardware interrupts. If this is the first external | ||
| 19 | hardware interrupt (i.e. not a nested hardware interrupt) then the | ||
| 20 | kernel switches from the current task to the interrupt stack. Like | ||
| 21 | the split thread and interrupt stacks on i386 (with CONFIG_4KSTACKS), | ||
| 22 | this gives more room for kernel interrupt processing without having | ||
| 23 | to increase the size of every per thread stack. | ||
| 24 | |||
| 25 | The interrupt stack is also used when processing a softirq. | ||
| 26 | |||
| 27 | Switching to the kernel interrupt stack is done by software based on a | ||
| 28 | per CPU interrupt nest counter. This is needed because x86-64 "IST" | ||
| 29 | hardware stacks cannot nest without races. | ||
| 30 | |||
| 31 | x86_64 also has a feature which is not available on i386, the ability | ||
| 32 | to automatically switch to a new stack for designated events such as | ||
| 33 | double fault or NMI, which makes it easier to handle these unusual | ||
| 34 | events on x86_64. This feature is called the Interrupt Stack Table | ||
| 35 | (IST). There can be up to 7 IST entries per cpu. The IST code is an | ||
| 36 | index into the Task State Segment (TSS), the IST entries in the TSS | ||
| 37 | point to dedicated stacks, each stack can be a different size. | ||
| 38 | |||
| 39 | An IST is selected by an non-zero value in the IST field of an | ||
| 40 | interrupt-gate descriptor. When an interrupt occurs and the hardware | ||
| 41 | loads such a descriptor, the hardware automatically sets the new stack | ||
| 42 | pointer based on the IST value, then invokes the interrupt handler. If | ||
| 43 | software wants to allow nested IST interrupts then the handler must | ||
| 44 | adjust the IST values on entry to and exit from the interrupt handler. | ||
| 45 | (this is occasionally done, e.g. for debug exceptions) | ||
| 46 | |||
| 47 | Events with different IST codes (i.e. with different stacks) can be | ||
| 48 | nested. For example, a debug interrupt can safely be interrupted by an | ||
| 49 | NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack | ||
| 50 | pointers on entry to and exit from all IST events, in theory allowing | ||
| 51 | IST events with the same code to be nested. However in most cases, the | ||
| 52 | stack size allocated to an IST assumes no nesting for the same code. | ||
| 53 | If that assumption is ever broken then the stacks will become corrupt. | ||
| 54 | |||
| 55 | The currently assigned IST stacks are :- | ||
| 56 | |||
| 57 | * STACKFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
| 58 | |||
| 59 | Used for interrupt 12 - Stack Fault Exception (#SS). | ||
| 60 | |||
| 61 | This allows to recover from invalid stack segments. Rarely | ||
| 62 | happens. | ||
| 63 | |||
| 64 | * DOUBLEFAULT_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
| 65 | |||
| 66 | Used for interrupt 8 - Double Fault Exception (#DF). | ||
| 67 | |||
| 68 | Invoked when handling a exception causes another exception. Happens | ||
| 69 | when the kernel is very confused (e.g. kernel stack pointer corrupt) | ||
| 70 | Using a separate stack allows to recover from it well enough in many | ||
| 71 | cases to still output an oops. | ||
| 72 | |||
| 73 | * NMI_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
| 74 | |||
| 75 | Used for non-maskable interrupts (NMI). | ||
| 76 | |||
| 77 | NMI can be delivered at any time, including when the kernel is in the | ||
| 78 | middle of switching stacks. Using IST for NMI events avoids making | ||
| 79 | assumptions about the previous state of the kernel stack. | ||
| 80 | |||
| 81 | * DEBUG_STACK. DEBUG_STKSZ | ||
| 82 | |||
| 83 | Used for hardware debug interrupts (interrupt 1) and for software | ||
| 84 | debug interrupts (INT3). | ||
| 85 | |||
| 86 | When debugging a kernel, debug interrupts (both hardware and | ||
| 87 | software) can occur at any time. Using IST for these interrupts | ||
| 88 | avoids making assumptions about the previous state of the kernel | ||
| 89 | stack. | ||
| 90 | |||
| 91 | * MCE_STACK. EXCEPTION_STKSZ (PAGE_SIZE). | ||
| 92 | |||
| 93 | Used for interrupt 18 - Machine Check Exception (#MC). | ||
| 94 | |||
| 95 | MCE can be delivered at any time, including when the kernel is in the | ||
| 96 | middle of switching stacks. Using IST for MCE events avoids making | ||
| 97 | assumptions about the previous state of the kernel stack. | ||
| 98 | |||
| 99 | For more details see the Intel IA32 or AMD AMD64 architecture manuals. | ||
