diff options
Diffstat (limited to 'Documentation')
30 files changed, 1863 insertions, 76 deletions
diff --git a/Documentation/ABI/stable/sysfs-driver-usb-usbtmc b/Documentation/ABI/stable/sysfs-driver-usb-usbtmc new file mode 100644 index 000000000000..9a75fb22187d --- /dev/null +++ b/Documentation/ABI/stable/sysfs-driver-usb-usbtmc | |||
@@ -0,0 +1,62 @@ | |||
1 | What: /sys/bus/usb/drivers/usbtmc/devices/*/interface_capabilities | ||
2 | What: /sys/bus/usb/drivers/usbtmc/devices/*/device_capabilities | ||
3 | Date: August 2008 | ||
4 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | ||
5 | Description: | ||
6 | These files show the various USB TMC capabilities as described | ||
7 | by the device itself. The full description of the bitfields | ||
8 | can be found in the USB TMC documents from the USB-IF entitled | ||
9 | "Universal Serial Bus Test and Measurement Class Specification | ||
10 | (USBTMC) Revision 1.0" section 4.2.1.8. | ||
11 | |||
12 | The files are read only. | ||
13 | |||
14 | |||
15 | What: /sys/bus/usb/drivers/usbtmc/devices/*/usb488_interface_capabilities | ||
16 | What: /sys/bus/usb/drivers/usbtmc/devices/*/usb488_device_capabilities | ||
17 | Date: August 2008 | ||
18 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | ||
19 | Description: | ||
20 | These files show the various USB TMC capabilities as described | ||
21 | by the device itself. The full description of the bitfields | ||
22 | can be found in the USB TMC documents from the USB-IF entitled | ||
23 | "Universal Serial Bus Test and Measurement Class, Subclass | ||
24 | USB488 Specification (USBTMC-USB488) Revision 1.0" section | ||
25 | 4.2.2. | ||
26 | |||
27 | The files are read only. | ||
28 | |||
29 | |||
30 | What: /sys/bus/usb/drivers/usbtmc/devices/*/TermChar | ||
31 | Date: August 2008 | ||
32 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | ||
33 | Description: | ||
34 | This file is the TermChar value to be sent to the USB TMC | ||
35 | device as described by the document, "Universal Serial Bus Test | ||
36 | and Measurement Class Specification | ||
37 | (USBTMC) Revision 1.0" as published by the USB-IF. | ||
38 | |||
39 | Note that the TermCharEnabled file determines if this value is | ||
40 | sent to the device or not. | ||
41 | |||
42 | |||
43 | What: /sys/bus/usb/drivers/usbtmc/devices/*/TermCharEnabled | ||
44 | Date: August 2008 | ||
45 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | ||
46 | Description: | ||
47 | This file determines if the TermChar is to be sent to the | ||
48 | device on every transaction or not. For more details about | ||
49 | this, please see the document, "Universal Serial Bus Test and | ||
50 | Measurement Class Specification (USBTMC) Revision 1.0" as | ||
51 | published by the USB-IF. | ||
52 | |||
53 | |||
54 | What: /sys/bus/usb/drivers/usbtmc/devices/*/auto_abort | ||
55 | Date: August 2008 | ||
56 | Contact: Greg Kroah-Hartman <gregkh@suse.de> | ||
57 | Description: | ||
58 | This file determines if the the transaction of the USB TMC | ||
59 | device is to be automatically aborted if there is any error. | ||
60 | For more details about this, please see the document, | ||
61 | "Universal Serial Bus Test and Measurement Class Specification | ||
62 | (USBTMC) Revision 1.0" as published by the USB-IF. | ||
diff --git a/Documentation/ABI/testing/sysfs-bus-usb b/Documentation/ABI/testing/sysfs-bus-usb index 11a3c1682cec..df6c8a0159f1 100644 --- a/Documentation/ABI/testing/sysfs-bus-usb +++ b/Documentation/ABI/testing/sysfs-bus-usb | |||
@@ -85,3 +85,19 @@ Description: | |||
85 | Users: | 85 | Users: |
86 | PowerTOP <power@bughost.org> | 86 | PowerTOP <power@bughost.org> |
87 | http://www.lesswatts.org/projects/powertop/ | 87 | http://www.lesswatts.org/projects/powertop/ |
88 | |||
89 | What: /sys/bus/usb/device/<busnum>-<devnum>...:<config num>-<interface num>/supports_autosuspend | ||
90 | Date: January 2008 | ||
91 | KernelVersion: 2.6.27 | ||
92 | Contact: Sarah Sharp <sarah.a.sharp@intel.com> | ||
93 | Description: | ||
94 | When read, this file returns 1 if the interface driver | ||
95 | for this interface supports autosuspend. It also | ||
96 | returns 1 if no driver has claimed this interface, as an | ||
97 | unclaimed interface will not stop the device from being | ||
98 | autosuspended if all other interface drivers are idle. | ||
99 | The file returns 0 if autosuspend support has not been | ||
100 | added to the driver. | ||
101 | Users: | ||
102 | USB PM tool | ||
103 | git://git.moblin.org/users/sarah/usb-pm-tool/ | ||
diff --git a/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg new file mode 100644 index 000000000000..cb830df8777c --- /dev/null +++ b/Documentation/ABI/testing/sysfs-bus-usb-devices-usbsevseg | |||
@@ -0,0 +1,43 @@ | |||
1 | Where: /sys/bus/usb/.../powered | ||
2 | Date: August 2008 | ||
3 | Kernel Version: 2.6.26 | ||
4 | Contact: Harrison Metzger <harrisonmetz@gmail.com> | ||
5 | Description: Controls whether the device's display will powered. | ||
6 | A value of 0 is off and a non-zero value is on. | ||
7 | |||
8 | Where: /sys/bus/usb/.../mode_msb | ||
9 | Where: /sys/bus/usb/.../mode_lsb | ||
10 | Date: August 2008 | ||
11 | Kernel Version: 2.6.26 | ||
12 | Contact: Harrison Metzger <harrisonmetz@gmail.com> | ||
13 | Description: Controls the devices display mode. | ||
14 | For a 6 character display the values are | ||
15 | MSB 0x06; LSB 0x3F, and | ||
16 | for an 8 character display the values are | ||
17 | MSB 0x08; LSB 0xFF. | ||
18 | |||
19 | Where: /sys/bus/usb/.../textmode | ||
20 | Date: August 2008 | ||
21 | Kernel Version: 2.6.26 | ||
22 | Contact: Harrison Metzger <harrisonmetz@gmail.com> | ||
23 | Description: Controls the way the device interprets its text buffer. | ||
24 | raw: each character controls its segment manually | ||
25 | hex: each character is between 0-15 | ||
26 | ascii: each character is between '0'-'9' and 'A'-'F'. | ||
27 | |||
28 | Where: /sys/bus/usb/.../text | ||
29 | Date: August 2008 | ||
30 | Kernel Version: 2.6.26 | ||
31 | Contact: Harrison Metzger <harrisonmetz@gmail.com> | ||
32 | Description: The text (or data) for the device to display | ||
33 | |||
34 | Where: /sys/bus/usb/.../decimals | ||
35 | Date: August 2008 | ||
36 | Kernel Version: 2.6.26 | ||
37 | Contact: Harrison Metzger <harrisonmetz@gmail.com> | ||
38 | Description: Controls the decimal places on the device. | ||
39 | To set the nth decimal place, give this field | ||
40 | the value of 10 ** n. Assume this field has | ||
41 | the value k and has 1 or more decimal places set, | ||
42 | to set the mth place (where m is not already set), | ||
43 | change this fields value to k + 10 ** m. \ No newline at end of file | ||
diff --git a/Documentation/DocBook/gadget.tmpl b/Documentation/DocBook/gadget.tmpl index ea3bc9565e6a..6ef2f0073e5a 100644 --- a/Documentation/DocBook/gadget.tmpl +++ b/Documentation/DocBook/gadget.tmpl | |||
@@ -557,6 +557,9 @@ Near-term plans include converting all of them, except for "gadgetfs". | |||
557 | </para> | 557 | </para> |
558 | 558 | ||
559 | !Edrivers/usb/gadget/f_acm.c | 559 | !Edrivers/usb/gadget/f_acm.c |
560 | !Edrivers/usb/gadget/f_ecm.c | ||
561 | !Edrivers/usb/gadget/f_subset.c | ||
562 | !Edrivers/usb/gadget/f_obex.c | ||
560 | !Edrivers/usb/gadget/f_serial.c | 563 | !Edrivers/usb/gadget/f_serial.c |
561 | 564 | ||
562 | </sect1> | 565 | </sect1> |
diff --git a/Documentation/DocBook/kernel-hacking.tmpl b/Documentation/DocBook/kernel-hacking.tmpl index 4c63e5864160..ae15d55350ec 100644 --- a/Documentation/DocBook/kernel-hacking.tmpl +++ b/Documentation/DocBook/kernel-hacking.tmpl | |||
@@ -1105,7 +1105,7 @@ static struct block_device_operations opt_fops = { | |||
1105 | </listitem> | 1105 | </listitem> |
1106 | <listitem> | 1106 | <listitem> |
1107 | <para> | 1107 | <para> |
1108 | Function names as strings (__FUNCTION__). | 1108 | Function names as strings (__func__). |
1109 | </para> | 1109 | </para> |
1110 | </listitem> | 1110 | </listitem> |
1111 | <listitem> | 1111 | <listitem> |
diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt index a51f693c1541..256defd7e174 100644 --- a/Documentation/MSI-HOWTO.txt +++ b/Documentation/MSI-HOWTO.txt | |||
@@ -236,10 +236,8 @@ software system can set different pages for controlling accesses to the | |||
236 | MSI-X structure. The implementation of MSI support requires the PCI | 236 | MSI-X structure. The implementation of MSI support requires the PCI |
237 | subsystem, not a device driver, to maintain full control of the MSI-X | 237 | subsystem, not a device driver, to maintain full control of the MSI-X |
238 | table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X | 238 | table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X |
239 | table/MSI-X PBA. A device driver is prohibited from requesting the MMIO | 239 | table/MSI-X PBA. A device driver should not access the MMIO address |
240 | address space of the MSI-X table/MSI-X PBA. Otherwise, the PCI subsystem | 240 | space of the MSI-X table/MSI-X PBA. |
241 | will fail enabling MSI-X on its hardware device when it calls the function | ||
242 | pci_enable_msix(). | ||
243 | 241 | ||
244 | 5.3.2 API pci_enable_msix | 242 | 5.3.2 API pci_enable_msix |
245 | 243 | ||
diff --git a/Documentation/PCI/pci.txt b/Documentation/PCI/pci.txt index 8d4dc6250c58..fd4907a2968c 100644 --- a/Documentation/PCI/pci.txt +++ b/Documentation/PCI/pci.txt | |||
@@ -163,6 +163,10 @@ need pass only as many optional fields as necessary: | |||
163 | o class and classmask fields default to 0 | 163 | o class and classmask fields default to 0 |
164 | o driver_data defaults to 0UL. | 164 | o driver_data defaults to 0UL. |
165 | 165 | ||
166 | Note that driver_data must match the value used by any of the pci_device_id | ||
167 | entries defined in the driver. This makes the driver_data field mandatory | ||
168 | if all the pci_device_id entries have a non-zero driver_data value. | ||
169 | |||
166 | Once added, the driver probe routine will be invoked for any unclaimed | 170 | Once added, the driver probe routine will be invoked for any unclaimed |
167 | PCI devices listed in its (newly updated) pci_ids list. | 171 | PCI devices listed in its (newly updated) pci_ids list. |
168 | 172 | ||
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt index 16c251230c82..ddeb14beacc8 100644 --- a/Documentation/PCI/pcieaer-howto.txt +++ b/Documentation/PCI/pcieaer-howto.txt | |||
@@ -203,22 +203,17 @@ to mmio_enabled. | |||
203 | 203 | ||
204 | 3.3 helper functions | 204 | 3.3 helper functions |
205 | 205 | ||
206 | 3.3.1 int pci_find_aer_capability(struct pci_dev *dev); | 206 | 3.3.1 int pci_enable_pcie_error_reporting(struct pci_dev *dev); |
207 | pci_find_aer_capability locates the PCI Express AER capability | ||
208 | in the device configuration space. If the device doesn't support | ||
209 | PCI-Express AER, the function returns 0. | ||
210 | |||
211 | 3.3.2 int pci_enable_pcie_error_reporting(struct pci_dev *dev); | ||
212 | pci_enable_pcie_error_reporting enables the device to send error | 207 | pci_enable_pcie_error_reporting enables the device to send error |
213 | messages to root port when an error is detected. Note that devices | 208 | messages to root port when an error is detected. Note that devices |
214 | don't enable the error reporting by default, so device drivers need | 209 | don't enable the error reporting by default, so device drivers need |
215 | call this function to enable it. | 210 | call this function to enable it. |
216 | 211 | ||
217 | 3.3.3 int pci_disable_pcie_error_reporting(struct pci_dev *dev); | 212 | 3.3.2 int pci_disable_pcie_error_reporting(struct pci_dev *dev); |
218 | pci_disable_pcie_error_reporting disables the device to send error | 213 | pci_disable_pcie_error_reporting disables the device to send error |
219 | messages to root port when an error is detected. | 214 | messages to root port when an error is detected. |
220 | 215 | ||
221 | 3.3.4 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); | 216 | 3.3.3 int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev); |
222 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable | 217 | pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable |
223 | error status register. | 218 | error status register. |
224 | 219 | ||
diff --git a/Documentation/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb68..d9014aa0eb68 100644 --- a/Documentation/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
diff --git a/Documentation/cgroups/freezer-subsystem.txt b/Documentation/cgroups/freezer-subsystem.txt new file mode 100644 index 000000000000..c50ab58b72eb --- /dev/null +++ b/Documentation/cgroups/freezer-subsystem.txt | |||
@@ -0,0 +1,99 @@ | |||
1 | The cgroup freezer is useful to batch job management system which start | ||
2 | and stop sets of tasks in order to schedule the resources of a machine | ||
3 | according to the desires of a system administrator. This sort of program | ||
4 | is often used on HPC clusters to schedule access to the cluster as a | ||
5 | whole. The cgroup freezer uses cgroups to describe the set of tasks to | ||
6 | be started/stopped by the batch job management system. It also provides | ||
7 | a means to start and stop the tasks composing the job. | ||
8 | |||
9 | The cgroup freezer will also be useful for checkpointing running groups | ||
10 | of tasks. The freezer allows the checkpoint code to obtain a consistent | ||
11 | image of the tasks by attempting to force the tasks in a cgroup into a | ||
12 | quiescent state. Once the tasks are quiescent another task can | ||
13 | walk /proc or invoke a kernel interface to gather information about the | ||
14 | quiesced tasks. Checkpointed tasks can be restarted later should a | ||
15 | recoverable error occur. This also allows the checkpointed tasks to be | ||
16 | migrated between nodes in a cluster by copying the gathered information | ||
17 | to another node and restarting the tasks there. | ||
18 | |||
19 | Sequences of SIGSTOP and SIGCONT are not always sufficient for stopping | ||
20 | and resuming tasks in userspace. Both of these signals are observable | ||
21 | from within the tasks we wish to freeze. While SIGSTOP cannot be caught, | ||
22 | blocked, or ignored it can be seen by waiting or ptracing parent tasks. | ||
23 | SIGCONT is especially unsuitable since it can be caught by the task. Any | ||
24 | programs designed to watch for SIGSTOP and SIGCONT could be broken by | ||
25 | attempting to use SIGSTOP and SIGCONT to stop and resume tasks. We can | ||
26 | demonstrate this problem using nested bash shells: | ||
27 | |||
28 | $ echo $$ | ||
29 | 16644 | ||
30 | $ bash | ||
31 | $ echo $$ | ||
32 | 16690 | ||
33 | |||
34 | From a second, unrelated bash shell: | ||
35 | $ kill -SIGSTOP 16690 | ||
36 | $ kill -SIGCONT 16990 | ||
37 | |||
38 | <at this point 16990 exits and causes 16644 to exit too> | ||
39 | |||
40 | This happens because bash can observe both signals and choose how it | ||
41 | responds to them. | ||
42 | |||
43 | Another example of a program which catches and responds to these | ||
44 | signals is gdb. In fact any program designed to use ptrace is likely to | ||
45 | have a problem with this method of stopping and resuming tasks. | ||
46 | |||
47 | In contrast, the cgroup freezer uses the kernel freezer code to | ||
48 | prevent the freeze/unfreeze cycle from becoming visible to the tasks | ||
49 | being frozen. This allows the bash example above and gdb to run as | ||
50 | expected. | ||
51 | |||
52 | The freezer subsystem in the container filesystem defines a file named | ||
53 | freezer.state. Writing "FROZEN" to the state file will freeze all tasks in the | ||
54 | cgroup. Subsequently writing "THAWED" will unfreeze the tasks in the cgroup. | ||
55 | Reading will return the current state. | ||
56 | |||
57 | * Examples of usage : | ||
58 | |||
59 | # mkdir /containers/freezer | ||
60 | # mount -t cgroup -ofreezer freezer /containers | ||
61 | # mkdir /containers/0 | ||
62 | # echo $some_pid > /containers/0/tasks | ||
63 | |||
64 | to get status of the freezer subsystem : | ||
65 | |||
66 | # cat /containers/0/freezer.state | ||
67 | THAWED | ||
68 | |||
69 | to freeze all tasks in the container : | ||
70 | |||
71 | # echo FROZEN > /containers/0/freezer.state | ||
72 | # cat /containers/0/freezer.state | ||
73 | FREEZING | ||
74 | # cat /containers/0/freezer.state | ||
75 | FROZEN | ||
76 | |||
77 | to unfreeze all tasks in the container : | ||
78 | |||
79 | # echo THAWED > /containers/0/freezer.state | ||
80 | # cat /containers/0/freezer.state | ||
81 | THAWED | ||
82 | |||
83 | This is the basic mechanism which should do the right thing for user space task | ||
84 | in a simple scenario. | ||
85 | |||
86 | It's important to note that freezing can be incomplete. In that case we return | ||
87 | EBUSY. This means that some tasks in the cgroup are busy doing something that | ||
88 | prevents us from completely freezing the cgroup at this time. After EBUSY, | ||
89 | the cgroup will remain partially frozen -- reflected by freezer.state reporting | ||
90 | "FREEZING" when read. The state will remain "FREEZING" until one of these | ||
91 | things happens: | ||
92 | |||
93 | 1) Userspace cancels the freezing operation by writing "THAWED" to | ||
94 | the freezer.state file | ||
95 | 2) Userspace retries the freezing operation by writing "FROZEN" to | ||
96 | the freezer.state file (writing "FREEZING" is not legal | ||
97 | and returns EIO) | ||
98 | 3) The tasks that blocked the cgroup from entering the "FROZEN" | ||
99 | state disappear from the cgroup's set of tasks. | ||
diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 9b53d5827361..1c07547d3f81 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt | |||
@@ -112,14 +112,22 @@ the per cgroup LRU. | |||
112 | 112 | ||
113 | 2.2.1 Accounting details | 113 | 2.2.1 Accounting details |
114 | 114 | ||
115 | All mapped pages (RSS) and unmapped user pages (Page Cache) are accounted. | 115 | All mapped anon pages (RSS) and cache pages (Page Cache) are accounted. |
116 | RSS pages are accounted at the time of page_add_*_rmap() unless they've already | 116 | (some pages which never be reclaimable and will not be on global LRU |
117 | been accounted for earlier. A file page will be accounted for as Page Cache; | 117 | are not accounted. we just accounts pages under usual vm management.) |
118 | it's mapped into the page tables of a process, duplicate accounting is carefully | 118 | |
119 | avoided. Page Cache pages are accounted at the time of add_to_page_cache(). | 119 | RSS pages are accounted at page_fault unless they've already been accounted |
120 | The corresponding routines that remove a page from the page tables or removes | 120 | for earlier. A file page will be accounted for as Page Cache when it's |
121 | a page from Page Cache is used to decrement the accounting counters of the | 121 | inserted into inode (radix-tree). While it's mapped into the page tables of |
122 | cgroup. | 122 | processes, duplicate accounting is carefully avoided. |
123 | |||
124 | A RSS page is unaccounted when it's fully unmapped. A PageCache page is | ||
125 | unaccounted when it's removed from radix-tree. | ||
126 | |||
127 | At page migration, accounting information is kept. | ||
128 | |||
129 | Note: we just account pages-on-lru because our purpose is to control amount | ||
130 | of used pages. not-on-lru pages are tend to be out-of-control from vm view. | ||
123 | 131 | ||
124 | 2.3 Shared Page Accounting | 132 | 2.3 Shared Page Accounting |
125 | 133 | ||
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 47e568a9370a..5c86c258c791 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -48,7 +48,7 @@ hooks, beyond what is already present, required to manage dynamic | |||
48 | job placement on large systems. | 48 | job placement on large systems. |
49 | 49 | ||
50 | Cpusets use the generic cgroup subsystem described in | 50 | Cpusets use the generic cgroup subsystem described in |
51 | Documentation/cgroup.txt. | 51 | Documentation/cgroups/cgroups.txt. |
52 | 52 | ||
53 | Requests by a task, using the sched_setaffinity(2) system call to | 53 | Requests by a task, using the sched_setaffinity(2) system call to |
54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and |
diff --git a/Documentation/devices.txt b/Documentation/devices.txt index 05c80645e4ee..2be08240ee80 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt | |||
@@ -2571,6 +2571,9 @@ Your cooperation is appreciated. | |||
2571 | 160 = /dev/usb/legousbtower0 1st USB Legotower device | 2571 | 160 = /dev/usb/legousbtower0 1st USB Legotower device |
2572 | ... | 2572 | ... |
2573 | 175 = /dev/usb/legousbtower15 16th USB Legotower device | 2573 | 175 = /dev/usb/legousbtower15 16th USB Legotower device |
2574 | 176 = /dev/usb/usbtmc1 First USB TMC device | ||
2575 | ... | ||
2576 | 192 = /dev/usb/usbtmc16 16th USB TMC device | ||
2574 | 240 = /dev/usb/dabusb0 First daubusb device | 2577 | 240 = /dev/usb/dabusb0 First daubusb device |
2575 | ... | 2578 | ... |
2576 | 243 = /dev/usb/dabusb3 Fourth dabusb device | 2579 | 243 = /dev/usb/dabusb3 Fourth dabusb device |
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 295f26cd895a..9dd2a3bb2acc 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -96,6 +96,11 @@ errors=remount-ro(*) Remount the filesystem read-only on an error. | |||
96 | errors=continue Keep going on a filesystem error. | 96 | errors=continue Keep going on a filesystem error. |
97 | errors=panic Panic and halt the machine if an error occurs. | 97 | errors=panic Panic and halt the machine if an error occurs. |
98 | 98 | ||
99 | data_err=ignore(*) Just print an error message if an error occurs | ||
100 | in a file data buffer in ordered mode. | ||
101 | data_err=abort Abort the journal if an error occurs in a file | ||
102 | data buffer in ordered mode. | ||
103 | |||
99 | grpid Give objects the same group ID as their creator. | 104 | grpid Give objects the same group ID as their creator. |
100 | bsdgroups | 105 | bsdgroups |
101 | 106 | ||
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index eb154ef36c2a..174eaff7ded9 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
@@ -2,19 +2,24 @@ | |||
2 | Ext4 Filesystem | 2 | Ext4 Filesystem |
3 | =============== | 3 | =============== |
4 | 4 | ||
5 | This is a development version of the ext4 filesystem, an advanced level | 5 | Ext4 is an an advanced level of the ext3 filesystem which incorporates |
6 | of the ext3 filesystem which incorporates scalability and reliability | 6 | scalability and reliability enhancements for supporting large filesystems |
7 | enhancements for supporting large filesystems (64 bit) in keeping with | 7 | (64 bit) in keeping with increasing disk capacities and state-of-the-art |
8 | increasing disk capacities and state-of-the-art feature requirements. | 8 | feature requirements. |
9 | 9 | ||
10 | Mailing list: linux-ext4@vger.kernel.org | 10 | Mailing list: linux-ext4@vger.kernel.org |
11 | Web site: http://ext4.wiki.kernel.org | ||
11 | 12 | ||
12 | 13 | ||
13 | 1. Quick usage instructions: | 14 | 1. Quick usage instructions: |
14 | =========================== | 15 | =========================== |
15 | 16 | ||
17 | Note: More extensive information for getting started with ext4 can be | ||
18 | found at the ext4 wiki site at the URL: | ||
19 | http://ext4.wiki.kernel.org/index.php/Ext4_Howto | ||
20 | |||
16 | - Compile and install the latest version of e2fsprogs (as of this | 21 | - Compile and install the latest version of e2fsprogs (as of this |
17 | writing version 1.41) from: | 22 | writing version 1.41.3) from: |
18 | 23 | ||
19 | http://sourceforge.net/project/showfiles.php?group_id=2406 | 24 | http://sourceforge.net/project/showfiles.php?group_id=2406 |
20 | 25 | ||
@@ -36,11 +41,9 @@ Mailing list: linux-ext4@vger.kernel.org | |||
36 | 41 | ||
37 | # mke2fs -t ext4 /dev/hda1 | 42 | # mke2fs -t ext4 /dev/hda1 |
38 | 43 | ||
39 | Or configure an existing ext3 filesystem to support extents and set | 44 | Or to configure an existing ext3 filesystem to support extents: |
40 | the test_fs flag to indicate that it's ok for an in-development | ||
41 | filesystem to touch this filesystem: | ||
42 | 45 | ||
43 | # tune2fs -O extents -E test_fs /dev/hda1 | 46 | # tune2fs -O extents /dev/hda1 |
44 | 47 | ||
45 | If the filesystem was created with 128 byte inodes, it can be | 48 | If the filesystem was created with 128 byte inodes, it can be |
46 | converted to use 256 byte for greater efficiency via: | 49 | converted to use 256 byte for greater efficiency via: |
@@ -104,8 +107,8 @@ exist yet so I'm not sure they're in the near-term roadmap. | |||
104 | The big performance win will come with mballoc, delalloc and flex_bg | 107 | The big performance win will come with mballoc, delalloc and flex_bg |
105 | grouping of bitmaps and inode tables. Some test results available here: | 108 | grouping of bitmaps and inode tables. Some test results available here: |
106 | 109 | ||
107 | - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html | 110 | - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html |
108 | - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html | 111 | - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html |
109 | 112 | ||
110 | 3. Options | 113 | 3. Options |
111 | ========== | 114 | ========== |
@@ -214,9 +217,6 @@ noreservation | |||
214 | bsddf (*) Make 'df' act like BSD. | 217 | bsddf (*) Make 'df' act like BSD. |
215 | minixdf Make 'df' act like Minix. | 218 | minixdf Make 'df' act like Minix. |
216 | 219 | ||
217 | check=none Don't do extra checking of bitmaps on mount. | ||
218 | nocheck | ||
219 | |||
220 | debug Extra debugging information is sent to syslog. | 220 | debug Extra debugging information is sent to syslog. |
221 | 221 | ||
222 | errors=remount-ro(*) Remount the filesystem read-only on an error. | 222 | errors=remount-ro(*) Remount the filesystem read-only on an error. |
@@ -253,8 +253,6 @@ nobh (a) cache disk block mapping information | |||
253 | "nobh" option tries to avoid associating buffer | 253 | "nobh" option tries to avoid associating buffer |
254 | heads (supported only for "writeback" mode). | 254 | heads (supported only for "writeback" mode). |
255 | 255 | ||
256 | mballoc (*) Use the multiple block allocator for block allocation | ||
257 | nomballoc disabled multiple block allocator for block allocation. | ||
258 | stripe=n Number of filesystem blocks that mballoc will try | 256 | stripe=n Number of filesystem blocks that mballoc will try |
259 | to use for allocation size and alignment. For RAID5/6 | 257 | to use for allocation size and alignment. For RAID5/6 |
260 | systems this should be the number of data | 258 | systems this should be the number of data |
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index c032bf39e8b9..bcceb99b81dd 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -1384,15 +1384,18 @@ causes the kernel to prefer to reclaim dentries and inodes. | |||
1384 | dirty_background_ratio | 1384 | dirty_background_ratio |
1385 | ---------------------- | 1385 | ---------------------- |
1386 | 1386 | ||
1387 | Contains, as a percentage of total system memory, the number of pages at which | 1387 | Contains, as a percentage of the dirtyable system memory (free pages + mapped |
1388 | the pdflush background writeback daemon will start writing out dirty data. | 1388 | pages + file cache, not including locked pages and HugePages), the number of |
1389 | pages at which the pdflush background writeback daemon will start writing out | ||
1390 | dirty data. | ||
1389 | 1391 | ||
1390 | dirty_ratio | 1392 | dirty_ratio |
1391 | ----------------- | 1393 | ----------------- |
1392 | 1394 | ||
1393 | Contains, as a percentage of total system memory, the number of pages at which | 1395 | Contains, as a percentage of the dirtyable system memory (free pages + mapped |
1394 | a process which is generating disk writes will itself start writing out dirty | 1396 | pages + file cache, not including locked pages and HugePages), the number of |
1395 | data. | 1397 | pages at which a process which is generating disk writes will itself start |
1398 | writing out dirty data. | ||
1396 | 1399 | ||
1397 | dirty_writeback_centisecs | 1400 | dirty_writeback_centisecs |
1398 | ------------------------- | 1401 | ------------------------- |
@@ -2412,24 +2415,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask | |||
2412 | of memory types. If a bit of the bitmask is set, memory segments of the | 2415 | of memory types. If a bit of the bitmask is set, memory segments of the |
2413 | corresponding memory type are dumped, otherwise they are not dumped. | 2416 | corresponding memory type are dumped, otherwise they are not dumped. |
2414 | 2417 | ||
2415 | The following 4 memory types are supported: | 2418 | The following 7 memory types are supported: |
2416 | - (bit 0) anonymous private memory | 2419 | - (bit 0) anonymous private memory |
2417 | - (bit 1) anonymous shared memory | 2420 | - (bit 1) anonymous shared memory |
2418 | - (bit 2) file-backed private memory | 2421 | - (bit 2) file-backed private memory |
2419 | - (bit 3) file-backed shared memory | 2422 | - (bit 3) file-backed shared memory |
2420 | - (bit 4) ELF header pages in file-backed private memory areas (it is | 2423 | - (bit 4) ELF header pages in file-backed private memory areas (it is |
2421 | effective only if the bit 2 is cleared) | 2424 | effective only if the bit 2 is cleared) |
2425 | - (bit 5) hugetlb private memory | ||
2426 | - (bit 6) hugetlb shared memory | ||
2422 | 2427 | ||
2423 | Note that MMIO pages such as frame buffer are never dumped and vDSO pages | 2428 | Note that MMIO pages such as frame buffer are never dumped and vDSO pages |
2424 | are always dumped regardless of the bitmask status. | 2429 | are always dumped regardless of the bitmask status. |
2425 | 2430 | ||
2426 | Default value of coredump_filter is 0x3; this means all anonymous memory | 2431 | Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only |
2427 | segments are dumped. | 2432 | effected by bit 5-6. |
2433 | |||
2434 | Default value of coredump_filter is 0x23; this means all anonymous memory | ||
2435 | segments and hugetlb private memory are dumped. | ||
2428 | 2436 | ||
2429 | If you don't want to dump all shared memory segments attached to pid 1234, | 2437 | If you don't want to dump all shared memory segments attached to pid 1234, |
2430 | write 1 to the process's proc file. | 2438 | write 0x21 to the process's proc file. |
2431 | 2439 | ||
2432 | $ echo 0x1 > /proc/1234/coredump_filter | 2440 | $ echo 0x21 > /proc/1234/coredump_filter |
2433 | 2441 | ||
2434 | When a new process is created, the process inherits the bitmask status from its | 2442 | When a new process is created, the process inherits the bitmask status from its |
2435 | parent. It is useful to set up coredump_filter before the program runs. | 2443 | parent. It is useful to set up coredump_filter before the program runs. |
diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt index 6a0d70a22f05..dd84ea3c10da 100644 --- a/Documentation/filesystems/ubifs.txt +++ b/Documentation/filesystems/ubifs.txt | |||
@@ -86,6 +86,15 @@ norm_unmount (*) commit on unmount; the journal is committed | |||
86 | fast_unmount do not commit on unmount; this option makes | 86 | fast_unmount do not commit on unmount; this option makes |
87 | unmount faster, but the next mount slower | 87 | unmount faster, but the next mount slower |
88 | because of the need to replay the journal. | 88 | because of the need to replay the journal. |
89 | bulk_read read more in one go to take advantage of flash | ||
90 | media that read faster sequentially | ||
91 | no_bulk_read (*) do not bulk-read | ||
92 | no_chk_data_crc skip checking of CRCs on data nodes in order to | ||
93 | improve read performance. Use this option only | ||
94 | if the flash media is highly reliable. The effect | ||
95 | of this option is that corruption of the contents | ||
96 | of a file can go unnoticed. | ||
97 | chk_data_crc (*) do not skip checking CRCs on data nodes | ||
89 | 98 | ||
90 | 99 | ||
91 | Quick usage instructions | 100 | Quick usage instructions |
diff --git a/Documentation/ioctl-number.txt b/Documentation/ioctl-number.txt index 1c6b545635a2..b880ce5dbd33 100644 --- a/Documentation/ioctl-number.txt +++ b/Documentation/ioctl-number.txt | |||
@@ -92,6 +92,7 @@ Code Seq# Include File Comments | |||
92 | 'J' 00-1F drivers/scsi/gdth_ioctl.h | 92 | 'J' 00-1F drivers/scsi/gdth_ioctl.h |
93 | 'K' all linux/kd.h | 93 | 'K' all linux/kd.h |
94 | 'L' 00-1F linux/loop.h | 94 | 'L' 00-1F linux/loop.h |
95 | 'L' 20-2F driver/usb/misc/vstusb.h | ||
95 | 'L' E0-FF linux/ppdd.h encrypted disk device driver | 96 | 'L' E0-FF linux/ppdd.h encrypted disk device driver |
96 | <http://linux01.gwdg.de/~alatham/ppdd.html> | 97 | <http://linux01.gwdg.de/~alatham/ppdd.html> |
97 | 'M' all linux/soundcard.h | 98 | 'M' all linux/soundcard.h |
@@ -110,6 +111,8 @@ Code Seq# Include File Comments | |||
110 | 'W' 00-1F linux/wanrouter.h conflict! | 111 | 'W' 00-1F linux/wanrouter.h conflict! |
111 | 'X' all linux/xfs_fs.h | 112 | 'X' all linux/xfs_fs.h |
112 | 'Y' all linux/cyclades.h | 113 | 'Y' all linux/cyclades.h |
114 | '[' 00-07 linux/usb/usbtmc.h USB Test and Measurement Devices | ||
115 | <mailto:gregkh@suse.de> | ||
113 | 'a' all ATM on linux | 116 | 'a' all ATM on linux |
114 | <http://lrcwww.epfl.ch/linux-atm/magic.html> | 117 | <http://lrcwww.epfl.ch/linux-atm/magic.html> |
115 | 'b' 00-FF bit3 vme host bridge | 118 | 'b' 00-FF bit3 vme host bridge |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index dd28a0d56981..53ba7c7d82b3 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -101,6 +101,7 @@ parameter is applicable: | |||
101 | X86-64 X86-64 architecture is enabled. | 101 | X86-64 X86-64 architecture is enabled. |
102 | More X86-64 boot options can be found in | 102 | More X86-64 boot options can be found in |
103 | Documentation/x86_64/boot-options.txt . | 103 | Documentation/x86_64/boot-options.txt . |
104 | X86 Either 32bit or 64bit x86 (same as X86-32+X86-64) | ||
104 | 105 | ||
105 | In addition, the following text indicates that the option: | 106 | In addition, the following text indicates that the option: |
106 | 107 | ||
@@ -690,7 +691,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
690 | See Documentation/block/as-iosched.txt and | 691 | See Documentation/block/as-iosched.txt and |
691 | Documentation/block/deadline-iosched.txt for details. | 692 | Documentation/block/deadline-iosched.txt for details. |
692 | 693 | ||
693 | elfcorehdr= [X86-32, X86_64] | 694 | elfcorehdr= [IA64,PPC,SH,X86-32,X86_64] |
694 | Specifies physical address of start of kernel core | 695 | Specifies physical address of start of kernel core |
695 | image elf header. Generally kexec loader will | 696 | image elf header. Generally kexec loader will |
696 | pass this option to capture kernel. | 697 | pass this option to capture kernel. |
@@ -796,6 +797,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
796 | Defaults to the default architecture's huge page size | 797 | Defaults to the default architecture's huge page size |
797 | if not specified. | 798 | if not specified. |
798 | 799 | ||
800 | hlt [BUGS=ARM,SH] | ||
801 | |||
799 | i8042.debug [HW] Toggle i8042 debug mode | 802 | i8042.debug [HW] Toggle i8042 debug mode |
800 | i8042.direct [HW] Put keyboard port into non-translated mode | 803 | i8042.direct [HW] Put keyboard port into non-translated mode |
801 | i8042.dumbkbd [HW] Pretend that controller can only read data from | 804 | i8042.dumbkbd [HW] Pretend that controller can only read data from |
@@ -1211,6 +1214,10 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1211 | mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel | 1214 | mem=nopentium [BUGS=X86-32] Disable usage of 4MB pages for kernel |
1212 | memory. | 1215 | memory. |
1213 | 1216 | ||
1217 | memchunk=nn[KMG] | ||
1218 | [KNL,SH] Allow user to override the default size for | ||
1219 | per-device physically contiguous DMA buffers. | ||
1220 | |||
1214 | memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact | 1221 | memmap=exactmap [KNL,X86-32,X86_64] Enable setting of an exact |
1215 | E820 memory map, as specified by the user. | 1222 | E820 memory map, as specified by the user. |
1216 | Such memmap=exactmap lines can be constructed based on | 1223 | Such memmap=exactmap lines can be constructed based on |
@@ -1393,6 +1400,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1393 | 1400 | ||
1394 | nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. | 1401 | nodisconnect [HW,SCSI,M68K] Disables SCSI disconnects. |
1395 | 1402 | ||
1403 | nodsp [SH] Disable hardware DSP at boot time. | ||
1404 | |||
1396 | noefi [X86-32,X86-64] Disable EFI runtime services support. | 1405 | noefi [X86-32,X86-64] Disable EFI runtime services support. |
1397 | 1406 | ||
1398 | noexec [IA-64] | 1407 | noexec [IA-64] |
@@ -1409,13 +1418,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1409 | noexec32=off: disable non-executable mappings | 1418 | noexec32=off: disable non-executable mappings |
1410 | read implies executable mappings | 1419 | read implies executable mappings |
1411 | 1420 | ||
1421 | nofpu [SH] Disable hardware FPU at boot time. | ||
1422 | |||
1412 | nofxsr [BUGS=X86-32] Disables x86 floating point extended | 1423 | nofxsr [BUGS=X86-32] Disables x86 floating point extended |
1413 | register save and restore. The kernel will only save | 1424 | register save and restore. The kernel will only save |
1414 | legacy floating-point registers on task switch. | 1425 | legacy floating-point registers on task switch. |
1415 | 1426 | ||
1416 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction | 1427 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction |
1417 | 1428 | ||
1418 | nohlt [BUGS=ARM] | 1429 | nohlt [BUGS=ARM,SH] |
1419 | 1430 | ||
1420 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt | 1431 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt |
1421 | instruction doesn't work correctly and not to | 1432 | instruction doesn't work correctly and not to |
@@ -1578,7 +1589,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1578 | See also Documentation/paride.txt. | 1589 | See also Documentation/paride.txt. |
1579 | 1590 | ||
1580 | pci=option[,option...] [PCI] various PCI subsystem options: | 1591 | pci=option[,option...] [PCI] various PCI subsystem options: |
1581 | off [X86-32] don't probe for the PCI bus | 1592 | off [X86] don't probe for the PCI bus |
1582 | bios [X86-32] force use of PCI BIOS, don't access | 1593 | bios [X86-32] force use of PCI BIOS, don't access |
1583 | the hardware directly. Use this if your machine | 1594 | the hardware directly. Use this if your machine |
1584 | has a non-standard PCI host bridge. | 1595 | has a non-standard PCI host bridge. |
@@ -1586,9 +1597,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1586 | hardware access methods are allowed. Use this | 1597 | hardware access methods are allowed. Use this |
1587 | if you experience crashes upon bootup and you | 1598 | if you experience crashes upon bootup and you |
1588 | suspect they are caused by the BIOS. | 1599 | suspect they are caused by the BIOS. |
1589 | conf1 [X86-32] Force use of PCI Configuration | 1600 | conf1 [X86] Force use of PCI Configuration |
1590 | Mechanism 1. | 1601 | Mechanism 1. |
1591 | conf2 [X86-32] Force use of PCI Configuration | 1602 | conf2 [X86] Force use of PCI Configuration |
1592 | Mechanism 2. | 1603 | Mechanism 2. |
1593 | noaer [PCIE] If the PCIEAER kernel config parameter is | 1604 | noaer [PCIE] If the PCIEAER kernel config parameter is |
1594 | enabled, this kernel boot option can be used to | 1605 | enabled, this kernel boot option can be used to |
@@ -1608,37 +1619,37 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1608 | this option if the kernel is unable to allocate | 1619 | this option if the kernel is unable to allocate |
1609 | IRQs or discover secondary PCI buses on your | 1620 | IRQs or discover secondary PCI buses on your |
1610 | motherboard. | 1621 | motherboard. |
1611 | rom [X86-32] Assign address space to expansion ROMs. | 1622 | rom [X86] Assign address space to expansion ROMs. |
1612 | Use with caution as certain devices share | 1623 | Use with caution as certain devices share |
1613 | address decoders between ROMs and other | 1624 | address decoders between ROMs and other |
1614 | resources. | 1625 | resources. |
1615 | norom [X86-32,X86_64] Do not assign address space to | 1626 | norom [X86] Do not assign address space to |
1616 | expansion ROMs that do not already have | 1627 | expansion ROMs that do not already have |
1617 | BIOS assigned address ranges. | 1628 | BIOS assigned address ranges. |
1618 | irqmask=0xMMMM [X86-32] Set a bit mask of IRQs allowed to be | 1629 | irqmask=0xMMMM [X86] Set a bit mask of IRQs allowed to be |
1619 | assigned automatically to PCI devices. You can | 1630 | assigned automatically to PCI devices. You can |
1620 | make the kernel exclude IRQs of your ISA cards | 1631 | make the kernel exclude IRQs of your ISA cards |
1621 | this way. | 1632 | this way. |
1622 | pirqaddr=0xAAAAA [X86-32] Specify the physical address | 1633 | pirqaddr=0xAAAAA [X86] Specify the physical address |
1623 | of the PIRQ table (normally generated | 1634 | of the PIRQ table (normally generated |
1624 | by the BIOS) if it is outside the | 1635 | by the BIOS) if it is outside the |
1625 | F0000h-100000h range. | 1636 | F0000h-100000h range. |
1626 | lastbus=N [X86-32] Scan all buses thru bus #N. Can be | 1637 | lastbus=N [X86] Scan all buses thru bus #N. Can be |
1627 | useful if the kernel is unable to find your | 1638 | useful if the kernel is unable to find your |
1628 | secondary buses and you want to tell it | 1639 | secondary buses and you want to tell it |
1629 | explicitly which ones they are. | 1640 | explicitly which ones they are. |
1630 | assign-busses [X86-32] Always assign all PCI bus | 1641 | assign-busses [X86] Always assign all PCI bus |
1631 | numbers ourselves, overriding | 1642 | numbers ourselves, overriding |
1632 | whatever the firmware may have done. | 1643 | whatever the firmware may have done. |
1633 | usepirqmask [X86-32] Honor the possible IRQ mask stored | 1644 | usepirqmask [X86] Honor the possible IRQ mask stored |
1634 | in the BIOS $PIR table. This is needed on | 1645 | in the BIOS $PIR table. This is needed on |
1635 | some systems with broken BIOSes, notably | 1646 | some systems with broken BIOSes, notably |
1636 | some HP Pavilion N5400 and Omnibook XE3 | 1647 | some HP Pavilion N5400 and Omnibook XE3 |
1637 | notebooks. This will have no effect if ACPI | 1648 | notebooks. This will have no effect if ACPI |
1638 | IRQ routing is enabled. | 1649 | IRQ routing is enabled. |
1639 | noacpi [X86-32] Do not use ACPI for IRQ routing | 1650 | noacpi [X86] Do not use ACPI for IRQ routing |
1640 | or for PCI scanning. | 1651 | or for PCI scanning. |
1641 | use_crs [X86-32] Use _CRS for PCI resource | 1652 | use_crs [X86] Use _CRS for PCI resource |
1642 | allocation. | 1653 | allocation. |
1643 | routeirq Do IRQ routing for all PCI devices. | 1654 | routeirq Do IRQ routing for all PCI devices. |
1644 | This is normally done in pci_enable_device(), | 1655 | This is normally done in pci_enable_device(), |
@@ -1667,6 +1678,12 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1667 | reserved for the CardBus bridge's memory | 1678 | reserved for the CardBus bridge's memory |
1668 | window. The default value is 64 megabytes. | 1679 | window. The default value is 64 megabytes. |
1669 | 1680 | ||
1681 | pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power | ||
1682 | Management. | ||
1683 | off Disable ASPM. | ||
1684 | force Enable ASPM even on devices that claim not to support it. | ||
1685 | WARNING: Forcing ASPM on may cause system lockups. | ||
1686 | |||
1670 | pcmv= [HW,PCMCIA] BadgePAD 4 | 1687 | pcmv= [HW,PCMCIA] BadgePAD 4 |
1671 | 1688 | ||
1672 | pd. [PARIDE] | 1689 | pd. [PARIDE] |
@@ -2253,6 +2270,25 @@ and is between 256 and 4096 characters. It is defined in the file | |||
2253 | autosuspended. Devices for which the delay is set | 2270 | autosuspended. Devices for which the delay is set |
2254 | to a negative value won't be autosuspended at all. | 2271 | to a negative value won't be autosuspended at all. |
2255 | 2272 | ||
2273 | usbcore.usbfs_snoop= | ||
2274 | [USB] Set to log all usbfs traffic (default 0 = off). | ||
2275 | |||
2276 | usbcore.blinkenlights= | ||
2277 | [USB] Set to cycle leds on hubs (default 0 = off). | ||
2278 | |||
2279 | usbcore.old_scheme_first= | ||
2280 | [USB] Start with the old device initialization | ||
2281 | scheme (default 0 = off). | ||
2282 | |||
2283 | usbcore.use_both_schemes= | ||
2284 | [USB] Try the other device initialization scheme | ||
2285 | if the first one fails (default 1 = enabled). | ||
2286 | |||
2287 | usbcore.initial_descriptor_timeout= | ||
2288 | [USB] Specifies timeout for the initial 64-byte | ||
2289 | USB_REQ_GET_DESCRIPTOR request in milliseconds | ||
2290 | (default 5000 = 5.0 seconds). | ||
2291 | |||
2256 | usbhid.mousepoll= | 2292 | usbhid.mousepoll= |
2257 | [USBHID] The interval which mice are to be polled at. | 2293 | [USBHID] The interval which mice are to be polled at. |
2258 | 2294 | ||
diff --git a/Documentation/markers.txt b/Documentation/markers.txt index d9f50a19fa0c..089f6138fcd9 100644 --- a/Documentation/markers.txt +++ b/Documentation/markers.txt | |||
@@ -50,10 +50,12 @@ Connecting a function (probe) to a marker is done by providing a probe (function | |||
50 | to call) for the specific marker through marker_probe_register() and can be | 50 | to call) for the specific marker through marker_probe_register() and can be |
51 | activated by calling marker_arm(). Marker deactivation can be done by calling | 51 | activated by calling marker_arm(). Marker deactivation can be done by calling |
52 | marker_disarm() as many times as marker_arm() has been called. Removing a probe | 52 | marker_disarm() as many times as marker_arm() has been called. Removing a probe |
53 | is done through marker_probe_unregister(); it will disarm the probe and make | 53 | is done through marker_probe_unregister(); it will disarm the probe. |
54 | sure there is no caller left using the probe when it returns. Probe removal is | 54 | marker_synchronize_unregister() must be called before the end of the module exit |
55 | preempt-safe because preemption is disabled around the probe call. See the | 55 | function to make sure there is no caller left using the probe. This, and the |
56 | "Probe example" section below for a sample probe module. | 56 | fact that preemption is disabled around the probe call, make sure that probe |
57 | removal and module unload are safe. See the "Probe example" section below for a | ||
58 | sample probe module. | ||
57 | 59 | ||
58 | The marker mechanism supports inserting multiple instances of the same marker. | 60 | The marker mechanism supports inserting multiple instances of the same marker. |
59 | Markers can be put in inline functions, inlined static functions, and | 61 | Markers can be put in inline functions, inlined static functions, and |
diff --git a/Documentation/mtd/nand_ecc.txt b/Documentation/mtd/nand_ecc.txt new file mode 100644 index 000000000000..bdf93b7f0f24 --- /dev/null +++ b/Documentation/mtd/nand_ecc.txt | |||
@@ -0,0 +1,714 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | Having looked at the linux mtd/nand driver and more specific at nand_ecc.c | ||
5 | I felt there was room for optimisation. I bashed the code for a few hours | ||
6 | performing tricks like table lookup removing superfluous code etc. | ||
7 | After that the speed was increased by 35-40%. | ||
8 | Still I was not too happy as I felt there was additional room for improvement. | ||
9 | |||
10 | Bad! I was hooked. | ||
11 | I decided to annotate my steps in this file. Perhaps it is useful to someone | ||
12 | or someone learns something from it. | ||
13 | |||
14 | |||
15 | The problem | ||
16 | =========== | ||
17 | |||
18 | NAND flash (at least SLC one) typically has sectors of 256 bytes. | ||
19 | However NAND flash is not extremely reliable so some error detection | ||
20 | (and sometimes correction) is needed. | ||
21 | |||
22 | This is done by means of a Hamming code. I'll try to explain it in | ||
23 | laymans terms (and apologies to all the pro's in the field in case I do | ||
24 | not use the right terminology, my coding theory class was almost 30 | ||
25 | years ago, and I must admit it was not one of my favourites). | ||
26 | |||
27 | As I said before the ecc calculation is performed on sectors of 256 | ||
28 | bytes. This is done by calculating several parity bits over the rows and | ||
29 | columns. The parity used is even parity which means that the parity bit = 1 | ||
30 | if the data over which the parity is calculated is 1 and the parity bit = 0 | ||
31 | if the data over which the parity is calculated is 0. So the total | ||
32 | number of bits over the data over which the parity is calculated + the | ||
33 | parity bit is even. (see wikipedia if you can't follow this). | ||
34 | Parity is often calculated by means of an exclusive or operation, | ||
35 | sometimes also referred to as xor. In C the operator for xor is ^ | ||
36 | |||
37 | Back to ecc. | ||
38 | Let's give a small figure: | ||
39 | |||
40 | byte 0: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp4 ... rp14 | ||
41 | byte 1: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp2 rp4 ... rp14 | ||
42 | byte 2: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp4 ... rp14 | ||
43 | byte 3: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp4 ... rp14 | ||
44 | byte 4: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp2 rp5 ... rp14 | ||
45 | .... | ||
46 | byte 254: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp0 rp3 rp5 ... rp15 | ||
47 | byte 255: bit7 bit6 bit5 bit4 bit3 bit2 bit1 bit0 rp1 rp3 rp5 ... rp15 | ||
48 | cp1 cp0 cp1 cp0 cp1 cp0 cp1 cp0 | ||
49 | cp3 cp3 cp2 cp2 cp3 cp3 cp2 cp2 | ||
50 | cp5 cp5 cp5 cp5 cp4 cp4 cp4 cp4 | ||
51 | |||
52 | This figure represents a sector of 256 bytes. | ||
53 | cp is my abbreviaton for column parity, rp for row parity. | ||
54 | |||
55 | Let's start to explain column parity. | ||
56 | cp0 is the parity that belongs to all bit0, bit2, bit4, bit6. | ||
57 | so the sum of all bit0, bit2, bit4 and bit6 values + cp0 itself is even. | ||
58 | Similarly cp1 is the sum of all bit1, bit3, bit5 and bit7. | ||
59 | cp2 is the parity over bit0, bit1, bit4 and bit5 | ||
60 | cp3 is the parity over bit2, bit3, bit6 and bit7. | ||
61 | cp4 is the parity over bit0, bit1, bit2 and bit3. | ||
62 | cp5 is the parity over bit4, bit5, bit6 and bit7. | ||
63 | Note that each of cp0 .. cp5 is exactly one bit. | ||
64 | |||
65 | Row parity actually works almost the same. | ||
66 | rp0 is the parity of all even bytes (0, 2, 4, 6, ... 252, 254) | ||
67 | rp1 is the parity of all odd bytes (1, 3, 5, 7, ..., 253, 255) | ||
68 | rp2 is the parity of all bytes 0, 1, 4, 5, 8, 9, ... | ||
69 | (so handle two bytes, then skip 2 bytes). | ||
70 | rp3 is covers the half rp2 does not cover (bytes 2, 3, 6, 7, 10, 11, ...) | ||
71 | for rp4 the rule is cover 4 bytes, skip 4 bytes, cover 4 bytes, skip 4 etc. | ||
72 | so rp4 calculates parity over bytes 0, 1, 2, 3, 8, 9, 10, 11, 16, ...) | ||
73 | and rp5 covers the other half, so bytes 4, 5, 6, 7, 12, 13, 14, 15, 20, .. | ||
74 | The story now becomes quite boring. I guess you get the idea. | ||
75 | rp6 covers 8 bytes then skips 8 etc | ||
76 | rp7 skips 8 bytes then covers 8 etc | ||
77 | rp8 covers 16 bytes then skips 16 etc | ||
78 | rp9 skips 16 bytes then covers 16 etc | ||
79 | rp10 covers 32 bytes then skips 32 etc | ||
80 | rp11 skips 32 bytes then covers 32 etc | ||
81 | rp12 covers 64 bytes then skips 64 etc | ||
82 | rp13 skips 64 bytes then covers 64 etc | ||
83 | rp14 covers 128 bytes then skips 128 | ||
84 | rp15 skips 128 bytes then covers 128 | ||
85 | |||
86 | In the end the parity bits are grouped together in three bytes as | ||
87 | follows: | ||
88 | ECC Bit 7 Bit 6 Bit 5 Bit 4 Bit 3 Bit 2 Bit 1 Bit 0 | ||
89 | ECC 0 rp07 rp06 rp05 rp04 rp03 rp02 rp01 rp00 | ||
90 | ECC 1 rp15 rp14 rp13 rp12 rp11 rp10 rp09 rp08 | ||
91 | ECC 2 cp5 cp4 cp3 cp2 cp1 cp0 1 1 | ||
92 | |||
93 | I detected after writing this that ST application note AN1823 | ||
94 | (http://www.st.com/stonline/books/pdf/docs/10123.pdf) gives a much | ||
95 | nicer picture.(but they use line parity as term where I use row parity) | ||
96 | Oh well, I'm graphically challenged, so suffer with me for a moment :-) | ||
97 | And I could not reuse the ST picture anyway for copyright reasons. | ||
98 | |||
99 | |||
100 | Attempt 0 | ||
101 | ========= | ||
102 | |||
103 | Implementing the parity calculation is pretty simple. | ||
104 | In C pseudocode: | ||
105 | for (i = 0; i < 256; i++) | ||
106 | { | ||
107 | if (i & 0x01) | ||
108 | rp1 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
109 | else | ||
110 | rp0 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp1; | ||
111 | if (i & 0x02) | ||
112 | rp3 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp3; | ||
113 | else | ||
114 | rp2 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp2; | ||
115 | if (i & 0x04) | ||
116 | rp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp5; | ||
117 | else | ||
118 | rp4 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp4; | ||
119 | if (i & 0x08) | ||
120 | rp7 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp7; | ||
121 | else | ||
122 | rp6 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp6; | ||
123 | if (i & 0x10) | ||
124 | rp9 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp9; | ||
125 | else | ||
126 | rp8 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp8; | ||
127 | if (i & 0x20) | ||
128 | rp11 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp11; | ||
129 | else | ||
130 | rp10 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp10; | ||
131 | if (i & 0x40) | ||
132 | rp13 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp13; | ||
133 | else | ||
134 | rp12 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp12; | ||
135 | if (i & 0x80) | ||
136 | rp15 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp15; | ||
137 | else | ||
138 | rp14 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ bit3 ^ bit2 ^ bit1 ^ bit0 ^ rp14; | ||
139 | cp0 = bit6 ^ bit4 ^ bit2 ^ bit0 ^ cp0; | ||
140 | cp1 = bit7 ^ bit5 ^ bit3 ^ bit1 ^ cp1; | ||
141 | cp2 = bit5 ^ bit4 ^ bit1 ^ bit0 ^ cp2; | ||
142 | cp3 = bit7 ^ bit6 ^ bit3 ^ bit2 ^ cp3 | ||
143 | cp4 = bit3 ^ bit2 ^ bit1 ^ bit0 ^ cp4 | ||
144 | cp5 = bit7 ^ bit6 ^ bit5 ^ bit4 ^ cp5 | ||
145 | } | ||
146 | |||
147 | |||
148 | Analysis 0 | ||
149 | ========== | ||
150 | |||
151 | C does have bitwise operators but not really operators to do the above | ||
152 | efficiently (and most hardware has no such instructions either). | ||
153 | Therefore without implementing this it was clear that the code above was | ||
154 | not going to bring me a Nobel prize :-) | ||
155 | |||
156 | Fortunately the exclusive or operation is commutative, so we can combine | ||
157 | the values in any order. So instead of calculating all the bits | ||
158 | individually, let us try to rearrange things. | ||
159 | For the column parity this is easy. We can just xor the bytes and in the | ||
160 | end filter out the relevant bits. This is pretty nice as it will bring | ||
161 | all cp calculation out of the if loop. | ||
162 | |||
163 | Similarly we can first xor the bytes for the various rows. | ||
164 | This leads to: | ||
165 | |||
166 | |||
167 | Attempt 1 | ||
168 | ========= | ||
169 | |||
170 | const char parity[256] = { | ||
171 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
172 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
173 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
174 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
175 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
176 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
177 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
178 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
179 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
180 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
181 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
182 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
183 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, | ||
184 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
185 | 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, | ||
186 | 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0 | ||
187 | }; | ||
188 | |||
189 | void ecc1(const unsigned char *buf, unsigned char *code) | ||
190 | { | ||
191 | int i; | ||
192 | const unsigned char *bp = buf; | ||
193 | unsigned char cur; | ||
194 | unsigned char rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
195 | unsigned char rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
196 | unsigned char par; | ||
197 | |||
198 | par = 0; | ||
199 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
200 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
201 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
202 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
203 | |||
204 | for (i = 0; i < 256; i++) | ||
205 | { | ||
206 | cur = *bp++; | ||
207 | par ^= cur; | ||
208 | if (i & 0x01) rp1 ^= cur; else rp0 ^= cur; | ||
209 | if (i & 0x02) rp3 ^= cur; else rp2 ^= cur; | ||
210 | if (i & 0x04) rp5 ^= cur; else rp4 ^= cur; | ||
211 | if (i & 0x08) rp7 ^= cur; else rp6 ^= cur; | ||
212 | if (i & 0x10) rp9 ^= cur; else rp8 ^= cur; | ||
213 | if (i & 0x20) rp11 ^= cur; else rp10 ^= cur; | ||
214 | if (i & 0x40) rp13 ^= cur; else rp12 ^= cur; | ||
215 | if (i & 0x80) rp15 ^= cur; else rp14 ^= cur; | ||
216 | } | ||
217 | code[0] = | ||
218 | (parity[rp7] << 7) | | ||
219 | (parity[rp6] << 6) | | ||
220 | (parity[rp5] << 5) | | ||
221 | (parity[rp4] << 4) | | ||
222 | (parity[rp3] << 3) | | ||
223 | (parity[rp2] << 2) | | ||
224 | (parity[rp1] << 1) | | ||
225 | (parity[rp0]); | ||
226 | code[1] = | ||
227 | (parity[rp15] << 7) | | ||
228 | (parity[rp14] << 6) | | ||
229 | (parity[rp13] << 5) | | ||
230 | (parity[rp12] << 4) | | ||
231 | (parity[rp11] << 3) | | ||
232 | (parity[rp10] << 2) | | ||
233 | (parity[rp9] << 1) | | ||
234 | (parity[rp8]); | ||
235 | code[2] = | ||
236 | (parity[par & 0xf0] << 7) | | ||
237 | (parity[par & 0x0f] << 6) | | ||
238 | (parity[par & 0xcc] << 5) | | ||
239 | (parity[par & 0x33] << 4) | | ||
240 | (parity[par & 0xaa] << 3) | | ||
241 | (parity[par & 0x55] << 2); | ||
242 | code[0] = ~code[0]; | ||
243 | code[1] = ~code[1]; | ||
244 | code[2] = ~code[2]; | ||
245 | } | ||
246 | |||
247 | Still pretty straightforward. The last three invert statements are there to | ||
248 | give a checksum of 0xff 0xff 0xff for an empty flash. In an empty flash | ||
249 | all data is 0xff, so the checksum then matches. | ||
250 | |||
251 | I also introduced the parity lookup. I expected this to be the fastest | ||
252 | way to calculate the parity, but I will investigate alternatives later | ||
253 | on. | ||
254 | |||
255 | |||
256 | Analysis 1 | ||
257 | ========== | ||
258 | |||
259 | The code works, but is not terribly efficient. On my system it took | ||
260 | almost 4 times as much time as the linux driver code. But hey, if it was | ||
261 | *that* easy this would have been done long before. | ||
262 | No pain. no gain. | ||
263 | |||
264 | Fortunately there is plenty of room for improvement. | ||
265 | |||
266 | In step 1 we moved from bit-wise calculation to byte-wise calculation. | ||
267 | However in C we can also use the unsigned long data type and virtually | ||
268 | every modern microprocessor supports 32 bit operations, so why not try | ||
269 | to write our code in such a way that we process data in 32 bit chunks. | ||
270 | |||
271 | Of course this means some modification as the row parity is byte by | ||
272 | byte. A quick analysis: | ||
273 | for the column parity we use the par variable. When extending to 32 bits | ||
274 | we can in the end easily calculate p0 and p1 from it. | ||
275 | (because par now consists of 4 bytes, contributing to rp1, rp0, rp1, rp0 | ||
276 | respectively) | ||
277 | also rp2 and rp3 can be easily retrieved from par as rp3 covers the | ||
278 | first two bytes and rp2 the last two bytes. | ||
279 | |||
280 | Note that of course now the loop is executed only 64 times (256/4). | ||
281 | And note that care must taken wrt byte ordering. The way bytes are | ||
282 | ordered in a long is machine dependent, and might affect us. | ||
283 | Anyway, if there is an issue: this code is developed on x86 (to be | ||
284 | precise: a DELL PC with a D920 Intel CPU) | ||
285 | |||
286 | And of course the performance might depend on alignment, but I expect | ||
287 | that the I/O buffers in the nand driver are aligned properly (and | ||
288 | otherwise that should be fixed to get maximum performance). | ||
289 | |||
290 | Let's give it a try... | ||
291 | |||
292 | |||
293 | Attempt 2 | ||
294 | ========= | ||
295 | |||
296 | extern const char parity[256]; | ||
297 | |||
298 | void ecc2(const unsigned char *buf, unsigned char *code) | ||
299 | { | ||
300 | int i; | ||
301 | const unsigned long *bp = (unsigned long *)buf; | ||
302 | unsigned long cur; | ||
303 | unsigned long rp0, rp1, rp2, rp3, rp4, rp5, rp6, rp7; | ||
304 | unsigned long rp8, rp9, rp10, rp11, rp12, rp13, rp14, rp15; | ||
305 | unsigned long par; | ||
306 | |||
307 | par = 0; | ||
308 | rp0 = 0; rp1 = 0; rp2 = 0; rp3 = 0; | ||
309 | rp4 = 0; rp5 = 0; rp6 = 0; rp7 = 0; | ||
310 | rp8 = 0; rp9 = 0; rp10 = 0; rp11 = 0; | ||
311 | rp12 = 0; rp13 = 0; rp14 = 0; rp15 = 0; | ||
312 | |||
313 | for (i = 0; i < 64; i++) | ||
314 | { | ||
315 | cur = *bp++; | ||
316 | par ^= cur; | ||
317 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
318 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
319 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
320 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
321 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
322 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
323 | } | ||
324 | /* | ||
325 | we need to adapt the code generation for the fact that rp vars are now | ||
326 | long; also the column parity calculation needs to be changed. | ||
327 | we'll bring rp4 to 15 back to single byte entities by shifting and | ||
328 | xoring | ||
329 | */ | ||
330 | rp4 ^= (rp4 >> 16); rp4 ^= (rp4 >> 8); rp4 &= 0xff; | ||
331 | rp5 ^= (rp5 >> 16); rp5 ^= (rp5 >> 8); rp5 &= 0xff; | ||
332 | rp6 ^= (rp6 >> 16); rp6 ^= (rp6 >> 8); rp6 &= 0xff; | ||
333 | rp7 ^= (rp7 >> 16); rp7 ^= (rp7 >> 8); rp7 &= 0xff; | ||
334 | rp8 ^= (rp8 >> 16); rp8 ^= (rp8 >> 8); rp8 &= 0xff; | ||
335 | rp9 ^= (rp9 >> 16); rp9 ^= (rp9 >> 8); rp9 &= 0xff; | ||
336 | rp10 ^= (rp10 >> 16); rp10 ^= (rp10 >> 8); rp10 &= 0xff; | ||
337 | rp11 ^= (rp11 >> 16); rp11 ^= (rp11 >> 8); rp11 &= 0xff; | ||
338 | rp12 ^= (rp12 >> 16); rp12 ^= (rp12 >> 8); rp12 &= 0xff; | ||
339 | rp13 ^= (rp13 >> 16); rp13 ^= (rp13 >> 8); rp13 &= 0xff; | ||
340 | rp14 ^= (rp14 >> 16); rp14 ^= (rp14 >> 8); rp14 &= 0xff; | ||
341 | rp15 ^= (rp15 >> 16); rp15 ^= (rp15 >> 8); rp15 &= 0xff; | ||
342 | rp3 = (par >> 16); rp3 ^= (rp3 >> 8); rp3 &= 0xff; | ||
343 | rp2 = par & 0xffff; rp2 ^= (rp2 >> 8); rp2 &= 0xff; | ||
344 | par ^= (par >> 16); | ||
345 | rp1 = (par >> 8); rp1 &= 0xff; | ||
346 | rp0 = (par & 0xff); | ||
347 | par ^= (par >> 8); par &= 0xff; | ||
348 | |||
349 | code[0] = | ||
350 | (parity[rp7] << 7) | | ||
351 | (parity[rp6] << 6) | | ||
352 | (parity[rp5] << 5) | | ||
353 | (parity[rp4] << 4) | | ||
354 | (parity[rp3] << 3) | | ||
355 | (parity[rp2] << 2) | | ||
356 | (parity[rp1] << 1) | | ||
357 | (parity[rp0]); | ||
358 | code[1] = | ||
359 | (parity[rp15] << 7) | | ||
360 | (parity[rp14] << 6) | | ||
361 | (parity[rp13] << 5) | | ||
362 | (parity[rp12] << 4) | | ||
363 | (parity[rp11] << 3) | | ||
364 | (parity[rp10] << 2) | | ||
365 | (parity[rp9] << 1) | | ||
366 | (parity[rp8]); | ||
367 | code[2] = | ||
368 | (parity[par & 0xf0] << 7) | | ||
369 | (parity[par & 0x0f] << 6) | | ||
370 | (parity[par & 0xcc] << 5) | | ||
371 | (parity[par & 0x33] << 4) | | ||
372 | (parity[par & 0xaa] << 3) | | ||
373 | (parity[par & 0x55] << 2); | ||
374 | code[0] = ~code[0]; | ||
375 | code[1] = ~code[1]; | ||
376 | code[2] = ~code[2]; | ||
377 | } | ||
378 | |||
379 | The parity array is not shown any more. Note also that for these | ||
380 | examples I kinda deviated from my regular programming style by allowing | ||
381 | multiple statements on a line, not using { } in then and else blocks | ||
382 | with only a single statement and by using operators like ^= | ||
383 | |||
384 | |||
385 | Analysis 2 | ||
386 | ========== | ||
387 | |||
388 | The code (of course) works, and hurray: we are a little bit faster than | ||
389 | the linux driver code (about 15%). But wait, don't cheer too quickly. | ||
390 | THere is more to be gained. | ||
391 | If we look at e.g. rp14 and rp15 we see that we either xor our data with | ||
392 | rp14 or with rp15. However we also have par which goes over all data. | ||
393 | This means there is no need to calculate rp14 as it can be calculated from | ||
394 | rp15 through rp14 = par ^ rp15; | ||
395 | (or if desired we can avoid calculating rp15 and calculate it from | ||
396 | rp14). That is why some places refer to inverse parity. | ||
397 | Of course the same thing holds for rp4/5, rp6/7, rp8/9, rp10/11 and rp12/13. | ||
398 | Effectively this means we can eliminate the else clause from the if | ||
399 | statements. Also we can optimise the calculation in the end a little bit | ||
400 | by going from long to byte first. Actually we can even avoid the table | ||
401 | lookups | ||
402 | |||
403 | Attempt 3 | ||
404 | ========= | ||
405 | |||
406 | Odd replaced: | ||
407 | if (i & 0x01) rp5 ^= cur; else rp4 ^= cur; | ||
408 | if (i & 0x02) rp7 ^= cur; else rp6 ^= cur; | ||
409 | if (i & 0x04) rp9 ^= cur; else rp8 ^= cur; | ||
410 | if (i & 0x08) rp11 ^= cur; else rp10 ^= cur; | ||
411 | if (i & 0x10) rp13 ^= cur; else rp12 ^= cur; | ||
412 | if (i & 0x20) rp15 ^= cur; else rp14 ^= cur; | ||
413 | with | ||
414 | if (i & 0x01) rp5 ^= cur; | ||
415 | if (i & 0x02) rp7 ^= cur; | ||
416 | if (i & 0x04) rp9 ^= cur; | ||
417 | if (i & 0x08) rp11 ^= cur; | ||
418 | if (i & 0x10) rp13 ^= cur; | ||
419 | if (i & 0x20) rp15 ^= cur; | ||
420 | |||
421 | and outside the loop added: | ||
422 | rp4 = par ^ rp5; | ||
423 | rp6 = par ^ rp7; | ||
424 | rp8 = par ^ rp9; | ||
425 | rp10 = par ^ rp11; | ||
426 | rp12 = par ^ rp13; | ||
427 | rp14 = par ^ rp15; | ||
428 | |||
429 | And after that the code takes about 30% more time, although the number of | ||
430 | statements is reduced. This is also reflected in the assembly code. | ||
431 | |||
432 | |||
433 | Analysis 3 | ||
434 | ========== | ||
435 | |||
436 | Very weird. Guess it has to do with caching or instruction parallellism | ||
437 | or so. I also tried on an eeePC (Celeron, clocked at 900 Mhz). Interesting | ||
438 | observation was that this one is only 30% slower (according to time) | ||
439 | executing the code as my 3Ghz D920 processor. | ||
440 | |||
441 | Well, it was expected not to be easy so maybe instead move to a | ||
442 | different track: let's move back to the code from attempt2 and do some | ||
443 | loop unrolling. This will eliminate a few if statements. I'll try | ||
444 | different amounts of unrolling to see what works best. | ||
445 | |||
446 | |||
447 | Attempt 4 | ||
448 | ========= | ||
449 | |||
450 | Unrolled the loop 1, 2, 3 and 4 times. | ||
451 | For 4 the code starts with: | ||
452 | |||
453 | for (i = 0; i < 4; i++) | ||
454 | { | ||
455 | cur = *bp++; | ||
456 | par ^= cur; | ||
457 | rp4 ^= cur; | ||
458 | rp6 ^= cur; | ||
459 | rp8 ^= cur; | ||
460 | rp10 ^= cur; | ||
461 | if (i & 0x1) rp13 ^= cur; else rp12 ^= cur; | ||
462 | if (i & 0x2) rp15 ^= cur; else rp14 ^= cur; | ||
463 | cur = *bp++; | ||
464 | par ^= cur; | ||
465 | rp5 ^= cur; | ||
466 | rp6 ^= cur; | ||
467 | ... | ||
468 | |||
469 | |||
470 | Analysis 4 | ||
471 | ========== | ||
472 | |||
473 | Unrolling once gains about 15% | ||
474 | Unrolling twice keeps the gain at about 15% | ||
475 | Unrolling three times gives a gain of 30% compared to attempt 2. | ||
476 | Unrolling four times gives a marginal improvement compared to unrolling | ||
477 | three times. | ||
478 | |||
479 | I decided to proceed with a four time unrolled loop anyway. It was my gut | ||
480 | feeling that in the next steps I would obtain additional gain from it. | ||
481 | |||
482 | The next step was triggered by the fact that par contains the xor of all | ||
483 | bytes and rp4 and rp5 each contain the xor of half of the bytes. | ||
484 | So in effect par = rp4 ^ rp5. But as xor is commutative we can also say | ||
485 | that rp5 = par ^ rp4. So no need to keep both rp4 and rp5 around. We can | ||
486 | eliminate rp5 (or rp4, but I already foresaw another optimisation). | ||
487 | The same holds for rp6/7, rp8/9, rp10/11 rp12/13 and rp14/15. | ||
488 | |||
489 | |||
490 | Attempt 5 | ||
491 | ========= | ||
492 | |||
493 | Effectively so all odd digit rp assignments in the loop were removed. | ||
494 | This included the else clause of the if statements. | ||
495 | Of course after the loop we need to correct things by adding code like: | ||
496 | rp5 = par ^ rp4; | ||
497 | Also the initial assignments (rp5 = 0; etc) could be removed. | ||
498 | Along the line I also removed the initialisation of rp0/1/2/3. | ||
499 | |||
500 | |||
501 | Analysis 5 | ||
502 | ========== | ||
503 | |||
504 | Measurements showed this was a good move. The run-time roughly halved | ||
505 | compared with attempt 4 with 4 times unrolled, and we only require 1/3rd | ||
506 | of the processor time compared to the current code in the linux kernel. | ||
507 | |||
508 | However, still I thought there was more. I didn't like all the if | ||
509 | statements. Why not keep a running parity and only keep the last if | ||
510 | statement. Time for yet another version! | ||
511 | |||
512 | |||
513 | Attempt 6 | ||
514 | ========= | ||
515 | |||
516 | THe code within the for loop was changed to: | ||
517 | |||
518 | for (i = 0; i < 4; i++) | ||
519 | { | ||
520 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
521 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
522 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
523 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
524 | |||
525 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
526 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
527 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
528 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
529 | |||
530 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
531 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; rp8 ^= cur; | ||
532 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp8 ^= cur; | ||
533 | cur = *bp++; tmppar ^= cur; rp8 ^= cur; | ||
534 | |||
535 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; rp6 ^= cur; | ||
536 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
537 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
538 | cur = *bp++; tmppar ^= cur; | ||
539 | |||
540 | par ^= tmppar; | ||
541 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
542 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
543 | } | ||
544 | |||
545 | As you can see tmppar is used to accumulate the parity within a for | ||
546 | iteration. In the last 3 statements is is added to par and, if needed, | ||
547 | to rp12 and rp14. | ||
548 | |||
549 | While making the changes I also found that I could exploit that tmppar | ||
550 | contains the running parity for this iteration. So instead of having: | ||
551 | rp4 ^= cur; rp6 = cur; | ||
552 | I removed the rp6 = cur; statement and did rp6 ^= tmppar; on next | ||
553 | statement. A similar change was done for rp8 and rp10 | ||
554 | |||
555 | |||
556 | Analysis 6 | ||
557 | ========== | ||
558 | |||
559 | Measuring this code again showed big gain. When executing the original | ||
560 | linux code 1 million times, this took about 1 second on my system. | ||
561 | (using time to measure the performance). After this iteration I was back | ||
562 | to 0.075 sec. Actually I had to decide to start measuring over 10 | ||
563 | million interations in order not to loose too much accuracy. This one | ||
564 | definitely seemed to be the jackpot! | ||
565 | |||
566 | There is a little bit more room for improvement though. There are three | ||
567 | places with statements: | ||
568 | rp4 ^= cur; rp6 ^= cur; | ||
569 | It seems more efficient to also maintain a variable rp4_6 in the while | ||
570 | loop; This eliminates 3 statements per loop. Of course after the loop we | ||
571 | need to correct by adding: | ||
572 | rp4 ^= rp4_6; | ||
573 | rp6 ^= rp4_6 | ||
574 | Furthermore there are 4 sequential assingments to rp8. This can be | ||
575 | encoded slightly more efficient by saving tmppar before those 4 lines | ||
576 | and later do rp8 = rp8 ^ tmppar ^ notrp8; | ||
577 | (where notrp8 is the value of rp8 before those 4 lines). | ||
578 | Again a use of the commutative property of xor. | ||
579 | Time for a new test! | ||
580 | |||
581 | |||
582 | Attempt 7 | ||
583 | ========= | ||
584 | |||
585 | The new code now looks like: | ||
586 | |||
587 | for (i = 0; i < 4; i++) | ||
588 | { | ||
589 | cur = *bp++; tmppar = cur; rp4 ^= cur; | ||
590 | cur = *bp++; tmppar ^= cur; rp6 ^= tmppar; | ||
591 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
592 | cur = *bp++; tmppar ^= cur; rp8 ^= tmppar; | ||
593 | |||
594 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
595 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
596 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
597 | cur = *bp++; tmppar ^= cur; rp10 ^= tmppar; | ||
598 | |||
599 | notrp8 = tmppar; | ||
600 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
601 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
602 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
603 | cur = *bp++; tmppar ^= cur; | ||
604 | rp8 = rp8 ^ tmppar ^ notrp8; | ||
605 | |||
606 | cur = *bp++; tmppar ^= cur; rp4_6 ^= cur; | ||
607 | cur = *bp++; tmppar ^= cur; rp6 ^= cur; | ||
608 | cur = *bp++; tmppar ^= cur; rp4 ^= cur; | ||
609 | cur = *bp++; tmppar ^= cur; | ||
610 | |||
611 | par ^= tmppar; | ||
612 | if ((i & 0x1) == 0) rp12 ^= tmppar; | ||
613 | if ((i & 0x2) == 0) rp14 ^= tmppar; | ||
614 | } | ||
615 | rp4 ^= rp4_6; | ||
616 | rp6 ^= rp4_6; | ||
617 | |||
618 | |||
619 | Not a big change, but every penny counts :-) | ||
620 | |||
621 | |||
622 | Analysis 7 | ||
623 | ========== | ||
624 | |||
625 | Acutally this made things worse. Not very much, but I don't want to move | ||
626 | into the wrong direction. Maybe something to investigate later. Could | ||
627 | have to do with caching again. | ||
628 | |||
629 | Guess that is what there is to win within the loop. Maybe unrolling one | ||
630 | more time will help. I'll keep the optimisations from 7 for now. | ||
631 | |||
632 | |||
633 | Attempt 8 | ||
634 | ========= | ||
635 | |||
636 | Unrolled the loop one more time. | ||
637 | |||
638 | |||
639 | Analysis 8 | ||
640 | ========== | ||
641 | |||
642 | This makes things worse. Let's stick with attempt 6 and continue from there. | ||
643 | Although it seems that the code within the loop cannot be optimised | ||
644 | further there is still room to optimize the generation of the ecc codes. | ||
645 | We can simply calcualate the total parity. If this is 0 then rp4 = rp5 | ||
646 | etc. If the parity is 1, then rp4 = !rp5; | ||
647 | But if rp4 = rp5 we do not need rp5 etc. We can just write the even bits | ||
648 | in the result byte and then do something like | ||
649 | code[0] |= (code[0] << 1); | ||
650 | Lets test this. | ||
651 | |||
652 | |||
653 | Attempt 9 | ||
654 | ========= | ||
655 | |||
656 | Changed the code but again this slightly degrades performance. Tried all | ||
657 | kind of other things, like having dedicated parity arrays to avoid the | ||
658 | shift after parity[rp7] << 7; No gain. | ||
659 | Change the lookup using the parity array by using shift operators (e.g. | ||
660 | replace parity[rp7] << 7 with: | ||
661 | rp7 ^= (rp7 << 4); | ||
662 | rp7 ^= (rp7 << 2); | ||
663 | rp7 ^= (rp7 << 1); | ||
664 | rp7 &= 0x80; | ||
665 | No gain. | ||
666 | |||
667 | The only marginal change was inverting the parity bits, so we can remove | ||
668 | the last three invert statements. | ||
669 | |||
670 | Ah well, pity this does not deliver more. Then again 10 million | ||
671 | iterations using the linux driver code takes between 13 and 13.5 | ||
672 | seconds, whereas my code now takes about 0.73 seconds for those 10 | ||
673 | million iterations. So basically I've improved the performance by a | ||
674 | factor 18 on my system. Not that bad. Of course on different hardware | ||
675 | you will get different results. No warranties! | ||
676 | |||
677 | But of course there is no such thing as a free lunch. The codesize almost | ||
678 | tripled (from 562 bytes to 1434 bytes). Then again, it is not that much. | ||
679 | |||
680 | |||
681 | Correcting errors | ||
682 | ================= | ||
683 | |||
684 | For correcting errors I again used the ST application note as a starter, | ||
685 | but I also peeked at the existing code. | ||
686 | The algorithm itself is pretty straightforward. Just xor the given and | ||
687 | the calculated ecc. If all bytes are 0 there is no problem. If 11 bits | ||
688 | are 1 we have one correctable bit error. If there is 1 bit 1, we have an | ||
689 | error in the given ecc code. | ||
690 | It proved to be fastest to do some table lookups. Performance gain | ||
691 | introduced by this is about a factor 2 on my system when a repair had to | ||
692 | be done, and 1% or so if no repair had to be done. | ||
693 | Code size increased from 330 bytes to 686 bytes for this function. | ||
694 | (gcc 4.2, -O3) | ||
695 | |||
696 | |||
697 | Conclusion | ||
698 | ========== | ||
699 | |||
700 | The gain when calculating the ecc is tremendous. Om my development hardware | ||
701 | a speedup of a factor of 18 for ecc calculation was achieved. On a test on an | ||
702 | embedded system with a MIPS core a factor 7 was obtained. | ||
703 | On a test with a Linksys NSLU2 (ARMv5TE processor) the speedup was a factor | ||
704 | 5 (big endian mode, gcc 4.1.2, -O3) | ||
705 | For correction not much gain could be obtained (as bitflips are rare). Then | ||
706 | again there are also much less cycles spent there. | ||
707 | |||
708 | It seems there is not much more gain possible in this, at least when | ||
709 | programmed in C. Of course it might be possible to squeeze something more | ||
710 | out of it with an assembler program, but due to pipeline behaviour etc | ||
711 | this is very tricky (at least for intel hw). | ||
712 | |||
713 | Author: Frans Meulenbroeks | ||
714 | Copyright (C) 2008 Koninklijke Philips Electronics NV. | ||
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 5ce0952aa065..10a0263ebb3f 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -95,7 +95,9 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
95 | 95 | ||
96 | 'p' - Will dump the current registers and flags to your console. | 96 | 'p' - Will dump the current registers and flags to your console. |
97 | 97 | ||
98 | 'q' - Will dump a list of all running timers. | 98 | 'q' - Will dump per CPU lists of all armed hrtimers (but NOT regular |
99 | timer_list timers) and detailed information about all | ||
100 | clockevent devices. | ||
99 | 101 | ||
100 | 'r' - Turns off keyboard raw mode and sets it to XLATE. | 102 | 'r' - Turns off keyboard raw mode and sets it to XLATE. |
101 | 103 | ||
diff --git a/Documentation/tracepoints.txt b/Documentation/tracepoints.txt new file mode 100644 index 000000000000..5d354e167494 --- /dev/null +++ b/Documentation/tracepoints.txt | |||
@@ -0,0 +1,101 @@ | |||
1 | Using the Linux Kernel Tracepoints | ||
2 | |||
3 | Mathieu Desnoyers | ||
4 | |||
5 | |||
6 | This document introduces Linux Kernel Tracepoints and their use. It provides | ||
7 | examples of how to insert tracepoints in the kernel and connect probe functions | ||
8 | to them and provides some examples of probe functions. | ||
9 | |||
10 | |||
11 | * Purpose of tracepoints | ||
12 | |||
13 | A tracepoint placed in code provides a hook to call a function (probe) that you | ||
14 | can provide at runtime. A tracepoint can be "on" (a probe is connected to it) or | ||
15 | "off" (no probe is attached). When a tracepoint is "off" it has no effect, | ||
16 | except for adding a tiny time penalty (checking a condition for a branch) and | ||
17 | space penalty (adding a few bytes for the function call at the end of the | ||
18 | instrumented function and adds a data structure in a separate section). When a | ||
19 | tracepoint is "on", the function you provide is called each time the tracepoint | ||
20 | is executed, in the execution context of the caller. When the function provided | ||
21 | ends its execution, it returns to the caller (continuing from the tracepoint | ||
22 | site). | ||
23 | |||
24 | You can put tracepoints at important locations in the code. They are | ||
25 | lightweight hooks that can pass an arbitrary number of parameters, | ||
26 | which prototypes are described in a tracepoint declaration placed in a header | ||
27 | file. | ||
28 | |||
29 | They can be used for tracing and performance accounting. | ||
30 | |||
31 | |||
32 | * Usage | ||
33 | |||
34 | Two elements are required for tracepoints : | ||
35 | |||
36 | - A tracepoint definition, placed in a header file. | ||
37 | - The tracepoint statement, in C code. | ||
38 | |||
39 | In order to use tracepoints, you should include linux/tracepoint.h. | ||
40 | |||
41 | In include/trace/subsys.h : | ||
42 | |||
43 | #include <linux/tracepoint.h> | ||
44 | |||
45 | DEFINE_TRACE(subsys_eventname, | ||
46 | TPPTOTO(int firstarg, struct task_struct *p), | ||
47 | TPARGS(firstarg, p)); | ||
48 | |||
49 | In subsys/file.c (where the tracing statement must be added) : | ||
50 | |||
51 | #include <trace/subsys.h> | ||
52 | |||
53 | void somefct(void) | ||
54 | { | ||
55 | ... | ||
56 | trace_subsys_eventname(arg, task); | ||
57 | ... | ||
58 | } | ||
59 | |||
60 | Where : | ||
61 | - subsys_eventname is an identifier unique to your event | ||
62 | - subsys is the name of your subsystem. | ||
63 | - eventname is the name of the event to trace. | ||
64 | - TPPTOTO(int firstarg, struct task_struct *p) is the prototype of the function | ||
65 | called by this tracepoint. | ||
66 | - TPARGS(firstarg, p) are the parameters names, same as found in the prototype. | ||
67 | |||
68 | Connecting a function (probe) to a tracepoint is done by providing a probe | ||
69 | (function to call) for the specific tracepoint through | ||
70 | register_trace_subsys_eventname(). Removing a probe is done through | ||
71 | unregister_trace_subsys_eventname(); it will remove the probe sure there is no | ||
72 | caller left using the probe when it returns. Probe removal is preempt-safe | ||
73 | because preemption is disabled around the probe call. See the "Probe example" | ||
74 | section below for a sample probe module. | ||
75 | |||
76 | The tracepoint mechanism supports inserting multiple instances of the same | ||
77 | tracepoint, but a single definition must be made of a given tracepoint name over | ||
78 | all the kernel to make sure no type conflict will occur. Name mangling of the | ||
79 | tracepoints is done using the prototypes to make sure typing is correct. | ||
80 | Verification of probe type correctness is done at the registration site by the | ||
81 | compiler. Tracepoints can be put in inline functions, inlined static functions, | ||
82 | and unrolled loops as well as regular functions. | ||
83 | |||
84 | The naming scheme "subsys_event" is suggested here as a convention intended | ||
85 | to limit collisions. Tracepoint names are global to the kernel: they are | ||
86 | considered as being the same whether they are in the core kernel image or in | ||
87 | modules. | ||
88 | |||
89 | |||
90 | * Probe / tracepoint example | ||
91 | |||
92 | See the example provided in samples/tracepoints/src | ||
93 | |||
94 | Compile them with your kernel. | ||
95 | |||
96 | Run, as root : | ||
97 | modprobe tracepoint-example (insmod order is not important) | ||
98 | modprobe tracepoint-probe-example | ||
99 | cat /proc/tracepoint-example (returns an expected error) | ||
100 | rmmod tracepoint-example tracepoint-probe-example | ||
101 | dmesg | ||
diff --git a/Documentation/tracers/mmiotrace.txt b/Documentation/tracers/mmiotrace.txt index a4afb560a45b..5bbbe2096223 100644 --- a/Documentation/tracers/mmiotrace.txt +++ b/Documentation/tracers/mmiotrace.txt | |||
@@ -36,7 +36,7 @@ $ mount -t debugfs debugfs /debug | |||
36 | $ echo mmiotrace > /debug/tracing/current_tracer | 36 | $ echo mmiotrace > /debug/tracing/current_tracer |
37 | $ cat /debug/tracing/trace_pipe > mydump.txt & | 37 | $ cat /debug/tracing/trace_pipe > mydump.txt & |
38 | Start X or whatever. | 38 | Start X or whatever. |
39 | $ echo "X is up" > /debug/tracing/marker | 39 | $ echo "X is up" > /debug/tracing/trace_marker |
40 | $ echo none > /debug/tracing/current_tracer | 40 | $ echo none > /debug/tracing/current_tracer |
41 | Check for lost events. | 41 | Check for lost events. |
42 | 42 | ||
@@ -59,9 +59,8 @@ The 'cat' process should stay running (sleeping) in the background. | |||
59 | Load the driver you want to trace and use it. Mmiotrace will only catch MMIO | 59 | Load the driver you want to trace and use it. Mmiotrace will only catch MMIO |
60 | accesses to areas that are ioremapped while mmiotrace is active. | 60 | accesses to areas that are ioremapped while mmiotrace is active. |
61 | 61 | ||
62 | [Unimplemented feature:] | ||
63 | During tracing you can place comments (markers) into the trace by | 62 | During tracing you can place comments (markers) into the trace by |
64 | $ echo "X is up" > /debug/tracing/marker | 63 | $ echo "X is up" > /debug/tracing/trace_marker |
65 | This makes it easier to see which part of the (huge) trace corresponds to | 64 | This makes it easier to see which part of the (huge) trace corresponds to |
66 | which action. It is recommended to place descriptive markers about what you | 65 | which action. It is recommended to place descriptive markers about what you |
67 | do. | 66 | do. |
diff --git a/Documentation/usb/anchors.txt b/Documentation/usb/anchors.txt index 5e6b64c20d25..6f24f566955a 100644 --- a/Documentation/usb/anchors.txt +++ b/Documentation/usb/anchors.txt | |||
@@ -52,6 +52,11 @@ Therefore no guarantee is made that the URBs have been unlinked when | |||
52 | the call returns. They may be unlinked later but will be unlinked in | 52 | the call returns. They may be unlinked later but will be unlinked in |
53 | finite time. | 53 | finite time. |
54 | 54 | ||
55 | usb_scuttle_anchored_urbs() | ||
56 | --------------------------- | ||
57 | |||
58 | All URBs of an anchor are unanchored en masse. | ||
59 | |||
55 | usb_wait_anchor_empty_timeout() | 60 | usb_wait_anchor_empty_timeout() |
56 | ------------------------------- | 61 | ------------------------------- |
57 | 62 | ||
@@ -59,4 +64,16 @@ This function waits for all URBs associated with an anchor to finish | |||
59 | or a timeout, whichever comes first. Its return value will tell you | 64 | or a timeout, whichever comes first. Its return value will tell you |
60 | whether the timeout was reached. | 65 | whether the timeout was reached. |
61 | 66 | ||
67 | usb_anchor_empty() | ||
68 | ------------------ | ||
69 | |||
70 | Returns true if no URBs are associated with an anchor. Locking | ||
71 | is the caller's responsibility. | ||
72 | |||
73 | usb_get_from_anchor() | ||
74 | --------------------- | ||
62 | 75 | ||
76 | Returns the oldest anchored URB of an anchor. The URB is unanchored | ||
77 | and returned with a reference. As you may mix URBs to several | ||
78 | destinations in one anchor you have no guarantee the chronologically | ||
79 | first submitted URB is returned. \ No newline at end of file | ||
diff --git a/Documentation/usb/misc_usbsevseg.txt b/Documentation/usb/misc_usbsevseg.txt new file mode 100644 index 000000000000..0f6be4f9930b --- /dev/null +++ b/Documentation/usb/misc_usbsevseg.txt | |||
@@ -0,0 +1,46 @@ | |||
1 | USB 7-Segment Numeric Display | ||
2 | Manufactured by Delcom Engineering | ||
3 | |||
4 | Device Information | ||
5 | ------------------ | ||
6 | USB VENDOR_ID 0x0fc5 | ||
7 | USB PRODUCT_ID 0x1227 | ||
8 | Both the 6 character and 8 character displays have PRODUCT_ID, | ||
9 | and according to Delcom Engineering no queryable information | ||
10 | can be obtained from the device to tell them apart. | ||
11 | |||
12 | Device Modes | ||
13 | ------------ | ||
14 | By default, the driver assumes the display is only 6 characters | ||
15 | The mode for 6 characters is: | ||
16 | MSB 0x06; LSB 0x3f | ||
17 | For the 8 character display: | ||
18 | MSB 0x08; LSB 0xff | ||
19 | The device can accept "text" either in raw, hex, or ascii textmode. | ||
20 | raw controls each segment manually, | ||
21 | hex expects a value between 0-15 per character, | ||
22 | ascii expects a value between '0'-'9' and 'A'-'F'. | ||
23 | The default is ascii. | ||
24 | |||
25 | Device Operation | ||
26 | ---------------- | ||
27 | 1. Turn on the device: | ||
28 | echo 1 > /sys/bus/usb/.../powered | ||
29 | 2. Set the device's mode: | ||
30 | echo $mode_msb > /sys/bus/usb/.../mode_msb | ||
31 | echo $mode_lsb > /sys/bus/usb/.../mode_lsb | ||
32 | 3. Set the textmode: | ||
33 | echo $textmode > /sys/bus/usb/.../textmode | ||
34 | 4. set the text (for example): | ||
35 | echo "123ABC" > /sys/bus/usb/.../text (ascii) | ||
36 | echo "A1B2" > /sys/bus/usb/.../text (ascii) | ||
37 | echo -ne "\x01\x02\x03" > /sys/bus/usb/.../text (hex) | ||
38 | 5. Set the decimal places. | ||
39 | The device has either 6 or 8 decimal points. | ||
40 | to set the nth decimal place calculate 10 ** n | ||
41 | and echo it in to /sys/bus/usb/.../decimals | ||
42 | To set multiple decimals points sum up each power. | ||
43 | For example, to set the 0th and 3rd decimal place | ||
44 | echo 1001 > /sys/bus/usb/.../decimals | ||
45 | |||
46 | |||
diff --git a/Documentation/usb/power-management.txt b/Documentation/usb/power-management.txt index 9d31140e3f5b..e48ea1d51010 100644 --- a/Documentation/usb/power-management.txt +++ b/Documentation/usb/power-management.txt | |||
@@ -350,12 +350,12 @@ without holding the mutex. | |||
350 | 350 | ||
351 | There also are a couple of utility routines drivers can use: | 351 | There also are a couple of utility routines drivers can use: |
352 | 352 | ||
353 | usb_autopm_enable() sets pm_usage_cnt to 1 and then calls | 353 | usb_autopm_enable() sets pm_usage_cnt to 0 and then calls |
354 | usb_autopm_set_interface(), which will attempt an autoresume. | ||
355 | |||
356 | usb_autopm_disable() sets pm_usage_cnt to 0 and then calls | ||
357 | usb_autopm_set_interface(), which will attempt an autosuspend. | 354 | usb_autopm_set_interface(), which will attempt an autosuspend. |
358 | 355 | ||
356 | usb_autopm_disable() sets pm_usage_cnt to 1 and then calls | ||
357 | usb_autopm_set_interface(), which will attempt an autoresume. | ||
358 | |||
359 | The conventional usage pattern is that a driver calls | 359 | The conventional usage pattern is that a driver calls |
360 | usb_autopm_get_interface() in its open routine and | 360 | usb_autopm_get_interface() in its open routine and |
361 | usb_autopm_put_interface() in its close or release routine. But | 361 | usb_autopm_put_interface() in its close or release routine. But |
diff --git a/Documentation/video4linux/CARDLIST.au0828 b/Documentation/video4linux/CARDLIST.au0828 index aa05e5bb22fb..d5cb4ea287b2 100644 --- a/Documentation/video4linux/CARDLIST.au0828 +++ b/Documentation/video4linux/CARDLIST.au0828 | |||
@@ -1,5 +1,5 @@ | |||
1 | 0 -> Unknown board (au0828) | 1 | 0 -> Unknown board (au0828) |
2 | 1 -> Hauppauge HVR950Q (au0828) [2040:7200,2040:7210,2040:7217,2040:721b,2040:721f,2040:7280,0fd9:0008] | 2 | 1 -> Hauppauge HVR950Q (au0828) [2040:7200,2040:7210,2040:7217,2040:721b,2040:721e,2040:721f,2040:7280,0fd9:0008] |
3 | 2 -> Hauppauge HVR850 (au0828) [2040:7240] | 3 | 2 -> Hauppauge HVR850 (au0828) [2040:7240] |
4 | 3 -> DViCO FusionHDTV USB (au0828) [0fe9:d620] | 4 | 3 -> DViCO FusionHDTV USB (au0828) [0fe9:d620] |
5 | 4 -> Hauppauge HVR950Q rev xxF8 (au0828) [2040:7201,2040:7211,2040:7281] | 5 | 4 -> Hauppauge HVR950Q rev xxF8 (au0828) [2040:7201,2040:7211,2040:7281] |
diff --git a/Documentation/video4linux/CARDLIST.tuner b/Documentation/video4linux/CARDLIST.tuner index 30bbdda68d03..691d2f37dc57 100644 --- a/Documentation/video4linux/CARDLIST.tuner +++ b/Documentation/video4linux/CARDLIST.tuner | |||
@@ -75,3 +75,4 @@ tuner=73 - Samsung TCPG 6121P30A | |||
75 | tuner=75 - Philips TEA5761 FM Radio | 75 | tuner=75 - Philips TEA5761 FM Radio |
76 | tuner=76 - Xceive 5000 tuner | 76 | tuner=76 - Xceive 5000 tuner |
77 | tuner=77 - TCL tuner MF02GIP-5N-E | 77 | tuner=77 - TCL tuner MF02GIP-5N-E |
78 | tuner=78 - Philips FMD1216MEX MK3 Hybrid Tuner | ||
diff --git a/Documentation/vm/unevictable-lru.txt b/Documentation/vm/unevictable-lru.txt new file mode 100644 index 000000000000..125eed560e5a --- /dev/null +++ b/Documentation/vm/unevictable-lru.txt | |||
@@ -0,0 +1,615 @@ | |||
1 | |||
2 | This document describes the Linux memory management "Unevictable LRU" | ||
3 | infrastructure and the use of this infrastructure to manage several types | ||
4 | of "unevictable" pages. The document attempts to provide the overall | ||
5 | rationale behind this mechanism and the rationale for some of the design | ||
6 | decisions that drove the implementation. The latter design rationale is | ||
7 | discussed in the context of an implementation description. Admittedly, one | ||
8 | can obtain the implementation details--the "what does it do?"--by reading the | ||
9 | code. One hopes that the descriptions below add value by provide the answer | ||
10 | to "why does it do that?". | ||
11 | |||
12 | Unevictable LRU Infrastructure: | ||
13 | |||
14 | The Unevictable LRU adds an additional LRU list to track unevictable pages | ||
15 | and to hide these pages from vmscan. This mechanism is based on a patch by | ||
16 | Larry Woodman of Red Hat to address several scalability problems with page | ||
17 | reclaim in Linux. The problems have been observed at customer sites on large | ||
18 | memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB | ||
19 | of main memory will have over 32 million 4k pages in a single zone. When a | ||
20 | large fraction of these pages are not evictable for any reason [see below], | ||
21 | vmscan will spend a lot of time scanning the LRU lists looking for the small | ||
22 | fraction of pages that are evictable. This can result in a situation where | ||
23 | all cpus are spending 100% of their time in vmscan for hours or days on end, | ||
24 | with the system completely unresponsive. | ||
25 | |||
26 | The Unevictable LRU infrastructure addresses the following classes of | ||
27 | unevictable pages: | ||
28 | |||
29 | + page owned by ramfs | ||
30 | + page mapped into SHM_LOCKed shared memory regions | ||
31 | + page mapped into VM_LOCKED [mlock()ed] vmas | ||
32 | |||
33 | The infrastructure might be able to handle other conditions that make pages | ||
34 | unevictable, either by definition or by circumstance, in the future. | ||
35 | |||
36 | |||
37 | The Unevictable LRU List | ||
38 | |||
39 | The Unevictable LRU infrastructure consists of an additional, per-zone, LRU list | ||
40 | called the "unevictable" list and an associated page flag, PG_unevictable, to | ||
41 | indicate that the page is being managed on the unevictable list. The | ||
42 | PG_unevictable flag is analogous to, and mutually exclusive with, the PG_active | ||
43 | flag in that it indicates on which LRU list a page resides when PG_lru is set. | ||
44 | The unevictable LRU list is source configurable based on the UNEVICTABLE_LRU | ||
45 | Kconfig option. | ||
46 | |||
47 | The Unevictable LRU infrastructure maintains unevictable pages on an additional | ||
48 | LRU list for a few reasons: | ||
49 | |||
50 | 1) We get to "treat unevictable pages just like we treat other pages in the | ||
51 | system, which means we get to use the same code to manipulate them, the | ||
52 | same code to isolate them (for migrate, etc.), the same code to keep track | ||
53 | of the statistics, etc..." [Rik van Riel] | ||
54 | |||
55 | 2) We want to be able to migrate unevictable pages between nodes--for memory | ||
56 | defragmentation, workload management and memory hotplug. The linux kernel | ||
57 | can only migrate pages that it can successfully isolate from the lru lists. | ||
58 | If we were to maintain pages elsewise than on an lru-like list, where they | ||
59 | can be found by isolate_lru_page(), we would prevent their migration, unless | ||
60 | we reworked migration code to find the unevictable pages. | ||
61 | |||
62 | |||
63 | The unevictable LRU list does not differentiate between file backed and swap | ||
64 | backed [anon] pages. This differentiation is only important while the pages | ||
65 | are, in fact, evictable. | ||
66 | |||
67 | The unevictable LRU list benefits from the "arrayification" of the per-zone | ||
68 | LRU lists and statistics originally proposed and posted by Christoph Lameter. | ||
69 | |||
70 | The unevictable list does not use the lru pagevec mechanism. Rather, | ||
71 | unevictable pages are placed directly on the page's zone's unevictable | ||
72 | list under the zone lru_lock. The reason for this is to prevent stranding | ||
73 | of pages on the unevictable list when one task has the page isolated from the | ||
74 | lru and other tasks are changing the "evictability" state of the page. | ||
75 | |||
76 | |||
77 | Unevictable LRU and Memory Controller Interaction | ||
78 | |||
79 | The memory controller data structure automatically gets a per zone unevictable | ||
80 | lru list as a result of the "arrayification" of the per-zone LRU lists. The | ||
81 | memory controller tracks the movement of pages to and from the unevictable list. | ||
82 | When a memory control group comes under memory pressure, the controller will | ||
83 | not attempt to reclaim pages on the unevictable list. This has a couple of | ||
84 | effects. Because the pages are "hidden" from reclaim on the unevictable list, | ||
85 | the reclaim process can be more efficient, dealing only with pages that have | ||
86 | a chance of being reclaimed. On the other hand, if too many of the pages | ||
87 | charged to the control group are unevictable, the evictable portion of the | ||
88 | working set of the tasks in the control group may not fit into the available | ||
89 | memory. This can cause the control group to thrash or to oom-kill tasks. | ||
90 | |||
91 | |||
92 | Unevictable LRU: Detecting Unevictable Pages | ||
93 | |||
94 | The function page_evictable(page, vma) in vmscan.c determines whether a | ||
95 | page is evictable or not. For ramfs pages and pages in SHM_LOCKed regions, | ||
96 | page_evictable() tests a new address space flag, AS_UNEVICTABLE, in the page's | ||
97 | address space using a wrapper function. Wrapper functions are used to set, | ||
98 | clear and test the flag to reduce the requirement for #ifdef's throughout the | ||
99 | source code. AS_UNEVICTABLE is set on ramfs inode/mapping when it is created. | ||
100 | This flag remains for the life of the inode. | ||
101 | |||
102 | For shared memory regions, AS_UNEVICTABLE is set when an application | ||
103 | successfully SHM_LOCKs the region and is removed when the region is | ||
104 | SHM_UNLOCKed. Note that shmctl(SHM_LOCK, ...) does not populate the page | ||
105 | tables for the region as does, for example, mlock(). So, we make no special | ||
106 | effort to push any pages in the SHM_LOCKed region to the unevictable list. | ||
107 | Vmscan will do this when/if it encounters the pages during reclaim. On | ||
108 | SHM_UNLOCK, shmctl() scans the pages in the region and "rescues" them from the | ||
109 | unevictable list if no other condition keeps them unevictable. If a SHM_LOCKed | ||
110 | region is destroyed, the pages are also "rescued" from the unevictable list in | ||
111 | the process of freeing them. | ||
112 | |||
113 | page_evictable() detects mlock()ed pages by testing an additional page flag, | ||
114 | PG_mlocked via the PageMlocked() wrapper. If the page is NOT mlocked, and a | ||
115 | non-NULL vma is supplied, page_evictable() will check whether the vma is | ||
116 | VM_LOCKED via is_mlocked_vma(). is_mlocked_vma() will SetPageMlocked() and | ||
117 | update the appropriate statistics if the vma is VM_LOCKED. This method allows | ||
118 | efficient "culling" of pages in the fault path that are being faulted in to | ||
119 | VM_LOCKED vmas. | ||
120 | |||
121 | |||
122 | Unevictable Pages and Vmscan [shrink_*_list()] | ||
123 | |||
124 | If unevictable pages are culled in the fault path, or moved to the unevictable | ||
125 | list at mlock() or mmap() time, vmscan will never encounter the pages until | ||
126 | they have become evictable again, for example, via munlock() and have been | ||
127 | "rescued" from the unevictable list. However, there may be situations where we | ||
128 | decide, for the sake of expediency, to leave a unevictable page on one of the | ||
129 | regular active/inactive LRU lists for vmscan to deal with. Vmscan checks for | ||
130 | such pages in all of the shrink_{active|inactive|page}_list() functions and | ||
131 | will "cull" such pages that it encounters--that is, it diverts those pages to | ||
132 | the unevictable list for the zone being scanned. | ||
133 | |||
134 | There may be situations where a page is mapped into a VM_LOCKED vma, but the | ||
135 | page is not marked as PageMlocked. Such pages will make it all the way to | ||
136 | shrink_page_list() where they will be detected when vmscan walks the reverse | ||
137 | map in try_to_unmap(). If try_to_unmap() returns SWAP_MLOCK, shrink_page_list() | ||
138 | will cull the page at that point. | ||
139 | |||
140 | Note that for anonymous pages, shrink_page_list() attempts to add the page to | ||
141 | the swap cache before it tries to unmap the page. To avoid this unnecessary | ||
142 | consumption of swap space, shrink_page_list() calls try_to_munlock() to check | ||
143 | whether any VM_LOCKED vmas map the page without attempting to unmap the page. | ||
144 | If try_to_munlock() returns SWAP_MLOCK, shrink_page_list() will cull the page | ||
145 | without consuming swap space. try_to_munlock() will be described below. | ||
146 | |||
147 | To "cull" an unevictable page, vmscan simply puts the page back on the lru | ||
148 | list using putback_lru_page()--the inverse operation to isolate_lru_page()-- | ||
149 | after dropping the page lock. Because the condition which makes the page | ||
150 | unevictable may change once the page is unlocked, putback_lru_page() will | ||
151 | recheck the unevictable state of a page that it places on the unevictable lru | ||
152 | list. If the page has become unevictable, putback_lru_page() removes it from | ||
153 | the list and retries, including the page_unevictable() test. Because such a | ||
154 | race is a rare event and movement of pages onto the unevictable list should be | ||
155 | rare, these extra evictabilty checks should not occur in the majority of calls | ||
156 | to putback_lru_page(). | ||
157 | |||
158 | |||
159 | Mlocked Page: Prior Work | ||
160 | |||
161 | The "Unevictable Mlocked Pages" infrastructure is based on work originally | ||
162 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". | ||
163 | Nick posted his patch as an alternative to a patch posted by Christoph | ||
164 | Lameter to achieve the same objective--hiding mlocked pages from vmscan. | ||
165 | In Nick's patch, he used one of the struct page lru list link fields as a count | ||
166 | of VM_LOCKED vmas that map the page. This use of the link field for a count | ||
167 | prevented the management of the pages on an LRU list. Thus, mlocked pages were | ||
168 | not migratable as isolate_lru_page() could not find them and the lru list link | ||
169 | field was not available to the migration subsystem. Nick resolved this by | ||
170 | putting mlocked pages back on the lru list before attempting to isolate them, | ||
171 | thus abandoning the count of VM_LOCKED vmas. When Nick's patch was integrated | ||
172 | with the Unevictable LRU work, the count was replaced by walking the reverse | ||
173 | map to determine whether any VM_LOCKED vmas mapped the page. More on this | ||
174 | below. | ||
175 | |||
176 | |||
177 | Mlocked Pages: Basic Management | ||
178 | |||
179 | Mlocked pages--pages mapped into a VM_LOCKED vma--represent one class of | ||
180 | unevictable pages. When such a page has been "noticed" by the memory | ||
181 | management subsystem, the page is marked with the PG_mlocked [PageMlocked()] | ||
182 | flag. A PageMlocked() page will be placed on the unevictable LRU list when | ||
183 | it is added to the LRU. Pages can be "noticed" by memory management in | ||
184 | several places: | ||
185 | |||
186 | 1) in the mlock()/mlockall() system call handlers. | ||
187 | 2) in the mmap() system call handler when mmap()ing a region with the | ||
188 | MAP_LOCKED flag, or mmap()ing a region in a task that has called | ||
189 | mlockall() with the MCL_FUTURE flag. Both of these conditions result | ||
190 | in the VM_LOCKED flag being set for the vma. | ||
191 | 3) in the fault path, if mlocked pages are "culled" in the fault path, | ||
192 | and when a VM_LOCKED stack segment is expanded. | ||
193 | 4) as mentioned above, in vmscan:shrink_page_list() with attempting to | ||
194 | reclaim a page in a VM_LOCKED vma--via try_to_unmap() or try_to_munlock(). | ||
195 | |||
196 | Mlocked pages become unlocked and rescued from the unevictable list when: | ||
197 | |||
198 | 1) mapped in a range unlocked via the munlock()/munlockall() system calls. | ||
199 | 2) munmapped() out of the last VM_LOCKED vma that maps the page, including | ||
200 | unmapping at task exit. | ||
201 | 3) when the page is truncated from the last VM_LOCKED vma of an mmap()ed file. | ||
202 | 4) before a page is COWed in a VM_LOCKED vma. | ||
203 | |||
204 | |||
205 | Mlocked Pages: mlock()/mlockall() System Call Handling | ||
206 | |||
207 | Both [do_]mlock() and [do_]mlockall() system call handlers call mlock_fixup() | ||
208 | for each vma in the range specified by the call. In the case of mlockall(), | ||
209 | this is the entire active address space of the task. Note that mlock_fixup() | ||
210 | is used for both mlock()ing and munlock()ing a range of memory. A call to | ||
211 | mlock() an already VM_LOCKED vma, or to munlock() a vma that is not VM_LOCKED | ||
212 | is treated as a no-op--mlock_fixup() simply returns. | ||
213 | |||
214 | If the vma passes some filtering described in "Mlocked Pages: Filtering Vmas" | ||
215 | below, mlock_fixup() will attempt to merge the vma with its neighbors or split | ||
216 | off a subset of the vma if the range does not cover the entire vma. Once the | ||
217 | vma has been merged or split or neither, mlock_fixup() will call | ||
218 | __mlock_vma_pages_range() to fault in the pages via get_user_pages() and | ||
219 | to mark the pages as mlocked via mlock_vma_page(). | ||
220 | |||
221 | Note that the vma being mlocked might be mapped with PROT_NONE. In this case, | ||
222 | get_user_pages() will be unable to fault in the pages. That's OK. If pages | ||
223 | do end up getting faulted into this VM_LOCKED vma, we'll handle them in the | ||
224 | fault path or in vmscan. | ||
225 | |||
226 | Also note that a page returned by get_user_pages() could be truncated or | ||
227 | migrated out from under us, while we're trying to mlock it. To detect | ||
228 | this, __mlock_vma_pages_range() tests the page_mapping after acquiring | ||
229 | the page lock. If the page is still associated with its mapping, we'll | ||
230 | go ahead and call mlock_vma_page(). If the mapping is gone, we just | ||
231 | unlock the page and move on. Worse case, this results in page mapped | ||
232 | in a VM_LOCKED vma remaining on a normal LRU list without being | ||
233 | PageMlocked(). Again, vmscan will detect and cull such pages. | ||
234 | |||
235 | mlock_vma_page(), called with the page locked [N.B., not "mlocked"], will | ||
236 | TestSetPageMlocked() for each page returned by get_user_pages(). We use | ||
237 | TestSetPageMlocked() because the page might already be mlocked by another | ||
238 | task/vma and we don't want to do extra work. We especially do not want to | ||
239 | count an mlocked page more than once in the statistics. If the page was | ||
240 | already mlocked, mlock_vma_page() is done. | ||
241 | |||
242 | If the page was NOT already mlocked, mlock_vma_page() attempts to isolate the | ||
243 | page from the LRU, as it is likely on the appropriate active or inactive list | ||
244 | at that time. If the isolate_lru_page() succeeds, mlock_vma_page() will | ||
245 | putback the page--putback_lru_page()--which will notice that the page is now | ||
246 | mlocked and divert the page to the zone's unevictable LRU list. If | ||
247 | mlock_vma_page() is unable to isolate the page from the LRU, vmscan will handle | ||
248 | it later if/when it attempts to reclaim the page. | ||
249 | |||
250 | |||
251 | Mlocked Pages: Filtering Special Vmas | ||
252 | |||
253 | mlock_fixup() filters several classes of "special" vmas: | ||
254 | |||
255 | 1) vmas with VM_IO|VM_PFNMAP set are skipped entirely. The pages behind | ||
256 | these mappings are inherently pinned, so we don't need to mark them as | ||
257 | mlocked. In any case, most of the pages have no struct page in which to | ||
258 | so mark the page. Because of this, get_user_pages() will fail for these | ||
259 | vmas, so there is no sense in attempting to visit them. | ||
260 | |||
261 | 2) vmas mapping hugetlbfs page are already effectively pinned into memory. | ||
262 | We don't need nor want to mlock() these pages. However, to preserve the | ||
263 | prior behavior of mlock()--before the unevictable/mlock changes--mlock_fixup() | ||
264 | will call make_pages_present() in the hugetlbfs vma range to allocate the | ||
265 | huge pages and populate the ptes. | ||
266 | |||
267 | 3) vmas with VM_DONTEXPAND|VM_RESERVED are generally user space mappings of | ||
268 | kernel pages, such as the vdso page, relay channel pages, etc. These pages | ||
269 | are inherently unevictable and are not managed on the LRU lists. | ||
270 | mlock_fixup() treats these vmas the same as hugetlbfs vmas. It calls | ||
271 | make_pages_present() to populate the ptes. | ||
272 | |||
273 | Note that for all of these special vmas, mlock_fixup() does not set the | ||
274 | VM_LOCKED flag. Therefore, we won't have to deal with them later during | ||
275 | munlock() or munmap()--for example, at task exit. Neither does mlock_fixup() | ||
276 | account these vmas against the task's "locked_vm". | ||
277 | |||
278 | Mlocked Pages: Downgrading the Mmap Semaphore. | ||
279 | |||
280 | mlock_fixup() must be called with the mmap semaphore held for write, because | ||
281 | it may have to merge or split vmas. However, mlocking a large region of | ||
282 | memory can take a long time--especially if vmscan must reclaim pages to | ||
283 | satisfy the regions requirements. Faulting in a large region with the mmap | ||
284 | semaphore held for write can hold off other faults on the address space, in | ||
285 | the case of a multi-threaded task. It can also hold off scans of the task's | ||
286 | address space via /proc. While testing under heavy load, it was observed that | ||
287 | the ps(1) command could be held off for many minutes while a large segment was | ||
288 | mlock()ed down. | ||
289 | |||
290 | To address this issue, and to make the system more responsive during mlock()ing | ||
291 | of large segments, mlock_fixup() downgrades the mmap semaphore to read mode | ||
292 | during the call to __mlock_vma_pages_range(). This works fine. However, the | ||
293 | callers of mlock_fixup() expect the semaphore to be returned in write mode. | ||
294 | So, mlock_fixup() "upgrades" the semphore to write mode. Linux does not | ||
295 | support an atomic upgrade_sem() call, so mlock_fixup() must drop the semaphore | ||
296 | and reacquire it in write mode. In a multi-threaded task, it is possible for | ||
297 | the task memory map to change while the semaphore is dropped. Therefore, | ||
298 | mlock_fixup() looks up the vma at the range start address after reacquiring | ||
299 | the semaphore in write mode and verifies that it still covers the original | ||
300 | range. If not, mlock_fixup() returns an error [-EAGAIN]. All callers of | ||
301 | mlock_fixup() have been changed to deal with this new error condition. | ||
302 | |||
303 | Note: when munlocking a region, all of the pages should already be resident-- | ||
304 | unless we have racing threads mlocking() and munlocking() regions. So, | ||
305 | unlocking should not have to wait for page allocations nor faults of any kind. | ||
306 | Therefore mlock_fixup() does not downgrade the semaphore for munlock(). | ||
307 | |||
308 | |||
309 | Mlocked Pages: munlock()/munlockall() System Call Handling | ||
310 | |||
311 | The munlock() and munlockall() system calls are handled by the same functions-- | ||
312 | do_mlock[all]()--as the mlock() and mlockall() system calls with the unlock | ||
313 | vs lock operation indicated by an argument. So, these system calls are also | ||
314 | handled by mlock_fixup(). Again, if called for an already munlock()ed vma, | ||
315 | mlock_fixup() simply returns. Because of the vma filtering discussed above, | ||
316 | VM_LOCKED will not be set in any "special" vmas. So, these vmas will be | ||
317 | ignored for munlock. | ||
318 | |||
319 | If the vma is VM_LOCKED, mlock_fixup() again attempts to merge or split off | ||
320 | the specified range. The range is then munlocked via the function | ||
321 | __mlock_vma_pages_range()--the same function used to mlock a vma range-- | ||
322 | passing a flag to indicate that munlock() is being performed. | ||
323 | |||
324 | Because the vma access protections could have been changed to PROT_NONE after | ||
325 | faulting in and mlocking some pages, get_user_pages() was unreliable for visiting | ||
326 | these pages for munlocking. Because we don't want to leave pages mlocked(), | ||
327 | get_user_pages() was enhanced to accept a flag to ignore the permissions when | ||
328 | fetching the pages--all of which should be resident as a result of previous | ||
329 | mlock()ing. | ||
330 | |||
331 | For munlock(), __mlock_vma_pages_range() unlocks individual pages by calling | ||
332 | munlock_vma_page(). munlock_vma_page() unconditionally clears the PG_mlocked | ||
333 | flag using TestClearPageMlocked(). As with mlock_vma_page(), munlock_vma_page() | ||
334 | use the Test*PageMlocked() function to handle the case where the page might | ||
335 | have already been unlocked by another task. If the page was mlocked, | ||
336 | munlock_vma_page() updates that zone statistics for the number of mlocked | ||
337 | pages. Note, however, that at this point we haven't checked whether the page | ||
338 | is mapped by other VM_LOCKED vmas. | ||
339 | |||
340 | We can't call try_to_munlock(), the function that walks the reverse map to check | ||
341 | for other VM_LOCKED vmas, without first isolating the page from the LRU. | ||
342 | try_to_munlock() is a variant of try_to_unmap() and thus requires that the page | ||
343 | not be on an lru list. [More on these below.] However, the call to | ||
344 | isolate_lru_page() could fail, in which case we couldn't try_to_munlock(). | ||
345 | So, we go ahead and clear PG_mlocked up front, as this might be the only chance | ||
346 | we have. If we can successfully isolate the page, we go ahead and | ||
347 | try_to_munlock(), which will restore the PG_mlocked flag and update the zone | ||
348 | page statistics if it finds another vma holding the page mlocked. If we fail | ||
349 | to isolate the page, we'll have left a potentially mlocked page on the LRU. | ||
350 | This is fine, because we'll catch it later when/if vmscan tries to reclaim the | ||
351 | page. This should be relatively rare. | ||
352 | |||
353 | Mlocked Pages: Migrating Them... | ||
354 | |||
355 | A page that is being migrated has been isolated from the lru lists and is | ||
356 | held locked across unmapping of the page, updating the page's mapping | ||
357 | [address_space] entry and copying the contents and state, until the | ||
358 | page table entry has been replaced with an entry that refers to the new | ||
359 | page. Linux supports migration of mlocked pages and other unevictable | ||
360 | pages. This involves simply moving the PageMlocked and PageUnevictable states | ||
361 | from the old page to the new page. | ||
362 | |||
363 | Note that page migration can race with mlocking or munlocking of the same | ||
364 | page. This has been discussed from the mlock/munlock perspective in the | ||
365 | respective sections above. Both processes [migration, m[un]locking], hold | ||
366 | the page locked. This provides the first level of synchronization. Page | ||
367 | migration zeros out the page_mapping of the old page before unlocking it, | ||
368 | so m[un]lock can skip these pages by testing the page mapping under page | ||
369 | lock. | ||
370 | |||
371 | When completing page migration, we place the new and old pages back onto the | ||
372 | lru after dropping the page lock. The "unneeded" page--old page on success, | ||
373 | new page on failure--will be freed when the reference count held by the | ||
374 | migration process is released. To ensure that we don't strand pages on the | ||
375 | unevictable list because of a race between munlock and migration, page | ||
376 | migration uses the putback_lru_page() function to add migrated pages back to | ||
377 | the lru. | ||
378 | |||
379 | |||
380 | Mlocked Pages: mmap(MAP_LOCKED) System Call Handling | ||
381 | |||
382 | In addition the the mlock()/mlockall() system calls, an application can request | ||
383 | that a region of memory be mlocked using the MAP_LOCKED flag with the mmap() | ||
384 | call. Furthermore, any mmap() call or brk() call that expands the heap by a | ||
385 | task that has previously called mlockall() with the MCL_FUTURE flag will result | ||
386 | in the newly mapped memory being mlocked. Before the unevictable/mlock changes, | ||
387 | the kernel simply called make_pages_present() to allocate pages and populate | ||
388 | the page table. | ||
389 | |||
390 | To mlock a range of memory under the unevictable/mlock infrastructure, the | ||
391 | mmap() handler and task address space expansion functions call | ||
392 | mlock_vma_pages_range() specifying the vma and the address range to mlock. | ||
393 | mlock_vma_pages_range() filters vmas like mlock_fixup(), as described above in | ||
394 | "Mlocked Pages: Filtering Vmas". It will clear the VM_LOCKED flag, which will | ||
395 | have already been set by the caller, in filtered vmas. Thus these vma's need | ||
396 | not be visited for munlock when the region is unmapped. | ||
397 | |||
398 | For "normal" vmas, mlock_vma_pages_range() calls __mlock_vma_pages_range() to | ||
399 | fault/allocate the pages and mlock them. Again, like mlock_fixup(), | ||
400 | mlock_vma_pages_range() downgrades the mmap semaphore to read mode before | ||
401 | attempting to fault/allocate and mlock the pages; and "upgrades" the semaphore | ||
402 | back to write mode before returning. | ||
403 | |||
404 | The callers of mlock_vma_pages_range() will have already added the memory | ||
405 | range to be mlocked to the task's "locked_vm". To account for filtered vmas, | ||
406 | mlock_vma_pages_range() returns the number of pages NOT mlocked. All of the | ||
407 | callers then subtract a non-negative return value from the task's locked_vm. | ||
408 | A negative return value represent an error--for example, from get_user_pages() | ||
409 | attempting to fault in a vma with PROT_NONE access. In this case, we leave | ||
410 | the memory range accounted as locked_vm, as the protections could be changed | ||
411 | later and pages allocated into that region. | ||
412 | |||
413 | |||
414 | Mlocked Pages: munmap()/exit()/exec() System Call Handling | ||
415 | |||
416 | When unmapping an mlocked region of memory, whether by an explicit call to | ||
417 | munmap() or via an internal unmap from exit() or exec() processing, we must | ||
418 | munlock the pages if we're removing the last VM_LOCKED vma that maps the pages. | ||
419 | Before the unevictable/mlock changes, mlocking did not mark the pages in any way, | ||
420 | so unmapping them required no processing. | ||
421 | |||
422 | To munlock a range of memory under the unevictable/mlock infrastructure, the | ||
423 | munmap() hander and task address space tear down function call | ||
424 | munlock_vma_pages_all(). The name reflects the observation that one always | ||
425 | specifies the entire vma range when munlock()ing during unmap of a region. | ||
426 | Because of the vma filtering when mlocking() regions, only "normal" vmas that | ||
427 | actually contain mlocked pages will be passed to munlock_vma_pages_all(). | ||
428 | |||
429 | munlock_vma_pages_all() clears the VM_LOCKED vma flag and, like mlock_fixup() | ||
430 | for the munlock case, calls __munlock_vma_pages_range() to walk the page table | ||
431 | for the vma's memory range and munlock_vma_page() each resident page mapped by | ||
432 | the vma. This effectively munlocks the page, only if this is the last | ||
433 | VM_LOCKED vma that maps the page. | ||
434 | |||
435 | |||
436 | Mlocked Page: try_to_unmap() | ||
437 | |||
438 | [Note: the code changes represented by this section are really quite small | ||
439 | compared to the text to describe what happening and why, and to discuss the | ||
440 | implications.] | ||
441 | |||
442 | Pages can, of course, be mapped into multiple vmas. Some of these vmas may | ||
443 | have VM_LOCKED flag set. It is possible for a page mapped into one or more | ||
444 | VM_LOCKED vmas not to have the PG_mlocked flag set and therefore reside on one | ||
445 | of the active or inactive LRU lists. This could happen if, for example, a | ||
446 | task in the process of munlock()ing the page could not isolate the page from | ||
447 | the LRU. As a result, vmscan/shrink_page_list() might encounter such a page | ||
448 | as described in "Unevictable Pages and Vmscan [shrink_*_list()]". To | ||
449 | handle this situation, try_to_unmap() has been enhanced to check for VM_LOCKED | ||
450 | vmas while it is walking a page's reverse map. | ||
451 | |||
452 | try_to_unmap() is always called, by either vmscan for reclaim or for page | ||
453 | migration, with the argument page locked and isolated from the LRU. BUG_ON() | ||
454 | assertions enforce this requirement. Separate functions handle anonymous and | ||
455 | mapped file pages, as these types of pages have different reverse map | ||
456 | mechanisms. | ||
457 | |||
458 | try_to_unmap_anon() | ||
459 | |||
460 | To unmap anonymous pages, each vma in the list anchored in the anon_vma must be | ||
461 | visited--at least until a VM_LOCKED vma is encountered. If the page is being | ||
462 | unmapped for migration, VM_LOCKED vmas do not stop the process because mlocked | ||
463 | pages are migratable. However, for reclaim, if the page is mapped into a | ||
464 | VM_LOCKED vma, the scan stops. try_to_unmap() attempts to acquire the mmap | ||
465 | semphore of the mm_struct to which the vma belongs in read mode. If this is | ||
466 | successful, try_to_unmap() will mlock the page via mlock_vma_page()--we | ||
467 | wouldn't have gotten to try_to_unmap() if the page were already mlocked--and | ||
468 | will return SWAP_MLOCK, indicating that the page is unevictable. If the | ||
469 | mmap semaphore cannot be acquired, we are not sure whether the page is really | ||
470 | unevictable or not. In this case, try_to_unmap() will return SWAP_AGAIN. | ||
471 | |||
472 | try_to_unmap_file() -- linear mappings | ||
473 | |||
474 | Unmapping of a mapped file page works the same, except that the scan visits | ||
475 | all vmas that maps the page's index/page offset in the page's mapping's | ||
476 | reverse map priority search tree. It must also visit each vma in the page's | ||
477 | mapping's non-linear list, if the list is non-empty. As for anonymous pages, | ||
478 | on encountering a VM_LOCKED vma for a mapped file page, try_to_unmap() will | ||
479 | attempt to acquire the associated mm_struct's mmap semaphore to mlock the page, | ||
480 | returning SWAP_MLOCK if this is successful, and SWAP_AGAIN, if not. | ||
481 | |||
482 | try_to_unmap_file() -- non-linear mappings | ||
483 | |||
484 | If a page's mapping contains a non-empty non-linear mapping vma list, then | ||
485 | try_to_un{map|lock}() must also visit each vma in that list to determine | ||
486 | whether the page is mapped in a VM_LOCKED vma. Again, the scan must visit | ||
487 | all vmas in the non-linear list to ensure that the pages is not/should not be | ||
488 | mlocked. If a VM_LOCKED vma is found in the list, the scan could terminate. | ||
489 | However, there is no easy way to determine whether the page is actually mapped | ||
490 | in a given vma--either for unmapping or testing whether the VM_LOCKED vma | ||
491 | actually pins the page. | ||
492 | |||
493 | So, try_to_unmap_file() handles non-linear mappings by scanning a certain | ||
494 | number of pages--a "cluster"--in each non-linear vma associated with the page's | ||
495 | mapping, for each file mapped page that vmscan tries to unmap. If this happens | ||
496 | to unmap the page we're trying to unmap, try_to_unmap() will notice this on | ||
497 | return--(page_mapcount(page) == 0)--and return SWAP_SUCCESS. Otherwise, it | ||
498 | will return SWAP_AGAIN, causing vmscan to recirculate this page. We take | ||
499 | advantage of the cluster scan in try_to_unmap_cluster() as follows: | ||
500 | |||
501 | For each non-linear vma, try_to_unmap_cluster() attempts to acquire the mmap | ||
502 | semaphore of the associated mm_struct for read without blocking. If this | ||
503 | attempt is successful and the vma is VM_LOCKED, try_to_unmap_cluster() will | ||
504 | retain the mmap semaphore for the scan; otherwise it drops it here. Then, | ||
505 | for each page in the cluster, if we're holding the mmap semaphore for a locked | ||
506 | vma, try_to_unmap_cluster() calls mlock_vma_page() to mlock the page. This | ||
507 | call is a no-op if the page is already locked, but will mlock any pages in | ||
508 | the non-linear mapping that happen to be unlocked. If one of the pages so | ||
509 | mlocked is the page passed in to try_to_unmap(), try_to_unmap_cluster() will | ||
510 | return SWAP_MLOCK, rather than the default SWAP_AGAIN. This will allow vmscan | ||
511 | to cull the page, rather than recirculating it on the inactive list. Again, | ||
512 | if try_to_unmap_cluster() cannot acquire the vma's mmap sem, it returns | ||
513 | SWAP_AGAIN, indicating that the page is mapped by a VM_LOCKED vma, but | ||
514 | couldn't be mlocked. | ||
515 | |||
516 | |||
517 | Mlocked pages: try_to_munlock() Reverse Map Scan | ||
518 | |||
519 | TODO/FIXME: a better name might be page_mlocked()--analogous to the | ||
520 | page_referenced() reverse map walker--especially if we continue to call this | ||
521 | from shrink_page_list(). See related TODO/FIXME below. | ||
522 | |||
523 | When munlock_vma_page()--see "Mlocked Pages: munlock()/munlockall() System | ||
524 | Call Handling" above--tries to munlock a page, or when shrink_page_list() | ||
525 | encounters an anonymous page that is not yet in the swap cache, they need to | ||
526 | determine whether or not the page is mapped by any VM_LOCKED vma, without | ||
527 | actually attempting to unmap all ptes from the page. For this purpose, the | ||
528 | unevictable/mlock infrastructure introduced a variant of try_to_unmap() called | ||
529 | try_to_munlock(). | ||
530 | |||
531 | try_to_munlock() calls the same functions as try_to_unmap() for anonymous and | ||
532 | mapped file pages with an additional argument specifing unlock versus unmap | ||
533 | processing. Again, these functions walk the respective reverse maps looking | ||
534 | for VM_LOCKED vmas. When such a vma is found for anonymous pages and file | ||
535 | pages mapped in linear VMAs, as in the try_to_unmap() case, the functions | ||
536 | attempt to acquire the associated mmap semphore, mlock the page via | ||
537 | mlock_vma_page() and return SWAP_MLOCK. This effectively undoes the | ||
538 | pre-clearing of the page's PG_mlocked done by munlock_vma_page() and informs | ||
539 | shrink_page_list() that the anonymous page should be culled rather than added | ||
540 | to the swap cache in preparation for a try_to_unmap() that will almost | ||
541 | certainly fail. | ||
542 | |||
543 | If try_to_unmap() is unable to acquire a VM_LOCKED vma's associated mmap | ||
544 | semaphore, it will return SWAP_AGAIN. This will allow shrink_page_list() | ||
545 | to recycle the page on the inactive list and hope that it has better luck | ||
546 | with the page next time. | ||
547 | |||
548 | For file pages mapped into non-linear vmas, the try_to_munlock() logic works | ||
549 | slightly differently. On encountering a VM_LOCKED non-linear vma that might | ||
550 | map the page, try_to_munlock() returns SWAP_AGAIN without actually mlocking | ||
551 | the page. munlock_vma_page() will just leave the page unlocked and let | ||
552 | vmscan deal with it--the usual fallback position. | ||
553 | |||
554 | Note that try_to_munlock()'s reverse map walk must visit every vma in a pages' | ||
555 | reverse map to determine that a page is NOT mapped into any VM_LOCKED vma. | ||
556 | However, the scan can terminate when it encounters a VM_LOCKED vma and can | ||
557 | successfully acquire the vma's mmap semphore for read and mlock the page. | ||
558 | Although try_to_munlock() can be called many [very many!] times when | ||
559 | munlock()ing a large region or tearing down a large address space that has been | ||
560 | mlocked via mlockall(), overall this is a fairly rare event. In addition, | ||
561 | although shrink_page_list() calls try_to_munlock() for every anonymous page that | ||
562 | it handles that is not yet in the swap cache, on average anonymous pages will | ||
563 | have very short reverse map lists. | ||
564 | |||
565 | Mlocked Page: Page Reclaim in shrink_*_list() | ||
566 | |||
567 | shrink_active_list() culls any obviously unevictable pages--i.e., | ||
568 | !page_evictable(page, NULL)--diverting these to the unevictable lru | ||
569 | list. However, shrink_active_list() only sees unevictable pages that | ||
570 | made it onto the active/inactive lru lists. Note that these pages do not | ||
571 | have PageUnevictable set--otherwise, they would be on the unevictable list and | ||
572 | shrink_active_list would never see them. | ||
573 | |||
574 | Some examples of these unevictable pages on the LRU lists are: | ||
575 | |||
576 | 1) ramfs pages that have been placed on the lru lists when first allocated. | ||
577 | |||
578 | 2) SHM_LOCKed shared memory pages. shmctl(SHM_LOCK) does not attempt to | ||
579 | allocate or fault in the pages in the shared memory region. This happens | ||
580 | when an application accesses the page the first time after SHM_LOCKing | ||
581 | the segment. | ||
582 | |||
583 | 3) Mlocked pages that could not be isolated from the lru and moved to the | ||
584 | unevictable list in mlock_vma_page(). | ||
585 | |||
586 | 3) Pages mapped into multiple VM_LOCKED vmas, but try_to_munlock() couldn't | ||
587 | acquire the vma's mmap semaphore to test the flags and set PageMlocked. | ||
588 | munlock_vma_page() was forced to let the page back on to the normal | ||
589 | LRU list for vmscan to handle. | ||
590 | |||
591 | shrink_inactive_list() also culls any unevictable pages that it finds | ||
592 | on the inactive lists, again diverting them to the appropriate zone's unevictable | ||
593 | lru list. shrink_inactive_list() should only see SHM_LOCKed pages that became | ||
594 | SHM_LOCKed after shrink_active_list() had moved them to the inactive list, or | ||
595 | pages mapped into VM_LOCKED vmas that munlock_vma_page() couldn't isolate from | ||
596 | the lru to recheck via try_to_munlock(). shrink_inactive_list() won't notice | ||
597 | the latter, but will pass on to shrink_page_list(). | ||
598 | |||
599 | shrink_page_list() again culls obviously unevictable pages that it could | ||
600 | encounter for similar reason to shrink_inactive_list(). As already discussed, | ||
601 | shrink_page_list() proactively looks for anonymous pages that should have | ||
602 | PG_mlocked set but don't--these would not be detected by page_evictable()--to | ||
603 | avoid adding them to the swap cache unnecessarily. File pages mapped into | ||
604 | VM_LOCKED vmas but without PG_mlocked set will make it all the way to | ||
605 | try_to_unmap(). shrink_page_list() will divert them to the unevictable list when | ||
606 | try_to_unmap() returns SWAP_MLOCK, as discussed above. | ||
607 | |||
608 | TODO/FIXME: If we can enhance the swap cache to reliably remove entries | ||
609 | with page_count(page) > 2, as long as all ptes are mapped to the page and | ||
610 | not the swap entry, we can probably remove the call to try_to_munlock() in | ||
611 | shrink_page_list() and just remove the page from the swap cache when | ||
612 | try_to_unmap() returns SWAP_MLOCK. Currently, remove_exclusive_swap_page() | ||
613 | doesn't seem to allow that. | ||
614 | |||
615 | |||