diff options
Diffstat (limited to 'Documentation')
62 files changed, 4679 insertions, 2356 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX index 2a39aeba1464..d05737aaa84b 100644 --- a/Documentation/00-INDEX +++ b/Documentation/00-INDEX | |||
@@ -86,6 +86,8 @@ cachetlb.txt | |||
86 | - describes the cache/TLB flushing interfaces Linux uses. | 86 | - describes the cache/TLB flushing interfaces Linux uses. |
87 | cdrom/ | 87 | cdrom/ |
88 | - directory with information on the CD-ROM drivers that Linux has. | 88 | - directory with information on the CD-ROM drivers that Linux has. |
89 | cgroups/ | ||
90 | - cgroups features, including cpusets and memory controller. | ||
89 | connector/ | 91 | connector/ |
90 | - docs on the netlink based userspace<->kernel space communication mod. | 92 | - docs on the netlink based userspace<->kernel space communication mod. |
91 | console/ | 93 | console/ |
@@ -98,8 +100,6 @@ cpu-load.txt | |||
98 | - document describing how CPU load statistics are collected. | 100 | - document describing how CPU load statistics are collected. |
99 | cpuidle/ | 101 | cpuidle/ |
100 | - info on CPU_IDLE, CPU idle state management subsystem. | 102 | - info on CPU_IDLE, CPU idle state management subsystem. |
101 | cpusets.txt | ||
102 | - documents the cpusets feature; assign CPUs and Mem to a set of tasks. | ||
103 | cputopology.txt | 103 | cputopology.txt |
104 | - documentation on how CPU topology info is exported via sysfs. | 104 | - documentation on how CPU topology info is exported via sysfs. |
105 | cris/ | 105 | cris/ |
diff --git a/Documentation/ABI/testing/sysfs-bus-pci b/Documentation/ABI/testing/sysfs-bus-pci index e638e15a8895..97ad190e13af 100644 --- a/Documentation/ABI/testing/sysfs-bus-pci +++ b/Documentation/ABI/testing/sysfs-bus-pci | |||
@@ -41,6 +41,49 @@ Description: | |||
41 | for the device and attempt to bind to it. For example: | 41 | for the device and attempt to bind to it. For example: |
42 | # echo "8086 10f5" > /sys/bus/pci/drivers/foo/new_id | 42 | # echo "8086 10f5" > /sys/bus/pci/drivers/foo/new_id |
43 | 43 | ||
44 | What: /sys/bus/pci/drivers/.../remove_id | ||
45 | Date: February 2009 | ||
46 | Contact: Chris Wright <chrisw@sous-sol.org> | ||
47 | Description: | ||
48 | Writing a device ID to this file will remove an ID | ||
49 | that was dynamically added via the new_id sysfs entry. | ||
50 | The format for the device ID is: | ||
51 | VVVV DDDD SVVV SDDD CCCC MMMM. That is Vendor ID, Device | ||
52 | ID, Subsystem Vendor ID, Subsystem Device ID, Class, | ||
53 | and Class Mask. The Vendor ID and Device ID fields are | ||
54 | required, the rest are optional. After successfully | ||
55 | removing an ID, the driver will no longer support the | ||
56 | device. This is useful to ensure auto probing won't | ||
57 | match the driver to the device. For example: | ||
58 | # echo "8086 10f5" > /sys/bus/pci/drivers/foo/remove_id | ||
59 | |||
60 | What: /sys/bus/pci/rescan | ||
61 | Date: January 2009 | ||
62 | Contact: Linux PCI developers <linux-pci@vger.kernel.org> | ||
63 | Description: | ||
64 | Writing a non-zero value to this attribute will | ||
65 | force a rescan of all PCI buses in the system, and | ||
66 | re-discover previously removed devices. | ||
67 | Depends on CONFIG_HOTPLUG. | ||
68 | |||
69 | What: /sys/bus/pci/devices/.../remove | ||
70 | Date: January 2009 | ||
71 | Contact: Linux PCI developers <linux-pci@vger.kernel.org> | ||
72 | Description: | ||
73 | Writing a non-zero value to this attribute will | ||
74 | hot-remove the PCI device and any of its children. | ||
75 | Depends on CONFIG_HOTPLUG. | ||
76 | |||
77 | What: /sys/bus/pci/devices/.../rescan | ||
78 | Date: January 2009 | ||
79 | Contact: Linux PCI developers <linux-pci@vger.kernel.org> | ||
80 | Description: | ||
81 | Writing a non-zero value to this attribute will | ||
82 | force a rescan of the device's parent bus and all | ||
83 | child buses, and re-discover devices removed earlier | ||
84 | from this part of the device tree. | ||
85 | Depends on CONFIG_HOTPLUG. | ||
86 | |||
44 | What: /sys/bus/pci/devices/.../vpd | 87 | What: /sys/bus/pci/devices/.../vpd |
45 | Date: February 2008 | 88 | Date: February 2008 |
46 | Contact: Ben Hutchings <bhutchings@solarflare.com> | 89 | Contact: Ben Hutchings <bhutchings@solarflare.com> |
@@ -52,3 +95,30 @@ Description: | |||
52 | that some devices may have malformatted data. If the | 95 | that some devices may have malformatted data. If the |
53 | underlying VPD has a writable section then the | 96 | underlying VPD has a writable section then the |
54 | corresponding section of this file will be writable. | 97 | corresponding section of this file will be writable. |
98 | |||
99 | What: /sys/bus/pci/devices/.../virtfnN | ||
100 | Date: March 2009 | ||
101 | Contact: Yu Zhao <yu.zhao@intel.com> | ||
102 | Description: | ||
103 | This symbolic link appears when hardware supports the SR-IOV | ||
104 | capability and the Physical Function driver has enabled it. | ||
105 | The symbolic link points to the PCI device sysfs entry of the | ||
106 | Virtual Function whose index is N (0...MaxVFs-1). | ||
107 | |||
108 | What: /sys/bus/pci/devices/.../dep_link | ||
109 | Date: March 2009 | ||
110 | Contact: Yu Zhao <yu.zhao@intel.com> | ||
111 | Description: | ||
112 | This symbolic link appears when hardware supports the SR-IOV | ||
113 | capability and the Physical Function driver has enabled it, | ||
114 | and this device has vendor specific dependencies with others. | ||
115 | The symbolic link points to the PCI device sysfs entry of | ||
116 | Physical Function this device depends on. | ||
117 | |||
118 | What: /sys/bus/pci/devices/.../physfn | ||
119 | Date: March 2009 | ||
120 | Contact: Yu Zhao <yu.zhao@intel.com> | ||
121 | Description: | ||
122 | This symbolic link appears when a device is a Virtual Function. | ||
123 | The symbolic link points to the PCI device sysfs entry of the | ||
124 | Physical Function this device associates with. | ||
diff --git a/Documentation/ABI/testing/sysfs-class-regulator b/Documentation/ABI/testing/sysfs-class-regulator index 873ef1fc1569..e091fa873792 100644 --- a/Documentation/ABI/testing/sysfs-class-regulator +++ b/Documentation/ABI/testing/sysfs-class-regulator | |||
@@ -4,8 +4,8 @@ KernelVersion: 2.6.26 | |||
4 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 4 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
5 | Description: | 5 | Description: |
6 | Some regulator directories will contain a field called | 6 | Some regulator directories will contain a field called |
7 | state. This reports the regulator enable status, for | 7 | state. This reports the regulator enable control, for |
8 | regulators which can report that value. | 8 | regulators which can report that input value. |
9 | 9 | ||
10 | This will be one of the following strings: | 10 | This will be one of the following strings: |
11 | 11 | ||
@@ -14,16 +14,54 @@ Description: | |||
14 | 'unknown' | 14 | 'unknown' |
15 | 15 | ||
16 | 'enabled' means the regulator output is ON and is supplying | 16 | 'enabled' means the regulator output is ON and is supplying |
17 | power to the system. | 17 | power to the system (assuming no error prevents it). |
18 | 18 | ||
19 | 'disabled' means the regulator output is OFF and is not | 19 | 'disabled' means the regulator output is OFF and is not |
20 | supplying power to the system.. | 20 | supplying power to the system (unless some non-Linux |
21 | control has enabled it). | ||
21 | 22 | ||
22 | 'unknown' means software cannot determine the state, or | 23 | 'unknown' means software cannot determine the state, or |
23 | the reported state is invalid. | 24 | the reported state is invalid. |
24 | 25 | ||
25 | NOTE: this field can be used in conjunction with microvolts | 26 | NOTE: this field can be used in conjunction with microvolts |
26 | and microamps to determine regulator output levels. | 27 | or microamps to determine configured regulator output levels. |
28 | |||
29 | |||
30 | What: /sys/class/regulator/.../status | ||
31 | Description: | ||
32 | Some regulator directories will contain a field called | ||
33 | "status". This reports the current regulator status, for | ||
34 | regulators which can report that output value. | ||
35 | |||
36 | This will be one of the following strings: | ||
37 | |||
38 | off | ||
39 | on | ||
40 | error | ||
41 | fast | ||
42 | normal | ||
43 | idle | ||
44 | standby | ||
45 | |||
46 | "off" means the regulator is not supplying power to the | ||
47 | system. | ||
48 | |||
49 | "on" means the regulator is supplying power to the system, | ||
50 | and the regulator can't report a detailed operation mode. | ||
51 | |||
52 | "error" indicates an out-of-regulation status such as being | ||
53 | disabled due to thermal shutdown, or voltage being unstable | ||
54 | because of problems with the input power supply. | ||
55 | |||
56 | "fast", "normal", "idle", and "standby" are all detailed | ||
57 | regulator operation modes (described elsewhere). They | ||
58 | imply "on", but provide more detail. | ||
59 | |||
60 | Note that regulator status is a function of many inputs, | ||
61 | not limited to control inputs from Linux. For example, | ||
62 | the actual load presented may trigger "error" status; or | ||
63 | a regulator may be enabled by another user, even though | ||
64 | Linux did not enable it. | ||
27 | 65 | ||
28 | 66 | ||
29 | What: /sys/class/regulator/.../type | 67 | What: /sys/class/regulator/.../type |
@@ -58,7 +96,7 @@ Description: | |||
58 | Some regulator directories will contain a field called | 96 | Some regulator directories will contain a field called |
59 | microvolts. This holds the regulator output voltage setting | 97 | microvolts. This holds the regulator output voltage setting |
60 | measured in microvolts (i.e. E-6 Volts), for regulators | 98 | measured in microvolts (i.e. E-6 Volts), for regulators |
61 | which can report that voltage. | 99 | which can report the control input for voltage. |
62 | 100 | ||
63 | NOTE: This value should not be used to determine the regulator | 101 | NOTE: This value should not be used to determine the regulator |
64 | output voltage level as this value is the same regardless of | 102 | output voltage level as this value is the same regardless of |
@@ -73,7 +111,7 @@ Description: | |||
73 | Some regulator directories will contain a field called | 111 | Some regulator directories will contain a field called |
74 | microamps. This holds the regulator output current limit | 112 | microamps. This holds the regulator output current limit |
75 | setting measured in microamps (i.e. E-6 Amps), for regulators | 113 | setting measured in microamps (i.e. E-6 Amps), for regulators |
76 | which can report that current. | 114 | which can report the control input for a current limit. |
77 | 115 | ||
78 | NOTE: This value should not be used to determine the regulator | 116 | NOTE: This value should not be used to determine the regulator |
79 | output current level as this value is the same regardless of | 117 | output current level as this value is the same regardless of |
@@ -87,7 +125,7 @@ Contact: Liam Girdwood <lrg@slimlogic.co.uk> | |||
87 | Description: | 125 | Description: |
88 | Some regulator directories will contain a field called | 126 | Some regulator directories will contain a field called |
89 | opmode. This holds the current regulator operating mode, | 127 | opmode. This holds the current regulator operating mode, |
90 | for regulators which can report it. | 128 | for regulators which can report that control input value. |
91 | 129 | ||
92 | The opmode value can be one of the following strings: | 130 | The opmode value can be one of the following strings: |
93 | 131 | ||
@@ -101,7 +139,8 @@ Description: | |||
101 | 139 | ||
102 | NOTE: This value should not be used to determine the regulator | 140 | NOTE: This value should not be used to determine the regulator |
103 | output operating mode as this value is the same regardless of | 141 | output operating mode as this value is the same regardless of |
104 | whether the regulator is enabled or disabled. | 142 | whether the regulator is enabled or disabled. A "status" |
143 | attribute may be available to determine the actual mode. | ||
105 | 144 | ||
106 | 145 | ||
107 | What: /sys/class/regulator/.../min_microvolts | 146 | What: /sys/class/regulator/.../min_microvolts |
diff --git a/Documentation/ABI/testing/sysfs-fs-ext4 b/Documentation/ABI/testing/sysfs-fs-ext4 new file mode 100644 index 000000000000..4e79074de282 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-fs-ext4 | |||
@@ -0,0 +1,81 @@ | |||
1 | What: /sys/fs/ext4/<disk>/mb_stats | ||
2 | Date: March 2008 | ||
3 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
4 | Description: | ||
5 | Controls whether the multiblock allocator should | ||
6 | collect statistics, which are shown during the unmount. | ||
7 | 1 means to collect statistics, 0 means not to collect | ||
8 | statistics | ||
9 | |||
10 | What: /sys/fs/ext4/<disk>/mb_group_prealloc | ||
11 | Date: March 2008 | ||
12 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
13 | Description: | ||
14 | The multiblock allocator will round up allocation | ||
15 | requests to a multiple of this tuning parameter if the | ||
16 | stripe size is not set in the ext4 superblock | ||
17 | |||
18 | What: /sys/fs/ext4/<disk>/mb_max_to_scan | ||
19 | Date: March 2008 | ||
20 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
21 | Description: | ||
22 | The maximum number of extents the multiblock allocator | ||
23 | will search to find the best extent | ||
24 | |||
25 | What: /sys/fs/ext4/<disk>/mb_min_to_scan | ||
26 | Date: March 2008 | ||
27 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
28 | Description: | ||
29 | The minimum number of extents the multiblock allocator | ||
30 | will search to find the best extent | ||
31 | |||
32 | What: /sys/fs/ext4/<disk>/mb_order2_req | ||
33 | Date: March 2008 | ||
34 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
35 | Description: | ||
36 | Tuning parameter which controls the minimum size for | ||
37 | requests (as a power of 2) where the buddy cache is | ||
38 | used | ||
39 | |||
40 | What: /sys/fs/ext4/<disk>/mb_stream_req | ||
41 | Date: March 2008 | ||
42 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
43 | Description: | ||
44 | Files which have fewer blocks than this tunable | ||
45 | parameter will have their blocks allocated out of a | ||
46 | block group specific preallocation pool, so that small | ||
47 | files are packed closely together. Each large file | ||
48 | will have its blocks allocated out of its own unique | ||
49 | preallocation pool. | ||
50 | |||
51 | What: /sys/fs/ext4/<disk>/inode_readahead | ||
52 | Date: March 2008 | ||
53 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
54 | Description: | ||
55 | Tuning parameter which controls the maximum number of | ||
56 | inode table blocks that ext4's inode table readahead | ||
57 | algorithm will pre-read into the buffer cache | ||
58 | |||
59 | What: /sys/fs/ext4/<disk>/delayed_allocation_blocks | ||
60 | Date: March 2008 | ||
61 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
62 | Description: | ||
63 | This file is read-only and shows the number of blocks | ||
64 | that are dirty in the page cache, but which do not | ||
65 | have their location in the filesystem allocated yet. | ||
66 | |||
67 | What: /sys/fs/ext4/<disk>/lifetime_write_kbytes | ||
68 | Date: March 2008 | ||
69 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
70 | Description: | ||
71 | This file is read-only and shows the number of kilobytes | ||
72 | of data that have been written to this filesystem since it was | ||
73 | created. | ||
74 | |||
75 | What: /sys/fs/ext4/<disk>/session_write_kbytes | ||
76 | Date: March 2008 | ||
77 | Contact: "Theodore Ts'o" <tytso@mit.edu> | ||
78 | Description: | ||
79 | This file is read-only and shows the number of | ||
80 | kilobytes of data that have been written to this | ||
81 | filesystem since it was mounted. | ||
diff --git a/Documentation/DocBook/.gitignore b/Documentation/DocBook/.gitignore index c102c02ecf89..c6def352fe39 100644 --- a/Documentation/DocBook/.gitignore +++ b/Documentation/DocBook/.gitignore | |||
@@ -4,3 +4,7 @@ | |||
4 | *.html | 4 | *.html |
5 | *.9.gz | 5 | *.9.gz |
6 | *.9 | 6 | *.9 |
7 | *.aux | ||
8 | *.dvi | ||
9 | *.log | ||
10 | *.out | ||
diff --git a/Documentation/DocBook/kernel-api.tmpl b/Documentation/DocBook/kernel-api.tmpl index bc962cda6504..58c194572c76 100644 --- a/Documentation/DocBook/kernel-api.tmpl +++ b/Documentation/DocBook/kernel-api.tmpl | |||
@@ -199,6 +199,7 @@ X!Edrivers/pci/hotplug.c | |||
199 | --> | 199 | --> |
200 | !Edrivers/pci/probe.c | 200 | !Edrivers/pci/probe.c |
201 | !Edrivers/pci/rom.c | 201 | !Edrivers/pci/rom.c |
202 | !Edrivers/pci/iov.c | ||
202 | </sect1> | 203 | </sect1> |
203 | <sect1><title>PCI Hotplug Support Library</title> | 204 | <sect1><title>PCI Hotplug Support Library</title> |
204 | !Edrivers/pci/hotplug/pci_hotplug_core.c | 205 | !Edrivers/pci/hotplug/pci_hotplug_core.c |
diff --git a/Documentation/PCI/MSI-HOWTO.txt b/Documentation/PCI/MSI-HOWTO.txt index 256defd7e174..dcf7acc720e1 100644 --- a/Documentation/PCI/MSI-HOWTO.txt +++ b/Documentation/PCI/MSI-HOWTO.txt | |||
@@ -4,506 +4,356 @@ | |||
4 | Revised Feb 12, 2004 by Martine Silbermann | 4 | Revised Feb 12, 2004 by Martine Silbermann |
5 | email: Martine.Silbermann@hp.com | 5 | email: Martine.Silbermann@hp.com |
6 | Revised Jun 25, 2004 by Tom L Nguyen | 6 | Revised Jun 25, 2004 by Tom L Nguyen |
7 | Revised Jul 9, 2008 by Matthew Wilcox <willy@linux.intel.com> | ||
8 | Copyright 2003, 2008 Intel Corporation | ||
7 | 9 | ||
8 | 1. About this guide | 10 | 1. About this guide |
9 | 11 | ||
10 | This guide describes the basics of Message Signaled Interrupts (MSI), | 12 | This guide describes the basics of Message Signaled Interrupts (MSIs), |
11 | the advantages of using MSI over traditional interrupt mechanisms, | 13 | the advantages of using MSI over traditional interrupt mechanisms, how |
12 | and how to enable your driver to use MSI or MSI-X. Also included is | 14 | to change your driver to use MSI or MSI-X and some basic diagnostics to |
13 | a Frequently Asked Questions (FAQ) section. | 15 | try if a device doesn't support MSIs. |
14 | |||
15 | 1.1 Terminology | ||
16 | |||
17 | PCI devices can be single-function or multi-function. In either case, | ||
18 | when this text talks about enabling or disabling MSI on a "device | ||
19 | function," it is referring to one specific PCI device and function and | ||
20 | not to all functions on a PCI device (unless the PCI device has only | ||
21 | one function). | ||
22 | |||
23 | 2. Copyright 2003 Intel Corporation | ||
24 | |||
25 | 3. What is MSI/MSI-X? | ||
26 | |||
27 | Message Signaled Interrupt (MSI), as described in the PCI Local Bus | ||
28 | Specification Revision 2.3 or later, is an optional feature, and a | ||
29 | required feature for PCI Express devices. MSI enables a device function | ||
30 | to request service by sending an Inbound Memory Write on its PCI bus to | ||
31 | the FSB as a Message Signal Interrupt transaction. Because MSI is | ||
32 | generated in the form of a Memory Write, all transaction conditions, | ||
33 | such as a Retry, Master-Abort, Target-Abort or normal completion, are | ||
34 | supported. | ||
35 | |||
36 | A PCI device that supports MSI must also support pin IRQ assertion | ||
37 | interrupt mechanism to provide backward compatibility for systems that | ||
38 | do not support MSI. In systems which support MSI, the bus driver is | ||
39 | responsible for initializing the message address and message data of | ||
40 | the device function's MSI/MSI-X capability structure during device | ||
41 | initial configuration. | ||
42 | |||
43 | An MSI capable device function indicates MSI support by implementing | ||
44 | the MSI/MSI-X capability structure in its PCI capability list. The | ||
45 | device function may implement both the MSI capability structure and | ||
46 | the MSI-X capability structure; however, the bus driver should not | ||
47 | enable both. | ||
48 | |||
49 | The MSI capability structure contains Message Control register, | ||
50 | Message Address register and Message Data register. These registers | ||
51 | provide the bus driver control over MSI. The Message Control register | ||
52 | indicates the MSI capability supported by the device. The Message | ||
53 | Address register specifies the target address and the Message Data | ||
54 | register specifies the characteristics of the message. To request | ||
55 | service, the device function writes the content of the Message Data | ||
56 | register to the target address. The device and its software driver | ||
57 | are prohibited from writing to these registers. | ||
58 | |||
59 | The MSI-X capability structure is an optional extension to MSI. It | ||
60 | uses an independent and separate capability structure. There are | ||
61 | some key advantages to implementing the MSI-X capability structure | ||
62 | over the MSI capability structure as described below. | ||
63 | |||
64 | - Support a larger maximum number of vectors per function. | ||
65 | |||
66 | - Provide the ability for system software to configure | ||
67 | each vector with an independent message address and message | ||
68 | data, specified by a table that resides in Memory Space. | ||
69 | |||
70 | - MSI and MSI-X both support per-vector masking. Per-vector | ||
71 | masking is an optional extension of MSI but a required | ||
72 | feature for MSI-X. Per-vector masking provides the kernel the | ||
73 | ability to mask/unmask a single MSI while running its | ||
74 | interrupt service routine. If per-vector masking is | ||
75 | not supported, then the device driver should provide the | ||
76 | hardware/software synchronization to ensure that the device | ||
77 | generates MSI when the driver wants it to do so. | ||
78 | |||
79 | 4. Why use MSI? | ||
80 | |||
81 | As a benefit to the simplification of board design, MSI allows board | ||
82 | designers to remove out-of-band interrupt routing. MSI is another | ||
83 | step towards a legacy-free environment. | ||
84 | |||
85 | Due to increasing pressure on chipset and processor packages to | ||
86 | reduce pin count, the need for interrupt pins is expected to | ||
87 | diminish over time. Devices, due to pin constraints, may implement | ||
88 | messages to increase performance. | ||
89 | |||
90 | PCI Express endpoints uses INTx emulation (in-band messages) instead | ||
91 | of IRQ pin assertion. Using INTx emulation requires interrupt | ||
92 | sharing among devices connected to the same node (PCI bridge) while | ||
93 | MSI is unique (non-shared) and does not require BIOS configuration | ||
94 | support. As a result, the PCI Express technology requires MSI | ||
95 | support for better interrupt performance. | ||
96 | |||
97 | Using MSI enables the device functions to support two or more | ||
98 | vectors, which can be configured to target different CPUs to | ||
99 | increase scalability. | ||
100 | |||
101 | 5. Configuring a driver to use MSI/MSI-X | ||
102 | |||
103 | By default, the kernel will not enable MSI/MSI-X on all devices that | ||
104 | support this capability. The CONFIG_PCI_MSI kernel option | ||
105 | must be selected to enable MSI/MSI-X support. | ||
106 | |||
107 | 5.1 Including MSI/MSI-X support into the kernel | ||
108 | |||
109 | To allow MSI/MSI-X capable device drivers to selectively enable | ||
110 | MSI/MSI-X (using pci_enable_msi()/pci_enable_msix() as described | ||
111 | below), the VECTOR based scheme needs to be enabled by setting | ||
112 | CONFIG_PCI_MSI during kernel config. | ||
113 | |||
114 | Since the target of the inbound message is the local APIC, providing | ||
115 | CONFIG_X86_LOCAL_APIC must be enabled as well as CONFIG_PCI_MSI. | ||
116 | |||
117 | 5.2 Configuring for MSI support | ||
118 | |||
119 | Due to the non-contiguous fashion in vector assignment of the | ||
120 | existing Linux kernel, this version does not support multiple | ||
121 | messages regardless of a device function is capable of supporting | ||
122 | more than one vector. To enable MSI on a device function's MSI | ||
123 | capability structure requires a device driver to call the function | ||
124 | pci_enable_msi() explicitly. | ||
125 | |||
126 | 5.2.1 API pci_enable_msi | ||
127 | 16 | ||
128 | int pci_enable_msi(struct pci_dev *dev) | ||
129 | 17 | ||
130 | With this new API, a device driver that wants to have MSI | 18 | 2. What are MSIs? |
131 | enabled on its device function must call this API to enable MSI. | ||
132 | A successful call will initialize the MSI capability structure | ||
133 | with ONE vector, regardless of whether a device function is | ||
134 | capable of supporting multiple messages. This vector replaces the | ||
135 | pre-assigned dev->irq with a new MSI vector. To avoid a conflict | ||
136 | of the new assigned vector with existing pre-assigned vector requires | ||
137 | a device driver to call this API before calling request_irq(). | ||
138 | 19 | ||
139 | 5.2.2 API pci_disable_msi | 20 | A Message Signaled Interrupt is a write from the device to a special |
21 | address which causes an interrupt to be received by the CPU. | ||
140 | 22 | ||
141 | void pci_disable_msi(struct pci_dev *dev) | 23 | The MSI capability was first specified in PCI 2.2 and was later enhanced |
24 | in PCI 3.0 to allow each interrupt to be masked individually. The MSI-X | ||
25 | capability was also introduced with PCI 3.0. It supports more interrupts | ||
26 | per device than MSI and allows interrupts to be independently configured. | ||
142 | 27 | ||
143 | This API should always be used to undo the effect of pci_enable_msi() | 28 | Devices may support both MSI and MSI-X, but only one can be enabled at |
144 | when a device driver is unloading. This API restores dev->irq with | 29 | a time. |
145 | the pre-assigned IOAPIC vector and switches a device's interrupt | ||
146 | mode to PCI pin-irq assertion/INTx emulation mode. | ||
147 | |||
148 | Note that a device driver should always call free_irq() on the MSI vector | ||
149 | that it has done request_irq() on before calling this API. Failure to do | ||
150 | so results in a BUG_ON() and a device will be left with MSI enabled and | ||
151 | leaks its vector. | ||
152 | |||
153 | 5.2.3 MSI mode vs. legacy mode diagram | ||
154 | |||
155 | The below diagram shows the events which switch the interrupt | ||
156 | mode on the MSI-capable device function between MSI mode and | ||
157 | PIN-IRQ assertion mode. | ||
158 | |||
159 | ------------ pci_enable_msi ------------------------ | ||
160 | | | <=============== | | | ||
161 | | MSI MODE | | PIN-IRQ ASSERTION MODE | | ||
162 | | | ===============> | | | ||
163 | ------------ pci_disable_msi ------------------------ | ||
164 | |||
165 | |||
166 | Figure 1. MSI Mode vs. Legacy Mode | ||
167 | |||
168 | In Figure 1, a device operates by default in legacy mode. Legacy | ||
169 | in this context means PCI pin-irq assertion or PCI-Express INTx | ||
170 | emulation. A successful MSI request (using pci_enable_msi()) switches | ||
171 | a device's interrupt mode to MSI mode. A pre-assigned IOAPIC vector | ||
172 | stored in dev->irq will be saved by the PCI subsystem and a new | ||
173 | assigned MSI vector will replace dev->irq. | ||
174 | |||
175 | To return back to its default mode, a device driver should always call | ||
176 | pci_disable_msi() to undo the effect of pci_enable_msi(). Note that a | ||
177 | device driver should always call free_irq() on the MSI vector it has | ||
178 | done request_irq() on before calling pci_disable_msi(). Failure to do | ||
179 | so results in a BUG_ON() and a device will be left with MSI enabled and | ||
180 | leaks its vector. Otherwise, the PCI subsystem restores a device's | ||
181 | dev->irq with a pre-assigned IOAPIC vector and marks the released | ||
182 | MSI vector as unused. | ||
183 | |||
184 | Once being marked as unused, there is no guarantee that the PCI | ||
185 | subsystem will reserve this MSI vector for a device. Depending on | ||
186 | the availability of current PCI vector resources and the number of | ||
187 | MSI/MSI-X requests from other drivers, this MSI may be re-assigned. | ||
188 | |||
189 | For the case where the PCI subsystem re-assigns this MSI vector to | ||
190 | another driver, a request to switch back to MSI mode may result | ||
191 | in being assigned a different MSI vector or a failure if no more | ||
192 | vectors are available. | ||
193 | |||
194 | 5.3 Configuring for MSI-X support | ||
195 | |||
196 | Due to the ability of the system software to configure each vector of | ||
197 | the MSI-X capability structure with an independent message address | ||
198 | and message data, the non-contiguous fashion in vector assignment of | ||
199 | the existing Linux kernel has no impact on supporting multiple | ||
200 | messages on an MSI-X capable device functions. To enable MSI-X on | ||
201 | a device function's MSI-X capability structure requires its device | ||
202 | driver to call the function pci_enable_msix() explicitly. | ||
203 | |||
204 | The function pci_enable_msix(), once invoked, enables either | ||
205 | all or nothing, depending on the current availability of PCI vector | ||
206 | resources. If the PCI vector resources are available for the number | ||
207 | of vectors requested by a device driver, this function will configure | ||
208 | the MSI-X table of the MSI-X capability structure of a device with | ||
209 | requested messages. To emphasize this reason, for example, a device | ||
210 | may be capable for supporting the maximum of 32 vectors while its | ||
211 | software driver usually may request 4 vectors. It is recommended | ||
212 | that the device driver should call this function once during the | ||
213 | initialization phase of the device driver. | ||
214 | |||
215 | Unlike the function pci_enable_msi(), the function pci_enable_msix() | ||
216 | does not replace the pre-assigned IOAPIC dev->irq with a new MSI | ||
217 | vector because the PCI subsystem writes the 1:1 vector-to-entry mapping | ||
218 | into the field vector of each element contained in a second argument. | ||
219 | Note that the pre-assigned IOAPIC dev->irq is valid only if the device | ||
220 | operates in PIN-IRQ assertion mode. In MSI-X mode, any attempt at | ||
221 | using dev->irq by the device driver to request for interrupt service | ||
222 | may result in unpredictable behavior. | ||
223 | |||
224 | For each MSI-X vector granted, a device driver is responsible for calling | ||
225 | other functions like request_irq(), enable_irq(), etc. to enable | ||
226 | this vector with its corresponding interrupt service handler. It is | ||
227 | a device driver's choice to assign all vectors with the same | ||
228 | interrupt service handler or each vector with a unique interrupt | ||
229 | service handler. | ||
230 | |||
231 | 5.3.1 Handling MMIO address space of MSI-X Table | ||
232 | |||
233 | The PCI 3.0 specification has implementation notes that MMIO address | ||
234 | space for a device's MSI-X structure should be isolated so that the | ||
235 | software system can set different pages for controlling accesses to the | ||
236 | MSI-X structure. The implementation of MSI support requires the PCI | ||
237 | subsystem, not a device driver, to maintain full control of the MSI-X | ||
238 | table/MSI-X PBA (Pending Bit Array) and MMIO address space of the MSI-X | ||
239 | table/MSI-X PBA. A device driver should not access the MMIO address | ||
240 | space of the MSI-X table/MSI-X PBA. | ||
241 | |||
242 | 5.3.2 API pci_enable_msix | ||
243 | 30 | ||
244 | int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) | ||
245 | 31 | ||
246 | This API enables a device driver to request the PCI subsystem | 32 | 3. Why use MSIs? |
247 | to enable MSI-X messages on its hardware device. Depending on | 33 | |
248 | the availability of PCI vectors resources, the PCI subsystem enables | 34 | There are three reasons why using MSIs can give an advantage over |
249 | either all or none of the requested vectors. | 35 | traditional pin-based interrupts. |
36 | |||
37 | Pin-based PCI interrupts are often shared amongst several devices. | ||
38 | To support this, the kernel must call each interrupt handler associated | ||
39 | with an interrupt, which leads to reduced performance for the system as | ||
40 | a whole. MSIs are never shared, so this problem cannot arise. | ||
41 | |||
42 | When a device writes data to memory, then raises a pin-based interrupt, | ||
43 | it is possible that the interrupt may arrive before all the data has | ||
44 | arrived in memory (this becomes more likely with devices behind PCI-PCI | ||
45 | bridges). In order to ensure that all the data has arrived in memory, | ||
46 | the interrupt handler must read a register on the device which raised | ||
47 | the interrupt. PCI transaction ordering rules require that all the data | ||
48 | arrives in memory before the value can be returned from the register. | ||
49 | Using MSIs avoids this problem as the interrupt-generating write cannot | ||
50 | pass the data writes, so by the time the interrupt is raised, the driver | ||
51 | knows that all the data has arrived in memory. | ||
52 | |||
53 | PCI devices can only support a single pin-based interrupt per function. | ||
54 | Often drivers have to query the device to find out what event has | ||
55 | occurred, slowing down interrupt handling for the common case. With | ||
56 | MSIs, a device can support more interrupts, allowing each interrupt | ||
57 | to be specialised to a different purpose. One possible design gives | ||
58 | infrequent conditions (such as errors) their own interrupt which allows | ||
59 | the driver to handle the normal interrupt handling path more efficiently. | ||
60 | Other possible designs include giving one interrupt to each packet queue | ||
61 | in a network card or each port in a storage controller. | ||
62 | |||
63 | |||
64 | 4. How to use MSIs | ||
65 | |||
66 | PCI devices are initialised to use pin-based interrupts. The device | ||
67 | driver has to set up the device to use MSI or MSI-X. Not all machines | ||
68 | support MSIs correctly, and for those machines, the APIs described below | ||
69 | will simply fail and the device will continue to use pin-based interrupts. | ||
70 | |||
71 | 4.1 Include kernel support for MSIs | ||
72 | |||
73 | To support MSI or MSI-X, the kernel must be built with the CONFIG_PCI_MSI | ||
74 | option enabled. This option is only available on some architectures, | ||
75 | and it may depend on some other options also being set. For example, | ||
76 | on x86, you must also enable X86_UP_APIC or SMP in order to see the | ||
77 | CONFIG_PCI_MSI option. | ||
78 | |||
79 | 4.2 Using MSI | ||
80 | |||
81 | Most of the hard work is done for the driver in the PCI layer. It simply | ||
82 | has to request that the PCI layer set up the MSI capability for this | ||
83 | device. | ||
84 | |||
85 | 4.2.1 pci_enable_msi | ||
86 | |||
87 | int pci_enable_msi(struct pci_dev *dev) | ||
88 | |||
89 | A successful call will allocate ONE interrupt to the device, regardless | ||
90 | of how many MSIs the device supports. The device will be switched from | ||
91 | pin-based interrupt mode to MSI mode. The dev->irq number is changed | ||
92 | to a new number which represents the message signaled interrupt. | ||
93 | This function should be called before the driver calls request_irq() | ||
94 | since enabling MSIs disables the pin-based IRQ and the driver will not | ||
95 | receive interrupts on the old interrupt. | ||
96 | |||
97 | 4.2.2 pci_enable_msi_block | ||
98 | |||
99 | int pci_enable_msi_block(struct pci_dev *dev, int count) | ||
100 | |||
101 | This variation on the above call allows a device driver to request multiple | ||
102 | MSIs. The MSI specification only allows interrupts to be allocated in | ||
103 | powers of two, up to a maximum of 2^5 (32). | ||
104 | |||
105 | If this function returns 0, it has succeeded in allocating at least as many | ||
106 | interrupts as the driver requested (it may have allocated more in order | ||
107 | to satisfy the power-of-two requirement). In this case, the function | ||
108 | enables MSI on this device and updates dev->irq to be the lowest of | ||
109 | the new interrupts assigned to it. The other interrupts assigned to | ||
110 | the device are in the range dev->irq to dev->irq + count - 1. | ||
111 | |||
112 | If this function returns a negative number, it indicates an error and | ||
113 | the driver should not attempt to request any more MSI interrupts for | ||
114 | this device. If this function returns a positive number, it will be | ||
115 | less than 'count' and indicate the number of interrupts that could have | ||
116 | been allocated. In neither case will the irq value have been | ||
117 | updated, nor will the device have been switched into MSI mode. | ||
118 | |||
119 | The device driver must decide what action to take if | ||
120 | pci_enable_msi_block() returns a value less than the number asked for. | ||
121 | Some devices can make use of fewer interrupts than the maximum they | ||
122 | request; in this case the driver should call pci_enable_msi_block() | ||
123 | again. Note that it is not guaranteed to succeed, even when the | ||
124 | 'count' has been reduced to the value returned from a previous call to | ||
125 | pci_enable_msi_block(). This is because there are multiple constraints | ||
126 | on the number of vectors that can be allocated; pci_enable_msi_block() | ||
127 | will return as soon as it finds any constraint that doesn't allow the | ||
128 | call to succeed. | ||
129 | |||
130 | 4.2.3 pci_disable_msi | ||
131 | |||
132 | void pci_disable_msi(struct pci_dev *dev) | ||
250 | 133 | ||
251 | Argument 'dev' points to the device (pci_dev) structure. | 134 | This function should be used to undo the effect of pci_enable_msi() or |
135 | pci_enable_msi_block(). Calling it restores dev->irq to the pin-based | ||
136 | interrupt number and frees the previously allocated message signaled | ||
137 | interrupt(s). The interrupt may subsequently be assigned to another | ||
138 | device, so drivers should not cache the value of dev->irq. | ||
252 | 139 | ||
253 | Argument 'entries' is a pointer to an array of msix_entry structs. | 140 | A device driver must always call free_irq() on the interrupt(s) |
254 | The number of entries is indicated in argument 'nvec'. | 141 | for which it has called request_irq() before calling this function. |
255 | struct msix_entry is defined in /driver/pci/msi.h: | 142 | Failure to do so will result in a BUG_ON(), the device will be left with |
143 | MSI enabled and will leak its vector. | ||
144 | |||
145 | 4.3 Using MSI-X | ||
146 | |||
147 | The MSI-X capability is much more flexible than the MSI capability. | ||
148 | It supports up to 2048 interrupts, each of which can be controlled | ||
149 | independently. To support this flexibility, drivers must use an array of | ||
150 | `struct msix_entry': | ||
256 | 151 | ||
257 | struct msix_entry { | 152 | struct msix_entry { |
258 | u16 vector; /* kernel uses to write alloc vector */ | 153 | u16 vector; /* kernel uses to write alloc vector */ |
259 | u16 entry; /* driver uses to specify entry */ | 154 | u16 entry; /* driver uses to specify entry */ |
260 | }; | 155 | }; |
261 | 156 | ||
262 | A device driver is responsible for initializing the field 'entry' of | 157 | This allows for the device to use these interrupts in a sparse fashion; |
263 | each element with a unique entry supported by MSI-X table. Otherwise, | 158 | for example it could use interrupts 3 and 1027 and allocate only a |
264 | -EINVAL will be returned as a result. A successful return of zero | 159 | two-element array. The driver is expected to fill in the 'entry' value |
265 | indicates the PCI subsystem completed initializing each of the requested | 160 | in each element of the array to indicate which entries it wants the kernel |
266 | entries of the MSI-X table with message address and message data. | 161 | to assign interrupts for. It is invalid to fill in two entries with the |
267 | Last but not least, the PCI subsystem will write the 1:1 | 162 | same number. |
268 | vector-to-entry mapping into the field 'vector' of each element. A | 163 | |
269 | device driver is responsible for keeping track of allocated MSI-X | 164 | 4.3.1 pci_enable_msix |
270 | vectors in its internal data structure. | 165 | |
271 | 166 | int pci_enable_msix(struct pci_dev *dev, struct msix_entry *entries, int nvec) | |
272 | A return of zero indicates that the number of MSI-X vectors was | 167 | |
273 | successfully allocated. A return of greater than zero indicates | 168 | Calling this function asks the PCI subsystem to allocate 'nvec' MSIs. |
274 | MSI-X vector shortage. Or a return of less than zero indicates | 169 | The 'entries' argument is a pointer to an array of msix_entry structs |
275 | a failure. This failure may be a result of duplicate entries | 170 | which should be at least 'nvec' entries in size. On success, the |
276 | specified in second argument, or a result of no available vector, | 171 | function will return 0 and the device will have been switched into |
277 | or a result of failing to initialize MSI-X table entries. | 172 | MSI-X interrupt mode. The 'vector' elements in each entry will have |
278 | 173 | been filled in with the interrupt number. The driver should then call | |
279 | 5.3.3 API pci_disable_msix | 174 | request_irq() for each 'vector' that it decides to use. |
175 | |||
176 | If this function returns a negative number, it indicates an error and | ||
177 | the driver should not attempt to allocate any more MSI-X interrupts for | ||
178 | this device. If it returns a positive number, it indicates the maximum | ||
179 | number of interrupt vectors that could have been allocated. See example | ||
180 | below. | ||
181 | |||
182 | This function, in contrast with pci_enable_msi(), does not adjust | ||
183 | dev->irq. The device will not generate interrupts for this interrupt | ||
184 | number once MSI-X is enabled. The device driver is responsible for | ||
185 | keeping track of the interrupts assigned to the MSI-X vectors so it can | ||
186 | free them again later. | ||
187 | |||
188 | Device drivers should normally call this function once per device | ||
189 | during the initialization phase. | ||
190 | |||
191 | It is ideal if drivers can cope with a variable number of MSI-X interrupts, | ||
192 | there are many reasons why the platform may not be able to provide the | ||
193 | exact number a driver asks for. | ||
194 | |||
195 | A request loop to achieve that might look like: | ||
196 | |||
197 | static int foo_driver_enable_msix(struct foo_adapter *adapter, int nvec) | ||
198 | { | ||
199 | while (nvec >= FOO_DRIVER_MINIMUM_NVEC) { | ||
200 | rc = pci_enable_msix(adapter->pdev, | ||
201 | adapter->msix_entries, nvec); | ||
202 | if (rc > 0) | ||
203 | nvec = rc; | ||
204 | else | ||
205 | return rc; | ||
206 | } | ||
207 | |||
208 | return -ENOSPC; | ||
209 | } | ||
210 | |||
211 | 4.3.2 pci_disable_msix | ||
280 | 212 | ||
281 | void pci_disable_msix(struct pci_dev *dev) | 213 | void pci_disable_msix(struct pci_dev *dev) |
282 | 214 | ||
283 | This API should always be used to undo the effect of pci_enable_msix() | 215 | This API should be used to undo the effect of pci_enable_msix(). It frees |
284 | when a device driver is unloading. Note that a device driver should | 216 | the previously allocated message signaled interrupts. The interrupts may |
285 | always call free_irq() on all MSI-X vectors it has done request_irq() | 217 | subsequently be assigned to another device, so drivers should not cache |
286 | on before calling this API. Failure to do so results in a BUG_ON() and | 218 | the value of the 'vector' elements over a call to pci_disable_msix(). |
287 | a device will be left with MSI-X enabled and leaks its vectors. | 219 | |
288 | 220 | A device driver must always call free_irq() on the interrupt(s) | |
289 | 5.3.4 MSI-X mode vs. legacy mode diagram | 221 | for which it has called request_irq() before calling this function. |
290 | 222 | Failure to do so will result in a BUG_ON(), the device will be left with | |
291 | The below diagram shows the events which switch the interrupt | 223 | MSI enabled and will leak its vector. |
292 | mode on the MSI-X capable device function between MSI-X mode and | 224 | |
293 | PIN-IRQ assertion mode (legacy). | 225 | 4.3.3 The MSI-X Table |
294 | 226 | ||
295 | ------------ pci_enable_msix(,,n) ------------------------ | 227 | The MSI-X capability specifies a BAR and offset within that BAR for the |
296 | | | <=============== | | | 228 | MSI-X Table. This address is mapped by the PCI subsystem, and should not |
297 | | MSI-X MODE | | PIN-IRQ ASSERTION MODE | | 229 | be accessed directly by the device driver. If the driver wishes to |
298 | | | ===============> | | | 230 | mask or unmask an interrupt, it should call disable_irq() / enable_irq(). |
299 | ------------ pci_disable_msix ------------------------ | 231 | |
300 | 232 | 4.4 Handling devices implementing both MSI and MSI-X capabilities | |
301 | Figure 2. MSI-X Mode vs. Legacy Mode | 233 | |
302 | 234 | If a device implements both MSI and MSI-X capabilities, it can | |
303 | In Figure 2, a device operates by default in legacy mode. A | 235 | run in either MSI mode or MSI-X mode but not both simultaneously. |
304 | successful MSI-X request (using pci_enable_msix()) switches a | 236 | This is a requirement of the PCI spec, and it is enforced by the |
305 | device's interrupt mode to MSI-X mode. A pre-assigned IOAPIC vector | 237 | PCI layer. Calling pci_enable_msi() when MSI-X is already enabled or |
306 | stored in dev->irq will be saved by the PCI subsystem; however, | 238 | pci_enable_msix() when MSI is already enabled will result in an error. |
307 | unlike MSI mode, the PCI subsystem will not replace dev->irq with | 239 | If a device driver wishes to switch between MSI and MSI-X at runtime, |
308 | assigned MSI-X vector because the PCI subsystem already writes the 1:1 | 240 | it must first quiesce the device, then switch it back to pin-interrupt |
309 | vector-to-entry mapping into the field 'vector' of each element | 241 | mode, before calling pci_enable_msi() or pci_enable_msix() and resuming |
310 | specified in second argument. | 242 | operation. This is not expected to be a common operation but may be |
311 | 243 | useful for debugging or testing during development. | |
312 | To return back to its default mode, a device driver should always call | 244 | |
313 | pci_disable_msix() to undo the effect of pci_enable_msix(). Note that | 245 | 4.5 Considerations when using MSIs |
314 | a device driver should always call free_irq() on all MSI-X vectors it | 246 | |
315 | has done request_irq() on before calling pci_disable_msix(). Failure | 247 | 4.5.1 Choosing between MSI-X and MSI |
316 | to do so results in a BUG_ON() and a device will be left with MSI-X | 248 | |
317 | enabled and leaks its vectors. Otherwise, the PCI subsystem switches a | 249 | If your device supports both MSI-X and MSI capabilities, you should use |
318 | device function's interrupt mode from MSI-X mode to legacy mode and | 250 | the MSI-X facilities in preference to the MSI facilities. As mentioned |
319 | marks all allocated MSI-X vectors as unused. | 251 | above, MSI-X supports any number of interrupts between 1 and 2048. |
320 | 252 | In constrast, MSI is restricted to a maximum of 32 interrupts (and | |
321 | Once being marked as unused, there is no guarantee that the PCI | 253 | must be a power of two). In addition, the MSI interrupt vectors must |
322 | subsystem will reserve these MSI-X vectors for a device. Depending on | 254 | be allocated consecutively, so the system may not be able to allocate |
323 | the availability of current PCI vector resources and the number of | 255 | as many vectors for MSI as it could for MSI-X. On some platforms, MSI |
324 | MSI/MSI-X requests from other drivers, these MSI-X vectors may be | 256 | interrupts must all be targetted at the same set of CPUs whereas MSI-X |
325 | re-assigned. | 257 | interrupts can all be targetted at different CPUs. |
326 | 258 | ||
327 | For the case where the PCI subsystem re-assigned these MSI-X vectors | 259 | 4.5.2 Spinlocks |
328 | to other drivers, a request to switch back to MSI-X mode may result | 260 | |
329 | being assigned with another set of MSI-X vectors or a failure if no | 261 | Most device drivers have a per-device spinlock which is taken in the |
330 | more vectors are available. | 262 | interrupt handler. With pin-based interrupts or a single MSI, it is not |
331 | 263 | necessary to disable interrupts (Linux guarantees the same interrupt will | |
332 | 5.4 Handling function implementing both MSI and MSI-X capabilities | 264 | not be re-entered). If a device uses multiple interrupts, the driver |
333 | 265 | must disable interrupts while the lock is held. If the device sends | |
334 | For the case where a function implements both MSI and MSI-X | 266 | a different interrupt, the driver will deadlock trying to recursively |
335 | capabilities, the PCI subsystem enables a device to run either in MSI | 267 | acquire the spinlock. |
336 | mode or MSI-X mode but not both. A device driver determines whether it | 268 | |
337 | wants MSI or MSI-X enabled on its hardware device. Once a device | 269 | There are two solutions. The first is to take the lock with |
338 | driver requests for MSI, for example, it is prohibited from requesting | 270 | spin_lock_irqsave() or spin_lock_irq() (see |
339 | MSI-X; in other words, a device driver is not permitted to ping-pong | 271 | Documentation/DocBook/kernel-locking). The second is to specify |
340 | between MSI mod MSI-X mode during a run-time. | 272 | IRQF_DISABLED to request_irq() so that the kernel runs the entire |
341 | 273 | interrupt routine with interrupts disabled. | |
342 | 5.5 Hardware requirements for MSI/MSI-X support | 274 | |
343 | 275 | If your MSI interrupt routine does not hold the lock for the whole time | |
344 | MSI/MSI-X support requires support from both system hardware and | 276 | it is running, the first solution may be best. The second solution is |
345 | individual hardware device functions. | 277 | normally preferred as it avoids making two transitions from interrupt |
346 | 278 | disabled to enabled and back again. | |
347 | 5.5.1 Required x86 hardware support | 279 | |
348 | 280 | 4.6 How to tell whether MSI/MSI-X is enabled on a device | |
349 | Since the target of MSI address is the local APIC CPU, enabling | 281 | |
350 | MSI/MSI-X support in the Linux kernel is dependent on whether existing | 282 | Using 'lspci -v' (as root) may show some devices with "MSI", "Message |
351 | system hardware supports local APIC. Users should verify that their | 283 | Signalled Interrupts" or "MSI-X" capabilities. Each of these capabilities |
352 | system supports local APIC operation by testing that it runs when | 284 | has an 'Enable' flag which will be followed with either "+" (enabled) |
353 | CONFIG_X86_LOCAL_APIC=y. | 285 | or "-" (disabled). |
354 | 286 | ||
355 | In SMP environment, CONFIG_X86_LOCAL_APIC is automatically set; | 287 | |
356 | however, in UP environment, users must manually set | 288 | 5. MSI quirks |
357 | CONFIG_X86_LOCAL_APIC. Once CONFIG_X86_LOCAL_APIC=y, setting | 289 | |
358 | CONFIG_PCI_MSI enables the VECTOR based scheme and the option for | 290 | Several PCI chipsets or devices are known not to support MSIs. |
359 | MSI-capable device drivers to selectively enable MSI/MSI-X. | 291 | The PCI stack provides three ways to disable MSIs: |
360 | 292 | ||
361 | Note that CONFIG_X86_IO_APIC setting is irrelevant because MSI/MSI-X | 293 | 1. globally |
362 | vector is allocated new during runtime and MSI/MSI-X support does not | 294 | 2. on all devices behind a specific bridge |
363 | depend on BIOS support. This key independency enables MSI/MSI-X | 295 | 3. on a single device |
364 | support on future IOxAPIC free platforms. | 296 | |
365 | 297 | 5.1. Disabling MSIs globally | |
366 | 5.5.2 Device hardware support | 298 | |
367 | 299 | Some host chipsets simply don't support MSIs properly. If we're | |
368 | The hardware device function supports MSI by indicating the | 300 | lucky, the manufacturer knows this and has indicated it in the ACPI |
369 | MSI/MSI-X capability structure on its PCI capability list. By | 301 | FADT table. In this case, Linux will automatically disable MSIs. |
370 | default, this capability structure will not be initialized by | 302 | Some boards don't include this information in the table and so we have |
371 | the kernel to enable MSI during the system boot. In other words, | 303 | to detect them ourselves. The complete list of these is found near the |
372 | the device function is running on its default pin assertion mode. | 304 | quirk_disable_all_msi() function in drivers/pci/quirks.c. |
373 | Note that in many cases the hardware supporting MSI have bugs, | 305 | |
374 | which may result in system hangs. The software driver of specific | 306 | If you have a board which has problems with MSIs, you can pass pci=nomsi |
375 | MSI-capable hardware is responsible for deciding whether to call | 307 | on the kernel command line to disable MSIs on all devices. It would be |
376 | pci_enable_msi or not. A return of zero indicates the kernel | 308 | in your best interests to report the problem to linux-pci@vger.kernel.org |
377 | successfully initialized the MSI/MSI-X capability structure of the | 309 | including a full 'lspci -v' so we can add the quirks to the kernel. |
378 | device function. The device function is now running on MSI/MSI-X mode. | 310 | |
379 | 311 | 5.2. Disabling MSIs below a bridge | |
380 | 5.6 How to tell whether MSI/MSI-X is enabled on device function | 312 | |
381 | 313 | Some PCI bridges are not able to route MSIs between busses properly. | |
382 | At the driver level, a return of zero from the function call of | 314 | In this case, MSIs must be disabled on all devices behind the bridge. |
383 | pci_enable_msi()/pci_enable_msix() indicates to a device driver that | 315 | |
384 | its device function is initialized successfully and ready to run in | 316 | Some bridges allow you to enable MSIs by changing some bits in their |
385 | MSI/MSI-X mode. | 317 | PCI configuration space (especially the Hypertransport chipsets such |
386 | 318 | as the nVidia nForce and Serverworks HT2000). As with host chipsets, | |
387 | At the user level, users can use the command 'cat /proc/interrupts' | 319 | Linux mostly knows about them and automatically enables MSIs if it can. |
388 | to display the vectors allocated for devices and their interrupt | 320 | If you have a bridge which Linux doesn't yet know about, you can enable |
389 | MSI/MSI-X modes ("PCI-MSI"/"PCI-MSI-X"). Below shows MSI mode is | 321 | MSIs in configuration space using whatever method you know works, then |
390 | enabled on a SCSI Adaptec 39320D Ultra320 controller. | 322 | enable MSIs on that bridge by doing: |
391 | 323 | ||
392 | CPU0 CPU1 | 324 | echo 1 > /sys/bus/pci/devices/$bridge/msi_bus |
393 | 0: 324639 0 IO-APIC-edge timer | 325 | |
394 | 1: 1186 0 IO-APIC-edge i8042 | 326 | where $bridge is the PCI address of the bridge you've enabled (eg |
395 | 2: 0 0 XT-PIC cascade | 327 | 0000:00:0e.0). |
396 | 12: 2797 0 IO-APIC-edge i8042 | 328 | |
397 | 14: 6543 0 IO-APIC-edge ide0 | 329 | To disable MSIs, echo 0 instead of 1. Changing this value should be |
398 | 15: 1 0 IO-APIC-edge ide1 | 330 | done with caution as it can break interrupt handling for all devices |
399 | 169: 0 0 IO-APIC-level uhci-hcd | 331 | below this bridge. |
400 | 185: 0 0 IO-APIC-level uhci-hcd | 332 | |
401 | 193: 138 10 PCI-MSI aic79xx | 333 | Again, please notify linux-pci@vger.kernel.org of any bridges that need |
402 | 201: 30 0 PCI-MSI aic79xx | 334 | special handling. |
403 | 225: 30 0 IO-APIC-level aic7xxx | 335 | |
404 | 233: 30 0 IO-APIC-level aic7xxx | 336 | 5.3. Disabling MSIs on a single device |
405 | NMI: 0 0 | 337 | |
406 | LOC: 324553 325068 | 338 | Some devices are known to have faulty MSI implementations. Usually this |
407 | ERR: 0 | 339 | is handled in the individual device driver but occasionally it's necessary |
408 | MIS: 0 | 340 | to handle this with a quirk. Some drivers have an option to disable use |
409 | 341 | of MSI. While this is a convenient workaround for the driver author, | |
410 | 6. MSI quirks | 342 | it is not good practise, and should not be emulated. |
411 | 343 | ||
412 | Several PCI chipsets or devices are known to not support MSI. | 344 | 5.4. Finding why MSIs are disabled on a device |
413 | The PCI stack provides 3 possible levels of MSI disabling: | 345 | |
414 | * on a single device | 346 | From the above three sections, you can see that there are many reasons |
415 | * on all devices behind a specific bridge | 347 | why MSIs may not be enabled for a given device. Your first step should |
416 | * globally | 348 | be to examine your dmesg carefully to determine whether MSIs are enabled |
417 | 349 | for your machine. You should also check your .config to be sure you | |
418 | 6.1. Disabling MSI on a single device | 350 | have enabled CONFIG_PCI_MSI. |
419 | 351 | ||
420 | Under some circumstances it might be required to disable MSI on a | 352 | Then, 'lspci -t' gives the list of bridges above a device. Reading |
421 | single device. This may be achieved by either not calling pci_enable_msi() | 353 | /sys/bus/pci/devices/*/msi_bus will tell you whether MSI are enabled (1) |
422 | or all, or setting the pci_dev->no_msi flag before (most of the time | 354 | or disabled (0). If 0 is found in any of the msi_bus files belonging |
423 | in a quirk). | 355 | to bridges between the PCI root and the device, MSIs are disabled. |
424 | 356 | ||
425 | 6.2. Disabling MSI below a bridge | 357 | It is also worth checking the device driver to see whether it supports MSIs. |
426 | 358 | For example, it may contain calls to pci_enable_msi(), pci_enable_msix() or | |
427 | The vast majority of MSI quirks are required by PCI bridges not | 359 | pci_enable_msi_block(). |
428 | being able to route MSI between busses. In this case, MSI have to be | ||
429 | disabled on all devices behind this bridge. It is achieves by setting | ||
430 | the PCI_BUS_FLAGS_NO_MSI flag in the pci_bus->bus_flags of the bridge | ||
431 | subordinate bus. There is no need to set the same flag on bridges that | ||
432 | are below the broken bridge. When pci_enable_msi() is called to enable | ||
433 | MSI on a device, pci_msi_supported() takes care of checking the NO_MSI | ||
434 | flag in all parent busses of the device. | ||
435 | |||
436 | Some bridges actually support dynamic MSI support enabling/disabling | ||
437 | by changing some bits in their PCI configuration space (especially | ||
438 | the Hypertransport chipsets such as the nVidia nForce and Serverworks | ||
439 | HT2000). It may then be required to update the NO_MSI flag on the | ||
440 | corresponding devices in the sysfs hierarchy. To enable MSI support | ||
441 | on device "0000:00:0e", do: | ||
442 | |||
443 | echo 1 > /sys/bus/pci/devices/0000:00:0e/msi_bus | ||
444 | |||
445 | To disable MSI support, echo 0 instead of 1. Note that it should be | ||
446 | used with caution since changing this value might break interrupts. | ||
447 | |||
448 | 6.3. Disabling MSI globally | ||
449 | |||
450 | Some extreme cases may require to disable MSI globally on the system. | ||
451 | For now, the only known case is a Serverworks PCI-X chipsets (MSI are | ||
452 | not supported on several busses that are not all connected to the | ||
453 | chipset in the Linux PCI hierarchy). In the vast majority of other | ||
454 | cases, disabling only behind a specific bridge is enough. | ||
455 | |||
456 | For debugging purpose, the user may also pass pci=nomsi on the kernel | ||
457 | command-line to explicitly disable MSI globally. But, once the appro- | ||
458 | priate quirks are added to the kernel, this option should not be | ||
459 | required anymore. | ||
460 | |||
461 | 6.4. Finding why MSI cannot be enabled on a device | ||
462 | |||
463 | Assuming that MSI are not enabled on a device, you should look at | ||
464 | dmesg to find messages that quirks may output when disabling MSI | ||
465 | on some devices, some bridges or even globally. | ||
466 | Then, lspci -t gives the list of bridges above a device. Reading | ||
467 | /sys/bus/pci/devices/0000:00:0e/msi_bus will tell you whether MSI | ||
468 | are enabled (1) or disabled (0). In 0 is found in a single bridge | ||
469 | msi_bus file above the device, MSI cannot be enabled. | ||
470 | |||
471 | 7. FAQ | ||
472 | |||
473 | Q1. Are there any limitations on using the MSI? | ||
474 | |||
475 | A1. If the PCI device supports MSI and conforms to the | ||
476 | specification and the platform supports the APIC local bus, | ||
477 | then using MSI should work. | ||
478 | |||
479 | Q2. Will it work on all the Pentium processors (P3, P4, Xeon, | ||
480 | AMD processors)? In P3 IPI's are transmitted on the APIC local | ||
481 | bus and in P4 and Xeon they are transmitted on the system | ||
482 | bus. Are there any implications with this? | ||
483 | |||
484 | A2. MSI support enables a PCI device sending an inbound | ||
485 | memory write (0xfeexxxxx as target address) on its PCI bus | ||
486 | directly to the FSB. Since the message address has a | ||
487 | redirection hint bit cleared, it should work. | ||
488 | |||
489 | Q3. The target address 0xfeexxxxx will be translated by the | ||
490 | Host Bridge into an interrupt message. Are there any | ||
491 | limitations on the chipsets such as Intel 8xx, Intel e7xxx, | ||
492 | or VIA? | ||
493 | |||
494 | A3. If these chipsets support an inbound memory write with | ||
495 | target address set as 0xfeexxxxx, as conformed to PCI | ||
496 | specification 2.3 or latest, then it should work. | ||
497 | |||
498 | Q4. From the driver point of view, if the MSI is lost because | ||
499 | of errors occurring during inbound memory write, then it may | ||
500 | wait forever. Is there a mechanism for it to recover? | ||
501 | |||
502 | A4. Since the target of the transaction is an inbound memory | ||
503 | write, all transaction termination conditions (Retry, | ||
504 | Master-Abort, Target-Abort, or normal completion) are | ||
505 | supported. A device sending an MSI must abide by all the PCI | ||
506 | rules and conditions regarding that inbound memory write. So, | ||
507 | if a retry is signaled it must retry, etc... We believe that | ||
508 | the recommendation for Abort is also a retry (refer to PCI | ||
509 | specification 2.3 or latest). | ||
diff --git a/Documentation/PCI/pci-iov-howto.txt b/Documentation/PCI/pci-iov-howto.txt new file mode 100644 index 000000000000..fc73ef5d65b8 --- /dev/null +++ b/Documentation/PCI/pci-iov-howto.txt | |||
@@ -0,0 +1,99 @@ | |||
1 | PCI Express I/O Virtualization Howto | ||
2 | Copyright (C) 2009 Intel Corporation | ||
3 | Yu Zhao <yu.zhao@intel.com> | ||
4 | |||
5 | |||
6 | 1. Overview | ||
7 | |||
8 | 1.1 What is SR-IOV | ||
9 | |||
10 | Single Root I/O Virtualization (SR-IOV) is a PCI Express Extended | ||
11 | capability which makes one physical device appear as multiple virtual | ||
12 | devices. The physical device is referred to as Physical Function (PF) | ||
13 | while the virtual devices are referred to as Virtual Functions (VF). | ||
14 | Allocation of the VF can be dynamically controlled by the PF via | ||
15 | registers encapsulated in the capability. By default, this feature is | ||
16 | not enabled and the PF behaves as traditional PCIe device. Once it's | ||
17 | turned on, each VF's PCI configuration space can be accessed by its own | ||
18 | Bus, Device and Function Number (Routing ID). And each VF also has PCI | ||
19 | Memory Space, which is used to map its register set. VF device driver | ||
20 | operates on the register set so it can be functional and appear as a | ||
21 | real existing PCI device. | ||
22 | |||
23 | 2. User Guide | ||
24 | |||
25 | 2.1 How can I enable SR-IOV capability | ||
26 | |||
27 | The device driver (PF driver) will control the enabling and disabling | ||
28 | of the capability via API provided by SR-IOV core. If the hardware | ||
29 | has SR-IOV capability, loading its PF driver would enable it and all | ||
30 | VFs associated with the PF. | ||
31 | |||
32 | 2.2 How can I use the Virtual Functions | ||
33 | |||
34 | The VF is treated as hot-plugged PCI devices in the kernel, so they | ||
35 | should be able to work in the same way as real PCI devices. The VF | ||
36 | requires device driver that is same as a normal PCI device's. | ||
37 | |||
38 | 3. Developer Guide | ||
39 | |||
40 | 3.1 SR-IOV API | ||
41 | |||
42 | To enable SR-IOV capability: | ||
43 | int pci_enable_sriov(struct pci_dev *dev, int nr_virtfn); | ||
44 | 'nr_virtfn' is number of VFs to be enabled. | ||
45 | |||
46 | To disable SR-IOV capability: | ||
47 | void pci_disable_sriov(struct pci_dev *dev); | ||
48 | |||
49 | To notify SR-IOV core of Virtual Function Migration: | ||
50 | irqreturn_t pci_sriov_migration(struct pci_dev *dev); | ||
51 | |||
52 | 3.2 Usage example | ||
53 | |||
54 | Following piece of code illustrates the usage of the SR-IOV API. | ||
55 | |||
56 | static int __devinit dev_probe(struct pci_dev *dev, const struct pci_device_id *id) | ||
57 | { | ||
58 | pci_enable_sriov(dev, NR_VIRTFN); | ||
59 | |||
60 | ... | ||
61 | |||
62 | return 0; | ||
63 | } | ||
64 | |||
65 | static void __devexit dev_remove(struct pci_dev *dev) | ||
66 | { | ||
67 | pci_disable_sriov(dev); | ||
68 | |||
69 | ... | ||
70 | } | ||
71 | |||
72 | static int dev_suspend(struct pci_dev *dev, pm_message_t state) | ||
73 | { | ||
74 | ... | ||
75 | |||
76 | return 0; | ||
77 | } | ||
78 | |||
79 | static int dev_resume(struct pci_dev *dev) | ||
80 | { | ||
81 | ... | ||
82 | |||
83 | return 0; | ||
84 | } | ||
85 | |||
86 | static void dev_shutdown(struct pci_dev *dev) | ||
87 | { | ||
88 | ... | ||
89 | } | ||
90 | |||
91 | static struct pci_driver dev_driver = { | ||
92 | .name = "SR-IOV Physical Function driver", | ||
93 | .id_table = dev_id_table, | ||
94 | .probe = dev_probe, | ||
95 | .remove = __devexit_p(dev_remove), | ||
96 | .suspend = dev_suspend, | ||
97 | .resume = dev_resume, | ||
98 | .shutdown = dev_shutdown, | ||
99 | }; | ||
diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt index 1fd175368a87..4349c1487e91 100644 --- a/Documentation/RCU/listRCU.txt +++ b/Documentation/RCU/listRCU.txt | |||
@@ -118,7 +118,7 @@ Following are the RCU equivalents for these two functions: | |||
118 | list_for_each_entry(e, list, list) { | 118 | list_for_each_entry(e, list, list) { |
119 | if (!audit_compare_rule(rule, &e->rule)) { | 119 | if (!audit_compare_rule(rule, &e->rule)) { |
120 | list_del_rcu(&e->list); | 120 | list_del_rcu(&e->list); |
121 | call_rcu(&e->rcu, audit_free_rule, e); | 121 | call_rcu(&e->rcu, audit_free_rule); |
122 | return 0; | 122 | return 0; |
123 | } | 123 | } |
124 | } | 124 | } |
@@ -206,7 +206,7 @@ RCU ("read-copy update") its name. The RCU code is as follows: | |||
206 | ne->rule.action = newaction; | 206 | ne->rule.action = newaction; |
207 | ne->rule.file_count = newfield_count; | 207 | ne->rule.file_count = newfield_count; |
208 | list_replace_rcu(e, ne); | 208 | list_replace_rcu(e, ne); |
209 | call_rcu(&e->rcu, audit_free_rule, e); | 209 | call_rcu(&e->rcu, audit_free_rule); |
210 | return 0; | 210 | return 0; |
211 | } | 211 | } |
212 | } | 212 | } |
@@ -283,7 +283,7 @@ flag under the spinlock as follows: | |||
283 | list_del_rcu(&e->list); | 283 | list_del_rcu(&e->list); |
284 | e->deleted = 1; | 284 | e->deleted = 1; |
285 | spin_unlock(&e->lock); | 285 | spin_unlock(&e->lock); |
286 | call_rcu(&e->rcu, audit_free_rule, e); | 286 | call_rcu(&e->rcu, audit_free_rule); |
287 | return 0; | 287 | return 0; |
288 | } | 288 | } |
289 | } | 289 | } |
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt index 95821a29ae41..7aa2002ade77 100644 --- a/Documentation/RCU/rcu.txt +++ b/Documentation/RCU/rcu.txt | |||
@@ -81,7 +81,7 @@ o I hear that RCU needs work in order to support realtime kernels? | |||
81 | This work is largely completed. Realtime-friendly RCU can be | 81 | This work is largely completed. Realtime-friendly RCU can be |
82 | enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. | 82 | enabled via the CONFIG_PREEMPT_RCU kernel configuration parameter. |
83 | However, work is in progress for enabling priority boosting of | 83 | However, work is in progress for enabling priority boosting of |
84 | preempted RCU read-side critical sections.This is needed if you | 84 | preempted RCU read-side critical sections. This is needed if you |
85 | have CPU-bound realtime threads. | 85 | have CPU-bound realtime threads. |
86 | 86 | ||
87 | o Where can I find more information on RCU? | 87 | o Where can I find more information on RCU? |
diff --git a/Documentation/RCU/rculist_nulls.txt b/Documentation/RCU/rculist_nulls.txt index 239f542d48ba..6389dec33459 100644 --- a/Documentation/RCU/rculist_nulls.txt +++ b/Documentation/RCU/rculist_nulls.txt | |||
@@ -21,7 +21,7 @@ if (obj) { | |||
21 | /* | 21 | /* |
22 | * Because a writer could delete object, and a writer could | 22 | * Because a writer could delete object, and a writer could |
23 | * reuse these object before the RCU grace period, we | 23 | * reuse these object before the RCU grace period, we |
24 | * must check key after geting the reference on object | 24 | * must check key after getting the reference on object |
25 | */ | 25 | */ |
26 | if (obj->key != key) { // not the object we expected | 26 | if (obj->key != key) { // not the object we expected |
27 | put_ref(obj); | 27 | put_ref(obj); |
@@ -117,7 +117,7 @@ a race (some writer did a delete and/or a move of an object | |||
117 | to another chain) checking the final 'nulls' value if | 117 | to another chain) checking the final 'nulls' value if |
118 | the lookup met the end of chain. If final 'nulls' value | 118 | the lookup met the end of chain. If final 'nulls' value |
119 | is not the slot number, then we must restart the lookup at | 119 | is not the slot number, then we must restart the lookup at |
120 | the begining. If the object was moved to same chain, | 120 | the beginning. If the object was moved to the same chain, |
121 | then the reader doesnt care : It might eventually | 121 | then the reader doesnt care : It might eventually |
122 | scan the list again without harm. | 122 | scan the list again without harm. |
123 | 123 | ||
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX new file mode 100644 index 000000000000..3f58fa3d6d00 --- /dev/null +++ b/Documentation/cgroups/00-INDEX | |||
@@ -0,0 +1,18 @@ | |||
1 | 00-INDEX | ||
2 | - this file | ||
3 | cgroups.txt | ||
4 | - Control Groups definition, implementation details, examples and API. | ||
5 | cpuacct.txt | ||
6 | - CPU Accounting Controller; account CPU usage for groups of tasks. | ||
7 | cpusets.txt | ||
8 | - documents the cpusets feature; assign CPUs and Mem to a set of tasks. | ||
9 | devices.txt | ||
10 | - Device Whitelist Controller; description, interface and security. | ||
11 | freezer-subsystem.txt | ||
12 | - checkpointing; rationale to not use signals, interface. | ||
13 | memcg_test.txt | ||
14 | - Memory Resource Controller; implementation details. | ||
15 | memory.txt | ||
16 | - Memory Resource Controller; design, accounting, interface, testing. | ||
17 | resource_counter.txt | ||
18 | - Resource Counter API. | ||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index 93feb8444489..6eb1a97e88ce 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -56,7 +56,7 @@ hierarchy, and a set of subsystems; each subsystem has system-specific | |||
56 | state attached to each cgroup in the hierarchy. Each hierarchy has | 56 | state attached to each cgroup in the hierarchy. Each hierarchy has |
57 | an instance of the cgroup virtual filesystem associated with it. | 57 | an instance of the cgroup virtual filesystem associated with it. |
58 | 58 | ||
59 | At any one time there may be multiple active hierachies of task | 59 | At any one time there may be multiple active hierarchies of task |
60 | cgroups. Each hierarchy is a partition of all tasks in the system. | 60 | cgroups. Each hierarchy is a partition of all tasks in the system. |
61 | 61 | ||
62 | User level code may create and destroy cgroups by name in an | 62 | User level code may create and destroy cgroups by name in an |
@@ -124,10 +124,10 @@ following lines: | |||
124 | / \ | 124 | / \ |
125 | Prof (15%) students (5%) | 125 | Prof (15%) students (5%) |
126 | 126 | ||
127 | Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go | 127 | Browsers like Firefox/Lynx go into the WWW network class, while (k)nfsd go |
128 | into NFS network class. | 128 | into NFS network class. |
129 | 129 | ||
130 | At the same time firefox/lynx will share an appropriate CPU/Memory class | 130 | At the same time Firefox/Lynx will share an appropriate CPU/Memory class |
131 | depending on who launched it (prof/student). | 131 | depending on who launched it (prof/student). |
132 | 132 | ||
133 | With the ability to classify tasks differently for different resources | 133 | With the ability to classify tasks differently for different resources |
@@ -325,7 +325,7 @@ and then start a subshell 'sh' in that cgroup: | |||
325 | Creating, modifying, using the cgroups can be done through the cgroup | 325 | Creating, modifying, using the cgroups can be done through the cgroup |
326 | virtual filesystem. | 326 | virtual filesystem. |
327 | 327 | ||
328 | To mount a cgroup hierarchy will all available subsystems, type: | 328 | To mount a cgroup hierarchy with all available subsystems, type: |
329 | # mount -t cgroup xxx /dev/cgroup | 329 | # mount -t cgroup xxx /dev/cgroup |
330 | 330 | ||
331 | The "xxx" is not interpreted by the cgroup code, but will appear in | 331 | The "xxx" is not interpreted by the cgroup code, but will appear in |
@@ -333,12 +333,23 @@ The "xxx" is not interpreted by the cgroup code, but will appear in | |||
333 | 333 | ||
334 | To mount a cgroup hierarchy with just the cpuset and numtasks | 334 | To mount a cgroup hierarchy with just the cpuset and numtasks |
335 | subsystems, type: | 335 | subsystems, type: |
336 | # mount -t cgroup -o cpuset,numtasks hier1 /dev/cgroup | 336 | # mount -t cgroup -o cpuset,memory hier1 /dev/cgroup |
337 | 337 | ||
338 | To change the set of subsystems bound to a mounted hierarchy, just | 338 | To change the set of subsystems bound to a mounted hierarchy, just |
339 | remount with different options: | 339 | remount with different options: |
340 | # mount -o remount,cpuset,ns hier1 /dev/cgroup | ||
340 | 341 | ||
341 | # mount -o remount,cpuset,ns /dev/cgroup | 342 | Now memory is removed from the hierarchy and ns is added. |
343 | |||
344 | Note this will add ns to the hierarchy but won't remove memory or | ||
345 | cpuset, because the new options are appended to the old ones: | ||
346 | # mount -o remount,ns /dev/cgroup | ||
347 | |||
348 | To Specify a hierarchy's release_agent: | ||
349 | # mount -t cgroup -o cpuset,release_agent="/sbin/cpuset_release_agent" \ | ||
350 | xxx /dev/cgroup | ||
351 | |||
352 | Note that specifying 'release_agent' more than once will return failure. | ||
342 | 353 | ||
343 | Note that changing the set of subsystems is currently only supported | 354 | Note that changing the set of subsystems is currently only supported |
344 | when the hierarchy consists of a single (root) cgroup. Supporting | 355 | when the hierarchy consists of a single (root) cgroup. Supporting |
@@ -349,6 +360,11 @@ Then under /dev/cgroup you can find a tree that corresponds to the | |||
349 | tree of the cgroups in the system. For instance, /dev/cgroup | 360 | tree of the cgroups in the system. For instance, /dev/cgroup |
350 | is the cgroup that holds the whole system. | 361 | is the cgroup that holds the whole system. |
351 | 362 | ||
363 | If you want to change the value of release_agent: | ||
364 | # echo "/sbin/new_release_agent" > /dev/cgroup/release_agent | ||
365 | |||
366 | It can also be changed via remount. | ||
367 | |||
352 | If you want to create a new cgroup under /dev/cgroup: | 368 | If you want to create a new cgroup under /dev/cgroup: |
353 | # cd /dev/cgroup | 369 | # cd /dev/cgroup |
354 | # mkdir my_cgroup | 370 | # mkdir my_cgroup |
@@ -476,11 +492,13 @@ cgroup->parent is still valid. (Note - can also be called for a | |||
476 | newly-created cgroup if an error occurs after this subsystem's | 492 | newly-created cgroup if an error occurs after this subsystem's |
477 | create() method has been called for the new cgroup). | 493 | create() method has been called for the new cgroup). |
478 | 494 | ||
479 | void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); | 495 | int pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); |
480 | 496 | ||
481 | Called before checking the reference count on each subsystem. This may | 497 | Called before checking the reference count on each subsystem. This may |
482 | be useful for subsystems which have some extra references even if | 498 | be useful for subsystems which have some extra references even if |
483 | there are not tasks in the cgroup. | 499 | there are not tasks in the cgroup. If pre_destroy() returns error code, |
500 | rmdir() will fail with it. From this behavior, pre_destroy() can be | ||
501 | called multiple times against a cgroup. | ||
484 | 502 | ||
485 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 503 | int can_attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
486 | struct task_struct *task) | 504 | struct task_struct *task) |
@@ -521,7 +539,7 @@ always handled well. | |||
521 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | 539 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) |
522 | (cgroup_mutex held by caller) | 540 | (cgroup_mutex held by caller) |
523 | 541 | ||
524 | Called at the end of cgroup_clone() to do any paramater | 542 | Called at the end of cgroup_clone() to do any parameter |
525 | initialization which might be required before a task could attach. For | 543 | initialization which might be required before a task could attach. For |
526 | example in cpusets, no task may attach before 'cpus' and 'mems' are set | 544 | example in cpusets, no task may attach before 'cpus' and 'mems' are set |
527 | up. | 545 | up. |
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt index 0611e9528c7c..f9ca389dddf4 100644 --- a/Documentation/cgroups/cpusets.txt +++ b/Documentation/cgroups/cpusets.txt | |||
@@ -131,7 +131,7 @@ Cpusets extends these two mechanisms as follows: | |||
131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for | 131 | - The hierarchy of cpusets can be mounted at /dev/cpuset, for |
132 | browsing and manipulation from user space. | 132 | browsing and manipulation from user space. |
133 | - A cpuset may be marked exclusive, which ensures that no other | 133 | - A cpuset may be marked exclusive, which ensures that no other |
134 | cpuset (except direct ancestors and descendents) may contain | 134 | cpuset (except direct ancestors and descendants) may contain |
135 | any overlapping CPUs or Memory Nodes. | 135 | any overlapping CPUs or Memory Nodes. |
136 | - You can list all the tasks (by pid) attached to any cpuset. | 136 | - You can list all the tasks (by pid) attached to any cpuset. |
137 | 137 | ||
@@ -226,7 +226,7 @@ nodes with memory--using the cpuset_track_online_nodes() hook. | |||
226 | -------------------------------- | 226 | -------------------------------- |
227 | 227 | ||
228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | 228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than |
229 | a direct ancestor or descendent, may share any of the same CPUs or | 229 | a direct ancestor or descendant, may share any of the same CPUs or |
230 | Memory Nodes. | 230 | Memory Nodes. |
231 | 231 | ||
232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", |
@@ -427,7 +427,7 @@ child cpusets have this flag enabled. | |||
427 | When doing this, you don't usually want to leave any unpinned tasks in | 427 | When doing this, you don't usually want to leave any unpinned tasks in |
428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | 428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks |
429 | may be artificially constrained to some subset of CPUs, depending on | 429 | may be artificially constrained to some subset of CPUs, depending on |
430 | the particulars of this flag setting in descendent cpusets. Even if | 430 | the particulars of this flag setting in descendant cpusets. Even if |
431 | such a task could use spare CPU cycles in some other CPUs, the kernel | 431 | such a task could use spare CPU cycles in some other CPUs, the kernel |
432 | scheduler might not consider the possibility of load balancing that | 432 | scheduler might not consider the possibility of load balancing that |
433 | task to that underused CPU. | 433 | task to that underused CPU. |
@@ -531,9 +531,9 @@ be idle. | |||
531 | 531 | ||
532 | Of course it takes some searching cost to find movable tasks and/or | 532 | Of course it takes some searching cost to find movable tasks and/or |
533 | idle CPUs, the scheduler might not search all CPUs in the domain | 533 | idle CPUs, the scheduler might not search all CPUs in the domain |
534 | everytime. In fact, in some architectures, the searching ranges on | 534 | every time. In fact, in some architectures, the searching ranges on |
535 | events are limited in the same socket or node where the CPU locates, | 535 | events are limited in the same socket or node where the CPU locates, |
536 | while the load balance on tick searchs all. | 536 | while the load balance on tick searches all. |
537 | 537 | ||
538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | 538 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z |
539 | is idle while CPU X and the siblings are busy, scheduler can't migrate | 539 | is idle while CPU X and the siblings are busy, scheduler can't migrate |
@@ -601,7 +601,7 @@ its new cpuset, then the task will continue to use whatever subset | |||
601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task | 601 | of MPOL_BIND nodes are still allowed in the new cpuset. If the task |
602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | 602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed |
603 | in the new cpuset, then the task will be essentially treated as if it | 603 | in the new cpuset, then the task will be essentially treated as if it |
604 | was MPOL_BIND bound to the new cpuset (even though its numa placement, | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, |
605 | as queried by get_mempolicy(), doesn't change). If a task is moved | 605 | as queried by get_mempolicy(), doesn't change). If a task is moved |
606 | from one cpuset to another, then the kernel will adjust the tasks | 606 | from one cpuset to another, then the kernel will adjust the tasks |
607 | memory placement, as above, the next time that the kernel attempts | 607 | memory placement, as above, the next time that the kernel attempts |
diff --git a/Documentation/cgroups/devices.txt b/Documentation/cgroups/devices.txt index 7cc6e6a60672..57ca4c89fe5c 100644 --- a/Documentation/cgroups/devices.txt +++ b/Documentation/cgroups/devices.txt | |||
@@ -42,7 +42,7 @@ suffice, but we can decide the best way to adequately restrict | |||
42 | movement as people get some experience with this. We may just want | 42 | movement as people get some experience with this. We may just want |
43 | to require CAP_SYS_ADMIN, which at least is a separate bit from | 43 | to require CAP_SYS_ADMIN, which at least is a separate bit from |
44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which | 44 | CAP_MKNOD. We may want to just refuse moving to a cgroup which |
45 | isn't a descendent of the current one. Or we may want to use | 45 | isn't a descendant of the current one. Or we may want to use |
46 | CAP_MAC_ADMIN, since we really are trying to lock down root. | 46 | CAP_MAC_ADMIN, since we really are trying to lock down root. |
47 | 47 | ||
48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another | 48 | CAP_SYS_ADMIN is needed to modify the whitelist or move another |
diff --git a/Documentation/cgroups/memcg_test.txt b/Documentation/cgroups/memcg_test.txt index 523a9c16c400..72db89ed0609 100644 --- a/Documentation/cgroups/memcg_test.txt +++ b/Documentation/cgroups/memcg_test.txt | |||
@@ -1,5 +1,5 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | 1 | Memory Resource Controller(Memcg) Implementation Memo. |
2 | Last Updated: 2009/1/19 | 2 | Last Updated: 2009/1/20 |
3 | Base Kernel Version: based on 2.6.29-rc2. | 3 | Base Kernel Version: based on 2.6.29-rc2. |
4 | 4 | ||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | 5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior |
@@ -356,7 +356,25 @@ Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | |||
356 | (Shell-B) | 356 | (Shell-B) |
357 | # move all tasks in /cgroup/test to /cgroup | 357 | # move all tasks in /cgroup/test to /cgroup |
358 | # /sbin/swapoff -a | 358 | # /sbin/swapoff -a |
359 | # rmdir /test/cgroup | 359 | # rmdir /cgroup/test |
360 | # kill malloc task. | 360 | # kill malloc task. |
361 | 361 | ||
362 | Of course, tmpfs v.s. swapoff test should be tested, too. | 362 | Of course, tmpfs v.s. swapoff test should be tested, too. |
363 | |||
364 | 9.8 OOM-Killer | ||
365 | Out-of-memory caused by memcg's limit will kill tasks under | ||
366 | the memcg. When hierarchy is used, a task under hierarchy | ||
367 | will be killed by the kernel. | ||
368 | In this case, panic_on_oom shouldn't be invoked and tasks | ||
369 | in other groups shouldn't be killed. | ||
370 | |||
371 | It's not difficult to cause OOM under memcg as following. | ||
372 | Case A) when you can swapoff | ||
373 | #swapoff -a | ||
374 | #echo 50M > /memory.limit_in_bytes | ||
375 | run 51M of malloc | ||
376 | |||
377 | Case B) when you use mem+swap limitation. | ||
378 | #echo 50M > memory.limit_in_bytes | ||
379 | #echo 50M > memory.memsw.limit_in_bytes | ||
380 | run 51M of malloc | ||
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index e1501964df1e..a98a7fe7aabb 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt | |||
@@ -302,7 +302,7 @@ will be charged as a new owner of it. | |||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | 302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) |
303 | 303 | ||
304 | Below is depend on CONFIG_DEBUG_VM. | 304 | Below is depend on CONFIG_DEBUG_VM. |
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | 305 | inactive_ratio - VM internal parameter. (see mm/page_alloc.c) |
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | 306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) |
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | 307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) |
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | 308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) |
diff --git a/Documentation/devices.txt b/Documentation/devices.txt index 62254d4510c6..327de1624759 100644 --- a/Documentation/devices.txt +++ b/Documentation/devices.txt | |||
@@ -1,7 +1,7 @@ | |||
1 | 1 | ||
2 | LINUX ALLOCATED DEVICES (2.6+ version) | 2 | LINUX ALLOCATED DEVICES (2.6+ version) |
3 | 3 | ||
4 | Maintained by Torben Mathiasen <device@lanana.org> | 4 | Maintained by Alan Cox <device@lanana.org> |
5 | 5 | ||
6 | Last revised: 29 November 2006 | 6 | Last revised: 29 November 2006 |
7 | 7 | ||
@@ -67,6 +67,11 @@ up to date. Due to the number of registrations I have to maintain it | |||
67 | in "batch mode", so there is likely additional registrations that | 67 | in "batch mode", so there is likely additional registrations that |
68 | haven't been listed yet. | 68 | haven't been listed yet. |
69 | 69 | ||
70 | Fourth, remember that Linux now has extensive support for dynamic allocation | ||
71 | of device numbering and can use sysfs and udev to handle the naming needs. | ||
72 | There are still some exceptions in the serial and boot device area. Before | ||
73 | asking for a device number make sure you actually need one. | ||
74 | |||
70 | Finally, sometimes I have to play "namespace police." Please don't be | 75 | Finally, sometimes I have to play "namespace police." Please don't be |
71 | offended. I often get submissions for /dev names that would be bound | 76 | offended. I often get submissions for /dev names that would be bound |
72 | to cause conflicts down the road. I am trying to avoid getting in a | 77 | to cause conflicts down the road. I am trying to avoid getting in a |
@@ -101,7 +106,7 @@ Your cooperation is appreciated. | |||
101 | 0 = /dev/ram0 First RAM disk | 106 | 0 = /dev/ram0 First RAM disk |
102 | 1 = /dev/ram1 Second RAM disk | 107 | 1 = /dev/ram1 Second RAM disk |
103 | ... | 108 | ... |
104 | 250 = /dev/initrd Initial RAM disk {2.6} | 109 | 250 = /dev/initrd Initial RAM disk |
105 | 110 | ||
106 | Older kernels had /dev/ramdisk (1, 1) here. | 111 | Older kernels had /dev/ramdisk (1, 1) here. |
107 | /dev/initrd refers to a RAM disk which was preloaded | 112 | /dev/initrd refers to a RAM disk which was preloaded |
@@ -340,7 +345,7 @@ Your cooperation is appreciated. | |||
340 | 14 = /dev/touchscreen/ucb1x00 UCB 1x00 touchscreen | 345 | 14 = /dev/touchscreen/ucb1x00 UCB 1x00 touchscreen |
341 | 15 = /dev/touchscreen/mk712 MK712 touchscreen | 346 | 15 = /dev/touchscreen/mk712 MK712 touchscreen |
342 | 128 = /dev/beep Fancy beep device | 347 | 128 = /dev/beep Fancy beep device |
343 | 129 = /dev/modreq Kernel module load request {2.6} | 348 | 129 = |
344 | 130 = /dev/watchdog Watchdog timer port | 349 | 130 = /dev/watchdog Watchdog timer port |
345 | 131 = /dev/temperature Machine internal temperature | 350 | 131 = /dev/temperature Machine internal temperature |
346 | 132 = /dev/hwtrap Hardware fault trap | 351 | 132 = /dev/hwtrap Hardware fault trap |
@@ -350,10 +355,10 @@ Your cooperation is appreciated. | |||
350 | 139 = /dev/openprom SPARC OpenBoot PROM | 355 | 139 = /dev/openprom SPARC OpenBoot PROM |
351 | 140 = /dev/relay8 Berkshire Products Octal relay card | 356 | 140 = /dev/relay8 Berkshire Products Octal relay card |
352 | 141 = /dev/relay16 Berkshire Products ISO-16 relay card | 357 | 141 = /dev/relay16 Berkshire Products ISO-16 relay card |
353 | 142 = /dev/msr x86 model-specific registers {2.6} | 358 | 142 = |
354 | 143 = /dev/pciconf PCI configuration space | 359 | 143 = /dev/pciconf PCI configuration space |
355 | 144 = /dev/nvram Non-volatile configuration RAM | 360 | 144 = /dev/nvram Non-volatile configuration RAM |
356 | 145 = /dev/hfmodem Soundcard shortwave modem control {2.6} | 361 | 145 = /dev/hfmodem Soundcard shortwave modem control |
357 | 146 = /dev/graphics Linux/SGI graphics device | 362 | 146 = /dev/graphics Linux/SGI graphics device |
358 | 147 = /dev/opengl Linux/SGI OpenGL pipe | 363 | 147 = /dev/opengl Linux/SGI OpenGL pipe |
359 | 148 = /dev/gfx Linux/SGI graphics effects device | 364 | 148 = /dev/gfx Linux/SGI graphics effects device |
@@ -435,6 +440,9 @@ Your cooperation is appreciated. | |||
435 | 228 = /dev/hpet HPET driver | 440 | 228 = /dev/hpet HPET driver |
436 | 229 = /dev/fuse Fuse (virtual filesystem in user-space) | 441 | 229 = /dev/fuse Fuse (virtual filesystem in user-space) |
437 | 230 = /dev/midishare MidiShare driver | 442 | 230 = /dev/midishare MidiShare driver |
443 | 231 = /dev/snapshot System memory snapshot device | ||
444 | 232 = /dev/kvm Kernel-based virtual machine (hardware virtualization extensions) | ||
445 | 233 = /dev/kmview View-OS A process with a view | ||
438 | 240-254 Reserved for local use | 446 | 240-254 Reserved for local use |
439 | 255 Reserved for MISC_DYNAMIC_MINOR | 447 | 255 Reserved for MISC_DYNAMIC_MINOR |
440 | 448 | ||
@@ -466,10 +474,7 @@ Your cooperation is appreciated. | |||
466 | The device names specified are proposed -- if there | 474 | The device names specified are proposed -- if there |
467 | are "standard" names for these devices, please let me know. | 475 | are "standard" names for these devices, please let me know. |
468 | 476 | ||
469 | 12 block MSCDEX CD-ROM callback support {2.6} | 477 | 12 block |
470 | 0 = /dev/dos_cd0 First MSCDEX CD-ROM | ||
471 | 1 = /dev/dos_cd1 Second MSCDEX CD-ROM | ||
472 | ... | ||
473 | 478 | ||
474 | 13 char Input core | 479 | 13 char Input core |
475 | 0 = /dev/input/js0 First joystick | 480 | 0 = /dev/input/js0 First joystick |
@@ -498,7 +503,7 @@ Your cooperation is appreciated. | |||
498 | 2 = /dev/midi00 First MIDI port | 503 | 2 = /dev/midi00 First MIDI port |
499 | 3 = /dev/dsp Digital audio | 504 | 3 = /dev/dsp Digital audio |
500 | 4 = /dev/audio Sun-compatible digital audio | 505 | 4 = /dev/audio Sun-compatible digital audio |
501 | 6 = /dev/sndstat Sound card status information {2.6} | 506 | 6 = |
502 | 7 = /dev/audioctl SPARC audio control device | 507 | 7 = /dev/audioctl SPARC audio control device |
503 | 8 = /dev/sequencer2 Sequencer -- alternate device | 508 | 8 = /dev/sequencer2 Sequencer -- alternate device |
504 | 16 = /dev/mixer1 Second soundcard mixer control | 509 | 16 = /dev/mixer1 Second soundcard mixer control |
@@ -510,14 +515,7 @@ Your cooperation is appreciated. | |||
510 | 34 = /dev/midi02 Third MIDI port | 515 | 34 = /dev/midi02 Third MIDI port |
511 | 50 = /dev/midi03 Fourth MIDI port | 516 | 50 = /dev/midi03 Fourth MIDI port |
512 | 517 | ||
513 | 14 block BIOS harddrive callback support {2.6} | 518 | 14 block |
514 | 0 = /dev/dos_hda First BIOS harddrive whole disk | ||
515 | 64 = /dev/dos_hdb Second BIOS harddrive whole disk | ||
516 | 128 = /dev/dos_hdc Third BIOS harddrive whole disk | ||
517 | 192 = /dev/dos_hdd Fourth BIOS harddrive whole disk | ||
518 | |||
519 | Partitions are handled in the same way as IDE disks | ||
520 | (see major number 3). | ||
521 | 519 | ||
522 | 15 char Joystick | 520 | 15 char Joystick |
523 | 0 = /dev/js0 First analog joystick | 521 | 0 = /dev/js0 First analog joystick |
@@ -535,14 +533,14 @@ Your cooperation is appreciated. | |||
535 | 16 block GoldStar CD-ROM | 533 | 16 block GoldStar CD-ROM |
536 | 0 = /dev/gscd GoldStar CD-ROM | 534 | 0 = /dev/gscd GoldStar CD-ROM |
537 | 535 | ||
538 | 17 char Chase serial card | 536 | 17 char OBSOLETE (was Chase serial card) |
539 | 0 = /dev/ttyH0 First Chase port | 537 | 0 = /dev/ttyH0 First Chase port |
540 | 1 = /dev/ttyH1 Second Chase port | 538 | 1 = /dev/ttyH1 Second Chase port |
541 | ... | 539 | ... |
542 | 17 block Optics Storage CD-ROM | 540 | 17 block Optics Storage CD-ROM |
543 | 0 = /dev/optcd Optics Storage CD-ROM | 541 | 0 = /dev/optcd Optics Storage CD-ROM |
544 | 542 | ||
545 | 18 char Chase serial card - alternate devices | 543 | 18 char OBSOLETE (was Chase serial card - alternate devices) |
546 | 0 = /dev/cuh0 Callout device for ttyH0 | 544 | 0 = /dev/cuh0 Callout device for ttyH0 |
547 | 1 = /dev/cuh1 Callout device for ttyH1 | 545 | 1 = /dev/cuh1 Callout device for ttyH1 |
548 | ... | 546 | ... |
@@ -644,8 +642,7 @@ Your cooperation is appreciated. | |||
644 | 2 = /dev/sbpcd2 Panasonic CD-ROM controller 0 unit 2 | 642 | 2 = /dev/sbpcd2 Panasonic CD-ROM controller 0 unit 2 |
645 | 3 = /dev/sbpcd3 Panasonic CD-ROM controller 0 unit 3 | 643 | 3 = /dev/sbpcd3 Panasonic CD-ROM controller 0 unit 3 |
646 | 644 | ||
647 | 26 char Quanta WinVision frame grabber {2.6} | 645 | 26 char |
648 | 0 = /dev/wvisfgrab Quanta WinVision frame grabber | ||
649 | 646 | ||
650 | 26 block Second Matsushita (Panasonic/SoundBlaster) CD-ROM | 647 | 26 block Second Matsushita (Panasonic/SoundBlaster) CD-ROM |
651 | 0 = /dev/sbpcd4 Panasonic CD-ROM controller 1 unit 0 | 648 | 0 = /dev/sbpcd4 Panasonic CD-ROM controller 1 unit 0 |
@@ -872,7 +869,7 @@ Your cooperation is appreciated. | |||
872 | and "user level packet I/O." This board is also | 869 | and "user level packet I/O." This board is also |
873 | accessible as a standard networking "eth" device. | 870 | accessible as a standard networking "eth" device. |
874 | 871 | ||
875 | 38 block Reserved for Linux/AP+ | 872 | 38 block OBSOLETE (was Linux/AP+) |
876 | 873 | ||
877 | 39 char ML-16P experimental I/O board | 874 | 39 char ML-16P experimental I/O board |
878 | 0 = /dev/ml16pa-a0 First card, first analog channel | 875 | 0 = /dev/ml16pa-a0 First card, first analog channel |
@@ -892,29 +889,16 @@ Your cooperation is appreciated. | |||
892 | 50 = /dev/ml16pb-c1 Second card, second counter/timer | 889 | 50 = /dev/ml16pb-c1 Second card, second counter/timer |
893 | 51 = /dev/ml16pb-c2 Second card, third counter/timer | 890 | 51 = /dev/ml16pb-c2 Second card, third counter/timer |
894 | ... | 891 | ... |
895 | 39 block Reserved for Linux/AP+ | 892 | 39 block |
896 | 893 | ||
897 | 40 char Matrox Meteor frame grabber {2.6} | 894 | 40 char |
898 | 0 = /dev/mmetfgrab Matrox Meteor frame grabber | ||
899 | 895 | ||
900 | 40 block Syquest EZ135 parallel port removable drive | 896 | 40 block |
901 | 0 = /dev/eza Parallel EZ135 drive, whole disk | ||
902 | |||
903 | This device is obsolete and will be removed in a | ||
904 | future version of Linux. It has been replaced with | ||
905 | the parallel port IDE disk driver at major number 45. | ||
906 | Partitions are handled in the same way as IDE disks | ||
907 | (see major number 3). | ||
908 | 897 | ||
909 | 41 char Yet Another Micro Monitor | 898 | 41 char Yet Another Micro Monitor |
910 | 0 = /dev/yamm Yet Another Micro Monitor | 899 | 0 = /dev/yamm Yet Another Micro Monitor |
911 | 900 | ||
912 | 41 block MicroSolutions BackPack parallel port CD-ROM | 901 | 41 block |
913 | 0 = /dev/bpcd BackPack CD-ROM | ||
914 | |||
915 | This device is obsolete and will be removed in a | ||
916 | future version of Linux. It has been replaced with | ||
917 | the parallel port ATAPI CD-ROM driver at major number 46. | ||
918 | 902 | ||
919 | 42 char Demo/sample use | 903 | 42 char Demo/sample use |
920 | 904 | ||
@@ -1681,13 +1665,7 @@ Your cooperation is appreciated. | |||
1681 | disks (see major number 3) except that the limit on | 1665 | disks (see major number 3) except that the limit on |
1682 | partitions is 15. | 1666 | partitions is 15. |
1683 | 1667 | ||
1684 | 93 char IBM Smart Capture Card frame grabber {2.6} | 1668 | 93 char |
1685 | 0 = /dev/iscc0 First Smart Capture Card | ||
1686 | 1 = /dev/iscc1 Second Smart Capture Card | ||
1687 | ... | ||
1688 | 128 = /dev/isccctl0 First Smart Capture Card control | ||
1689 | 129 = /dev/isccctl1 Second Smart Capture Card control | ||
1690 | ... | ||
1691 | 1669 | ||
1692 | 93 block NAND Flash Translation Layer filesystem | 1670 | 93 block NAND Flash Translation Layer filesystem |
1693 | 0 = /dev/nftla First NFTL layer | 1671 | 0 = /dev/nftla First NFTL layer |
@@ -1695,10 +1673,7 @@ Your cooperation is appreciated. | |||
1695 | ... | 1673 | ... |
1696 | 240 = /dev/nftlp 16th NTFL layer | 1674 | 240 = /dev/nftlp 16th NTFL layer |
1697 | 1675 | ||
1698 | 94 char miroVIDEO DC10/30 capture/playback device {2.6} | 1676 | 94 char |
1699 | 0 = /dev/dcxx0 First capture card | ||
1700 | 1 = /dev/dcxx1 Second capture card | ||
1701 | ... | ||
1702 | 1677 | ||
1703 | 94 block IBM S/390 DASD block storage | 1678 | 94 block IBM S/390 DASD block storage |
1704 | 0 = /dev/dasda First DASD device, major | 1679 | 0 = /dev/dasda First DASD device, major |
@@ -1791,11 +1766,7 @@ Your cooperation is appreciated. | |||
1791 | ... | 1766 | ... |
1792 | 15 = /dev/amiraid/ar?p15 15th partition | 1767 | 15 = /dev/amiraid/ar?p15 15th partition |
1793 | 1768 | ||
1794 | 102 char Philips SAA5249 Teletext signal decoder {2.6} | 1769 | 102 char |
1795 | 0 = /dev/tlk0 First Teletext decoder | ||
1796 | 1 = /dev/tlk1 Second Teletext decoder | ||
1797 | 2 = /dev/tlk2 Third Teletext decoder | ||
1798 | 3 = /dev/tlk3 Fourth Teletext decoder | ||
1799 | 1770 | ||
1800 | 102 block Compressed block device | 1771 | 102 block Compressed block device |
1801 | 0 = /dev/cbd/a First compressed block device, whole device | 1772 | 0 = /dev/cbd/a First compressed block device, whole device |
@@ -1916,10 +1887,7 @@ Your cooperation is appreciated. | |||
1916 | DAC960 (see major number 48) except that the limit on | 1887 | DAC960 (see major number 48) except that the limit on |
1917 | partitions is 15. | 1888 | partitions is 15. |
1918 | 1889 | ||
1919 | 111 char Philips SAA7146-based audio/video card {2.6} | 1890 | 111 char |
1920 | 0 = /dev/av0 First A/V card | ||
1921 | 1 = /dev/av1 Second A/V card | ||
1922 | ... | ||
1923 | 1891 | ||
1924 | 111 block Compaq Next Generation Drive Array, eighth controller | 1892 | 111 block Compaq Next Generation Drive Array, eighth controller |
1925 | 0 = /dev/cciss/c7d0 First logical drive, whole disk | 1893 | 0 = /dev/cciss/c7d0 First logical drive, whole disk |
@@ -2079,8 +2047,8 @@ Your cooperation is appreciated. | |||
2079 | ... | 2047 | ... |
2080 | 2048 | ||
2081 | 119 char VMware virtual network control | 2049 | 119 char VMware virtual network control |
2082 | 0 = /dev/vmnet0 1st virtual network | 2050 | 0 = /dev/vnet0 1st virtual network |
2083 | 1 = /dev/vmnet1 2nd virtual network | 2051 | 1 = /dev/vnet1 2nd virtual network |
2084 | ... | 2052 | ... |
2085 | 2053 | ||
2086 | 120-127 char LOCAL/EXPERIMENTAL USE | 2054 | 120-127 char LOCAL/EXPERIMENTAL USE |
@@ -2450,7 +2418,7 @@ Your cooperation is appreciated. | |||
2450 | 2 = /dev/raw/raw2 Second raw I/O device | 2418 | 2 = /dev/raw/raw2 Second raw I/O device |
2451 | ... | 2419 | ... |
2452 | 2420 | ||
2453 | 163 char UNASSIGNED (was Radio Tech BIM-XXX-RS232 radio modem - see 51) | 2421 | 163 char |
2454 | 2422 | ||
2455 | 164 char Chase Research AT/PCI-Fast serial card | 2423 | 164 char Chase Research AT/PCI-Fast serial card |
2456 | 0 = /dev/ttyCH0 AT/PCI-Fast board 0, port 0 | 2424 | 0 = /dev/ttyCH0 AT/PCI-Fast board 0, port 0 |
@@ -2542,6 +2510,12 @@ Your cooperation is appreciated. | |||
2542 | 1 = /dev/clanvi1 Second cLAN adapter | 2510 | 1 = /dev/clanvi1 Second cLAN adapter |
2543 | ... | 2511 | ... |
2544 | 2512 | ||
2513 | 179 block MMC block devices | ||
2514 | 0 = /dev/mmcblk0 First SD/MMC card | ||
2515 | 1 = /dev/mmcblk0p1 First partition on first MMC card | ||
2516 | 8 = /dev/mmcblk1 Second SD/MMC card | ||
2517 | ... | ||
2518 | |||
2545 | 179 char CCube DVXChip-based PCI products | 2519 | 179 char CCube DVXChip-based PCI products |
2546 | 0 = /dev/dvxirq0 First DVX device | 2520 | 0 = /dev/dvxirq0 First DVX device |
2547 | 1 = /dev/dvxirq1 Second DVX device | 2521 | 1 = /dev/dvxirq1 Second DVX device |
@@ -2560,6 +2534,9 @@ Your cooperation is appreciated. | |||
2560 | 96 = /dev/usb/hiddev0 1st USB HID device | 2534 | 96 = /dev/usb/hiddev0 1st USB HID device |
2561 | ... | 2535 | ... |
2562 | 111 = /dev/usb/hiddev15 16th USB HID device | 2536 | 111 = /dev/usb/hiddev15 16th USB HID device |
2537 | 112 = /dev/usb/auer0 1st auerswald ISDN device | ||
2538 | ... | ||
2539 | 127 = /dev/usb/auer15 16th auerswald ISDN device | ||
2563 | 128 = /dev/usb/brlvgr0 First Braille Voyager device | 2540 | 128 = /dev/usb/brlvgr0 First Braille Voyager device |
2564 | ... | 2541 | ... |
2565 | 131 = /dev/usb/brlvgr3 Fourth Braille Voyager device | 2542 | 131 = /dev/usb/brlvgr3 Fourth Braille Voyager device |
@@ -2810,6 +2787,16 @@ Your cooperation is appreciated. | |||
2810 | ... | 2787 | ... |
2811 | 190 = /dev/ttyUL3 Xilinx uartlite - port 3 | 2788 | 190 = /dev/ttyUL3 Xilinx uartlite - port 3 |
2812 | 191 = /dev/xvc0 Xen virtual console - port 0 | 2789 | 191 = /dev/xvc0 Xen virtual console - port 0 |
2790 | 192 = /dev/ttyPZ0 pmac_zilog - port 0 | ||
2791 | ... | ||
2792 | 195 = /dev/ttyPZ3 pmac_zilog - port 3 | ||
2793 | 196 = /dev/ttyTX0 TX39/49 serial port 0 | ||
2794 | ... | ||
2795 | 204 = /dev/ttyTX7 TX39/49 serial port 7 | ||
2796 | 205 = /dev/ttySC0 SC26xx serial port 0 | ||
2797 | 206 = /dev/ttySC1 SC26xx serial port 1 | ||
2798 | 207 = /dev/ttySC2 SC26xx serial port 2 | ||
2799 | 208 = /dev/ttySC3 SC26xx serial port 3 | ||
2813 | 2800 | ||
2814 | 205 char Low-density serial ports (alternate device) | 2801 | 205 char Low-density serial ports (alternate device) |
2815 | 0 = /dev/culu0 Callout device for ttyLU0 | 2802 | 0 = /dev/culu0 Callout device for ttyLU0 |
@@ -3145,6 +3132,14 @@ Your cooperation is appreciated. | |||
3145 | 1 = /dev/blockrom1 Second ROM card's translation layer interface | 3132 | 1 = /dev/blockrom1 Second ROM card's translation layer interface |
3146 | ... | 3133 | ... |
3147 | 3134 | ||
3135 | 259 block Block Extended Major | ||
3136 | Used dynamically to hold additional partition minor | ||
3137 | numbers and allow large numbers of partitions per device | ||
3138 | |||
3139 | 259 char FPGA configuration interfaces | ||
3140 | 0 = /dev/icap0 First Xilinx internal configuration | ||
3141 | 1 = /dev/icap1 Second Xilinx internal configuration | ||
3142 | |||
3148 | 260 char OSD (Object-based-device) SCSI Device | 3143 | 260 char OSD (Object-based-device) SCSI Device |
3149 | 0 = /dev/osd0 First OSD Device | 3144 | 0 = /dev/osd0 First OSD Device |
3150 | 1 = /dev/osd1 Second OSD Device | 3145 | 1 = /dev/osd1 Second OSD Device |
diff --git a/Documentation/fb/00-INDEX b/Documentation/fb/00-INDEX index caabbd395e61..a618fd99c9f0 100644 --- a/Documentation/fb/00-INDEX +++ b/Documentation/fb/00-INDEX | |||
@@ -11,8 +11,6 @@ aty128fb.txt | |||
11 | - info on the ATI Rage128 frame buffer driver. | 11 | - info on the ATI Rage128 frame buffer driver. |
12 | cirrusfb.txt | 12 | cirrusfb.txt |
13 | - info on the driver for Cirrus Logic chipsets. | 13 | - info on the driver for Cirrus Logic chipsets. |
14 | cyblafb/ | ||
15 | - directory with documentation files related to the cyblafb driver. | ||
16 | deferred_io.txt | 14 | deferred_io.txt |
17 | - an introduction to deferred IO. | 15 | - an introduction to deferred IO. |
18 | fbcon.txt | 16 | fbcon.txt |
diff --git a/Documentation/fb/cyblafb/bugs b/Documentation/fb/cyblafb/bugs deleted file mode 100644 index 9443a6d72cdd..000000000000 --- a/Documentation/fb/cyblafb/bugs +++ /dev/null | |||
@@ -1,13 +0,0 @@ | |||
1 | Bugs | ||
2 | ==== | ||
3 | |||
4 | I currently don't know of any bug. Please do send reports to: | ||
5 | - linux-fbdev-devel@lists.sourceforge.net | ||
6 | - Knut_Petersen@t-online.de. | ||
7 | |||
8 | |||
9 | Untested features | ||
10 | ================= | ||
11 | |||
12 | All LCD stuff is untested. If it worked in tridentfb, it should work in | ||
13 | cyblafb. Please test and report the results to Knut_Petersen@t-online.de. | ||
diff --git a/Documentation/fb/cyblafb/credits b/Documentation/fb/cyblafb/credits deleted file mode 100644 index 0eb3b443dc2b..000000000000 --- a/Documentation/fb/cyblafb/credits +++ /dev/null | |||
@@ -1,7 +0,0 @@ | |||
1 | Thanks to | ||
2 | ========= | ||
3 | * Alan Hourihane, for writing the X trident driver | ||
4 | * Jani Monoses, for writing the tridentfb driver | ||
5 | * Antonino A. Daplas, for review of the first published | ||
6 | version of cyblafb and some code | ||
7 | * Jochen Hein, for testing and a helpfull bug report | ||
diff --git a/Documentation/fb/cyblafb/documentation b/Documentation/fb/cyblafb/documentation deleted file mode 100644 index bb1aac048425..000000000000 --- a/Documentation/fb/cyblafb/documentation +++ /dev/null | |||
@@ -1,17 +0,0 @@ | |||
1 | Available Documentation | ||
2 | ======================= | ||
3 | |||
4 | Apollo PLE 133 Chipset VT8601A North Bridge Datasheet, Rev. 1.82, October 22, | ||
5 | 2001, available from VIA: | ||
6 | |||
7 | http://www.viavpsd.com/product/6/15/DS8601A182.pdf | ||
8 | |||
9 | The datasheet is incomplete, some registers that need to be programmed are not | ||
10 | explained at all and important bits are listed as "reserved". But you really | ||
11 | need the datasheet to understand the code. "p. xxx" comments refer to page | ||
12 | numbers of this document. | ||
13 | |||
14 | XFree/XOrg drivers are available and of good quality, looking at the code | ||
15 | there is a good idea if the datasheet does not provide enough information | ||
16 | or if the datasheet seems to be wrong. | ||
17 | |||
diff --git a/Documentation/fb/cyblafb/fb.modes b/Documentation/fb/cyblafb/fb.modes deleted file mode 100644 index fe0e5223ba86..000000000000 --- a/Documentation/fb/cyblafb/fb.modes +++ /dev/null | |||
@@ -1,154 +0,0 @@ | |||
1 | # | ||
2 | # Sample fb.modes file | ||
3 | # | ||
4 | # Provides an incomplete list of working modes for | ||
5 | # the cyberblade/i1 graphics core. | ||
6 | # | ||
7 | # The value 4294967256 is used instead of -40. Of course, -40 is not | ||
8 | # a really reasonable value, but chip design does not always follow | ||
9 | # logic. Believe me, it's ok, and it's the way the BIOS does it. | ||
10 | # | ||
11 | # fbset requires 4294967256 in fb.modes and -40 as an argument to | ||
12 | # the -t parameter. That's also not too reasonable, and it might change | ||
13 | # in the future or might even be differt for your current version. | ||
14 | # | ||
15 | |||
16 | mode "640x480-50" | ||
17 | geometry 640 480 2048 4096 8 | ||
18 | timings 47619 4294967256 24 17 0 216 3 | ||
19 | endmode | ||
20 | |||
21 | mode "640x480-60" | ||
22 | geometry 640 480 2048 4096 8 | ||
23 | timings 39682 4294967256 24 17 0 216 3 | ||
24 | endmode | ||
25 | |||
26 | mode "640x480-70" | ||
27 | geometry 640 480 2048 4096 8 | ||
28 | timings 34013 4294967256 24 17 0 216 3 | ||
29 | endmode | ||
30 | |||
31 | mode "640x480-72" | ||
32 | geometry 640 480 2048 4096 8 | ||
33 | timings 33068 4294967256 24 17 0 216 3 | ||
34 | endmode | ||
35 | |||
36 | mode "640x480-75" | ||
37 | geometry 640 480 2048 4096 8 | ||
38 | timings 31746 4294967256 24 17 0 216 3 | ||
39 | endmode | ||
40 | |||
41 | mode "640x480-80" | ||
42 | geometry 640 480 2048 4096 8 | ||
43 | timings 29761 4294967256 24 17 0 216 3 | ||
44 | endmode | ||
45 | |||
46 | mode "640x480-85" | ||
47 | geometry 640 480 2048 4096 8 | ||
48 | timings 28011 4294967256 24 17 0 216 3 | ||
49 | endmode | ||
50 | |||
51 | mode "800x600-50" | ||
52 | geometry 800 600 2048 4096 8 | ||
53 | timings 30303 96 24 14 0 136 11 | ||
54 | endmode | ||
55 | |||
56 | mode "800x600-60" | ||
57 | geometry 800 600 2048 4096 8 | ||
58 | timings 25252 96 24 14 0 136 11 | ||
59 | endmode | ||
60 | |||
61 | mode "800x600-70" | ||
62 | geometry 800 600 2048 4096 8 | ||
63 | timings 21645 96 24 14 0 136 11 | ||
64 | endmode | ||
65 | |||
66 | mode "800x600-72" | ||
67 | geometry 800 600 2048 4096 8 | ||
68 | timings 21043 96 24 14 0 136 11 | ||
69 | endmode | ||
70 | |||
71 | mode "800x600-75" | ||
72 | geometry 800 600 2048 4096 8 | ||
73 | timings 20202 96 24 14 0 136 11 | ||
74 | endmode | ||
75 | |||
76 | mode "800x600-80" | ||
77 | geometry 800 600 2048 4096 8 | ||
78 | timings 18939 96 24 14 0 136 11 | ||
79 | endmode | ||
80 | |||
81 | mode "800x600-85" | ||
82 | geometry 800 600 2048 4096 8 | ||
83 | timings 17825 96 24 14 0 136 11 | ||
84 | endmode | ||
85 | |||
86 | mode "1024x768-50" | ||
87 | geometry 1024 768 2048 4096 8 | ||
88 | timings 19054 144 24 29 0 120 3 | ||
89 | endmode | ||
90 | |||
91 | mode "1024x768-60" | ||
92 | geometry 1024 768 2048 4096 8 | ||
93 | timings 15880 144 24 29 0 120 3 | ||
94 | endmode | ||
95 | |||
96 | mode "1024x768-70" | ||
97 | geometry 1024 768 2048 4096 8 | ||
98 | timings 13610 144 24 29 0 120 3 | ||
99 | endmode | ||
100 | |||
101 | mode "1024x768-72" | ||
102 | geometry 1024 768 2048 4096 8 | ||
103 | timings 13232 144 24 29 0 120 3 | ||
104 | endmode | ||
105 | |||
106 | mode "1024x768-75" | ||
107 | geometry 1024 768 2048 4096 8 | ||
108 | timings 12703 144 24 29 0 120 3 | ||
109 | endmode | ||
110 | |||
111 | mode "1024x768-80" | ||
112 | geometry 1024 768 2048 4096 8 | ||
113 | timings 11910 144 24 29 0 120 3 | ||
114 | endmode | ||
115 | |||
116 | mode "1024x768-85" | ||
117 | geometry 1024 768 2048 4096 8 | ||
118 | timings 11209 144 24 29 0 120 3 | ||
119 | endmode | ||
120 | |||
121 | mode "1280x1024-50" | ||
122 | geometry 1280 1024 2048 4096 8 | ||
123 | timings 11114 232 16 39 0 160 3 | ||
124 | endmode | ||
125 | |||
126 | mode "1280x1024-60" | ||
127 | geometry 1280 1024 2048 4096 8 | ||
128 | timings 9262 232 16 39 0 160 3 | ||
129 | endmode | ||
130 | |||
131 | mode "1280x1024-70" | ||
132 | geometry 1280 1024 2048 4096 8 | ||
133 | timings 7939 232 16 39 0 160 3 | ||
134 | endmode | ||
135 | |||
136 | mode "1280x1024-72" | ||
137 | geometry 1280 1024 2048 4096 8 | ||
138 | timings 7719 232 16 39 0 160 3 | ||
139 | endmode | ||
140 | |||
141 | mode "1280x1024-75" | ||
142 | geometry 1280 1024 2048 4096 8 | ||
143 | timings 7410 232 16 39 0 160 3 | ||
144 | endmode | ||
145 | |||
146 | mode "1280x1024-80" | ||
147 | geometry 1280 1024 2048 4096 8 | ||
148 | timings 6946 232 16 39 0 160 3 | ||
149 | endmode | ||
150 | |||
151 | mode "1280x1024-85" | ||
152 | geometry 1280 1024 2048 4096 8 | ||
153 | timings 6538 232 16 39 0 160 3 | ||
154 | endmode | ||
diff --git a/Documentation/fb/cyblafb/performance b/Documentation/fb/cyblafb/performance deleted file mode 100644 index 8d15d5dfc6b3..000000000000 --- a/Documentation/fb/cyblafb/performance +++ /dev/null | |||
@@ -1,79 +0,0 @@ | |||
1 | Speed | ||
2 | ===== | ||
3 | |||
4 | CyBlaFB is much faster than tridentfb and vesafb. Compare the performance data | ||
5 | for mode 1280x1024-[8,16,32]@61 Hz. | ||
6 | |||
7 | Test 1: Cat a file with 2000 lines of 0 characters. | ||
8 | Test 2: Cat a file with 2000 lines of 80 characters. | ||
9 | Test 3: Cat a file with 2000 lines of 160 characters. | ||
10 | |||
11 | All values show system time use in seconds, kernel 2.6.12 was used for | ||
12 | the measurements. 2.6.13 is a bit slower, 2.6.14 hopefully will include a | ||
13 | patch that speeds up kernel bitblitting a lot ( > 20%). | ||
14 | |||
15 | +-----------+-----------------------------------------------------+ | ||
16 | | | not accelerated | | ||
17 | | TRIDENTFB +-----------------+-----------------+-----------------+ | ||
18 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
19 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
20 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
21 | | Test 1 | 4.31 | 4.33 | 6.05 | 12.81 | ---- | ---- | | ||
22 | | Test 2 | 67.94 | 5.44 | 123.16 | 14.79 | ---- | ---- | | ||
23 | | Test 3 | 131.36 | 6.55 | 240.12 | 16.76 | ---- | ---- | | ||
24 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
25 | | Comments | | | completely bro- | | ||
26 | | | | | ken, monitor | | ||
27 | | | | | switches off | | ||
28 | +-----------+-----------------+-----------------+-----------------+ | ||
29 | |||
30 | |||
31 | +-----------+-----------------------------------------------------+ | ||
32 | | | accelerated | | ||
33 | | TRIDENTFB +-----------------+-----------------+-----------------+ | ||
34 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
35 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
36 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
37 | | Test 1 | ---- | ---- | 20.62 | 1.22 | ---- | ---- | | ||
38 | | Test 2 | ---- | ---- | 22.61 | 3.19 | ---- | ---- | | ||
39 | | Test 3 | ---- | ---- | 24.59 | 5.16 | ---- | ---- | | ||
40 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
41 | | Comments | broken, writing | broken, ok only | completely bro- | | ||
42 | | | to wrong places | if bgcolor is | ken, monitor | | ||
43 | | | on screen + bug | black, bug in | switches off | | ||
44 | | | in fillrect() | fillrect() | | | ||
45 | +-----------+-----------------+-----------------+-----------------+ | ||
46 | |||
47 | |||
48 | +-----------+-----------------------------------------------------+ | ||
49 | | | not accelerated | | ||
50 | | VESAFB +-----------------+-----------------+-----------------+ | ||
51 | | of 2.6.12 | 8 bpp | 16 bpp | 32 bpp | | ||
52 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
53 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
54 | | Test 1 | 4.26 | 3.76 | 5.99 | 7.23 | ---- | ---- | | ||
55 | | Test 2 | 65.65 | 4.89 | 120.88 | 9.08 | ---- | ---- | | ||
56 | | Test 3 | 126.91 | 5.94 | 235.77 | 11.03 | ---- | ---- | | ||
57 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
58 | | Comments | vga=0x307 | vga=0x31a | vga=0x31b not | | ||
59 | | | fh=80kHz | fh=80kHz | supported by | | ||
60 | | | fv=75kHz | fv=75kHz | video BIOS and | | ||
61 | | | | | hardware | | ||
62 | +-----------+-----------------+-----------------+-----------------+ | ||
63 | |||
64 | |||
65 | +-----------+-----------------------------------------------------+ | ||
66 | | | accelerated | | ||
67 | | CYBLAFB +-----------------+-----------------+-----------------+ | ||
68 | | | 8 bpp | 16 bpp | 32 bpp | | ||
69 | | | noypan | ypan | noypan | ypan | noypan | ypan | | ||
70 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
71 | | Test 1 | 8.02 | 0.23 | 19.04 | 0.61 | 57.12 | 2.74 | | ||
72 | | Test 2 | 8.38 | 0.55 | 19.39 | 0.92 | 57.54 | 3.13 | | ||
73 | | Test 3 | 8.73 | 0.86 | 19.74 | 1.24 | 57.95 | 3.51 | | ||
74 | +-----------+--------+--------+--------+--------+--------+--------+ | ||
75 | | Comments | | | | | ||
76 | | | | | | | ||
77 | | | | | | | ||
78 | | | | | | | ||
79 | +-----------+-----------------+-----------------+-----------------+ | ||
diff --git a/Documentation/fb/cyblafb/todo b/Documentation/fb/cyblafb/todo deleted file mode 100644 index c5f6d0eae545..000000000000 --- a/Documentation/fb/cyblafb/todo +++ /dev/null | |||
@@ -1,31 +0,0 @@ | |||
1 | TODO / Missing features | ||
2 | ======================= | ||
3 | |||
4 | Verify LCD stuff "stretch" and "center" options are | ||
5 | completely untested ... this code needs to be | ||
6 | verified. As I don't have access to such | ||
7 | hardware, please contact me if you are | ||
8 | willing run some tests. | ||
9 | |||
10 | Interlaced video modes The reason that interleaved | ||
11 | modes are disabled is that I do not know | ||
12 | the meaning of the vertical interlace | ||
13 | parameter. Also the datasheet mentions a | ||
14 | bit d8 of a horizontal interlace parameter, | ||
15 | but nowhere the lower 8 bits. Please help | ||
16 | if you can. | ||
17 | |||
18 | low-res double scan modes Who needs it? | ||
19 | |||
20 | accelerated color blitting Who needs it? The console driver does use color | ||
21 | blitting for nothing but drawing the penguine, | ||
22 | everything else is done using color expanding | ||
23 | blitting of 1bpp character bitmaps. | ||
24 | |||
25 | ioctls Who needs it? | ||
26 | |||
27 | TV-out Will be done later. Use "vga= " at boot time | ||
28 | to set a suitable video mode. | ||
29 | |||
30 | ??? Feel free to contact me if you have any | ||
31 | feature requests | ||
diff --git a/Documentation/fb/cyblafb/usage b/Documentation/fb/cyblafb/usage deleted file mode 100644 index a39bb3d402a2..000000000000 --- a/Documentation/fb/cyblafb/usage +++ /dev/null | |||
@@ -1,217 +0,0 @@ | |||
1 | CyBlaFB is a framebuffer driver for the Cyberblade/i1 graphics core integrated | ||
2 | into the VIA Apollo PLE133 (aka vt8601) south bridge. It is developed and | ||
3 | tested using a VIA EPIA 5000 board. | ||
4 | |||
5 | Cyblafb - compiled into the kernel or as a module? | ||
6 | ================================================== | ||
7 | |||
8 | You might compile cyblafb either as a module or compile it permanently into the | ||
9 | kernel. | ||
10 | |||
11 | Unless you have a real reason to do so you should not compile both vesafb and | ||
12 | cyblafb permanently into the kernel. It's possible and it helps during the | ||
13 | developement cycle, but it's useless and will at least block some otherwise | ||
14 | usefull memory for ordinary users. | ||
15 | |||
16 | Selecting Modes | ||
17 | =============== | ||
18 | |||
19 | Startup Mode | ||
20 | ============ | ||
21 | |||
22 | First of all, you might use the "vga=???" boot parameter as it is | ||
23 | documented in vesafb.txt and svga.txt. Cyblafb will detect the video | ||
24 | mode selected and will use the geometry and timings found by | ||
25 | inspecting the hardware registers. | ||
26 | |||
27 | video=cyblafb vga=0x317 | ||
28 | |||
29 | Alternatively you might use a combination of the mode, ref and bpp | ||
30 | parameters. If you compiled the driver into the kernel, add something | ||
31 | like this to the kernel command line: | ||
32 | |||
33 | video=cyblafb:1280x1024,bpp=16,ref=50 ... | ||
34 | |||
35 | If you compiled the driver as a module, the same mode would be | ||
36 | selected by the following command: | ||
37 | |||
38 | modprobe cyblafb mode=1280x1024 bpp=16 ref=50 ... | ||
39 | |||
40 | None of the modes possible to select as startup modes are affected by | ||
41 | the problems described at the end of the next subsection. | ||
42 | |||
43 | For all startup modes cyblafb chooses a virtual x resolution of 2048, | ||
44 | the only exception is mode 1280x1024 in combination with 32 bpp. This | ||
45 | allows ywrap scrolling for all those modes if rotation is 0 or 2, and | ||
46 | also fast scrolling if rotation is 1 or 3. The default virtual y reso- | ||
47 | lution is 4096 for bpp == 8, 2048 for bpp==16 and 1024 for bpp == 32, | ||
48 | again with the only exception of 1280x1024 at 32 bpp. | ||
49 | |||
50 | Please do set your video memory size to 8 Mb in the Bios setup. Other | ||
51 | values will work, but performace is decreased for a lot of modes. | ||
52 | |||
53 | Mode changes using fbset | ||
54 | ======================== | ||
55 | |||
56 | You might use fbset to change the video mode, see "man fbset". Cyblafb | ||
57 | generally does assume that you know what you are doing. But it does | ||
58 | some checks, especially those that are needed to prevent you from | ||
59 | damaging your hardware. | ||
60 | |||
61 | - only 8, 16, 24 and 32 bpp video modes are accepted | ||
62 | - interlaced video modes are not accepted | ||
63 | - double scan video modes are not accepted | ||
64 | - if a flat panel is found, cyblafb does not allow you | ||
65 | to program a resolution higher than the physical | ||
66 | resolution of the flat panel monitor | ||
67 | - cyblafb does not allow vclk to exceed 230 MHz. As 32 bpp | ||
68 | and (currently) 24 bit modes use a doubled vclk internally, | ||
69 | the dotclock limit as seen by fbset is 115 MHz for those | ||
70 | modes and 230 MHz for 8 and 16 bpp modes. | ||
71 | - cyblafb will allow you to select very high resolutions as | ||
72 | long as the hardware can be programmed to these modes. The | ||
73 | documented limit 1600x1200 is not enforced, but don't expect | ||
74 | perfect signal quality. | ||
75 | |||
76 | Any request that violates the rules given above will be either changed | ||
77 | to something the hardware supports or an error value will be returned. | ||
78 | |||
79 | If you program a virtual y resolution higher than the hardware limit, | ||
80 | cyblafb will silently decrease that value to the highest possible | ||
81 | value. The same is true for a virtual x resolution that is not | ||
82 | supported by the hardware. Cyblafb tries to adapt vyres first because | ||
83 | vxres decides if ywrap scrolling is possible or not. | ||
84 | |||
85 | Attempts to disable acceleration are ignored, I believe that this is | ||
86 | safe. | ||
87 | |||
88 | Some video modes that should work do not work as expected. If you use | ||
89 | the standard fb.modes, fbset 640x480-60 will program that mode, but | ||
90 | you will see a vertical area, about two characters wide, with only | ||
91 | much darker characters than the other characters on the screen. | ||
92 | Cyblafb does allow that mode to be set, as it does not violate the | ||
93 | official specifications. It would need a lot of code to reliably sort | ||
94 | out all invalid modes, playing around with the margin values will | ||
95 | give a valid mode quickly. And if cyblafb would detect such an invalid | ||
96 | mode, should it silently alter the requested values or should it | ||
97 | report an error? Both options have some pros and cons. As stated | ||
98 | above, none of the startup modes are affected, and if you set | ||
99 | verbosity to 1 or higher, cyblafb will print the fbset command that | ||
100 | would be needed to program that mode using fbset. | ||
101 | |||
102 | |||
103 | Other Parameters | ||
104 | ================ | ||
105 | |||
106 | |||
107 | crt don't autodetect, assume monitor connected to | ||
108 | standard VGA connector | ||
109 | |||
110 | fp don't autodetect, assume flat panel display | ||
111 | connected to flat panel monitor interface | ||
112 | |||
113 | nativex inform driver about native x resolution of | ||
114 | flat panel monitor connected to special | ||
115 | interface (should be autodetected) | ||
116 | |||
117 | stretch stretch image to adapt low resolution modes to | ||
118 | higer resolutions of flat panel monitors | ||
119 | connected to special interface | ||
120 | |||
121 | center center image to adapt low resolution modes to | ||
122 | higer resolutions of flat panel monitors | ||
123 | connected to special interface | ||
124 | |||
125 | memsize use if autodetected memsize is wrong ... | ||
126 | should never be necessary | ||
127 | |||
128 | nopcirr disable PCI read retry | ||
129 | nopciwr disable PCI write retry | ||
130 | nopcirb disable PCI read bursts | ||
131 | nopciwb disable PCI write bursts | ||
132 | |||
133 | bpp bpp for specified modes | ||
134 | valid values: 8 || 16 || 24 || 32 | ||
135 | |||
136 | ref refresh rate for specified mode | ||
137 | valid values: 50 <= ref <= 85 | ||
138 | |||
139 | mode 640x480 or 800x600 or 1024x768 or 1280x1024 | ||
140 | if not specified, the startup mode will be detected | ||
141 | and used, so you might also use the vga=??? parameter | ||
142 | described in vesafb.txt. If you do not specify a mode, | ||
143 | bpp and ref parameters are ignored. | ||
144 | |||
145 | verbosity 0 is the default, increase to at least 2 for every | ||
146 | bug report! | ||
147 | |||
148 | Development hints | ||
149 | ================= | ||
150 | |||
151 | It's much faster do compile a module and to load the new version after | ||
152 | unloading the old module than to compile a new kernel and to reboot. So if you | ||
153 | try to work on cyblafb, it might be a good idea to use cyblafb as a module. | ||
154 | In real life, fast often means dangerous, and that's also the case here. If | ||
155 | you introduce a serious bug when cyblafb is compiled into the kernel, the | ||
156 | kernel will lock or oops with a high probability before the file system is | ||
157 | mounted, and the danger for your data is low. If you load a broken own version | ||
158 | of cyblafb on a running system, the danger for the integrity of the file | ||
159 | system is much higher as you might need a hard reset afterwards. Decide | ||
160 | yourself. | ||
161 | |||
162 | Module unloading, the vfb method | ||
163 | ================================ | ||
164 | |||
165 | If you want to unload/reload cyblafb using the virtual framebuffer, you need | ||
166 | to enable vfb support in the kernel first. After that, load the modules as | ||
167 | shown below: | ||
168 | |||
169 | modprobe vfb vfb_enable=1 | ||
170 | modprobe fbcon | ||
171 | modprobe cyblafb | ||
172 | fbset -fb /dev/fb1 1280x1024-60 -vyres 2662 | ||
173 | con2fb /dev/fb1 /dev/tty1 | ||
174 | ... | ||
175 | |||
176 | If you now made some changes to cyblafb and want to reload it, you might do it | ||
177 | as show below: | ||
178 | |||
179 | con2fb /dev/fb0 /dev/tty1 | ||
180 | ... | ||
181 | rmmod cyblafb | ||
182 | modprobe cyblafb | ||
183 | con2fb /dev/fb1 /dev/tty1 | ||
184 | ... | ||
185 | |||
186 | Of course, you might choose another mode, and most certainly you also want to | ||
187 | map some other /dev/tty* to the real framebuffer device. You might also choose | ||
188 | to compile fbcon as a kernel module or place it permanently in the kernel. | ||
189 | |||
190 | I do not know of any way to unload fbcon, and fbcon will prevent the | ||
191 | framebuffer device loaded first from unloading. [If there is a way, then | ||
192 | please add a description here!] | ||
193 | |||
194 | Module unloading, the vesafb method | ||
195 | =================================== | ||
196 | |||
197 | Configure the kernel: | ||
198 | |||
199 | <*> Support for frame buffer devices | ||
200 | [*] VESA VGA graphics support | ||
201 | <M> Cyberblade/i1 support | ||
202 | |||
203 | Add e.g. "video=vesafb:ypan vga=0x307" to the kernel parameters. The ypan | ||
204 | parameter is important, choose any vga parameter you like as long as it is | ||
205 | a graphics mode. | ||
206 | |||
207 | After booting, load cyblafb without any mode and bpp parameter and assign | ||
208 | cyblafb to individual ttys using con2fb, e.g.: | ||
209 | |||
210 | modprobe cyblafb | ||
211 | con2fb /dev/fb1 /dev/tty1 | ||
212 | |||
213 | Unloading cyblafb works without problems after you assign vesafb to all | ||
214 | ttys again, e.g.: | ||
215 | |||
216 | con2fb /dev/fb0 /dev/tty1 | ||
217 | rmmod cyblafb | ||
diff --git a/Documentation/fb/cyblafb/whatsnew b/Documentation/fb/cyblafb/whatsnew deleted file mode 100644 index 76c07a26e044..000000000000 --- a/Documentation/fb/cyblafb/whatsnew +++ /dev/null | |||
@@ -1,29 +0,0 @@ | |||
1 | 0.62 | ||
2 | ==== | ||
3 | |||
4 | - the vesafb parameter has been removed as I decided to allow the | ||
5 | feature without any special parameter. | ||
6 | |||
7 | - Cyblafb does not use the vga style of panning any longer, now the | ||
8 | "right view" register in the graphics engine IO space is used. Without | ||
9 | that change it was impossible to use all available memory, and without | ||
10 | access to all available memory it is impossible to ywrap. | ||
11 | |||
12 | - The imageblit function now uses hardware acceleration for all font | ||
13 | widths. Hardware blitting across pixel column 2048 is broken in the | ||
14 | cyberblade/i1 graphics core, but we work around that hardware bug. | ||
15 | |||
16 | - modes with vxres != xres are supported now. | ||
17 | |||
18 | - ywrap scrolling is supported now and the default. This is a big | ||
19 | performance gain. | ||
20 | |||
21 | - default video modes use vyres > yres and vxres > xres to allow | ||
22 | almost optimal scrolling speed for normal and rotated screens | ||
23 | |||
24 | - some features mainly usefull for debugging the upper layers of the | ||
25 | framebuffer system have been added, have a look at the code | ||
26 | |||
27 | - fixed: Oops after unloading cyblafb when reading /proc/io* | ||
28 | |||
29 | - we work around some bugs of the higher framebuffer layers. | ||
diff --git a/Documentation/fb/cyblafb/whycyblafb b/Documentation/fb/cyblafb/whycyblafb deleted file mode 100644 index a123bc11e698..000000000000 --- a/Documentation/fb/cyblafb/whycyblafb +++ /dev/null | |||
@@ -1,85 +0,0 @@ | |||
1 | I tried the following framebuffer drivers: | ||
2 | |||
3 | - TRIDENTFB is full of bugs. Acceleration is broken for Blade3D | ||
4 | graphics cores like the cyberblade/i1. It claims to support a great | ||
5 | number of devices, but documentation for most of these devices is | ||
6 | unfortunately not available. There is _no_ reason to use tridentfb | ||
7 | for cyberblade/i1 + CRT users. VESAFB is faster, and the one | ||
8 | advantage, mode switching, is broken in tridentfb. | ||
9 | |||
10 | - VESAFB is used by many distributions as a standard. Vesafb does | ||
11 | not support mode switching. VESAFB is a bit faster than the working | ||
12 | configurations of TRIDENTFB, but it is still too slow, even if you | ||
13 | use ypan. | ||
14 | |||
15 | - EPIAFB (you'll find it on sourceforge) supports the Cyberblade/i1 | ||
16 | graphics core, but it still has serious bugs and developement seems | ||
17 | to have stopped. This is the one driver with TV-out support. If you | ||
18 | do need this feature, try epiafb. | ||
19 | |||
20 | None of these drivers was a real option for me. | ||
21 | |||
22 | I believe that is unreasonable to change code that announces to support 20 | ||
23 | devices if I only have more or less sufficient documentation for exactly one | ||
24 | of these. The risk of breaking device foo while fixing device bar is too high. | ||
25 | |||
26 | So I decided to start CyBlaFB as a stripped down tridentfb. | ||
27 | |||
28 | All code specific to other Trident chips has been removed. After that there | ||
29 | were a lot of cosmetic changes to increase the readability of the code. All | ||
30 | register names were changed to those mnemonics used in the datasheet. Function | ||
31 | and macro names were changed if they hindered easy understanding of the code. | ||
32 | |||
33 | After that I debugged the code and implemented some new features. I'll try to | ||
34 | give a little summary of the main changes: | ||
35 | |||
36 | - calculation of vertical and horizontal timings was fixed | ||
37 | |||
38 | - video signal quality has been improved dramatically | ||
39 | |||
40 | - acceleration: | ||
41 | |||
42 | - fillrect and copyarea were fixed and reenabled | ||
43 | |||
44 | - color expanding imageblit was newly implemented, color | ||
45 | imageblit (only used to draw the penguine) still uses the | ||
46 | generic code. | ||
47 | |||
48 | - init of the acceleration engine was improved and moved to a | ||
49 | place where it really works ... | ||
50 | |||
51 | - sync function has a timeout now and tries to reset and | ||
52 | reinit the accel engine if necessary | ||
53 | |||
54 | - fewer slow copyarea calls when doing ypan scrolling by using | ||
55 | undocumented bit d21 of screen start address stored in | ||
56 | CR2B[5]. BIOS does use it also, so this should be safe. | ||
57 | |||
58 | - cyblafb rejects any attempt to set modes that would cause vclk | ||
59 | values above reasonable 230 MHz. 32bit modes use a clock | ||
60 | multiplicator of 2, so fbset does show the correct values for | ||
61 | pixclock but not for vclk in this case. The fbset limit is 115 MHz | ||
62 | for 32 bpp modes. | ||
63 | |||
64 | - cyblafb rejects modes known to be broken or unimplemented (all | ||
65 | interlaced modes, all doublescan modes for now) | ||
66 | |||
67 | - cyblafb now works independant of the video mode in effect at startup | ||
68 | time (tridentfb does not init all needed registers to reasonable | ||
69 | values) | ||
70 | |||
71 | - switching between video modes does work reliably now | ||
72 | |||
73 | - the first video mode now is the one selected on startup using the | ||
74 | vga=???? mechanism or any of | ||
75 | - 640x480, 800x600, 1024x768, 1280x1024 | ||
76 | - 8, 16, 24 or 32 bpp | ||
77 | - refresh between 50 Hz and 85 Hz, 1 Hz steps (1280x1024-32 | ||
78 | is limited to 63Hz) | ||
79 | |||
80 | - pci retry and pci burst mode are settable (try to disable if you | ||
81 | experience latency problems) | ||
82 | |||
83 | - built as a module cyblafb might be unloaded and reloaded using | ||
84 | the vfb module and con2vt or might be used together with vesafb | ||
85 | |||
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 5e02b83ac12b..39246fc11257 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt | |||
@@ -255,6 +255,16 @@ Who: Jan Engelhardt <jengelh@computergmbh.de> | |||
255 | 255 | ||
256 | --------------------------- | 256 | --------------------------- |
257 | 257 | ||
258 | What: GPIO autorequest on gpio_direction_{input,output}() in gpiolib | ||
259 | When: February 2010 | ||
260 | Why: All callers should use explicit gpio_request()/gpio_free(). | ||
261 | The autorequest mechanism in gpiolib was provided mostly as a | ||
262 | migration aid for legacy GPIO interfaces (for SOC based GPIOs). | ||
263 | Those users have now largely migrated. Platforms implementing | ||
264 | the GPIO interfaces without using gpiolib will see no changes. | ||
265 | Who: David Brownell <dbrownell@users.sourceforge.net> | ||
266 | --------------------------- | ||
267 | |||
258 | What: b43 support for firmware revision < 410 | 268 | What: b43 support for firmware revision < 410 |
259 | When: The schedule was July 2008, but it was decided that we are going to keep the | 269 | When: The schedule was July 2008, but it was decided that we are going to keep the |
260 | code as long as there are no major maintanance headaches. | 270 | code as long as there are no major maintanance headaches. |
@@ -273,13 +283,6 @@ Who: Glauber Costa <gcosta@redhat.com> | |||
273 | 283 | ||
274 | --------------------------- | 284 | --------------------------- |
275 | 285 | ||
276 | What: remove HID compat support | ||
277 | When: 2.6.29 | ||
278 | Why: needed only as a temporary solution until distros fix themselves up | ||
279 | Who: Jiri Slaby <jirislaby@gmail.com> | ||
280 | |||
281 | --------------------------- | ||
282 | |||
283 | What: print_fn_descriptor_symbol() | 286 | What: print_fn_descriptor_symbol() |
284 | When: October 2009 | 287 | When: October 2009 |
285 | Why: The %pF vsprintf format provides the same functionality in a | 288 | Why: The %pF vsprintf format provides the same functionality in a |
@@ -311,6 +314,18 @@ Who: Vlad Yasevich <vladislav.yasevich@hp.com> | |||
311 | 314 | ||
312 | --------------------------- | 315 | --------------------------- |
313 | 316 | ||
317 | What: Ability for non root users to shm_get hugetlb pages based on mlock | ||
318 | resource limits | ||
319 | When: 2.6.31 | ||
320 | Why: Non root users need to be part of /proc/sys/vm/hugetlb_shm_group or | ||
321 | have CAP_IPC_LOCK to be able to allocate shm segments backed by | ||
322 | huge pages. The mlock based rlimit check to allow shm hugetlb is | ||
323 | inconsistent with mmap based allocations. Hence it is being | ||
324 | deprecated. | ||
325 | Who: Ravikiran Thirumalai <kiran@scalex86.org> | ||
326 | |||
327 | --------------------------- | ||
328 | |||
314 | What: CONFIG_THERMAL_HWMON | 329 | What: CONFIG_THERMAL_HWMON |
315 | When: January 2009 | 330 | When: January 2009 |
316 | Why: This option was introduced just to allow older lm-sensors userspace | 331 | Why: This option was introduced just to allow older lm-sensors userspace |
@@ -380,3 +395,35 @@ Why: The defines and typedefs (hw_interrupt_type, no_irq_type, irq_desc_t) | |||
380 | have been kept around for migration reasons. After more than two years | 395 | have been kept around for migration reasons. After more than two years |
381 | it's time to remove them finally | 396 | it's time to remove them finally |
382 | Who: Thomas Gleixner <tglx@linutronix.de> | 397 | Who: Thomas Gleixner <tglx@linutronix.de> |
398 | |||
399 | --------------------------- | ||
400 | |||
401 | What: fakephp and associated sysfs files in /sys/bus/pci/slots/ | ||
402 | When: 2011 | ||
403 | Why: In 2.6.27, the semantics of /sys/bus/pci/slots was redefined to | ||
404 | represent a machine's physical PCI slots. The change in semantics | ||
405 | had userspace implications, as the hotplug core no longer allowed | ||
406 | drivers to create multiple sysfs files per physical slot (required | ||
407 | for multi-function devices, e.g.). fakephp was seen as a developer's | ||
408 | tool only, and its interface changed. Too late, we learned that | ||
409 | there were some users of the fakephp interface. | ||
410 | |||
411 | In 2.6.30, the original fakephp interface was restored. At the same | ||
412 | time, the PCI core gained the ability that fakephp provided, namely | ||
413 | function-level hot-remove and hot-add. | ||
414 | |||
415 | Since the PCI core now provides the same functionality, exposed in: | ||
416 | |||
417 | /sys/bus/pci/rescan | ||
418 | /sys/bus/pci/devices/.../remove | ||
419 | /sys/bus/pci/devices/.../rescan | ||
420 | |||
421 | there is no functional reason to maintain fakephp as well. | ||
422 | |||
423 | We will keep the existing module so that 'modprobe fakephp' will | ||
424 | present the old /sys/bus/pci/slots/... interface for compatibility, | ||
425 | but users are urged to migrate their applications to the API above. | ||
426 | |||
427 | After a reasonable transition period, we will remove the legacy | ||
428 | fakephp interface. | ||
429 | Who: Alex Chiang <achiang@hp.com> | ||
diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 4e78ce677843..76efe5b71d7d 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking | |||
@@ -505,7 +505,7 @@ prototypes: | |||
505 | void (*open)(struct vm_area_struct*); | 505 | void (*open)(struct vm_area_struct*); |
506 | void (*close)(struct vm_area_struct*); | 506 | void (*close)(struct vm_area_struct*); |
507 | int (*fault)(struct vm_area_struct*, struct vm_fault *); | 507 | int (*fault)(struct vm_area_struct*, struct vm_fault *); |
508 | int (*page_mkwrite)(struct vm_area_struct *, struct page *); | 508 | int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); |
509 | int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); | 509 | int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); |
510 | 510 | ||
511 | locking rules: | 511 | locking rules: |
diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt new file mode 100644 index 000000000000..382d52cdaf2d --- /dev/null +++ b/Documentation/filesystems/caching/backend-api.txt | |||
@@ -0,0 +1,658 @@ | |||
1 | ========================== | ||
2 | FS-CACHE CACHE BACKEND API | ||
3 | ========================== | ||
4 | |||
5 | The FS-Cache system provides an API by which actual caches can be supplied to | ||
6 | FS-Cache for it to then serve out to network filesystems and other interested | ||
7 | parties. | ||
8 | |||
9 | This API is declared in <linux/fscache-cache.h>. | ||
10 | |||
11 | |||
12 | ==================================== | ||
13 | INITIALISING AND REGISTERING A CACHE | ||
14 | ==================================== | ||
15 | |||
16 | To start off, a cache definition must be initialised and registered for each | ||
17 | cache the backend wants to make available. For instance, CacheFS does this in | ||
18 | the fill_super() operation on mounting. | ||
19 | |||
20 | The cache definition (struct fscache_cache) should be initialised by calling: | ||
21 | |||
22 | void fscache_init_cache(struct fscache_cache *cache, | ||
23 | struct fscache_cache_ops *ops, | ||
24 | const char *idfmt, | ||
25 | ...); | ||
26 | |||
27 | Where: | ||
28 | |||
29 | (*) "cache" is a pointer to the cache definition; | ||
30 | |||
31 | (*) "ops" is a pointer to the table of operations that the backend supports on | ||
32 | this cache; and | ||
33 | |||
34 | (*) "idfmt" is a format and printf-style arguments for constructing a label | ||
35 | for the cache. | ||
36 | |||
37 | |||
38 | The cache should then be registered with FS-Cache by passing a pointer to the | ||
39 | previously initialised cache definition to: | ||
40 | |||
41 | int fscache_add_cache(struct fscache_cache *cache, | ||
42 | struct fscache_object *fsdef, | ||
43 | const char *tagname); | ||
44 | |||
45 | Two extra arguments should also be supplied: | ||
46 | |||
47 | (*) "fsdef" which should point to the object representation for the FS-Cache | ||
48 | master index in this cache. Netfs primary index entries will be created | ||
49 | here. FS-Cache keeps the caller's reference to the index object if | ||
50 | successful and will release it upon withdrawal of the cache. | ||
51 | |||
52 | (*) "tagname" which, if given, should be a text string naming this cache. If | ||
53 | this is NULL, the identifier will be used instead. For CacheFS, the | ||
54 | identifier is set to name the underlying block device and the tag can be | ||
55 | supplied by mount. | ||
56 | |||
57 | This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag | ||
58 | is already in use. 0 will be returned on success. | ||
59 | |||
60 | |||
61 | ===================== | ||
62 | UNREGISTERING A CACHE | ||
63 | ===================== | ||
64 | |||
65 | A cache can be withdrawn from the system by calling this function with a | ||
66 | pointer to the cache definition: | ||
67 | |||
68 | void fscache_withdraw_cache(struct fscache_cache *cache); | ||
69 | |||
70 | In CacheFS's case, this is called by put_super(). | ||
71 | |||
72 | |||
73 | ======== | ||
74 | SECURITY | ||
75 | ======== | ||
76 | |||
77 | The cache methods are executed one of two contexts: | ||
78 | |||
79 | (1) that of the userspace process that issued the netfs operation that caused | ||
80 | the cache method to be invoked, or | ||
81 | |||
82 | (2) that of one of the processes in the FS-Cache thread pool. | ||
83 | |||
84 | In either case, this may not be an appropriate context in which to access the | ||
85 | cache. | ||
86 | |||
87 | The calling process's fsuid, fsgid and SELinux security identities may need to | ||
88 | be masqueraded for the duration of the cache driver's access to the cache. | ||
89 | This is left to the cache to handle; FS-Cache makes no effort in this regard. | ||
90 | |||
91 | |||
92 | =================================== | ||
93 | CONTROL AND STATISTICS PRESENTATION | ||
94 | =================================== | ||
95 | |||
96 | The cache may present data to the outside world through FS-Cache's interfaces | ||
97 | in sysfs and procfs - the former for control and the latter for statistics. | ||
98 | |||
99 | A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS | ||
100 | is enabled. This is accessible through the kobject struct fscache_cache::kobj | ||
101 | and is for use by the cache as it sees fit. | ||
102 | |||
103 | |||
104 | ======================== | ||
105 | RELEVANT DATA STRUCTURES | ||
106 | ======================== | ||
107 | |||
108 | (*) Index/Data file FS-Cache representation cookie: | ||
109 | |||
110 | struct fscache_cookie { | ||
111 | struct fscache_object_def *def; | ||
112 | struct fscache_netfs *netfs; | ||
113 | void *netfs_data; | ||
114 | ... | ||
115 | }; | ||
116 | |||
117 | The fields that might be of use to the backend describe the object | ||
118 | definition, the netfs definition and the netfs's data for this cookie. | ||
119 | The object definition contain functions supplied by the netfs for loading | ||
120 | and matching index entries; these are required to provide some of the | ||
121 | cache operations. | ||
122 | |||
123 | |||
124 | (*) In-cache object representation: | ||
125 | |||
126 | struct fscache_object { | ||
127 | int debug_id; | ||
128 | enum { | ||
129 | FSCACHE_OBJECT_RECYCLING, | ||
130 | ... | ||
131 | } state; | ||
132 | spinlock_t lock | ||
133 | struct fscache_cache *cache; | ||
134 | struct fscache_cookie *cookie; | ||
135 | ... | ||
136 | }; | ||
137 | |||
138 | Structures of this type should be allocated by the cache backend and | ||
139 | passed to FS-Cache when requested by the appropriate cache operation. In | ||
140 | the case of CacheFS, they're embedded in CacheFS's internal object | ||
141 | structures. | ||
142 | |||
143 | The debug_id is a simple integer that can be used in debugging messages | ||
144 | that refer to a particular object. In such a case it should be printed | ||
145 | using "OBJ%x" to be consistent with FS-Cache. | ||
146 | |||
147 | Each object contains a pointer to the cookie that represents the object it | ||
148 | is backing. An object should retired when put_object() is called if it is | ||
149 | in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be | ||
150 | initialised by calling fscache_object_init(object). | ||
151 | |||
152 | |||
153 | (*) FS-Cache operation record: | ||
154 | |||
155 | struct fscache_operation { | ||
156 | atomic_t usage; | ||
157 | struct fscache_object *object; | ||
158 | unsigned long flags; | ||
159 | #define FSCACHE_OP_EXCLUSIVE | ||
160 | void (*processor)(struct fscache_operation *op); | ||
161 | void (*release)(struct fscache_operation *op); | ||
162 | ... | ||
163 | }; | ||
164 | |||
165 | FS-Cache has a pool of threads that it uses to give CPU time to the | ||
166 | various asynchronous operations that need to be done as part of driving | ||
167 | the cache. These are represented by the above structure. The processor | ||
168 | method is called to give the op CPU time, and the release method to get | ||
169 | rid of it when its usage count reaches 0. | ||
170 | |||
171 | An operation can be made exclusive upon an object by setting the | ||
172 | appropriate flag before enqueuing it with fscache_enqueue_operation(). If | ||
173 | an operation needs more processing time, it should be enqueued again. | ||
174 | |||
175 | |||
176 | (*) FS-Cache retrieval operation record: | ||
177 | |||
178 | struct fscache_retrieval { | ||
179 | struct fscache_operation op; | ||
180 | struct address_space *mapping; | ||
181 | struct list_head *to_do; | ||
182 | ... | ||
183 | }; | ||
184 | |||
185 | A structure of this type is allocated by FS-Cache to record retrieval and | ||
186 | allocation requests made by the netfs. This struct is then passed to the | ||
187 | backend to do the operation. The backend may get extra refs to it by | ||
188 | calling fscache_get_retrieval() and refs may be discarded by calling | ||
189 | fscache_put_retrieval(). | ||
190 | |||
191 | A retrieval operation can be used by the backend to do retrieval work. To | ||
192 | do this, the retrieval->op.processor method pointer should be set | ||
193 | appropriately by the backend and fscache_enqueue_retrieval() called to | ||
194 | submit it to the thread pool. CacheFiles, for example, uses this to queue | ||
195 | page examination when it detects PG_lock being cleared. | ||
196 | |||
197 | The to_do field is an empty list available for the cache backend to use as | ||
198 | it sees fit. | ||
199 | |||
200 | |||
201 | (*) FS-Cache storage operation record: | ||
202 | |||
203 | struct fscache_storage { | ||
204 | struct fscache_operation op; | ||
205 | pgoff_t store_limit; | ||
206 | ... | ||
207 | }; | ||
208 | |||
209 | A structure of this type is allocated by FS-Cache to record outstanding | ||
210 | writes to be made. FS-Cache itself enqueues this operation and invokes | ||
211 | the write_page() method on the object at appropriate times to effect | ||
212 | storage. | ||
213 | |||
214 | |||
215 | ================ | ||
216 | CACHE OPERATIONS | ||
217 | ================ | ||
218 | |||
219 | The cache backend provides FS-Cache with a table of operations that can be | ||
220 | performed on the denizens of the cache. These are held in a structure of type: | ||
221 | |||
222 | struct fscache_cache_ops | ||
223 | |||
224 | (*) Name of cache provider [mandatory]: | ||
225 | |||
226 | const char *name | ||
227 | |||
228 | This isn't strictly an operation, but should be pointed at a string naming | ||
229 | the backend. | ||
230 | |||
231 | |||
232 | (*) Allocate a new object [mandatory]: | ||
233 | |||
234 | struct fscache_object *(*alloc_object)(struct fscache_cache *cache, | ||
235 | struct fscache_cookie *cookie) | ||
236 | |||
237 | This method is used to allocate a cache object representation to back a | ||
238 | cookie in a particular cache. fscache_object_init() should be called on | ||
239 | the object to initialise it prior to returning. | ||
240 | |||
241 | This function may also be used to parse the index key to be used for | ||
242 | multiple lookup calls to turn it into a more convenient form. FS-Cache | ||
243 | will call the lookup_complete() method to allow the cache to release the | ||
244 | form once lookup is complete or aborted. | ||
245 | |||
246 | |||
247 | (*) Look up and create object [mandatory]: | ||
248 | |||
249 | void (*lookup_object)(struct fscache_object *object) | ||
250 | |||
251 | This method is used to look up an object, given that the object is already | ||
252 | allocated and attached to the cookie. This should instantiate that object | ||
253 | in the cache if it can. | ||
254 | |||
255 | The method should call fscache_object_lookup_negative() as soon as | ||
256 | possible if it determines the object doesn't exist in the cache. If the | ||
257 | object is found to exist and the netfs indicates that it is valid then | ||
258 | fscache_obtained_object() should be called once the object is in a | ||
259 | position to have data stored in it. Similarly, fscache_obtained_object() | ||
260 | should also be called once a non-present object has been created. | ||
261 | |||
262 | If a lookup error occurs, fscache_object_lookup_error() should be called | ||
263 | to abort the lookup of that object. | ||
264 | |||
265 | |||
266 | (*) Release lookup data [mandatory]: | ||
267 | |||
268 | void (*lookup_complete)(struct fscache_object *object) | ||
269 | |||
270 | This method is called to ask the cache to release any resources it was | ||
271 | using to perform a lookup. | ||
272 | |||
273 | |||
274 | (*) Increment object refcount [mandatory]: | ||
275 | |||
276 | struct fscache_object *(*grab_object)(struct fscache_object *object) | ||
277 | |||
278 | This method is called to increment the reference count on an object. It | ||
279 | may fail (for instance if the cache is being withdrawn) by returning NULL. | ||
280 | It should return the object pointer if successful. | ||
281 | |||
282 | |||
283 | (*) Lock/Unlock object [mandatory]: | ||
284 | |||
285 | void (*lock_object)(struct fscache_object *object) | ||
286 | void (*unlock_object)(struct fscache_object *object) | ||
287 | |||
288 | These methods are used to exclusively lock an object. It must be possible | ||
289 | to schedule with the lock held, so a spinlock isn't sufficient. | ||
290 | |||
291 | |||
292 | (*) Pin/Unpin object [optional]: | ||
293 | |||
294 | int (*pin_object)(struct fscache_object *object) | ||
295 | void (*unpin_object)(struct fscache_object *object) | ||
296 | |||
297 | These methods are used to pin an object into the cache. Once pinned an | ||
298 | object cannot be reclaimed to make space. Return -ENOSPC if there's not | ||
299 | enough space in the cache to permit this. | ||
300 | |||
301 | |||
302 | (*) Update object [mandatory]: | ||
303 | |||
304 | int (*update_object)(struct fscache_object *object) | ||
305 | |||
306 | This is called to update the index entry for the specified object. The | ||
307 | new information should be in object->cookie->netfs_data. This can be | ||
308 | obtained by calling object->cookie->def->get_aux()/get_attr(). | ||
309 | |||
310 | |||
311 | (*) Discard object [mandatory]: | ||
312 | |||
313 | void (*drop_object)(struct fscache_object *object) | ||
314 | |||
315 | This method is called to indicate that an object has been unbound from its | ||
316 | cookie, and that the cache should release the object's resources and | ||
317 | retire it if it's in state FSCACHE_OBJECT_RECYCLING. | ||
318 | |||
319 | This method should not attempt to release any references held by the | ||
320 | caller. The caller will invoke the put_object() method as appropriate. | ||
321 | |||
322 | |||
323 | (*) Release object reference [mandatory]: | ||
324 | |||
325 | void (*put_object)(struct fscache_object *object) | ||
326 | |||
327 | This method is used to discard a reference to an object. The object may | ||
328 | be freed when all the references to it are released. | ||
329 | |||
330 | |||
331 | (*) Synchronise a cache [mandatory]: | ||
332 | |||
333 | void (*sync)(struct fscache_cache *cache) | ||
334 | |||
335 | This is called to ask the backend to synchronise a cache with its backing | ||
336 | device. | ||
337 | |||
338 | |||
339 | (*) Dissociate a cache [mandatory]: | ||
340 | |||
341 | void (*dissociate_pages)(struct fscache_cache *cache) | ||
342 | |||
343 | This is called to ask a cache to perform any page dissociations as part of | ||
344 | cache withdrawal. | ||
345 | |||
346 | |||
347 | (*) Notification that the attributes on a netfs file changed [mandatory]: | ||
348 | |||
349 | int (*attr_changed)(struct fscache_object *object); | ||
350 | |||
351 | This is called to indicate to the cache that certain attributes on a netfs | ||
352 | file have changed (for example the maximum size a file may reach). The | ||
353 | cache can read these from the netfs by calling the cookie's get_attr() | ||
354 | method. | ||
355 | |||
356 | The cache may use the file size information to reserve space on the cache. | ||
357 | It should also call fscache_set_store_limit() to indicate to FS-Cache the | ||
358 | highest byte it's willing to store for an object. | ||
359 | |||
360 | This method may return -ve if an error occurred or the cache object cannot | ||
361 | be expanded. In such a case, the object will be withdrawn from service. | ||
362 | |||
363 | This operation is run asynchronously from FS-Cache's thread pool, and | ||
364 | storage and retrieval operations from the netfs are excluded during the | ||
365 | execution of this operation. | ||
366 | |||
367 | |||
368 | (*) Reserve cache space for an object's data [optional]: | ||
369 | |||
370 | int (*reserve_space)(struct fscache_object *object, loff_t size); | ||
371 | |||
372 | This is called to request that cache space be reserved to hold the data | ||
373 | for an object and the metadata used to track it. Zero size should be | ||
374 | taken as request to cancel a reservation. | ||
375 | |||
376 | This should return 0 if successful, -ENOSPC if there isn't enough space | ||
377 | available, or -ENOMEM or -EIO on other errors. | ||
378 | |||
379 | The reservation may exceed the current size of the object, thus permitting | ||
380 | future expansion. If the amount of space consumed by an object would | ||
381 | exceed the reservation, it's permitted to refuse requests to allocate | ||
382 | pages, but not required. An object may be pruned down to its reservation | ||
383 | size if larger than that already. | ||
384 | |||
385 | |||
386 | (*) Request page be read from cache [mandatory]: | ||
387 | |||
388 | int (*read_or_alloc_page)(struct fscache_retrieval *op, | ||
389 | struct page *page, | ||
390 | gfp_t gfp) | ||
391 | |||
392 | This is called to attempt to read a netfs page from the cache, or to | ||
393 | reserve a backing block if not. FS-Cache will have done as much checking | ||
394 | as it can before calling, but most of the work belongs to the backend. | ||
395 | |||
396 | If there's no page in the cache, then -ENODATA should be returned if the | ||
397 | backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it | ||
398 | didn't. | ||
399 | |||
400 | If there is suitable data in the cache, then a read operation should be | ||
401 | queued and 0 returned. When the read finishes, fscache_end_io() should be | ||
402 | called. | ||
403 | |||
404 | The fscache_mark_pages_cached() should be called for the page if any cache | ||
405 | metadata is retained. This will indicate to the netfs that the page needs | ||
406 | explicit uncaching. This operation takes a pagevec, thus allowing several | ||
407 | pages to be marked at once. | ||
408 | |||
409 | The retrieval record pointed to by op should be retained for each page | ||
410 | queued and released when I/O on the page has been formally ended. | ||
411 | fscache_get/put_retrieval() are available for this purpose. | ||
412 | |||
413 | The retrieval record may be used to get CPU time via the FS-Cache thread | ||
414 | pool. If this is desired, the op->op.processor should be set to point to | ||
415 | the appropriate processing routine, and fscache_enqueue_retrieval() should | ||
416 | be called at an appropriate point to request CPU time. For instance, the | ||
417 | retrieval routine could be enqueued upon the completion of a disk read. | ||
418 | The to_do field in the retrieval record is provided to aid in this. | ||
419 | |||
420 | If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS | ||
421 | returned if possible or fscache_end_io() called with a suitable error | ||
422 | code.. | ||
423 | |||
424 | |||
425 | (*) Request pages be read from cache [mandatory]: | ||
426 | |||
427 | int (*read_or_alloc_pages)(struct fscache_retrieval *op, | ||
428 | struct list_head *pages, | ||
429 | unsigned *nr_pages, | ||
430 | gfp_t gfp) | ||
431 | |||
432 | This is like the read_or_alloc_page() method, except it is handed a list | ||
433 | of pages instead of one page. Any pages on which a read operation is | ||
434 | started must be added to the page cache for the specified mapping and also | ||
435 | to the LRU. Such pages must also be removed from the pages list and | ||
436 | *nr_pages decremented per page. | ||
437 | |||
438 | If there was an error such as -ENOMEM, then that should be returned; else | ||
439 | if one or more pages couldn't be read or allocated, then -ENOBUFS should | ||
440 | be returned; else if one or more pages couldn't be read, then -ENODATA | ||
441 | should be returned. If all the pages are dispatched then 0 should be | ||
442 | returned. | ||
443 | |||
444 | |||
445 | (*) Request page be allocated in the cache [mandatory]: | ||
446 | |||
447 | int (*allocate_page)(struct fscache_retrieval *op, | ||
448 | struct page *page, | ||
449 | gfp_t gfp) | ||
450 | |||
451 | This is like the read_or_alloc_page() method, except that it shouldn't | ||
452 | read from the cache, even if there's data there that could be retrieved. | ||
453 | It should, however, set up any internal metadata required such that | ||
454 | the write_page() method can write to the cache. | ||
455 | |||
456 | If there's no backing block available, then -ENOBUFS should be returned | ||
457 | (or -ENOMEM if there were other problems). If a block is successfully | ||
458 | allocated, then the netfs page should be marked and 0 returned. | ||
459 | |||
460 | |||
461 | (*) Request pages be allocated in the cache [mandatory]: | ||
462 | |||
463 | int (*allocate_pages)(struct fscache_retrieval *op, | ||
464 | struct list_head *pages, | ||
465 | unsigned *nr_pages, | ||
466 | gfp_t gfp) | ||
467 | |||
468 | This is an multiple page version of the allocate_page() method. pages and | ||
469 | nr_pages should be treated as for the read_or_alloc_pages() method. | ||
470 | |||
471 | |||
472 | (*) Request page be written to cache [mandatory]: | ||
473 | |||
474 | int (*write_page)(struct fscache_storage *op, | ||
475 | struct page *page); | ||
476 | |||
477 | This is called to write from a page on which there was a previously | ||
478 | successful read_or_alloc_page() call or similar. FS-Cache filters out | ||
479 | pages that don't have mappings. | ||
480 | |||
481 | This method is called asynchronously from the FS-Cache thread pool. It is | ||
482 | not required to actually store anything, provided -ENODATA is then | ||
483 | returned to the next read of this page. | ||
484 | |||
485 | If an error occurred, then a negative error code should be returned, | ||
486 | otherwise zero should be returned. FS-Cache will take appropriate action | ||
487 | in response to an error, such as withdrawing this object. | ||
488 | |||
489 | If this method returns success then FS-Cache will inform the netfs | ||
490 | appropriately. | ||
491 | |||
492 | |||
493 | (*) Discard retained per-page metadata [mandatory]: | ||
494 | |||
495 | void (*uncache_page)(struct fscache_object *object, struct page *page) | ||
496 | |||
497 | This is called when a netfs page is being evicted from the pagecache. The | ||
498 | cache backend should tear down any internal representation or tracking it | ||
499 | maintains for this page. | ||
500 | |||
501 | |||
502 | ================== | ||
503 | FS-CACHE UTILITIES | ||
504 | ================== | ||
505 | |||
506 | FS-Cache provides some utilities that a cache backend may make use of: | ||
507 | |||
508 | (*) Note occurrence of an I/O error in a cache: | ||
509 | |||
510 | void fscache_io_error(struct fscache_cache *cache) | ||
511 | |||
512 | This tells FS-Cache that an I/O error occurred in the cache. After this | ||
513 | has been called, only resource dissociation operations (object and page | ||
514 | release) will be passed from the netfs to the cache backend for the | ||
515 | specified cache. | ||
516 | |||
517 | This does not actually withdraw the cache. That must be done separately. | ||
518 | |||
519 | |||
520 | (*) Invoke the retrieval I/O completion function: | ||
521 | |||
522 | void fscache_end_io(struct fscache_retrieval *op, struct page *page, | ||
523 | int error); | ||
524 | |||
525 | This is called to note the end of an attempt to retrieve a page. The | ||
526 | error value should be 0 if successful and an error otherwise. | ||
527 | |||
528 | |||
529 | (*) Set highest store limit: | ||
530 | |||
531 | void fscache_set_store_limit(struct fscache_object *object, | ||
532 | loff_t i_size); | ||
533 | |||
534 | This sets the limit FS-Cache imposes on the highest byte it's willing to | ||
535 | try and store for a netfs. Any page over this limit is automatically | ||
536 | rejected by fscache_read_alloc_page() and co with -ENOBUFS. | ||
537 | |||
538 | |||
539 | (*) Mark pages as being cached: | ||
540 | |||
541 | void fscache_mark_pages_cached(struct fscache_retrieval *op, | ||
542 | struct pagevec *pagevec); | ||
543 | |||
544 | This marks a set of pages as being cached. After this has been called, | ||
545 | the netfs must call fscache_uncache_page() to unmark the pages. | ||
546 | |||
547 | |||
548 | (*) Perform coherency check on an object: | ||
549 | |||
550 | enum fscache_checkaux fscache_check_aux(struct fscache_object *object, | ||
551 | const void *data, | ||
552 | uint16_t datalen); | ||
553 | |||
554 | This asks the netfs to perform a coherency check on an object that has | ||
555 | just been looked up. The cookie attached to the object will determine the | ||
556 | netfs to use. data and datalen should specify where the auxiliary data | ||
557 | retrieved from the cache can be found. | ||
558 | |||
559 | One of three values will be returned: | ||
560 | |||
561 | (*) FSCACHE_CHECKAUX_OKAY | ||
562 | |||
563 | The coherency data indicates the object is valid as is. | ||
564 | |||
565 | (*) FSCACHE_CHECKAUX_NEEDS_UPDATE | ||
566 | |||
567 | The coherency data needs updating, but otherwise the object is | ||
568 | valid. | ||
569 | |||
570 | (*) FSCACHE_CHECKAUX_OBSOLETE | ||
571 | |||
572 | The coherency data indicates that the object is obsolete and should | ||
573 | be discarded. | ||
574 | |||
575 | |||
576 | (*) Initialise a freshly allocated object: | ||
577 | |||
578 | void fscache_object_init(struct fscache_object *object); | ||
579 | |||
580 | This initialises all the fields in an object representation. | ||
581 | |||
582 | |||
583 | (*) Indicate the destruction of an object: | ||
584 | |||
585 | void fscache_object_destroyed(struct fscache_cache *cache); | ||
586 | |||
587 | This must be called to inform FS-Cache that an object that belonged to a | ||
588 | cache has been destroyed and deallocated. This will allow continuation | ||
589 | of the cache withdrawal process when it is stopped pending destruction of | ||
590 | all the objects. | ||
591 | |||
592 | |||
593 | (*) Indicate negative lookup on an object: | ||
594 | |||
595 | void fscache_object_lookup_negative(struct fscache_object *object); | ||
596 | |||
597 | This is called to indicate to FS-Cache that a lookup process for an object | ||
598 | found a negative result. | ||
599 | |||
600 | This changes the state of an object to permit reads pending on lookup | ||
601 | completion to go off and start fetching data from the netfs server as it's | ||
602 | known at this point that there can't be any data in the cache. | ||
603 | |||
604 | This may be called multiple times on an object. Only the first call is | ||
605 | significant - all subsequent calls are ignored. | ||
606 | |||
607 | |||
608 | (*) Indicate an object has been obtained: | ||
609 | |||
610 | void fscache_obtained_object(struct fscache_object *object); | ||
611 | |||
612 | This is called to indicate to FS-Cache that a lookup process for an object | ||
613 | produced a positive result, or that an object was created. This should | ||
614 | only be called once for any particular object. | ||
615 | |||
616 | This changes the state of an object to indicate: | ||
617 | |||
618 | (1) if no call to fscache_object_lookup_negative() has been made on | ||
619 | this object, that there may be data available, and that reads can | ||
620 | now go and look for it; and | ||
621 | |||
622 | (2) that writes may now proceed against this object. | ||
623 | |||
624 | |||
625 | (*) Indicate that object lookup failed: | ||
626 | |||
627 | void fscache_object_lookup_error(struct fscache_object *object); | ||
628 | |||
629 | This marks an object as having encountered a fatal error (usually EIO) | ||
630 | and causes it to move into a state whereby it will be withdrawn as soon | ||
631 | as possible. | ||
632 | |||
633 | |||
634 | (*) Get and release references on a retrieval record: | ||
635 | |||
636 | void fscache_get_retrieval(struct fscache_retrieval *op); | ||
637 | void fscache_put_retrieval(struct fscache_retrieval *op); | ||
638 | |||
639 | These two functions are used to retain a retrieval record whilst doing | ||
640 | asynchronous data retrieval and block allocation. | ||
641 | |||
642 | |||
643 | (*) Enqueue a retrieval record for processing. | ||
644 | |||
645 | void fscache_enqueue_retrieval(struct fscache_retrieval *op); | ||
646 | |||
647 | This enqueues a retrieval record for processing by the FS-Cache thread | ||
648 | pool. One of the threads in the pool will invoke the retrieval record's | ||
649 | op->op.processor callback function. This function may be called from | ||
650 | within the callback function. | ||
651 | |||
652 | |||
653 | (*) List of object state names: | ||
654 | |||
655 | const char *fscache_object_states[]; | ||
656 | |||
657 | For debugging purposes, this may be used to turn the state that an object | ||
658 | is in into a text string for display purposes. | ||
diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt new file mode 100644 index 000000000000..c78a49b7bba6 --- /dev/null +++ b/Documentation/filesystems/caching/cachefiles.txt | |||
@@ -0,0 +1,501 @@ | |||
1 | =============================================== | ||
2 | CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM | ||
3 | =============================================== | ||
4 | |||
5 | Contents: | ||
6 | |||
7 | (*) Overview. | ||
8 | |||
9 | (*) Requirements. | ||
10 | |||
11 | (*) Configuration. | ||
12 | |||
13 | (*) Starting the cache. | ||
14 | |||
15 | (*) Things to avoid. | ||
16 | |||
17 | (*) Cache culling. | ||
18 | |||
19 | (*) Cache structure. | ||
20 | |||
21 | (*) Security model and SELinux. | ||
22 | |||
23 | (*) A note on security. | ||
24 | |||
25 | (*) Statistical information. | ||
26 | |||
27 | (*) Debugging. | ||
28 | |||
29 | |||
30 | ======== | ||
31 | OVERVIEW | ||
32 | ======== | ||
33 | |||
34 | CacheFiles is a caching backend that's meant to use as a cache a directory on | ||
35 | an already mounted filesystem of a local type (such as Ext3). | ||
36 | |||
37 | CacheFiles uses a userspace daemon to do some of the cache management - such as | ||
38 | reaping stale nodes and culling. This is called cachefilesd and lives in | ||
39 | /sbin. | ||
40 | |||
41 | The filesystem and data integrity of the cache are only as good as those of the | ||
42 | filesystem providing the backing services. Note that CacheFiles does not | ||
43 | attempt to journal anything since the journalling interfaces of the various | ||
44 | filesystems are very specific in nature. | ||
45 | |||
46 | CacheFiles creates a misc character device - "/dev/cachefiles" - that is used | ||
47 | to communication with the daemon. Only one thing may have this open at once, | ||
48 | and whilst it is open, a cache is at least partially in existence. The daemon | ||
49 | opens this and sends commands down it to control the cache. | ||
50 | |||
51 | CacheFiles is currently limited to a single cache. | ||
52 | |||
53 | CacheFiles attempts to maintain at least a certain percentage of free space on | ||
54 | the filesystem, shrinking the cache by culling the objects it contains to make | ||
55 | space if necessary - see the "Cache Culling" section. This means it can be | ||
56 | placed on the same medium as a live set of data, and will expand to make use of | ||
57 | spare space and automatically contract when the set of data requires more | ||
58 | space. | ||
59 | |||
60 | |||
61 | ============ | ||
62 | REQUIREMENTS | ||
63 | ============ | ||
64 | |||
65 | The use of CacheFiles and its daemon requires the following features to be | ||
66 | available in the system and in the cache filesystem: | ||
67 | |||
68 | - dnotify. | ||
69 | |||
70 | - extended attributes (xattrs). | ||
71 | |||
72 | - openat() and friends. | ||
73 | |||
74 | - bmap() support on files in the filesystem (FIBMAP ioctl). | ||
75 | |||
76 | - The use of bmap() to detect a partial page at the end of the file. | ||
77 | |||
78 | It is strongly recommended that the "dir_index" option is enabled on Ext3 | ||
79 | filesystems being used as a cache. | ||
80 | |||
81 | |||
82 | ============= | ||
83 | CONFIGURATION | ||
84 | ============= | ||
85 | |||
86 | The cache is configured by a script in /etc/cachefilesd.conf. These commands | ||
87 | set up cache ready for use. The following script commands are available: | ||
88 | |||
89 | (*) brun <N>% | ||
90 | (*) bcull <N>% | ||
91 | (*) bstop <N>% | ||
92 | (*) frun <N>% | ||
93 | (*) fcull <N>% | ||
94 | (*) fstop <N>% | ||
95 | |||
96 | Configure the culling limits. Optional. See the section on culling | ||
97 | The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. | ||
98 | |||
99 | The commands beginning with a 'b' are file space (block) limits, those | ||
100 | beginning with an 'f' are file count limits. | ||
101 | |||
102 | (*) dir <path> | ||
103 | |||
104 | Specify the directory containing the root of the cache. Mandatory. | ||
105 | |||
106 | (*) tag <name> | ||
107 | |||
108 | Specify a tag to FS-Cache to use in distinguishing multiple caches. | ||
109 | Optional. The default is "CacheFiles". | ||
110 | |||
111 | (*) debug <mask> | ||
112 | |||
113 | Specify a numeric bitmask to control debugging in the kernel module. | ||
114 | Optional. The default is zero (all off). The following values can be | ||
115 | OR'd into the mask to collect various information: | ||
116 | |||
117 | 1 Turn on trace of function entry (_enter() macros) | ||
118 | 2 Turn on trace of function exit (_leave() macros) | ||
119 | 4 Turn on trace of internal debug points (_debug()) | ||
120 | |||
121 | This mask can also be set through sysfs, eg: | ||
122 | |||
123 | echo 5 >/sys/modules/cachefiles/parameters/debug | ||
124 | |||
125 | |||
126 | ================== | ||
127 | STARTING THE CACHE | ||
128 | ================== | ||
129 | |||
130 | The cache is started by running the daemon. The daemon opens the cache device, | ||
131 | configures the cache and tells it to begin caching. At that point the cache | ||
132 | binds to fscache and the cache becomes live. | ||
133 | |||
134 | The daemon is run as follows: | ||
135 | |||
136 | /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] | ||
137 | |||
138 | The flags are: | ||
139 | |||
140 | (*) -d | ||
141 | |||
142 | Increase the debugging level. This can be specified multiple times and | ||
143 | is cumulative with itself. | ||
144 | |||
145 | (*) -s | ||
146 | |||
147 | Send messages to stderr instead of syslog. | ||
148 | |||
149 | (*) -n | ||
150 | |||
151 | Don't daemonise and go into background. | ||
152 | |||
153 | (*) -f <configfile> | ||
154 | |||
155 | Use an alternative configuration file rather than the default one. | ||
156 | |||
157 | |||
158 | =============== | ||
159 | THINGS TO AVOID | ||
160 | =============== | ||
161 | |||
162 | Do not mount other things within the cache as this will cause problems. The | ||
163 | kernel module contains its own very cut-down path walking facility that ignores | ||
164 | mountpoints, but the daemon can't avoid them. | ||
165 | |||
166 | Do not create, rename or unlink files and directories in the cache whilst the | ||
167 | cache is active, as this may cause the state to become uncertain. | ||
168 | |||
169 | Renaming files in the cache might make objects appear to be other objects (the | ||
170 | filename is part of the lookup key). | ||
171 | |||
172 | Do not change or remove the extended attributes attached to cache files by the | ||
173 | cache as this will cause the cache state management to get confused. | ||
174 | |||
175 | Do not create files or directories in the cache, lest the cache get confused or | ||
176 | serve incorrect data. | ||
177 | |||
178 | Do not chmod files in the cache. The module creates things with minimal | ||
179 | permissions to prevent random users being able to access them directly. | ||
180 | |||
181 | |||
182 | ============= | ||
183 | CACHE CULLING | ||
184 | ============= | ||
185 | |||
186 | The cache may need culling occasionally to make space. This involves | ||
187 | discarding objects from the cache that have been used less recently than | ||
188 | anything else. Culling is based on the access time of data objects. Empty | ||
189 | directories are culled if not in use. | ||
190 | |||
191 | Cache culling is done on the basis of the percentage of blocks and the | ||
192 | percentage of files available in the underlying filesystem. There are six | ||
193 | "limits": | ||
194 | |||
195 | (*) brun | ||
196 | (*) frun | ||
197 | |||
198 | If the amount of free space and the number of available files in the cache | ||
199 | rises above both these limits, then culling is turned off. | ||
200 | |||
201 | (*) bcull | ||
202 | (*) fcull | ||
203 | |||
204 | If the amount of available space or the number of available files in the | ||
205 | cache falls below either of these limits, then culling is started. | ||
206 | |||
207 | (*) bstop | ||
208 | (*) fstop | ||
209 | |||
210 | If the amount of available space or the number of available files in the | ||
211 | cache falls below either of these limits, then no further allocation of | ||
212 | disk space or files is permitted until culling has raised things above | ||
213 | these limits again. | ||
214 | |||
215 | These must be configured thusly: | ||
216 | |||
217 | 0 <= bstop < bcull < brun < 100 | ||
218 | 0 <= fstop < fcull < frun < 100 | ||
219 | |||
220 | Note that these are percentages of available space and available files, and do | ||
221 | _not_ appear as 100 minus the percentage displayed by the "df" program. | ||
222 | |||
223 | The userspace daemon scans the cache to build up a table of cullable objects. | ||
224 | These are then culled in least recently used order. A new scan of the cache is | ||
225 | started as soon as space is made in the table. Objects will be skipped if | ||
226 | their atimes have changed or if the kernel module says it is still using them. | ||
227 | |||
228 | |||
229 | =============== | ||
230 | CACHE STRUCTURE | ||
231 | =============== | ||
232 | |||
233 | The CacheFiles module will create two directories in the directory it was | ||
234 | given: | ||
235 | |||
236 | (*) cache/ | ||
237 | |||
238 | (*) graveyard/ | ||
239 | |||
240 | The active cache objects all reside in the first directory. The CacheFiles | ||
241 | kernel module moves any retired or culled objects that it can't simply unlink | ||
242 | to the graveyard from which the daemon will actually delete them. | ||
243 | |||
244 | The daemon uses dnotify to monitor the graveyard directory, and will delete | ||
245 | anything that appears therein. | ||
246 | |||
247 | |||
248 | The module represents index objects as directories with the filename "I..." or | ||
249 | "J...". Note that the "cache/" directory is itself a special index. | ||
250 | |||
251 | Data objects are represented as files if they have no children, or directories | ||
252 | if they do. Their filenames all begin "D..." or "E...". If represented as a | ||
253 | directory, data objects will have a file in the directory called "data" that | ||
254 | actually holds the data. | ||
255 | |||
256 | Special objects are similar to data objects, except their filenames begin | ||
257 | "S..." or "T...". | ||
258 | |||
259 | |||
260 | If an object has children, then it will be represented as a directory. | ||
261 | Immediately in the representative directory are a collection of directories | ||
262 | named for hash values of the child object keys with an '@' prepended. Into | ||
263 | this directory, if possible, will be placed the representations of the child | ||
264 | objects: | ||
265 | |||
266 | INDEX INDEX INDEX DATA FILES | ||
267 | ========= ========== ================================= ================ | ||
268 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 | ||
269 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry | ||
270 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry | ||
271 | cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry | ||
272 | |||
273 | |||
274 | If the key is so long that it exceeds NAME_MAX with the decorations added on to | ||
275 | it, then it will be cut into pieces, the first few of which will be used to | ||
276 | make a nest of directories, and the last one of which will be the objects | ||
277 | inside the last directory. The names of the intermediate directories will have | ||
278 | '+' prepended: | ||
279 | |||
280 | J1223/@23/+xy...z/+kl...m/Epqr | ||
281 | |||
282 | |||
283 | Note that keys are raw data, and not only may they exceed NAME_MAX in size, | ||
284 | they may also contain things like '/' and NUL characters, and so they may not | ||
285 | be suitable for turning directly into a filename. | ||
286 | |||
287 | To handle this, CacheFiles will use a suitably printable filename directly and | ||
288 | "base-64" encode ones that aren't directly suitable. The two versions of | ||
289 | object filenames indicate the encoding: | ||
290 | |||
291 | OBJECT TYPE PRINTABLE ENCODED | ||
292 | =============== =============== =============== | ||
293 | Index "I..." "J..." | ||
294 | Data "D..." "E..." | ||
295 | Special "S..." "T..." | ||
296 | |||
297 | Intermediate directories are always "@" or "+" as appropriate. | ||
298 | |||
299 | |||
300 | Each object in the cache has an extended attribute label that holds the object | ||
301 | type ID (required to distinguish special objects) and the auxiliary data from | ||
302 | the netfs. The latter is used to detect stale objects in the cache and update | ||
303 | or retire them. | ||
304 | |||
305 | |||
306 | Note that CacheFiles will erase from the cache any file it doesn't recognise or | ||
307 | any file of an incorrect type (such as a FIFO file or a device file). | ||
308 | |||
309 | |||
310 | ========================== | ||
311 | SECURITY MODEL AND SELINUX | ||
312 | ========================== | ||
313 | |||
314 | CacheFiles is implemented to deal properly with the LSM security features of | ||
315 | the Linux kernel and the SELinux facility. | ||
316 | |||
317 | One of the problems that CacheFiles faces is that it is generally acting on | ||
318 | behalf of a process, and running in that process's context, and that includes a | ||
319 | security context that is not appropriate for accessing the cache - either | ||
320 | because the files in the cache are inaccessible to that process, or because if | ||
321 | the process creates a file in the cache, that file may be inaccessible to other | ||
322 | processes. | ||
323 | |||
324 | The way CacheFiles works is to temporarily change the security context (fsuid, | ||
325 | fsgid and actor security label) that the process acts as - without changing the | ||
326 | security context of the process when it the target of an operation performed by | ||
327 | some other process (so signalling and suchlike still work correctly). | ||
328 | |||
329 | |||
330 | When the CacheFiles module is asked to bind to its cache, it: | ||
331 | |||
332 | (1) Finds the security label attached to the root cache directory and uses | ||
333 | that as the security label with which it will create files. By default, | ||
334 | this is: | ||
335 | |||
336 | cachefiles_var_t | ||
337 | |||
338 | (2) Finds the security label of the process which issued the bind request | ||
339 | (presumed to be the cachefilesd daemon), which by default will be: | ||
340 | |||
341 | cachefilesd_t | ||
342 | |||
343 | and asks LSM to supply a security ID as which it should act given the | ||
344 | daemon's label. By default, this will be: | ||
345 | |||
346 | cachefiles_kernel_t | ||
347 | |||
348 | SELinux transitions the daemon's security ID to the module's security ID | ||
349 | based on a rule of this form in the policy. | ||
350 | |||
351 | type_transition <daemon's-ID> kernel_t : process <module's-ID>; | ||
352 | |||
353 | For instance: | ||
354 | |||
355 | type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; | ||
356 | |||
357 | |||
358 | The module's security ID gives it permission to create, move and remove files | ||
359 | and directories in the cache, to find and access directories and files in the | ||
360 | cache, to set and access extended attributes on cache objects, and to read and | ||
361 | write files in the cache. | ||
362 | |||
363 | The daemon's security ID gives it only a very restricted set of permissions: it | ||
364 | may scan directories, stat files and erase files and directories. It may | ||
365 | not read or write files in the cache, and so it is precluded from accessing the | ||
366 | data cached therein; nor is it permitted to create new files in the cache. | ||
367 | |||
368 | |||
369 | There are policy source files available in: | ||
370 | |||
371 | http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 | ||
372 | |||
373 | and later versions. In that tarball, see the files: | ||
374 | |||
375 | cachefilesd.te | ||
376 | cachefilesd.fc | ||
377 | cachefilesd.if | ||
378 | |||
379 | They are built and installed directly by the RPM. | ||
380 | |||
381 | If a non-RPM based system is being used, then copy the above files to their own | ||
382 | directory and run: | ||
383 | |||
384 | make -f /usr/share/selinux/devel/Makefile | ||
385 | semodule -i cachefilesd.pp | ||
386 | |||
387 | You will need checkpolicy and selinux-policy-devel installed prior to the | ||
388 | build. | ||
389 | |||
390 | |||
391 | By default, the cache is located in /var/fscache, but if it is desirable that | ||
392 | it should be elsewhere, than either the above policy files must be altered, or | ||
393 | an auxiliary policy must be installed to label the alternate location of the | ||
394 | cache. | ||
395 | |||
396 | For instructions on how to add an auxiliary policy to enable the cache to be | ||
397 | located elsewhere when SELinux is in enforcing mode, please see: | ||
398 | |||
399 | /usr/share/doc/cachefilesd-*/move-cache.txt | ||
400 | |||
401 | When the cachefilesd rpm is installed; alternatively, the document can be found | ||
402 | in the sources. | ||
403 | |||
404 | |||
405 | ================== | ||
406 | A NOTE ON SECURITY | ||
407 | ================== | ||
408 | |||
409 | CacheFiles makes use of the split security in the task_struct. It allocates | ||
410 | its own task_security structure, and redirects current->act_as to point to it | ||
411 | when it acts on behalf of another process, in that process's context. | ||
412 | |||
413 | The reason it does this is that it calls vfs_mkdir() and suchlike rather than | ||
414 | bypassing security and calling inode ops directly. Therefore the VFS and LSM | ||
415 | may deny the CacheFiles access to the cache data because under some | ||
416 | circumstances the caching code is running in the security context of whatever | ||
417 | process issued the original syscall on the netfs. | ||
418 | |||
419 | Furthermore, should CacheFiles create a file or directory, the security | ||
420 | parameters with that object is created (UID, GID, security label) would be | ||
421 | derived from that process that issued the system call, thus potentially | ||
422 | preventing other processes from accessing the cache - including CacheFiles's | ||
423 | cache management daemon (cachefilesd). | ||
424 | |||
425 | What is required is to temporarily override the security of the process that | ||
426 | issued the system call. We can't, however, just do an in-place change of the | ||
427 | security data as that affects the process as an object, not just as a subject. | ||
428 | This means it may lose signals or ptrace events for example, and affects what | ||
429 | the process looks like in /proc. | ||
430 | |||
431 | So CacheFiles makes use of a logical split in the security between the | ||
432 | objective security (task->sec) and the subjective security (task->act_as). The | ||
433 | objective security holds the intrinsic security properties of a process and is | ||
434 | never overridden. This is what appears in /proc, and is what is used when a | ||
435 | process is the target of an operation by some other process (SIGKILL for | ||
436 | example). | ||
437 | |||
438 | The subjective security holds the active security properties of a process, and | ||
439 | may be overridden. This is not seen externally, and is used whan a process | ||
440 | acts upon another object, for example SIGKILLing another process or opening a | ||
441 | file. | ||
442 | |||
443 | LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request | ||
444 | for CacheFiles to run in a context of a specific security label, or to create | ||
445 | files and directories with another security label. | ||
446 | |||
447 | |||
448 | ======================= | ||
449 | STATISTICAL INFORMATION | ||
450 | ======================= | ||
451 | |||
452 | If FS-Cache is compiled with the following option enabled: | ||
453 | |||
454 | CONFIG_CACHEFILES_HISTOGRAM=y | ||
455 | |||
456 | then it will gather certain statistics and display them through a proc file. | ||
457 | |||
458 | (*) /proc/fs/cachefiles/histogram | ||
459 | |||
460 | cat /proc/fs/cachefiles/histogram | ||
461 | JIFS SECS LOOKUPS MKDIRS CREATES | ||
462 | ===== ===== ========= ========= ========= | ||
463 | |||
464 | This shows the breakdown of the number of times each amount of time | ||
465 | between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The | ||
466 | columns are as follows: | ||
467 | |||
468 | COLUMN TIME MEASUREMENT | ||
469 | ======= ======================================================= | ||
470 | LOOKUPS Length of time to perform a lookup on the backing fs | ||
471 | MKDIRS Length of time to perform a mkdir on the backing fs | ||
472 | CREATES Length of time to perform a create on the backing fs | ||
473 | |||
474 | Each row shows the number of events that took a particular range of times. | ||
475 | Each step is 1 jiffy in size. The JIFS column indicates the particular | ||
476 | jiffy range covered, and the SECS field the equivalent number of seconds. | ||
477 | |||
478 | |||
479 | ========= | ||
480 | DEBUGGING | ||
481 | ========= | ||
482 | |||
483 | If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime | ||
484 | debugging enabled by adjusting the value in: | ||
485 | |||
486 | /sys/module/cachefiles/parameters/debug | ||
487 | |||
488 | This is a bitmask of debugging streams to enable: | ||
489 | |||
490 | BIT VALUE STREAM POINT | ||
491 | ======= ======= =============================== ======================= | ||
492 | 0 1 General Function entry trace | ||
493 | 1 2 Function exit trace | ||
494 | 2 4 General | ||
495 | |||
496 | The appropriate set of values should be OR'd together and the result written to | ||
497 | the control file. For example: | ||
498 | |||
499 | echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug | ||
500 | |||
501 | will turn on all function entry debugging. | ||
diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt new file mode 100644 index 000000000000..9e94b9491d89 --- /dev/null +++ b/Documentation/filesystems/caching/fscache.txt | |||
@@ -0,0 +1,333 @@ | |||
1 | ========================== | ||
2 | General Filesystem Caching | ||
3 | ========================== | ||
4 | |||
5 | ======== | ||
6 | OVERVIEW | ||
7 | ======== | ||
8 | |||
9 | This facility is a general purpose cache for network filesystems, though it | ||
10 | could be used for caching other things such as ISO9660 filesystems too. | ||
11 | |||
12 | FS-Cache mediates between cache backends (such as CacheFS) and network | ||
13 | filesystems: | ||
14 | |||
15 | +---------+ | ||
16 | | | +--------------+ | ||
17 | | NFS |--+ | | | ||
18 | | | | +-->| CacheFS | | ||
19 | +---------+ | +----------+ | | /dev/hda5 | | ||
20 | | | | | +--------------+ | ||
21 | +---------+ +-->| | | | ||
22 | | | | |--+ | ||
23 | | AFS |----->| FS-Cache | | ||
24 | | | | |--+ | ||
25 | +---------+ +-->| | | | ||
26 | | | | | +--------------+ | ||
27 | +---------+ | +----------+ | | | | ||
28 | | | | +-->| CacheFiles | | ||
29 | | ISOFS |--+ | /var/cache | | ||
30 | | | +--------------+ | ||
31 | +---------+ | ||
32 | |||
33 | Or to look at it another way, FS-Cache is a module that provides a caching | ||
34 | facility to a network filesystem such that the cache is transparent to the | ||
35 | user: | ||
36 | |||
37 | +---------+ | ||
38 | | | | ||
39 | | Server | | ||
40 | | | | ||
41 | +---------+ | ||
42 | | NETWORK | ||
43 | ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
44 | | | ||
45 | | +----------+ | ||
46 | V | | | ||
47 | +---------+ | | | ||
48 | | | | | | ||
49 | | NFS |----->| FS-Cache | | ||
50 | | | | |--+ | ||
51 | +---------+ | | | +--------------+ +--------------+ | ||
52 | | | | | | | | | | ||
53 | V +----------+ +-->| CacheFiles |-->| Ext3 | | ||
54 | +---------+ | /var/cache | | /dev/sda6 | | ||
55 | | | +--------------+ +--------------+ | ||
56 | | VFS | ^ ^ | ||
57 | | | | | | ||
58 | +---------+ +--------------+ | | ||
59 | | KERNEL SPACE | | | ||
60 | ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ | ||
61 | | USER SPACE | | | ||
62 | V | | | ||
63 | +---------+ +--------------+ | ||
64 | | | | | | ||
65 | | Process | | cachefilesd | | ||
66 | | | | | | ||
67 | +---------+ +--------------+ | ||
68 | |||
69 | |||
70 | FS-Cache does not follow the idea of completely loading every netfs file | ||
71 | opened in its entirety into a cache before permitting it to be accessed and | ||
72 | then serving the pages out of that cache rather than the netfs inode because: | ||
73 | |||
74 | (1) It must be practical to operate without a cache. | ||
75 | |||
76 | (2) The size of any accessible file must not be limited to the size of the | ||
77 | cache. | ||
78 | |||
79 | (3) The combined size of all opened files (this includes mapped libraries) | ||
80 | must not be limited to the size of the cache. | ||
81 | |||
82 | (4) The user should not be forced to download an entire file just to do a | ||
83 | one-off access of a small portion of it (such as might be done with the | ||
84 | "file" program). | ||
85 | |||
86 | It instead serves the cache out in PAGE_SIZE chunks as and when requested by | ||
87 | the netfs('s) using it. | ||
88 | |||
89 | |||
90 | FS-Cache provides the following facilities: | ||
91 | |||
92 | (1) More than one cache can be used at once. Caches can be selected | ||
93 | explicitly by use of tags. | ||
94 | |||
95 | (2) Caches can be added / removed at any time. | ||
96 | |||
97 | (3) The netfs is provided with an interface that allows either party to | ||
98 | withdraw caching facilities from a file (required for (2)). | ||
99 | |||
100 | (4) The interface to the netfs returns as few errors as possible, preferring | ||
101 | rather to let the netfs remain oblivious. | ||
102 | |||
103 | (5) Cookies are used to represent indices, files and other objects to the | ||
104 | netfs. The simplest cookie is just a NULL pointer - indicating nothing | ||
105 | cached there. | ||
106 | |||
107 | (6) The netfs is allowed to propose - dynamically - any index hierarchy it | ||
108 | desires, though it must be aware that the index search function is | ||
109 | recursive, stack space is limited, and indices can only be children of | ||
110 | indices. | ||
111 | |||
112 | (7) Data I/O is done direct to and from the netfs's pages. The netfs | ||
113 | indicates that page A is at index B of the data-file represented by cookie | ||
114 | C, and that it should be read or written. The cache backend may or may | ||
115 | not start I/O on that page, but if it does, a netfs callback will be | ||
116 | invoked to indicate completion. The I/O may be either synchronous or | ||
117 | asynchronous. | ||
118 | |||
119 | (8) Cookies can be "retired" upon release. At this point FS-Cache will mark | ||
120 | them as obsolete and the index hierarchy rooted at that point will get | ||
121 | recycled. | ||
122 | |||
123 | (9) The netfs provides a "match" function for index searches. In addition to | ||
124 | saying whether a match was made or not, this can also specify that an | ||
125 | entry should be updated or deleted. | ||
126 | |||
127 | (10) As much as possible is done asynchronously. | ||
128 | |||
129 | |||
130 | FS-Cache maintains a virtual indexing tree in which all indices, files, objects | ||
131 | and pages are kept. Bits of this tree may actually reside in one or more | ||
132 | caches. | ||
133 | |||
134 | FSDEF | ||
135 | | | ||
136 | +------------------------------------+ | ||
137 | | | | ||
138 | NFS AFS | ||
139 | | | | ||
140 | +--------------------------+ +-----------+ | ||
141 | | | | | | ||
142 | homedir mirror afs.org redhat.com | ||
143 | | | | | ||
144 | +------------+ +---------------+ +----------+ | ||
145 | | | | | | | | ||
146 | 00001 00002 00007 00125 vol00001 vol00002 | ||
147 | | | | | | | ||
148 | +---+---+ +-----+ +---+ +------+------+ +-----+----+ | ||
149 | | | | | | | | | | | | | | | ||
150 | PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak | ||
151 | | | | ||
152 | PG0 +-------+ | ||
153 | | | | ||
154 | 00001 00003 | ||
155 | | | ||
156 | +---+---+ | ||
157 | | | | | ||
158 | PG0 PG1 PG2 | ||
159 | |||
160 | In the example above, you can see two netfs's being backed: NFS and AFS. These | ||
161 | have different index hierarchies: | ||
162 | |||
163 | (*) The NFS primary index contains per-server indices. Each server index is | ||
164 | indexed by NFS file handles to get data file objects. Each data file | ||
165 | objects can have an array of pages, but may also have further child | ||
166 | objects, such as extended attributes and directory entries. Extended | ||
167 | attribute objects themselves have page-array contents. | ||
168 | |||
169 | (*) The AFS primary index contains per-cell indices. Each cell index contains | ||
170 | per-logical-volume indices. Each of volume index contains up to three | ||
171 | indices for the read-write, read-only and backup mirrors of those volumes. | ||
172 | Each of these contains vnode data file objects, each of which contains an | ||
173 | array of pages. | ||
174 | |||
175 | The very top index is the FS-Cache master index in which individual netfs's | ||
176 | have entries. | ||
177 | |||
178 | Any index object may reside in more than one cache, provided it only has index | ||
179 | children. Any index with non-index object children will be assumed to only | ||
180 | reside in one cache. | ||
181 | |||
182 | |||
183 | The netfs API to FS-Cache can be found in: | ||
184 | |||
185 | Documentation/filesystems/caching/netfs-api.txt | ||
186 | |||
187 | The cache backend API to FS-Cache can be found in: | ||
188 | |||
189 | Documentation/filesystems/caching/backend-api.txt | ||
190 | |||
191 | A description of the internal representations and object state machine can be | ||
192 | found in: | ||
193 | |||
194 | Documentation/filesystems/caching/object.txt | ||
195 | |||
196 | |||
197 | ======================= | ||
198 | STATISTICAL INFORMATION | ||
199 | ======================= | ||
200 | |||
201 | If FS-Cache is compiled with the following options enabled: | ||
202 | |||
203 | CONFIG_FSCACHE_STATS=y | ||
204 | CONFIG_FSCACHE_HISTOGRAM=y | ||
205 | |||
206 | then it will gather certain statistics and display them through a number of | ||
207 | proc files. | ||
208 | |||
209 | (*) /proc/fs/fscache/stats | ||
210 | |||
211 | This shows counts of a number of events that can happen in FS-Cache: | ||
212 | |||
213 | CLASS EVENT MEANING | ||
214 | ======= ======= ======================================================= | ||
215 | Cookies idx=N Number of index cookies allocated | ||
216 | dat=N Number of data storage cookies allocated | ||
217 | spc=N Number of special cookies allocated | ||
218 | Objects alc=N Number of objects allocated | ||
219 | nal=N Number of object allocation failures | ||
220 | avl=N Number of objects that reached the available state | ||
221 | ded=N Number of objects that reached the dead state | ||
222 | ChkAux non=N Number of objects that didn't have a coherency check | ||
223 | ok=N Number of objects that passed a coherency check | ||
224 | upd=N Number of objects that needed a coherency data update | ||
225 | obs=N Number of objects that were declared obsolete | ||
226 | Pages mrk=N Number of pages marked as being cached | ||
227 | unc=N Number of uncache page requests seen | ||
228 | Acquire n=N Number of acquire cookie requests seen | ||
229 | nul=N Number of acq reqs given a NULL parent | ||
230 | noc=N Number of acq reqs rejected due to no cache available | ||
231 | ok=N Number of acq reqs succeeded | ||
232 | nbf=N Number of acq reqs rejected due to error | ||
233 | oom=N Number of acq reqs failed on ENOMEM | ||
234 | Lookups n=N Number of lookup calls made on cache backends | ||
235 | neg=N Number of negative lookups made | ||
236 | pos=N Number of positive lookups made | ||
237 | crt=N Number of objects created by lookup | ||
238 | Updates n=N Number of update cookie requests seen | ||
239 | nul=N Number of upd reqs given a NULL parent | ||
240 | run=N Number of upd reqs granted CPU time | ||
241 | Relinqs n=N Number of relinquish cookie requests seen | ||
242 | nul=N Number of rlq reqs given a NULL parent | ||
243 | wcr=N Number of rlq reqs waited on completion of creation | ||
244 | AttrChg n=N Number of attribute changed requests seen | ||
245 | ok=N Number of attr changed requests queued | ||
246 | nbf=N Number of attr changed rejected -ENOBUFS | ||
247 | oom=N Number of attr changed failed -ENOMEM | ||
248 | run=N Number of attr changed ops given CPU time | ||
249 | Allocs n=N Number of allocation requests seen | ||
250 | ok=N Number of successful alloc reqs | ||
251 | wt=N Number of alloc reqs that waited on lookup completion | ||
252 | nbf=N Number of alloc reqs rejected -ENOBUFS | ||
253 | ops=N Number of alloc reqs submitted | ||
254 | owt=N Number of alloc reqs waited for CPU time | ||
255 | Retrvls n=N Number of retrieval (read) requests seen | ||
256 | ok=N Number of successful retr reqs | ||
257 | wt=N Number of retr reqs that waited on lookup completion | ||
258 | nod=N Number of retr reqs returned -ENODATA | ||
259 | nbf=N Number of retr reqs rejected -ENOBUFS | ||
260 | int=N Number of retr reqs aborted -ERESTARTSYS | ||
261 | oom=N Number of retr reqs failed -ENOMEM | ||
262 | ops=N Number of retr reqs submitted | ||
263 | owt=N Number of retr reqs waited for CPU time | ||
264 | Stores n=N Number of storage (write) requests seen | ||
265 | ok=N Number of successful store reqs | ||
266 | agn=N Number of store reqs on a page already pending storage | ||
267 | nbf=N Number of store reqs rejected -ENOBUFS | ||
268 | oom=N Number of store reqs failed -ENOMEM | ||
269 | ops=N Number of store reqs submitted | ||
270 | run=N Number of store reqs granted CPU time | ||
271 | Ops pend=N Number of times async ops added to pending queues | ||
272 | run=N Number of times async ops given CPU time | ||
273 | enq=N Number of times async ops queued for processing | ||
274 | dfr=N Number of async ops queued for deferred release | ||
275 | rel=N Number of async ops released | ||
276 | gc=N Number of deferred-release async ops garbage collected | ||
277 | |||
278 | |||
279 | (*) /proc/fs/fscache/histogram | ||
280 | |||
281 | cat /proc/fs/fscache/histogram | ||
282 | JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS | ||
283 | ===== ===== ========= ========= ========= ========= ========= | ||
284 | |||
285 | This shows the breakdown of the number of times each amount of time | ||
286 | between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The | ||
287 | columns are as follows: | ||
288 | |||
289 | COLUMN TIME MEASUREMENT | ||
290 | ======= ======================================================= | ||
291 | OBJ INST Length of time to instantiate an object | ||
292 | OP RUNS Length of time a call to process an operation took | ||
293 | OBJ RUNS Length of time a call to process an object event took | ||
294 | RETRV DLY Time between an requesting a read and lookup completing | ||
295 | RETRIEVLS Time between beginning and end of a retrieval | ||
296 | |||
297 | Each row shows the number of events that took a particular range of times. | ||
298 | Each step is 1 jiffy in size. The JIFS column indicates the particular | ||
299 | jiffy range covered, and the SECS field the equivalent number of seconds. | ||
300 | |||
301 | |||
302 | ========= | ||
303 | DEBUGGING | ||
304 | ========= | ||
305 | |||
306 | If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime | ||
307 | debugging enabled by adjusting the value in: | ||
308 | |||
309 | /sys/module/fscache/parameters/debug | ||
310 | |||
311 | This is a bitmask of debugging streams to enable: | ||
312 | |||
313 | BIT VALUE STREAM POINT | ||
314 | ======= ======= =============================== ======================= | ||
315 | 0 1 Cache management Function entry trace | ||
316 | 1 2 Function exit trace | ||
317 | 2 4 General | ||
318 | 3 8 Cookie management Function entry trace | ||
319 | 4 16 Function exit trace | ||
320 | 5 32 General | ||
321 | 6 64 Page handling Function entry trace | ||
322 | 7 128 Function exit trace | ||
323 | 8 256 General | ||
324 | 9 512 Operation management Function entry trace | ||
325 | 10 1024 Function exit trace | ||
326 | 11 2048 General | ||
327 | |||
328 | The appropriate set of values should be OR'd together and the result written to | ||
329 | the control file. For example: | ||
330 | |||
331 | echo $((1|8|64)) >/sys/module/fscache/parameters/debug | ||
332 | |||
333 | will turn on all function entry debugging. | ||
diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt new file mode 100644 index 000000000000..4db125b3a5c6 --- /dev/null +++ b/Documentation/filesystems/caching/netfs-api.txt | |||
@@ -0,0 +1,778 @@ | |||
1 | =============================== | ||
2 | FS-CACHE NETWORK FILESYSTEM API | ||
3 | =============================== | ||
4 | |||
5 | There's an API by which a network filesystem can make use of the FS-Cache | ||
6 | facilities. This is based around a number of principles: | ||
7 | |||
8 | (1) Caches can store a number of different object types. There are two main | ||
9 | object types: indices and files. The first is a special type used by | ||
10 | FS-Cache to make finding objects faster and to make retiring of groups of | ||
11 | objects easier. | ||
12 | |||
13 | (2) Every index, file or other object is represented by a cookie. This cookie | ||
14 | may or may not have anything associated with it, but the netfs doesn't | ||
15 | need to care. | ||
16 | |||
17 | (3) Barring the top-level index (one entry per cached netfs), the index | ||
18 | hierarchy for each netfs is structured according the whim of the netfs. | ||
19 | |||
20 | This API is declared in <linux/fscache.h>. | ||
21 | |||
22 | This document contains the following sections: | ||
23 | |||
24 | (1) Network filesystem definition | ||
25 | (2) Index definition | ||
26 | (3) Object definition | ||
27 | (4) Network filesystem (un)registration | ||
28 | (5) Cache tag lookup | ||
29 | (6) Index registration | ||
30 | (7) Data file registration | ||
31 | (8) Miscellaneous object registration | ||
32 | (9) Setting the data file size | ||
33 | (10) Page alloc/read/write | ||
34 | (11) Page uncaching | ||
35 | (12) Index and data file update | ||
36 | (13) Miscellaneous cookie operations | ||
37 | (14) Cookie unregistration | ||
38 | (15) Index and data file invalidation | ||
39 | (16) FS-Cache specific page flags. | ||
40 | |||
41 | |||
42 | ============================= | ||
43 | NETWORK FILESYSTEM DEFINITION | ||
44 | ============================= | ||
45 | |||
46 | FS-Cache needs a description of the network filesystem. This is specified | ||
47 | using a record of the following structure: | ||
48 | |||
49 | struct fscache_netfs { | ||
50 | uint32_t version; | ||
51 | const char *name; | ||
52 | struct fscache_cookie *primary_index; | ||
53 | ... | ||
54 | }; | ||
55 | |||
56 | This first two fields should be filled in before registration, and the third | ||
57 | will be filled in by the registration function; any other fields should just be | ||
58 | ignored and are for internal use only. | ||
59 | |||
60 | The fields are: | ||
61 | |||
62 | (1) The name of the netfs (used as the key in the toplevel index). | ||
63 | |||
64 | (2) The version of the netfs (if the name matches but the version doesn't, the | ||
65 | entire in-cache hierarchy for this netfs will be scrapped and begun | ||
66 | afresh). | ||
67 | |||
68 | (3) The cookie representing the primary index will be allocated according to | ||
69 | another parameter passed into the registration function. | ||
70 | |||
71 | For example, kAFS (linux/fs/afs/) uses the following definitions to describe | ||
72 | itself: | ||
73 | |||
74 | struct fscache_netfs afs_cache_netfs = { | ||
75 | .version = 0, | ||
76 | .name = "afs", | ||
77 | }; | ||
78 | |||
79 | |||
80 | ================ | ||
81 | INDEX DEFINITION | ||
82 | ================ | ||
83 | |||
84 | Indices are used for two purposes: | ||
85 | |||
86 | (1) To aid the finding of a file based on a series of keys (such as AFS's | ||
87 | "cell", "volume ID", "vnode ID"). | ||
88 | |||
89 | (2) To make it easier to discard a subset of all the files cached based around | ||
90 | a particular key - for instance to mirror the removal of an AFS volume. | ||
91 | |||
92 | However, since it's unlikely that any two netfs's are going to want to define | ||
93 | their index hierarchies in quite the same way, FS-Cache tries to impose as few | ||
94 | restraints as possible on how an index is structured and where it is placed in | ||
95 | the tree. The netfs can even mix indices and data files at the same level, but | ||
96 | it's not recommended. | ||
97 | |||
98 | Each index entry consists of a key of indeterminate length plus some auxilliary | ||
99 | data, also of indeterminate length. | ||
100 | |||
101 | There are some limits on indices: | ||
102 | |||
103 | (1) Any index containing non-index objects should be restricted to a single | ||
104 | cache. Any such objects created within an index will be created in the | ||
105 | first cache only. The cache in which an index is created can be | ||
106 | controlled by cache tags (see below). | ||
107 | |||
108 | (2) The entry data must be atomically journallable, so it is limited to about | ||
109 | 400 bytes at present. At least 400 bytes will be available. | ||
110 | |||
111 | (3) The depth of the index tree should be judged with care as the search | ||
112 | function is recursive. Too many layers will run the kernel out of stack. | ||
113 | |||
114 | |||
115 | ================= | ||
116 | OBJECT DEFINITION | ||
117 | ================= | ||
118 | |||
119 | To define an object, a structure of the following type should be filled out: | ||
120 | |||
121 | struct fscache_cookie_def | ||
122 | { | ||
123 | uint8_t name[16]; | ||
124 | uint8_t type; | ||
125 | |||
126 | struct fscache_cache_tag *(*select_cache)( | ||
127 | const void *parent_netfs_data, | ||
128 | const void *cookie_netfs_data); | ||
129 | |||
130 | uint16_t (*get_key)(const void *cookie_netfs_data, | ||
131 | void *buffer, | ||
132 | uint16_t bufmax); | ||
133 | |||
134 | void (*get_attr)(const void *cookie_netfs_data, | ||
135 | uint64_t *size); | ||
136 | |||
137 | uint16_t (*get_aux)(const void *cookie_netfs_data, | ||
138 | void *buffer, | ||
139 | uint16_t bufmax); | ||
140 | |||
141 | enum fscache_checkaux (*check_aux)(void *cookie_netfs_data, | ||
142 | const void *data, | ||
143 | uint16_t datalen); | ||
144 | |||
145 | void (*get_context)(void *cookie_netfs_data, void *context); | ||
146 | |||
147 | void (*put_context)(void *cookie_netfs_data, void *context); | ||
148 | |||
149 | void (*mark_pages_cached)(void *cookie_netfs_data, | ||
150 | struct address_space *mapping, | ||
151 | struct pagevec *cached_pvec); | ||
152 | |||
153 | void (*now_uncached)(void *cookie_netfs_data); | ||
154 | }; | ||
155 | |||
156 | This has the following fields: | ||
157 | |||
158 | (1) The type of the object [mandatory]. | ||
159 | |||
160 | This is one of the following values: | ||
161 | |||
162 | (*) FSCACHE_COOKIE_TYPE_INDEX | ||
163 | |||
164 | This defines an index, which is a special FS-Cache type. | ||
165 | |||
166 | (*) FSCACHE_COOKIE_TYPE_DATAFILE | ||
167 | |||
168 | This defines an ordinary data file. | ||
169 | |||
170 | (*) Any other value between 2 and 255 | ||
171 | |||
172 | This defines an extraordinary object such as an XATTR. | ||
173 | |||
174 | (2) The name of the object type (NUL terminated unless all 16 chars are used) | ||
175 | [optional]. | ||
176 | |||
177 | (3) A function to select the cache in which to store an index [optional]. | ||
178 | |||
179 | This function is invoked when an index needs to be instantiated in a cache | ||
180 | during the instantiation of a non-index object. Only the immediate index | ||
181 | parent for the non-index object will be queried. Any indices above that | ||
182 | in the hierarchy may be stored in multiple caches. This function does not | ||
183 | need to be supplied for any non-index object or any index that will only | ||
184 | have index children. | ||
185 | |||
186 | If this function is not supplied or if it returns NULL then the first | ||
187 | cache in the parent's list will be chosed, or failing that, the first | ||
188 | cache in the master list. | ||
189 | |||
190 | (4) A function to retrieve an object's key from the netfs [mandatory]. | ||
191 | |||
192 | This function will be called with the netfs data that was passed to the | ||
193 | cookie acquisition function and the maximum length of key data that it may | ||
194 | provide. It should write the required key data into the given buffer and | ||
195 | return the quantity it wrote. | ||
196 | |||
197 | (5) A function to retrieve attribute data from the netfs [optional]. | ||
198 | |||
199 | This function will be called with the netfs data that was passed to the | ||
200 | cookie acquisition function. It should return the size of the file if | ||
201 | this is a data file. The size may be used to govern how much cache must | ||
202 | be reserved for this file in the cache. | ||
203 | |||
204 | If the function is absent, a file size of 0 is assumed. | ||
205 | |||
206 | (6) A function to retrieve auxilliary data from the netfs [optional]. | ||
207 | |||
208 | This function will be called with the netfs data that was passed to the | ||
209 | cookie acquisition function and the maximum length of auxilliary data that | ||
210 | it may provide. It should write the auxilliary data into the given buffer | ||
211 | and return the quantity it wrote. | ||
212 | |||
213 | If this function is absent, the auxilliary data length will be set to 0. | ||
214 | |||
215 | The length of the auxilliary data buffer may be dependent on the key | ||
216 | length. A netfs mustn't rely on being able to provide more than 400 bytes | ||
217 | for both. | ||
218 | |||
219 | (7) A function to check the auxilliary data [optional]. | ||
220 | |||
221 | This function will be called to check that a match found in the cache for | ||
222 | this object is valid. For instance with AFS it could check the auxilliary | ||
223 | data against the data version number returned by the server to determine | ||
224 | whether the index entry in a cache is still valid. | ||
225 | |||
226 | If this function is absent, it will be assumed that matching objects in a | ||
227 | cache are always valid. | ||
228 | |||
229 | If present, the function should return one of the following values: | ||
230 | |||
231 | (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is | ||
232 | (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update | ||
233 | (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted | ||
234 | |||
235 | This function can also be used to extract data from the auxilliary data in | ||
236 | the cache and copy it into the netfs's structures. | ||
237 | |||
238 | (8) A pair of functions to manage contexts for the completion callback | ||
239 | [optional]. | ||
240 | |||
241 | The cache read/write functions are passed a context which is then passed | ||
242 | to the I/O completion callback function. To ensure this context remains | ||
243 | valid until after the I/O completion is called, two functions may be | ||
244 | provided: one to get an extra reference on the context, and one to drop a | ||
245 | reference to it. | ||
246 | |||
247 | If the context is not used or is a type of object that won't go out of | ||
248 | scope, then these functions are not required. These functions are not | ||
249 | required for indices as indices may not contain data. These functions may | ||
250 | be called in interrupt context and so may not sleep. | ||
251 | |||
252 | (9) A function to mark a page as retaining cache metadata [optional]. | ||
253 | |||
254 | This is called by the cache to indicate that it is retaining in-memory | ||
255 | information for this page and that the netfs should uncache the page when | ||
256 | it has finished. This does not indicate whether there's data on the disk | ||
257 | or not. Note that several pages at once may be presented for marking. | ||
258 | |||
259 | The PG_fscache bit is set on the pages before this function would be | ||
260 | called, so the function need not be provided if this is sufficient. | ||
261 | |||
262 | This function is not required for indices as they're not permitted data. | ||
263 | |||
264 | (10) A function to unmark all the pages retaining cache metadata [mandatory]. | ||
265 | |||
266 | This is called by FS-Cache to indicate that a backing store is being | ||
267 | unbound from a cookie and that all the marks on the pages should be | ||
268 | cleared to prevent confusion. Note that the cache will have torn down all | ||
269 | its tracking information so that the pages don't need to be explicitly | ||
270 | uncached. | ||
271 | |||
272 | This function is not required for indices as they're not permitted data. | ||
273 | |||
274 | |||
275 | =================================== | ||
276 | NETWORK FILESYSTEM (UN)REGISTRATION | ||
277 | =================================== | ||
278 | |||
279 | The first step is to declare the network filesystem to the cache. This also | ||
280 | involves specifying the layout of the primary index (for AFS, this would be the | ||
281 | "cell" level). | ||
282 | |||
283 | The registration function is: | ||
284 | |||
285 | int fscache_register_netfs(struct fscache_netfs *netfs); | ||
286 | |||
287 | It just takes a pointer to the netfs definition. It returns 0 or an error as | ||
288 | appropriate. | ||
289 | |||
290 | For kAFS, registration is done as follows: | ||
291 | |||
292 | ret = fscache_register_netfs(&afs_cache_netfs); | ||
293 | |||
294 | The last step is, of course, unregistration: | ||
295 | |||
296 | void fscache_unregister_netfs(struct fscache_netfs *netfs); | ||
297 | |||
298 | |||
299 | ================ | ||
300 | CACHE TAG LOOKUP | ||
301 | ================ | ||
302 | |||
303 | FS-Cache permits the use of more than one cache. To permit particular index | ||
304 | subtrees to be bound to particular caches, the second step is to look up cache | ||
305 | representation tags. This step is optional; it can be left entirely up to | ||
306 | FS-Cache as to which cache should be used. The problem with doing that is that | ||
307 | FS-Cache will always pick the first cache that was registered. | ||
308 | |||
309 | To get the representation for a named tag: | ||
310 | |||
311 | struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); | ||
312 | |||
313 | This takes a text string as the name and returns a representation of a tag. It | ||
314 | will never return an error. It may return a dummy tag, however, if it runs out | ||
315 | of memory; this will inhibit caching with this tag. | ||
316 | |||
317 | Any representation so obtained must be released by passing it to this function: | ||
318 | |||
319 | void fscache_release_cache_tag(struct fscache_cache_tag *tag); | ||
320 | |||
321 | The tag will be retrieved by FS-Cache when it calls the object definition | ||
322 | operation select_cache(). | ||
323 | |||
324 | |||
325 | ================== | ||
326 | INDEX REGISTRATION | ||
327 | ================== | ||
328 | |||
329 | The third step is to inform FS-Cache about part of an index hierarchy that can | ||
330 | be used to locate files. This is done by requesting a cookie for each index in | ||
331 | the path to the file: | ||
332 | |||
333 | struct fscache_cookie * | ||
334 | fscache_acquire_cookie(struct fscache_cookie *parent, | ||
335 | const struct fscache_object_def *def, | ||
336 | void *netfs_data); | ||
337 | |||
338 | This function creates an index entry in the index represented by parent, | ||
339 | filling in the index entry by calling the operations pointed to by def. | ||
340 | |||
341 | Note that this function never returns an error - all errors are handled | ||
342 | internally. It may, however, return NULL to indicate no cookie. It is quite | ||
343 | acceptable to pass this token back to this function as the parent to another | ||
344 | acquisition (or even to the relinquish cookie, read page and write page | ||
345 | functions - see below). | ||
346 | |||
347 | Note also that no indices are actually created in a cache until a non-index | ||
348 | object needs to be created somewhere down the hierarchy. Furthermore, an index | ||
349 | may be created in several different caches independently at different times. | ||
350 | This is all handled transparently, and the netfs doesn't see any of it. | ||
351 | |||
352 | For example, with AFS, a cell would be added to the primary index. This index | ||
353 | entry would have a dependent inode containing a volume location index for the | ||
354 | volume mappings within this cell: | ||
355 | |||
356 | cell->cache = | ||
357 | fscache_acquire_cookie(afs_cache_netfs.primary_index, | ||
358 | &afs_cell_cache_index_def, | ||
359 | cell); | ||
360 | |||
361 | Then when a volume location was accessed, it would be entered into the cell's | ||
362 | index and an inode would be allocated that acts as a volume type and hash chain | ||
363 | combination: | ||
364 | |||
365 | vlocation->cache = | ||
366 | fscache_acquire_cookie(cell->cache, | ||
367 | &afs_vlocation_cache_index_def, | ||
368 | vlocation); | ||
369 | |||
370 | And then a particular flavour of volume (R/O for example) could be added to | ||
371 | that index, creating another index for vnodes (AFS inode equivalents): | ||
372 | |||
373 | volume->cache = | ||
374 | fscache_acquire_cookie(vlocation->cache, | ||
375 | &afs_volume_cache_index_def, | ||
376 | volume); | ||
377 | |||
378 | |||
379 | ====================== | ||
380 | DATA FILE REGISTRATION | ||
381 | ====================== | ||
382 | |||
383 | The fourth step is to request a data file be created in the cache. This is | ||
384 | identical to index cookie acquisition. The only difference is that the type in | ||
385 | the object definition should be something other than index type. | ||
386 | |||
387 | vnode->cache = | ||
388 | fscache_acquire_cookie(volume->cache, | ||
389 | &afs_vnode_cache_object_def, | ||
390 | vnode); | ||
391 | |||
392 | |||
393 | ================================= | ||
394 | MISCELLANEOUS OBJECT REGISTRATION | ||
395 | ================================= | ||
396 | |||
397 | An optional step is to request an object of miscellaneous type be created in | ||
398 | the cache. This is almost identical to index cookie acquisition. The only | ||
399 | difference is that the type in the object definition should be something other | ||
400 | than index type. Whilst the parent object could be an index, it's more likely | ||
401 | it would be some other type of object such as a data file. | ||
402 | |||
403 | xattr->cache = | ||
404 | fscache_acquire_cookie(vnode->cache, | ||
405 | &afs_xattr_cache_object_def, | ||
406 | xattr); | ||
407 | |||
408 | Miscellaneous objects might be used to store extended attributes or directory | ||
409 | entries for example. | ||
410 | |||
411 | |||
412 | ========================== | ||
413 | SETTING THE DATA FILE SIZE | ||
414 | ========================== | ||
415 | |||
416 | The fifth step is to set the physical attributes of the file, such as its size. | ||
417 | This doesn't automatically reserve any space in the cache, but permits the | ||
418 | cache to adjust its metadata for data tracking appropriately: | ||
419 | |||
420 | int fscache_attr_changed(struct fscache_cookie *cookie); | ||
421 | |||
422 | The cache will return -ENOBUFS if there is no backing cache or if there is no | ||
423 | space to allocate any extra metadata required in the cache. The attributes | ||
424 | will be accessed with the get_attr() cookie definition operation. | ||
425 | |||
426 | Note that attempts to read or write data pages in the cache over this size may | ||
427 | be rebuffed with -ENOBUFS. | ||
428 | |||
429 | This operation schedules an attribute adjustment to happen asynchronously at | ||
430 | some point in the future, and as such, it may happen after the function returns | ||
431 | to the caller. The attribute adjustment excludes read and write operations. | ||
432 | |||
433 | |||
434 | ===================== | ||
435 | PAGE READ/ALLOC/WRITE | ||
436 | ===================== | ||
437 | |||
438 | And the sixth step is to store and retrieve pages in the cache. There are | ||
439 | three functions that are used to do this. | ||
440 | |||
441 | Note: | ||
442 | |||
443 | (1) A page should not be re-read or re-allocated without uncaching it first. | ||
444 | |||
445 | (2) A read or allocated page must be uncached when the netfs page is released | ||
446 | from the pagecache. | ||
447 | |||
448 | (3) A page should only be written to the cache if previous read or allocated. | ||
449 | |||
450 | This permits the cache to maintain its page tracking in proper order. | ||
451 | |||
452 | |||
453 | PAGE READ | ||
454 | --------- | ||
455 | |||
456 | Firstly, the netfs should ask FS-Cache to examine the caches and read the | ||
457 | contents cached for a particular page of a particular file if present, or else | ||
458 | allocate space to store the contents if not: | ||
459 | |||
460 | typedef | ||
461 | void (*fscache_rw_complete_t)(struct page *page, | ||
462 | void *context, | ||
463 | int error); | ||
464 | |||
465 | int fscache_read_or_alloc_page(struct fscache_cookie *cookie, | ||
466 | struct page *page, | ||
467 | fscache_rw_complete_t end_io_func, | ||
468 | void *context, | ||
469 | gfp_t gfp); | ||
470 | |||
471 | The cookie argument must specify a cookie for an object that isn't an index, | ||
472 | the page specified will have the data loaded into it (and is also used to | ||
473 | specify the page number), and the gfp argument is used to control how any | ||
474 | memory allocations made are satisfied. | ||
475 | |||
476 | If the cookie indicates the inode is not cached: | ||
477 | |||
478 | (1) The function will return -ENOBUFS. | ||
479 | |||
480 | Else if there's a copy of the page resident in the cache: | ||
481 | |||
482 | (1) The mark_pages_cached() cookie operation will be called on that page. | ||
483 | |||
484 | (2) The function will submit a request to read the data from the cache's | ||
485 | backing device directly into the page specified. | ||
486 | |||
487 | (3) The function will return 0. | ||
488 | |||
489 | (4) When the read is complete, end_io_func() will be invoked with: | ||
490 | |||
491 | (*) The netfs data supplied when the cookie was created. | ||
492 | |||
493 | (*) The page descriptor. | ||
494 | |||
495 | (*) The context argument passed to the above function. This will be | ||
496 | maintained with the get_context/put_context functions mentioned above. | ||
497 | |||
498 | (*) An argument that's 0 on success or negative for an error code. | ||
499 | |||
500 | If an error occurs, it should be assumed that the page contains no usable | ||
501 | data. | ||
502 | |||
503 | end_io_func() will be called in process context if the read is results in | ||
504 | an error, but it might be called in interrupt context if the read is | ||
505 | successful. | ||
506 | |||
507 | Otherwise, if there's not a copy available in cache, but the cache may be able | ||
508 | to store the page: | ||
509 | |||
510 | (1) The mark_pages_cached() cookie operation will be called on that page. | ||
511 | |||
512 | (2) A block may be reserved in the cache and attached to the object at the | ||
513 | appropriate place. | ||
514 | |||
515 | (3) The function will return -ENODATA. | ||
516 | |||
517 | This function may also return -ENOMEM or -EINTR, in which case it won't have | ||
518 | read any data from the cache. | ||
519 | |||
520 | |||
521 | PAGE ALLOCATE | ||
522 | ------------- | ||
523 | |||
524 | Alternatively, if there's not expected to be any data in the cache for a page | ||
525 | because the file has been extended, a block can simply be allocated instead: | ||
526 | |||
527 | int fscache_alloc_page(struct fscache_cookie *cookie, | ||
528 | struct page *page, | ||
529 | gfp_t gfp); | ||
530 | |||
531 | This is similar to the fscache_read_or_alloc_page() function, except that it | ||
532 | never reads from the cache. It will return 0 if a block has been allocated, | ||
533 | rather than -ENODATA as the other would. One or the other must be performed | ||
534 | before writing to the cache. | ||
535 | |||
536 | The mark_pages_cached() cookie operation will be called on the page if | ||
537 | successful. | ||
538 | |||
539 | |||
540 | PAGE WRITE | ||
541 | ---------- | ||
542 | |||
543 | Secondly, if the netfs changes the contents of the page (either due to an | ||
544 | initial download or if a user performs a write), then the page should be | ||
545 | written back to the cache: | ||
546 | |||
547 | int fscache_write_page(struct fscache_cookie *cookie, | ||
548 | struct page *page, | ||
549 | gfp_t gfp); | ||
550 | |||
551 | The cookie argument must specify a data file cookie, the page specified should | ||
552 | contain the data to be written (and is also used to specify the page number), | ||
553 | and the gfp argument is used to control how any memory allocations made are | ||
554 | satisfied. | ||
555 | |||
556 | The page must have first been read or allocated successfully and must not have | ||
557 | been uncached before writing is performed. | ||
558 | |||
559 | If the cookie indicates the inode is not cached then: | ||
560 | |||
561 | (1) The function will return -ENOBUFS. | ||
562 | |||
563 | Else if space can be allocated in the cache to hold this page: | ||
564 | |||
565 | (1) PG_fscache_write will be set on the page. | ||
566 | |||
567 | (2) The function will submit a request to write the data to cache's backing | ||
568 | device directly from the page specified. | ||
569 | |||
570 | (3) The function will return 0. | ||
571 | |||
572 | (4) When the write is complete PG_fscache_write is cleared on the page and | ||
573 | anyone waiting for that bit will be woken up. | ||
574 | |||
575 | Else if there's no space available in the cache, -ENOBUFS will be returned. It | ||
576 | is also possible for the PG_fscache_write bit to be cleared when no write took | ||
577 | place if unforeseen circumstances arose (such as a disk error). | ||
578 | |||
579 | Writing takes place asynchronously. | ||
580 | |||
581 | |||
582 | MULTIPLE PAGE READ | ||
583 | ------------------ | ||
584 | |||
585 | A facility is provided to read several pages at once, as requested by the | ||
586 | readpages() address space operation: | ||
587 | |||
588 | int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, | ||
589 | struct address_space *mapping, | ||
590 | struct list_head *pages, | ||
591 | int *nr_pages, | ||
592 | fscache_rw_complete_t end_io_func, | ||
593 | void *context, | ||
594 | gfp_t gfp); | ||
595 | |||
596 | This works in a similar way to fscache_read_or_alloc_page(), except: | ||
597 | |||
598 | (1) Any page it can retrieve data for is removed from pages and nr_pages and | ||
599 | dispatched for reading to the disk. Reads of adjacent pages on disk may | ||
600 | be merged for greater efficiency. | ||
601 | |||
602 | (2) The mark_pages_cached() cookie operation will be called on several pages | ||
603 | at once if they're being read or allocated. | ||
604 | |||
605 | (3) If there was an general error, then that error will be returned. | ||
606 | |||
607 | Else if some pages couldn't be allocated or read, then -ENOBUFS will be | ||
608 | returned. | ||
609 | |||
610 | Else if some pages couldn't be read but were allocated, then -ENODATA will | ||
611 | be returned. | ||
612 | |||
613 | Otherwise, if all pages had reads dispatched, then 0 will be returned, the | ||
614 | list will be empty and *nr_pages will be 0. | ||
615 | |||
616 | (4) end_io_func will be called once for each page being read as the reads | ||
617 | complete. It will be called in process context if error != 0, but it may | ||
618 | be called in interrupt context if there is no error. | ||
619 | |||
620 | Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude | ||
621 | some of the pages being read and some being allocated. Those pages will have | ||
622 | been marked appropriately and will need uncaching. | ||
623 | |||
624 | |||
625 | ============== | ||
626 | PAGE UNCACHING | ||
627 | ============== | ||
628 | |||
629 | To uncache a page, this function should be called: | ||
630 | |||
631 | void fscache_uncache_page(struct fscache_cookie *cookie, | ||
632 | struct page *page); | ||
633 | |||
634 | This function permits the cache to release any in-memory representation it | ||
635 | might be holding for this netfs page. This function must be called once for | ||
636 | each page on which the read or write page functions above have been called to | ||
637 | make sure the cache's in-memory tracking information gets torn down. | ||
638 | |||
639 | Note that pages can't be explicitly deleted from the a data file. The whole | ||
640 | data file must be retired (see the relinquish cookie function below). | ||
641 | |||
642 | Furthermore, note that this does not cancel the asynchronous read or write | ||
643 | operation started by the read/alloc and write functions, so the page | ||
644 | invalidation and release functions must use: | ||
645 | |||
646 | bool fscache_check_page_write(struct fscache_cookie *cookie, | ||
647 | struct page *page); | ||
648 | |||
649 | to see if a page is being written to the cache, and: | ||
650 | |||
651 | void fscache_wait_on_page_write(struct fscache_cookie *cookie, | ||
652 | struct page *page); | ||
653 | |||
654 | to wait for it to finish if it is. | ||
655 | |||
656 | |||
657 | ========================== | ||
658 | INDEX AND DATA FILE UPDATE | ||
659 | ========================== | ||
660 | |||
661 | To request an update of the index data for an index or other object, the | ||
662 | following function should be called: | ||
663 | |||
664 | void fscache_update_cookie(struct fscache_cookie *cookie); | ||
665 | |||
666 | This function will refer back to the netfs_data pointer stored in the cookie by | ||
667 | the acquisition function to obtain the data to write into each revised index | ||
668 | entry. The update method in the parent index definition will be called to | ||
669 | transfer the data. | ||
670 | |||
671 | Note that partial updates may happen automatically at other times, such as when | ||
672 | data blocks are added to a data file object. | ||
673 | |||
674 | |||
675 | =============================== | ||
676 | MISCELLANEOUS COOKIE OPERATIONS | ||
677 | =============================== | ||
678 | |||
679 | There are a number of operations that can be used to control cookies: | ||
680 | |||
681 | (*) Cookie pinning: | ||
682 | |||
683 | int fscache_pin_cookie(struct fscache_cookie *cookie); | ||
684 | void fscache_unpin_cookie(struct fscache_cookie *cookie); | ||
685 | |||
686 | These operations permit data cookies to be pinned into the cache and to | ||
687 | have the pinning removed. They are not permitted on index cookies. | ||
688 | |||
689 | The pinning function will return 0 if successful, -ENOBUFS in the cookie | ||
690 | isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning, | ||
691 | -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or | ||
692 | -EIO if there's any other problem. | ||
693 | |||
694 | (*) Data space reservation: | ||
695 | |||
696 | int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); | ||
697 | |||
698 | This permits a netfs to request cache space be reserved to store up to the | ||
699 | given amount of a file. It is permitted to ask for more than the current | ||
700 | size of the file to allow for future file expansion. | ||
701 | |||
702 | If size is given as zero then the reservation will be cancelled. | ||
703 | |||
704 | The function will return 0 if successful, -ENOBUFS in the cookie isn't | ||
705 | backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations, | ||
706 | -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or | ||
707 | -EIO if there's any other problem. | ||
708 | |||
709 | Note that this doesn't pin an object in a cache; it can still be culled to | ||
710 | make space if it's not in use. | ||
711 | |||
712 | |||
713 | ===================== | ||
714 | COOKIE UNREGISTRATION | ||
715 | ===================== | ||
716 | |||
717 | To get rid of a cookie, this function should be called. | ||
718 | |||
719 | void fscache_relinquish_cookie(struct fscache_cookie *cookie, | ||
720 | int retire); | ||
721 | |||
722 | If retire is non-zero, then the object will be marked for recycling, and all | ||
723 | copies of it will be removed from all active caches in which it is present. | ||
724 | Not only that but all child objects will also be retired. | ||
725 | |||
726 | If retire is zero, then the object may be available again when next the | ||
727 | acquisition function is called. Retirement here will overrule the pinning on a | ||
728 | cookie. | ||
729 | |||
730 | One very important note - relinquish must NOT be called for a cookie unless all | ||
731 | the cookies for "child" indices, objects and pages have been relinquished | ||
732 | first. | ||
733 | |||
734 | |||
735 | ================================ | ||
736 | INDEX AND DATA FILE INVALIDATION | ||
737 | ================================ | ||
738 | |||
739 | There is no direct way to invalidate an index subtree or a data file. To do | ||
740 | this, the caller should relinquish and retire the cookie they have, and then | ||
741 | acquire a new one. | ||
742 | |||
743 | |||
744 | =========================== | ||
745 | FS-CACHE SPECIFIC PAGE FLAG | ||
746 | =========================== | ||
747 | |||
748 | FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is | ||
749 | given the alternative name PG_fscache. | ||
750 | |||
751 | PG_fscache is used to indicate that the page is known by the cache, and that | ||
752 | the cache must be informed if the page is going to go away. It's an indication | ||
753 | to the netfs that the cache has an interest in this page, where an interest may | ||
754 | be a pointer to it, resources allocated or reserved for it, or I/O in progress | ||
755 | upon it. | ||
756 | |||
757 | The netfs can use this information in methods such as releasepage() to | ||
758 | determine whether it needs to uncache a page or update it. | ||
759 | |||
760 | Furthermore, if this bit is set, releasepage() and invalidatepage() operations | ||
761 | will be called on a page to get rid of it, even if PG_private is not set. This | ||
762 | allows caching to attempted on a page before read_cache_pages() to be called | ||
763 | after fscache_read_or_alloc_pages() as the former will try and release pages it | ||
764 | was given under certain circumstances. | ||
765 | |||
766 | This bit does not overlap with such as PG_private. This means that FS-Cache | ||
767 | can be used with a filesystem that uses the block buffering code. | ||
768 | |||
769 | There are a number of operations defined on this flag: | ||
770 | |||
771 | int PageFsCache(struct page *page); | ||
772 | void SetPageFsCache(struct page *page) | ||
773 | void ClearPageFsCache(struct page *page) | ||
774 | int TestSetPageFsCache(struct page *page) | ||
775 | int TestClearPageFsCache(struct page *page) | ||
776 | |||
777 | These functions are bit test, bit set, bit clear, bit test and set and bit | ||
778 | test and clear operations on PG_fscache. | ||
diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt new file mode 100644 index 000000000000..e8b0a35d8fe5 --- /dev/null +++ b/Documentation/filesystems/caching/object.txt | |||
@@ -0,0 +1,313 @@ | |||
1 | ==================================================== | ||
2 | IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT | ||
3 | ==================================================== | ||
4 | |||
5 | By: David Howells <dhowells@redhat.com> | ||
6 | |||
7 | Contents: | ||
8 | |||
9 | (*) Representation | ||
10 | |||
11 | (*) Object management state machine. | ||
12 | |||
13 | - Provision of cpu time. | ||
14 | - Locking simplification. | ||
15 | |||
16 | (*) The set of states. | ||
17 | |||
18 | (*) The set of events. | ||
19 | |||
20 | |||
21 | ============== | ||
22 | REPRESENTATION | ||
23 | ============== | ||
24 | |||
25 | FS-Cache maintains an in-kernel representation of each object that a netfs is | ||
26 | currently interested in. Such objects are represented by the fscache_cookie | ||
27 | struct and are referred to as cookies. | ||
28 | |||
29 | FS-Cache also maintains a separate in-kernel representation of the objects that | ||
30 | a cache backend is currently actively caching. Such objects are represented by | ||
31 | the fscache_object struct. The cache backends allocate these upon request, and | ||
32 | are expected to embed them in their own representations. These are referred to | ||
33 | as objects. | ||
34 | |||
35 | There is a 1:N relationship between cookies and objects. A cookie may be | ||
36 | represented by multiple objects - an index may exist in more than one cache - | ||
37 | or even by no objects (it may not be cached). | ||
38 | |||
39 | Furthermore, both cookies and objects are hierarchical. The two hierarchies | ||
40 | correspond, but the cookies tree is a superset of the union of the object trees | ||
41 | of multiple caches: | ||
42 | |||
43 | NETFS INDEX TREE : CACHE 1 : CACHE 2 | ||
44 | : : | ||
45 | : +-----------+ : | ||
46 | +----------->| IObject | : | ||
47 | +-----------+ | : +-----------+ : | ||
48 | | ICookie |-------+ : | : | ||
49 | +-----------+ | : | : +-----------+ | ||
50 | | +------------------------------>| IObject | | ||
51 | | : | : +-----------+ | ||
52 | | : V : | | ||
53 | | : +-----------+ : | | ||
54 | V +----------->| IObject | : | | ||
55 | +-----------+ | : +-----------+ : | | ||
56 | | ICookie |-------+ : | : V | ||
57 | +-----------+ | : | : +-----------+ | ||
58 | | +------------------------------>| IObject | | ||
59 | +-----+-----+ : | : +-----------+ | ||
60 | | | : | : | | ||
61 | V | : V : | | ||
62 | +-----------+ | : +-----------+ : | | ||
63 | | ICookie |------------------------->| IObject | : | | ||
64 | +-----------+ | : +-----------+ : | | ||
65 | | V : | : V | ||
66 | | +-----------+ : | : +-----------+ | ||
67 | | | ICookie |-------------------------------->| IObject | | ||
68 | | +-----------+ : | : +-----------+ | ||
69 | V | : V : | | ||
70 | +-----------+ | : +-----------+ : | | ||
71 | | DCookie |------------------------->| DObject | : | | ||
72 | +-----------+ | : +-----------+ : | | ||
73 | | : : | | ||
74 | +-------+-------+ : : | | ||
75 | | | : : | | ||
76 | V V : : V | ||
77 | +-----------+ +-----------+ : : +-----------+ | ||
78 | | DCookie | | DCookie |------------------------>| DObject | | ||
79 | +-----------+ +-----------+ : : +-----------+ | ||
80 | : : | ||
81 | |||
82 | In the above illustration, ICookie and IObject represent indices and DCookie | ||
83 | and DObject represent data storage objects. Indices may have representation in | ||
84 | multiple caches, but currently, non-index objects may not. Objects of any type | ||
85 | may also be entirely unrepresented. | ||
86 | |||
87 | As far as the netfs API goes, the netfs is only actually permitted to see | ||
88 | pointers to the cookies. The cookies themselves and any objects attached to | ||
89 | those cookies are hidden from it. | ||
90 | |||
91 | |||
92 | =============================== | ||
93 | OBJECT MANAGEMENT STATE MACHINE | ||
94 | =============================== | ||
95 | |||
96 | Within FS-Cache, each active object is managed by its own individual state | ||
97 | machine. The state for an object is kept in the fscache_object struct, in | ||
98 | object->state. A cookie may point to a set of objects that are in different | ||
99 | states. | ||
100 | |||
101 | Each state has an action associated with it that is invoked when the machine | ||
102 | wakes up in that state. There are four logical sets of states: | ||
103 | |||
104 | (1) Preparation: states that wait for the parent objects to become ready. The | ||
105 | representations are hierarchical, and it is expected that an object must | ||
106 | be created or accessed with respect to its parent object. | ||
107 | |||
108 | (2) Initialisation: states that perform lookups in the cache and validate | ||
109 | what's found and that create on disk any missing metadata. | ||
110 | |||
111 | (3) Normal running: states that allow netfs operations on objects to proceed | ||
112 | and that update the state of objects. | ||
113 | |||
114 | (4) Termination: states that detach objects from their netfs cookies, that | ||
115 | delete objects from disk, that handle disk and system errors and that free | ||
116 | up in-memory resources. | ||
117 | |||
118 | |||
119 | In most cases, transitioning between states is in response to signalled events. | ||
120 | When a state has finished processing, it will usually set the mask of events in | ||
121 | which it is interested (object->event_mask) and relinquish the worker thread. | ||
122 | Then when an event is raised (by calling fscache_raise_event()), if the event | ||
123 | is not masked, the object will be queued for processing (by calling | ||
124 | fscache_enqueue_object()). | ||
125 | |||
126 | |||
127 | PROVISION OF CPU TIME | ||
128 | --------------------- | ||
129 | |||
130 | The work to be done by the various states is given CPU time by the threads of | ||
131 | the slow work facility (see Documentation/slow-work.txt). This is used in | ||
132 | preference to the workqueue facility because: | ||
133 | |||
134 | (1) Threads may be completely occupied for very long periods of time by a | ||
135 | particular work item. These state actions may be doing sequences of | ||
136 | synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, | ||
137 | getxattr, truncate, unlink, rmdir, rename). | ||
138 | |||
139 | (2) Threads may do little actual work, but may rather spend a lot of time | ||
140 | sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded | ||
141 | workqueues don't necessarily have the right numbers of threads. | ||
142 | |||
143 | |||
144 | LOCKING SIMPLIFICATION | ||
145 | ---------------------- | ||
146 | |||
147 | Because only one worker thread may be operating on any particular object's | ||
148 | state machine at once, this simplifies the locking, particularly with respect | ||
149 | to disconnecting the netfs's representation of a cache object (fscache_cookie) | ||
150 | from the cache backend's representation (fscache_object) - which may be | ||
151 | requested from either end. | ||
152 | |||
153 | |||
154 | ================= | ||
155 | THE SET OF STATES | ||
156 | ================= | ||
157 | |||
158 | The object state machine has a set of states that it can be in. There are | ||
159 | preparation states in which the object sets itself up and waits for its parent | ||
160 | object to transit to a state that allows access to its children: | ||
161 | |||
162 | (1) State FSCACHE_OBJECT_INIT. | ||
163 | |||
164 | Initialise the object and wait for the parent object to become active. In | ||
165 | the cache, it is expected that it will not be possible to look an object | ||
166 | up from the parent object, until that parent object itself has been looked | ||
167 | up. | ||
168 | |||
169 | There are initialisation states in which the object sets itself up and accesses | ||
170 | disk for the object metadata: | ||
171 | |||
172 | (2) State FSCACHE_OBJECT_LOOKING_UP. | ||
173 | |||
174 | Look up the object on disk, using the parent as a starting point. | ||
175 | FS-Cache expects the cache backend to probe the cache to see whether this | ||
176 | object is represented there, and if it is, to see if it's valid (coherency | ||
177 | management). | ||
178 | |||
179 | The cache should call fscache_object_lookup_negative() to indicate lookup | ||
180 | failure for whatever reason, and should call fscache_obtained_object() to | ||
181 | indicate success. | ||
182 | |||
183 | At the completion of lookup, FS-Cache will let the netfs go ahead with | ||
184 | read operations, no matter whether the file is yet cached. If not yet | ||
185 | cached, read operations will be immediately rejected with ENODATA until | ||
186 | the first known page is uncached - as to that point there can be no data | ||
187 | to be read out of the cache for that file that isn't currently also held | ||
188 | in the pagecache. | ||
189 | |||
190 | (3) State FSCACHE_OBJECT_CREATING. | ||
191 | |||
192 | Create an object on disk, using the parent as a starting point. This | ||
193 | happens if the lookup failed to find the object, or if the object's | ||
194 | coherency data indicated what's on disk is out of date. In this state, | ||
195 | FS-Cache expects the cache to create | ||
196 | |||
197 | The cache should call fscache_obtained_object() if creation completes | ||
198 | successfully, fscache_object_lookup_negative() otherwise. | ||
199 | |||
200 | At the completion of creation, FS-Cache will start processing write | ||
201 | operations the netfs has queued for an object. If creation failed, the | ||
202 | write ops will be transparently discarded, and nothing recorded in the | ||
203 | cache. | ||
204 | |||
205 | There are some normal running states in which the object spends its time | ||
206 | servicing netfs requests: | ||
207 | |||
208 | (4) State FSCACHE_OBJECT_AVAILABLE. | ||
209 | |||
210 | A transient state in which pending operations are started, child objects | ||
211 | are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary | ||
212 | lookup data is freed. | ||
213 | |||
214 | (5) State FSCACHE_OBJECT_ACTIVE. | ||
215 | |||
216 | The normal running state. In this state, requests the netfs makes will be | ||
217 | passed on to the cache. | ||
218 | |||
219 | (6) State FSCACHE_OBJECT_UPDATING. | ||
220 | |||
221 | The state machine comes here to update the object in the cache from the | ||
222 | netfs's records. This involves updating the auxiliary data that is used | ||
223 | to maintain coherency. | ||
224 | |||
225 | And there are terminal states in which an object cleans itself up, deallocates | ||
226 | memory and potentially deletes stuff from disk: | ||
227 | |||
228 | (7) State FSCACHE_OBJECT_LC_DYING. | ||
229 | |||
230 | The object comes here if it is dying because of a lookup or creation | ||
231 | error. This would be due to a disk error or system error of some sort. | ||
232 | Temporary data is cleaned up, and the parent is released. | ||
233 | |||
234 | (8) State FSCACHE_OBJECT_DYING. | ||
235 | |||
236 | The object comes here if it is dying due to an error, because its parent | ||
237 | cookie has been relinquished by the netfs or because the cache is being | ||
238 | withdrawn. | ||
239 | |||
240 | Any child objects waiting on this one are given CPU time so that they too | ||
241 | can destroy themselves. This object waits for all its children to go away | ||
242 | before advancing to the next state. | ||
243 | |||
244 | (9) State FSCACHE_OBJECT_ABORT_INIT. | ||
245 | |||
246 | The object comes to this state if it was waiting on its parent in | ||
247 | FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself | ||
248 | so that the parent may proceed from the FSCACHE_OBJECT_DYING state. | ||
249 | |||
250 | (10) State FSCACHE_OBJECT_RELEASING. | ||
251 | (11) State FSCACHE_OBJECT_RECYCLING. | ||
252 | |||
253 | The object comes to one of these two states when dying once it is rid of | ||
254 | all its children, if it is dying because the netfs relinquished its | ||
255 | cookie. In the first state, the cached data is expected to persist, and | ||
256 | in the second it will be deleted. | ||
257 | |||
258 | (12) State FSCACHE_OBJECT_WITHDRAWING. | ||
259 | |||
260 | The object transits to this state if the cache decides it wants to | ||
261 | withdraw the object from service, perhaps to make space, but also due to | ||
262 | error or just because the whole cache is being withdrawn. | ||
263 | |||
264 | (13) State FSCACHE_OBJECT_DEAD. | ||
265 | |||
266 | The object transits to this state when the in-memory object record is | ||
267 | ready to be deleted. The object processor shouldn't ever see an object in | ||
268 | this state. | ||
269 | |||
270 | |||
271 | THE SET OF EVENTS | ||
272 | ----------------- | ||
273 | |||
274 | There are a number of events that can be raised to an object state machine: | ||
275 | |||
276 | (*) FSCACHE_OBJECT_EV_UPDATE | ||
277 | |||
278 | The netfs requested that an object be updated. The state machine will ask | ||
279 | the cache backend to update the object, and the cache backend will ask the | ||
280 | netfs for details of the change through its cookie definition ops. | ||
281 | |||
282 | (*) FSCACHE_OBJECT_EV_CLEARED | ||
283 | |||
284 | This is signalled in two circumstances: | ||
285 | |||
286 | (a) when an object's last child object is dropped and | ||
287 | |||
288 | (b) when the last operation outstanding on an object is completed. | ||
289 | |||
290 | This is used to proceed from the dying state. | ||
291 | |||
292 | (*) FSCACHE_OBJECT_EV_ERROR | ||
293 | |||
294 | This is signalled when an I/O error occurs during the processing of some | ||
295 | object. | ||
296 | |||
297 | (*) FSCACHE_OBJECT_EV_RELEASE | ||
298 | (*) FSCACHE_OBJECT_EV_RETIRE | ||
299 | |||
300 | These are signalled when the netfs relinquishes a cookie it was using. | ||
301 | The event selected depends on whether the netfs asks for the backing | ||
302 | object to be retired (deleted) or retained. | ||
303 | |||
304 | (*) FSCACHE_OBJECT_EV_WITHDRAW | ||
305 | |||
306 | This is signalled when the cache backend wants to withdraw an object. | ||
307 | This means that the object will have to be detached from the netfs's | ||
308 | cookie. | ||
309 | |||
310 | Because the withdrawing releasing/retiring events are all handled by the object | ||
311 | state machine, it doesn't matter if there's a collision with both ends trying | ||
312 | to sever the connection at the same time. The state machine can just pick | ||
313 | which one it wants to honour, and that effects the other. | ||
diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt new file mode 100644 index 000000000000..b6b070c57cbf --- /dev/null +++ b/Documentation/filesystems/caching/operations.txt | |||
@@ -0,0 +1,213 @@ | |||
1 | ================================ | ||
2 | ASYNCHRONOUS OPERATIONS HANDLING | ||
3 | ================================ | ||
4 | |||
5 | By: David Howells <dhowells@redhat.com> | ||
6 | |||
7 | Contents: | ||
8 | |||
9 | (*) Overview. | ||
10 | |||
11 | (*) Operation record initialisation. | ||
12 | |||
13 | (*) Parameters. | ||
14 | |||
15 | (*) Procedure. | ||
16 | |||
17 | (*) Asynchronous callback. | ||
18 | |||
19 | |||
20 | ======== | ||
21 | OVERVIEW | ||
22 | ======== | ||
23 | |||
24 | FS-Cache has an asynchronous operations handling facility that it uses for its | ||
25 | data storage and retrieval routines. Its operations are represented by | ||
26 | fscache_operation structs, though these are usually embedded into some other | ||
27 | structure. | ||
28 | |||
29 | This facility is available to and expected to be be used by the cache backends, | ||
30 | and FS-Cache will create operations and pass them off to the appropriate cache | ||
31 | backend for completion. | ||
32 | |||
33 | To make use of this facility, <linux/fscache-cache.h> should be #included. | ||
34 | |||
35 | |||
36 | =============================== | ||
37 | OPERATION RECORD INITIALISATION | ||
38 | =============================== | ||
39 | |||
40 | An operation is recorded in an fscache_operation struct: | ||
41 | |||
42 | struct fscache_operation { | ||
43 | union { | ||
44 | struct work_struct fast_work; | ||
45 | struct slow_work slow_work; | ||
46 | }; | ||
47 | unsigned long flags; | ||
48 | fscache_operation_processor_t processor; | ||
49 | ... | ||
50 | }; | ||
51 | |||
52 | Someone wanting to issue an operation should allocate something with this | ||
53 | struct embedded in it. They should initialise it by calling: | ||
54 | |||
55 | void fscache_operation_init(struct fscache_operation *op, | ||
56 | fscache_operation_release_t release); | ||
57 | |||
58 | with the operation to be initialised and the release function to use. | ||
59 | |||
60 | The op->flags parameter should be set to indicate the CPU time provision and | ||
61 | the exclusivity (see the Parameters section). | ||
62 | |||
63 | The op->fast_work, op->slow_work and op->processor flags should be set as | ||
64 | appropriate for the CPU time provision (see the Parameters section). | ||
65 | |||
66 | FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the | ||
67 | operation and waited for afterwards. | ||
68 | |||
69 | |||
70 | ========== | ||
71 | PARAMETERS | ||
72 | ========== | ||
73 | |||
74 | There are a number of parameters that can be set in the operation record's flag | ||
75 | parameter. There are three options for the provision of CPU time in these | ||
76 | operations: | ||
77 | |||
78 | (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread | ||
79 | may decide it wants to handle an operation itself without deferring it to | ||
80 | another thread. | ||
81 | |||
82 | This is, for example, used in read operations for calling readpages() on | ||
83 | the backing filesystem in CacheFiles. Although readpages() does an | ||
84 | asynchronous data fetch, the determination of whether pages exist is done | ||
85 | synchronously - and the netfs does not proceed until this has been | ||
86 | determined. | ||
87 | |||
88 | If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags | ||
89 | before submitting the operation, and the operating thread must wait for it | ||
90 | to be cleared before proceeding: | ||
91 | |||
92 | wait_on_bit(&op->flags, FSCACHE_OP_WAITING, | ||
93 | fscache_wait_bit, TASK_UNINTERRUPTIBLE); | ||
94 | |||
95 | |||
96 | (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it | ||
97 | will be given to keventd to process. Such an operation is not permitted | ||
98 | to sleep on I/O. | ||
99 | |||
100 | This is, for example, used by CacheFiles to copy data from a backing fs | ||
101 | page to a netfs page after the backing fs has read the page in. | ||
102 | |||
103 | If this option is used, op->fast_work and op->processor must be | ||
104 | initialised before submitting the operation: | ||
105 | |||
106 | INIT_WORK(&op->fast_work, do_some_work); | ||
107 | |||
108 | |||
109 | (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it | ||
110 | will be given to the slow work facility to process. Such an operation is | ||
111 | permitted to sleep on I/O. | ||
112 | |||
113 | This is, for example, used by FS-Cache to handle background writes of | ||
114 | pages that have just been fetched from a remote server. | ||
115 | |||
116 | If this option is used, op->slow_work and op->processor must be | ||
117 | initialised before submitting the operation: | ||
118 | |||
119 | fscache_operation_init_slow(op, processor) | ||
120 | |||
121 | |||
122 | Furthermore, operations may be one of two types: | ||
123 | |||
124 | (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in | ||
125 | conjunction with any other operation on the object being operated upon. | ||
126 | |||
127 | An example of this is the attribute change operation, in which the file | ||
128 | being written to may need truncation. | ||
129 | |||
130 | (2) Shareable. Operations of this type may be running simultaneously. It's | ||
131 | up to the operation implementation to prevent interference between other | ||
132 | operations running at the same time. | ||
133 | |||
134 | |||
135 | ========= | ||
136 | PROCEDURE | ||
137 | ========= | ||
138 | |||
139 | Operations are used through the following procedure: | ||
140 | |||
141 | (1) The submitting thread must allocate the operation and initialise it | ||
142 | itself. Normally this would be part of a more specific structure with the | ||
143 | generic op embedded within. | ||
144 | |||
145 | (2) The submitting thread must then submit the operation for processing using | ||
146 | one of the following two functions: | ||
147 | |||
148 | int fscache_submit_op(struct fscache_object *object, | ||
149 | struct fscache_operation *op); | ||
150 | |||
151 | int fscache_submit_exclusive_op(struct fscache_object *object, | ||
152 | struct fscache_operation *op); | ||
153 | |||
154 | The first function should be used to submit non-exclusive ops and the | ||
155 | second to submit exclusive ones. The caller must still set the | ||
156 | FSCACHE_OP_EXCLUSIVE flag. | ||
157 | |||
158 | If successful, both functions will assign the operation to the specified | ||
159 | object and return 0. -ENOBUFS will be returned if the object specified is | ||
160 | permanently unavailable. | ||
161 | |||
162 | The operation manager will defer operations on an object that is still | ||
163 | undergoing lookup or creation. The operation will also be deferred if an | ||
164 | operation of conflicting exclusivity is in progress on the object. | ||
165 | |||
166 | If the operation is asynchronous, the manager will retain a reference to | ||
167 | it, so the caller should put their reference to it by passing it to: | ||
168 | |||
169 | void fscache_put_operation(struct fscache_operation *op); | ||
170 | |||
171 | (3) If the submitting thread wants to do the work itself, and has marked the | ||
172 | operation with FSCACHE_OP_MYTHREAD, then it should monitor | ||
173 | FSCACHE_OP_WAITING as described above and check the state of the object if | ||
174 | necessary (the object might have died whilst the thread was waiting). | ||
175 | |||
176 | When it has finished doing its processing, it should call | ||
177 | fscache_put_operation() on it. | ||
178 | |||
179 | (4) The operation holds an effective lock upon the object, preventing other | ||
180 | exclusive ops conflicting until it is released. The operation can be | ||
181 | enqueued for further immediate asynchronous processing by adjusting the | ||
182 | CPU time provisioning option if necessary, eg: | ||
183 | |||
184 | op->flags &= ~FSCACHE_OP_TYPE; | ||
185 | op->flags |= ~FSCACHE_OP_FAST; | ||
186 | |||
187 | and calling: | ||
188 | |||
189 | void fscache_enqueue_operation(struct fscache_operation *op) | ||
190 | |||
191 | This can be used to allow other things to have use of the worker thread | ||
192 | pools. | ||
193 | |||
194 | |||
195 | ===================== | ||
196 | ASYNCHRONOUS CALLBACK | ||
197 | ===================== | ||
198 | |||
199 | When used in asynchronous mode, the worker thread pool will invoke the | ||
200 | processor method with a pointer to the operation. This should then get at the | ||
201 | container struct by using container_of(): | ||
202 | |||
203 | static void fscache_write_op(struct fscache_operation *_op) | ||
204 | { | ||
205 | struct fscache_storage *op = | ||
206 | container_of(_op, struct fscache_storage, op); | ||
207 | ... | ||
208 | } | ||
209 | |||
210 | The caller holds a reference on the operation, and will invoke | ||
211 | fscache_put_operation() when the processor function returns. The processor | ||
212 | function is at liberty to call fscache_enqueue_operation() or to take extra | ||
213 | references. | ||
diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt new file mode 100644 index 000000000000..0ced74c2f73c --- /dev/null +++ b/Documentation/filesystems/exofs.txt | |||
@@ -0,0 +1,176 @@ | |||
1 | =============================================================================== | ||
2 | WHAT IS EXOFS? | ||
3 | =============================================================================== | ||
4 | |||
5 | exofs is a file system that uses an OSD and exports the API of a normal Linux | ||
6 | file system. Users access exofs like any other local file system, and exofs | ||
7 | will in turn issue commands to the local OSD initiator. | ||
8 | |||
9 | OSD is a new T10 command set that views storage devices not as a large/flat | ||
10 | array of sectors but as a container of objects, each having a length, quota, | ||
11 | time attributes and more. Each object is addressed by a 64bit ID, and is | ||
12 | contained in a 64bit ID partition. Each object has associated attributes | ||
13 | attached to it, which are integral part of the object and provide metadata about | ||
14 | the object. The standard defines some common obligatory attributes, but user | ||
15 | attributes can be added as needed. | ||
16 | |||
17 | =============================================================================== | ||
18 | ENVIRONMENT | ||
19 | =============================================================================== | ||
20 | |||
21 | To use this file system, you need to have an object store to run it on. You | ||
22 | may download a target from: | ||
23 | http://open-osd.org | ||
24 | |||
25 | See Documentation/scsi/osd.txt for how to setup a working osd environment. | ||
26 | |||
27 | =============================================================================== | ||
28 | USAGE | ||
29 | =============================================================================== | ||
30 | |||
31 | 1. Download and compile exofs and open-osd initiator: | ||
32 | You need an external Kernel source tree or kernel headers from your | ||
33 | distribution. (anything based on 2.6.26 or later). | ||
34 | |||
35 | a. download open-osd including exofs source using: | ||
36 | [parent-directory]$ git clone git://git.open-osd.org/open-osd.git | ||
37 | |||
38 | b. Build the library module like this: | ||
39 | [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd | ||
40 | |||
41 | This will build both the open-osd initiator as well as the exofs kernel | ||
42 | module. Use whatever parameters you compiled your Kernel with and | ||
43 | $(KER_DIR) above pointing to the Kernel you compile against. See the file | ||
44 | open-osd/top-level-Makefile for an example. | ||
45 | |||
46 | 2. Get the OSD initiator and target set up properly, and login to the target. | ||
47 | See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd | ||
48 | for example script that does all these steps. | ||
49 | |||
50 | 3. Insmod the exofs.ko module: | ||
51 | [exofs]$ insmod exofs.ko | ||
52 | |||
53 | 4. Make sure the directory where you want to mount exists. If not, create it. | ||
54 | (For example, mkdir /mnt/exofs) | ||
55 | |||
56 | 5. At first run you will need to invoke the mkfs.exofs application | ||
57 | |||
58 | As an example, this will create the file system on: | ||
59 | /dev/osd0 partition ID 65536 | ||
60 | |||
61 | mkfs.exofs --pid=65536 --format /dev/osd0 | ||
62 | |||
63 | The --format is optional if not specified no OSD_FORMAT will be | ||
64 | preformed and a clean file system will be created in the specified pid, | ||
65 | in the available space of the target. (Use --format=size_in_meg to limit | ||
66 | the total LUN space available) | ||
67 | |||
68 | If pid already exist it will be deleted and a new one will be created in it's | ||
69 | place. Be careful. | ||
70 | |||
71 | An exofs lives inside a single OSD partition. You can create multiple exofs | ||
72 | filesystems on the same device using multiple pids. | ||
73 | |||
74 | (run mkfs.exofs without any parameters for usage help message) | ||
75 | |||
76 | 6. Mount the file system. | ||
77 | |||
78 | For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs: | ||
79 | |||
80 | mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/ | ||
81 | |||
82 | 7. For reference (See do-exofs example script): | ||
83 | do-exofs start - an example of how to perform the above steps. | ||
84 | do-exofs stop - an example of how to unmount the file system. | ||
85 | do-exofs format - an example of how to format and mkfs a new exofs. | ||
86 | |||
87 | 8. Extra compilation flags (uncomment in fs/exofs/Kbuild): | ||
88 | CONFIG_EXOFS_DEBUG - for debug messages and extra checks. | ||
89 | |||
90 | =============================================================================== | ||
91 | exofs mount options | ||
92 | =============================================================================== | ||
93 | Similar to any mount command: | ||
94 | mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory | ||
95 | |||
96 | Where: | ||
97 | -t exofs: specifies the exofs file system | ||
98 | |||
99 | /dev/osdX: X is a decimal number. /dev/osdX was created after a successful | ||
100 | login into an OSD target. | ||
101 | |||
102 | mount_exofs_directory: The directory to mount the file system on | ||
103 | |||
104 | exofs specific options: Options are separated by commas (,) | ||
105 | pid=<integer> - The partition number to mount/create as | ||
106 | container of the filesystem. | ||
107 | This option is mandatory | ||
108 | to=<integer> - Timeout in ticks for a single command | ||
109 | default is (60 * HZ) [for debugging only] | ||
110 | |||
111 | =============================================================================== | ||
112 | DESIGN | ||
113 | =============================================================================== | ||
114 | |||
115 | * The file system control block (AKA on-disk superblock) resides in an object | ||
116 | with a special ID (defined in common.h). | ||
117 | Information included in the file system control block is used to fill the | ||
118 | in-memory superblock structure at mount time. This object is created before | ||
119 | the file system is used by mkexofs.c It contains information such as: | ||
120 | - The file system's magic number | ||
121 | - The next inode number to be allocated | ||
122 | |||
123 | * Each file resides in its own object and contains the data (and it will be | ||
124 | possible to extend the file over multiple objects, though this has not been | ||
125 | implemented yet). | ||
126 | |||
127 | * A directory is treated as a file, and essentially contains a list of <file | ||
128 | name, inode #> pairs for files that are found in that directory. The object | ||
129 | IDs correspond to the files' inode numbers and will be allocated according to | ||
130 | a bitmap (stored in a separate object). Now they are allocated using a | ||
131 | counter. | ||
132 | |||
133 | * Each file's control block (AKA on-disk inode) is stored in its object's | ||
134 | attributes. This applies to both regular files and other types (directories, | ||
135 | device files, symlinks, etc.). | ||
136 | |||
137 | * Credentials are generated per object (inode and superblock) when they is | ||
138 | created in memory (read off disk or created). The credential works for all | ||
139 | operations and is used as long as the object remains in memory. | ||
140 | |||
141 | * Async OSD operations are used whenever possible, but the target may execute | ||
142 | them out of order. The operations that concern us are create, delete, | ||
143 | readpage, writepage, update_inode, and truncate. The following pairs of | ||
144 | operations should execute in the order written, and we need to prevent them | ||
145 | from executing in reverse order: | ||
146 | - The following are handled with the OBJ_CREATED and OBJ_2BCREATED | ||
147 | flags. OBJ_CREATED is set when we know the object exists on the OSD - | ||
148 | in create's callback function, and when we successfully do a read_inode. | ||
149 | OBJ_2BCREATED is set in the beginning of the create function, so we | ||
150 | know that we should wait. | ||
151 | - create/delete: delete should wait until the object is created | ||
152 | on the OSD. | ||
153 | - create/readpage: readpage should be able to return a page | ||
154 | full of zeroes in this case. If there was a write already | ||
155 | en-route (i.e. create, writepage, readpage) then the page | ||
156 | would be locked, and so it would really be the same as | ||
157 | create/writepage. | ||
158 | - create/writepage: if writepage is called for a sync write, it | ||
159 | should wait until the object is created on the OSD. | ||
160 | Otherwise, it should just return. | ||
161 | - create/truncate: truncate should wait until the object is | ||
162 | created on the OSD. | ||
163 | - create/update_inode: update_inode should wait until the | ||
164 | object is created on the OSD. | ||
165 | - Handled by VFS locks: | ||
166 | - readpage/delete: shouldn't happen because of page lock. | ||
167 | - writepage/delete: shouldn't happen because of page lock. | ||
168 | - readpage/writepage: shouldn't happen because of page lock. | ||
169 | |||
170 | =============================================================================== | ||
171 | LICENSE/COPYRIGHT | ||
172 | =============================================================================== | ||
173 | The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel | ||
174 | version 2.6.10). All files include the original copyrights, and the license | ||
175 | is GPL version 2 (only version 2, as is true for the Linux kernel). The | ||
176 | Linux kernel can be downloaded from www.kernel.org. | ||
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index e5f3833a6ef8..570f9bd9be2b 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt | |||
@@ -14,6 +14,11 @@ Options | |||
14 | When mounting an ext3 filesystem, the following option are accepted: | 14 | When mounting an ext3 filesystem, the following option are accepted: |
15 | (*) == default | 15 | (*) == default |
16 | 16 | ||
17 | ro Mount filesystem read only. Note that ext3 will replay | ||
18 | the journal (and thus write to the partition) even when | ||
19 | mounted "read only". Mount options "ro,noload" can be | ||
20 | used to prevent writes to the filesystem. | ||
21 | |||
17 | journal=update Update the ext3 file system's journal to the current | 22 | journal=update Update the ext3 file system's journal to the current |
18 | format. | 23 | format. |
19 | 24 | ||
@@ -27,7 +32,9 @@ journal_dev=devnum When the external journal device's major/minor numbers | |||
27 | identified through its new major/minor numbers encoded | 32 | identified through its new major/minor numbers encoded |
28 | in devnum. | 33 | in devnum. |
29 | 34 | ||
30 | noload Don't load the journal on mounting. | 35 | noload Don't load the journal on mounting. Note that this forces |
36 | mount of inconsistent filesystem, which can lead to | ||
37 | various problems. | ||
31 | 38 | ||
32 | data=journal All data are committed into the journal prior to being | 39 | data=journal All data are committed into the journal prior to being |
33 | written into the main file system. | 40 | written into the main file system. |
@@ -92,9 +99,12 @@ nocheck | |||
92 | 99 | ||
93 | debug Extra debugging information is sent to syslog. | 100 | debug Extra debugging information is sent to syslog. |
94 | 101 | ||
95 | errors=remount-ro(*) Remount the filesystem read-only on an error. | 102 | errors=remount-ro Remount the filesystem read-only on an error. |
96 | errors=continue Keep going on a filesystem error. | 103 | errors=continue Keep going on a filesystem error. |
97 | errors=panic Panic and halt the machine if an error occurs. | 104 | errors=panic Panic and halt the machine if an error occurs. |
105 | (These mount options override the errors behavior | ||
106 | specified in the superblock, which can be | ||
107 | configured using tune2fs.) | ||
98 | 108 | ||
99 | data_err=ignore(*) Just print an error message if an error occurs | 109 | data_err=ignore(*) Just print an error message if an error occurs |
100 | in a file data buffer in ordered mode. | 110 | in a file data buffer in ordered mode. |
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index cec829bc7291..97882df04865 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
@@ -85,7 +85,7 @@ Note: More extensive information for getting started with ext4 can be | |||
85 | * extent format more robust in face of on-disk corruption due to magics, | 85 | * extent format more robust in face of on-disk corruption due to magics, |
86 | * internal redundancy in tree | 86 | * internal redundancy in tree |
87 | * improved file allocation (multi-block alloc) | 87 | * improved file allocation (multi-block alloc) |
88 | * fix 32000 subdirectory limit | 88 | * lift 32000 subdirectory limit imposed by i_links_count[1] |
89 | * nsec timestamps for mtime, atime, ctime, create time | 89 | * nsec timestamps for mtime, atime, ctime, create time |
90 | * inode version field on disk (NFSv4, Lustre) | 90 | * inode version field on disk (NFSv4, Lustre) |
91 | * reduced e2fsck time via uninit_bg feature | 91 | * reduced e2fsck time via uninit_bg feature |
@@ -100,6 +100,9 @@ Note: More extensive information for getting started with ext4 can be | |||
100 | * efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force | 100 | * efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force |
101 | the ordering) | 101 | the ordering) |
102 | 102 | ||
103 | [1] Filesystems with a block size of 1k may see a limit imposed by the | ||
104 | directory hash tree having a maximum depth of two. | ||
105 | |||
103 | 2.2 Candidate features for future inclusion | 106 | 2.2 Candidate features for future inclusion |
104 | 107 | ||
105 | * Online defrag (patches available but not well tested) | 108 | * Online defrag (patches available but not well tested) |
@@ -180,8 +183,8 @@ commit=nrsec (*) Ext4 can be told to sync all its data and metadata | |||
180 | performance. | 183 | performance. |
181 | 184 | ||
182 | barrier=<0|1(*)> This enables/disables the use of write barriers in | 185 | barrier=<0|1(*)> This enables/disables the use of write barriers in |
183 | the jbd code. barrier=0 disables, barrier=1 enables. | 186 | barrier(*) the jbd code. barrier=0 disables, barrier=1 enables. |
184 | This also requires an IO stack which can support | 187 | nobarrier This also requires an IO stack which can support |
185 | barriers, and if jbd gets an error on a barrier | 188 | barriers, and if jbd gets an error on a barrier |
186 | write, it will disable again with a warning. | 189 | write, it will disable again with a warning. |
187 | Write barriers enforce proper on-disk ordering | 190 | Write barriers enforce proper on-disk ordering |
@@ -189,6 +192,9 @@ barrier=<0|1(*)> This enables/disables the use of write barriers in | |||
189 | safe to use, at some performance penalty. If | 192 | safe to use, at some performance penalty. If |
190 | your disks are battery-backed in one way or another, | 193 | your disks are battery-backed in one way or another, |
191 | disabling barriers may safely improve performance. | 194 | disabling barriers may safely improve performance. |
195 | The mount options "barrier" and "nobarrier" can | ||
196 | also be used to enable or disable barriers, for | ||
197 | consistency with other ext4 mount options. | ||
192 | 198 | ||
193 | inode_readahead=n This tuning parameter controls the maximum | 199 | inode_readahead=n This tuning parameter controls the maximum |
194 | number of inode table blocks that ext4's inode | 200 | number of inode table blocks that ext4's inode |
@@ -310,6 +316,24 @@ journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the | |||
310 | a slightly higher priority than the default I/O | 316 | a slightly higher priority than the default I/O |
311 | priority. | 317 | priority. |
312 | 318 | ||
319 | auto_da_alloc(*) Many broken applications don't use fsync() when | ||
320 | noauto_da_alloc replacing existing files via patterns such as | ||
321 | fd = open("foo.new")/write(fd,..)/close(fd)/ | ||
322 | rename("foo.new", "foo"), or worse yet, | ||
323 | fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). | ||
324 | If auto_da_alloc is enabled, ext4 will detect | ||
325 | the replace-via-rename and replace-via-truncate | ||
326 | patterns and force that any delayed allocation | ||
327 | blocks are allocated such that at the next | ||
328 | journal commit, in the default data=ordered | ||
329 | mode, the data blocks of the new file are forced | ||
330 | to disk before the rename() operation is | ||
331 | commited. This provides roughly the same level | ||
332 | of guarantees as ext3, and avoids the | ||
333 | "zero-length" problem that can happen when a | ||
334 | system crashes before the delayed allocation | ||
335 | blocks are forced to disk. | ||
336 | |||
313 | Data Mode | 337 | Data Mode |
314 | ========= | 338 | ========= |
315 | There are 3 different data modes: | 339 | There are 3 different data modes: |
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 830bad7cce0f..ce84cfc9eae0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt | |||
@@ -5,6 +5,7 @@ | |||
5 | Bodo Bauer <bb@ricochet.net> | 5 | Bodo Bauer <bb@ricochet.net> |
6 | 6 | ||
7 | 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 | 7 | 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 |
8 | move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 | ||
8 | ------------------------------------------------------------------------------ | 9 | ------------------------------------------------------------------------------ |
9 | Version 1.3 Kernel version 2.2.12 | 10 | Version 1.3 Kernel version 2.2.12 |
10 | Kernel version 2.4.0-test11-pre4 | 11 | Kernel version 2.4.0-test11-pre4 |
@@ -26,25 +27,17 @@ Table of Contents | |||
26 | 1.6 Parallel port info in /proc/parport | 27 | 1.6 Parallel port info in /proc/parport |
27 | 1.7 TTY info in /proc/tty | 28 | 1.7 TTY info in /proc/tty |
28 | 1.8 Miscellaneous kernel statistics in /proc/stat | 29 | 1.8 Miscellaneous kernel statistics in /proc/stat |
30 | 1.9 Ext4 file system parameters | ||
29 | 31 | ||
30 | 2 Modifying System Parameters | 32 | 2 Modifying System Parameters |
31 | 2.1 /proc/sys/fs - File system data | 33 | |
32 | 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats | 34 | 3 Per-Process Parameters |
33 | 2.3 /proc/sys/kernel - general kernel parameters | 35 | 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score |
34 | 2.4 /proc/sys/vm - The virtual memory subsystem | 36 | 3.2 /proc/<pid>/oom_score - Display current oom-killer score |
35 | 2.5 /proc/sys/dev - Device specific parameters | 37 | 3.3 /proc/<pid>/io - Display the IO accounting fields |
36 | 2.6 /proc/sys/sunrpc - Remote procedure calls | 38 | 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings |
37 | 2.7 /proc/sys/net - Networking stuff | 39 | 3.5 /proc/<pid>/mountinfo - Information about mounts |
38 | 2.8 /proc/sys/net/ipv4 - IPV4 settings | 40 | |
39 | 2.9 Appletalk | ||
40 | 2.10 IPX | ||
41 | 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem | ||
42 | 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score | ||
43 | 2.13 /proc/<pid>/oom_score - Display current oom-killer score | ||
44 | 2.14 /proc/<pid>/io - Display the IO accounting fields | ||
45 | 2.15 /proc/<pid>/coredump_filter - Core dump filtering settings | ||
46 | 2.16 /proc/<pid>/mountinfo - Information about mounts | ||
47 | 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface | ||
48 | 41 | ||
49 | ------------------------------------------------------------------------------ | 42 | ------------------------------------------------------------------------------ |
50 | Preface | 43 | Preface |
@@ -940,27 +933,6 @@ Table 1-10: Files in /proc/fs/ext4/<devname> | |||
940 | File Content | 933 | File Content |
941 | mb_groups details of multiblock allocator buddy cache of free blocks | 934 | mb_groups details of multiblock allocator buddy cache of free blocks |
942 | mb_history multiblock allocation history | 935 | mb_history multiblock allocation history |
943 | stats controls whether the multiblock allocator should start | ||
944 | collecting statistics, which are shown during the unmount | ||
945 | group_prealloc the multiblock allocator will round up allocation | ||
946 | requests to a multiple of this tuning parameter if the | ||
947 | stripe size is not set in the ext4 superblock | ||
948 | max_to_scan The maximum number of extents the multiblock allocator | ||
949 | will search to find the best extent | ||
950 | min_to_scan The minimum number of extents the multiblock allocator | ||
951 | will search to find the best extent | ||
952 | order2_req Tuning parameter which controls the minimum size for | ||
953 | requests (as a power of 2) where the buddy cache is | ||
954 | used | ||
955 | stream_req Files which have fewer blocks than this tunable | ||
956 | parameter will have their blocks allocated out of a | ||
957 | block group specific preallocation pool, so that small | ||
958 | files are packed closely together. Each large file | ||
959 | will have its blocks allocated out of its own unique | ||
960 | preallocation pool. | ||
961 | inode_readahead Tuning parameter which controls the maximum number of | ||
962 | inode table blocks that ext4's inode table readahead | ||
963 | algorithm will pre-read into the buffer cache | ||
964 | .............................................................................. | 936 | .............................................................................. |
965 | 937 | ||
966 | 938 | ||
@@ -1011,1021 +983,24 @@ review the kernel documentation in the directory /usr/src/linux/Documentation. | |||
1011 | This chapter is heavily based on the documentation included in the pre 2.2 | 983 | This chapter is heavily based on the documentation included in the pre 2.2 |
1012 | kernels, and became part of it in version 2.2.1 of the Linux kernel. | 984 | kernels, and became part of it in version 2.2.1 of the Linux kernel. |
1013 | 985 | ||
1014 | 2.1 /proc/sys/fs - File system data | 986 | Please see: Documentation/sysctls/ directory for descriptions of these |
1015 | ----------------------------------- | ||
1016 | |||
1017 | This subdirectory contains specific file system, file handle, inode, dentry | ||
1018 | and quota information. | ||
1019 | |||
1020 | Currently, these files are in /proc/sys/fs: | ||
1021 | |||
1022 | dentry-state | ||
1023 | ------------ | ||
1024 | |||
1025 | Status of the directory cache. Since directory entries are dynamically | ||
1026 | allocated and deallocated, this file indicates the current status. It holds | ||
1027 | six values, in which the last two are not used and are always zero. The others | ||
1028 | are listed in table 2-1. | ||
1029 | |||
1030 | |||
1031 | Table 2-1: Status files of the directory cache | ||
1032 | .............................................................................. | ||
1033 | File Content | ||
1034 | nr_dentry Almost always zero | ||
1035 | nr_unused Number of unused cache entries | ||
1036 | age_limit | ||
1037 | in seconds after the entry may be reclaimed, when memory is short | ||
1038 | want_pages internally | ||
1039 | .............................................................................. | ||
1040 | |||
1041 | dquot-nr and dquot-max | ||
1042 | ---------------------- | ||
1043 | |||
1044 | The file dquot-max shows the maximum number of cached disk quota entries. | ||
1045 | |||
1046 | The file dquot-nr shows the number of allocated disk quota entries and the | ||
1047 | number of free disk quota entries. | ||
1048 | |||
1049 | If the number of available cached disk quotas is very low and you have a large | ||
1050 | number of simultaneous system users, you might want to raise the limit. | ||
1051 | |||
1052 | file-nr and file-max | ||
1053 | -------------------- | ||
1054 | |||
1055 | The kernel allocates file handles dynamically, but doesn't free them again at | ||
1056 | this time. | ||
1057 | |||
1058 | The value in file-max denotes the maximum number of file handles that the | ||
1059 | Linux kernel will allocate. When you get a lot of error messages about running | ||
1060 | out of file handles, you might want to raise this limit. The default value is | ||
1061 | 10% of RAM in kilobytes. To change it, just write the new number into the | ||
1062 | file: | ||
1063 | |||
1064 | # cat /proc/sys/fs/file-max | ||
1065 | 4096 | ||
1066 | # echo 8192 > /proc/sys/fs/file-max | ||
1067 | # cat /proc/sys/fs/file-max | ||
1068 | 8192 | ||
1069 | |||
1070 | |||
1071 | This method of revision is useful for all customizable parameters of the | ||
1072 | kernel - simply echo the new value to the corresponding file. | ||
1073 | |||
1074 | Historically, the three values in file-nr denoted the number of allocated file | ||
1075 | handles, the number of allocated but unused file handles, and the maximum | ||
1076 | number of file handles. Linux 2.6 always reports 0 as the number of free file | ||
1077 | handles -- this is not an error, it just means that the number of allocated | ||
1078 | file handles exactly matches the number of used file handles. | ||
1079 | |||
1080 | Attempts to allocate more file descriptors than file-max are reported with | ||
1081 | printk, look for "VFS: file-max limit <number> reached". | ||
1082 | |||
1083 | inode-state and inode-nr | ||
1084 | ------------------------ | ||
1085 | |||
1086 | The file inode-nr contains the first two items from inode-state, so we'll skip | ||
1087 | to that file... | ||
1088 | |||
1089 | inode-state contains two actual numbers and five dummy values. The numbers | ||
1090 | are nr_inodes and nr_free_inodes (in order of appearance). | ||
1091 | |||
1092 | nr_inodes | ||
1093 | ~~~~~~~~~ | ||
1094 | |||
1095 | Denotes the number of inodes the system has allocated. This number will | ||
1096 | grow and shrink dynamically. | ||
1097 | |||
1098 | nr_open | ||
1099 | ------- | ||
1100 | |||
1101 | Denotes the maximum number of file-handles a process can | ||
1102 | allocate. Default value is 1024*1024 (1048576) which should be | ||
1103 | enough for most machines. Actual limit depends on RLIMIT_NOFILE | ||
1104 | resource limit. | ||
1105 | |||
1106 | nr_free_inodes | ||
1107 | -------------- | ||
1108 | |||
1109 | Represents the number of free inodes. Ie. The number of inuse inodes is | ||
1110 | (nr_inodes - nr_free_inodes). | ||
1111 | |||
1112 | aio-nr and aio-max-nr | ||
1113 | --------------------- | ||
1114 | |||
1115 | aio-nr is the running total of the number of events specified on the | ||
1116 | io_setup system call for all currently active aio contexts. If aio-nr | ||
1117 | reaches aio-max-nr then io_setup will fail with EAGAIN. Note that | ||
1118 | raising aio-max-nr does not result in the pre-allocation or re-sizing | ||
1119 | of any kernel data structures. | ||
1120 | |||
1121 | 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats | ||
1122 | ----------------------------------------------------------- | ||
1123 | |||
1124 | Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This | ||
1125 | handles the kernel support for miscellaneous binary formats. | ||
1126 | |||
1127 | Binfmt_misc provides the ability to register additional binary formats to the | ||
1128 | Kernel without compiling an additional module/kernel. Therefore, binfmt_misc | ||
1129 | needs to know magic numbers at the beginning or the filename extension of the | ||
1130 | binary. | ||
1131 | |||
1132 | It works by maintaining a linked list of structs that contain a description of | ||
1133 | a binary format, including a magic with size (or the filename extension), | ||
1134 | offset and mask, and the interpreter name. On request it invokes the given | ||
1135 | interpreter with the original program as argument, as binfmt_java and | ||
1136 | binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default | ||
1137 | binary-formats, you have to register an additional binary-format. | ||
1138 | |||
1139 | There are two general files in binfmt_misc and one file per registered format. | ||
1140 | The two general files are register and status. | ||
1141 | |||
1142 | Registering a new binary format | ||
1143 | ------------------------------- | ||
1144 | |||
1145 | To register a new binary format you have to issue the command | ||
1146 | |||
1147 | echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register | ||
1148 | |||
1149 | |||
1150 | |||
1151 | with appropriate name (the name for the /proc-dir entry), offset (defaults to | ||
1152 | 0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and | ||
1153 | last but not least, the interpreter that is to be invoked (for example and | ||
1154 | testing /bin/echo). Type can be M for usual magic matching or E for filename | ||
1155 | extension matching (give extension in place of magic). | ||
1156 | |||
1157 | Check or reset the status of the binary format handler | ||
1158 | ------------------------------------------------------ | ||
1159 | |||
1160 | If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the | ||
1161 | current status (enabled/disabled) of binfmt_misc. Change the status by echoing | ||
1162 | 0 (disables) or 1 (enables) or -1 (caution: this clears all previously | ||
1163 | registered binary formats) to status. For example echo 0 > status to disable | ||
1164 | binfmt_misc (temporarily). | ||
1165 | |||
1166 | Status of a single handler | ||
1167 | -------------------------- | ||
1168 | |||
1169 | Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files | ||
1170 | perform the same function as status, but their scope is limited to the actual | ||
1171 | binary format. By cating this file, you also receive all related information | ||
1172 | about the interpreter/magic of the binfmt. | ||
1173 | |||
1174 | Example usage of binfmt_misc (emulate binfmt_java) | ||
1175 | -------------------------------------------------- | ||
1176 | |||
1177 | cd /proc/sys/fs/binfmt_misc | ||
1178 | echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register | ||
1179 | echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register | ||
1180 | echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register | ||
1181 | echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register | ||
1182 | |||
1183 | |||
1184 | These four lines add support for Java executables and Java applets (like | ||
1185 | binfmt_java, additionally recognizing the .html extension with no need to put | ||
1186 | <!--applet> to every applet file). You have to install the JDK and the | ||
1187 | shell-script /usr/local/java/bin/javawrapper too. It works around the | ||
1188 | brokenness of the Java filename handling. To add a Java binary, just create a | ||
1189 | link to the class-file somewhere in the path. | ||
1190 | |||
1191 | 2.3 /proc/sys/kernel - general kernel parameters | ||
1192 | ------------------------------------------------ | ||
1193 | |||
1194 | This directory reflects general kernel behaviors. As I've said before, the | ||
1195 | contents depend on your configuration. Here you'll find the most important | ||
1196 | files, along with descriptions of what they mean and how to use them. | ||
1197 | |||
1198 | acct | ||
1199 | ---- | ||
1200 | |||
1201 | The file contains three values; highwater, lowwater, and frequency. | ||
1202 | |||
1203 | It exists only when BSD-style process accounting is enabled. These values | ||
1204 | control its behavior. If the free space on the file system where the log lives | ||
1205 | goes below lowwater percentage, accounting suspends. If it goes above | ||
1206 | highwater percentage, accounting resumes. Frequency determines how often you | ||
1207 | check the amount of free space (value is in seconds). Default settings are: 4, | ||
1208 | 2, and 30. That is, suspend accounting if there is less than 2 percent free; | ||
1209 | resume it if we have a value of 3 or more percent; consider information about | ||
1210 | the amount of free space valid for 30 seconds | ||
1211 | |||
1212 | ctrl-alt-del | ||
1213 | ------------ | ||
1214 | |||
1215 | When the value in this file is 0, ctrl-alt-del is trapped and sent to the init | ||
1216 | program to handle a graceful restart. However, when the value is greater that | ||
1217 | zero, Linux's reaction to this key combination will be an immediate reboot, | ||
1218 | without syncing its dirty buffers. | ||
1219 | |||
1220 | [NOTE] | ||
1221 | When a program (like dosemu) has the keyboard in raw mode, the | ||
1222 | ctrl-alt-del is intercepted by the program before it ever reaches the | ||
1223 | kernel tty layer, and it is up to the program to decide what to do with | ||
1224 | it. | ||
1225 | |||
1226 | domainname and hostname | ||
1227 | ----------------------- | ||
1228 | |||
1229 | These files can be controlled to set the NIS domainname and hostname of your | ||
1230 | box. For the classic darkstar.frop.org a simple: | ||
1231 | |||
1232 | # echo "darkstar" > /proc/sys/kernel/hostname | ||
1233 | # echo "frop.org" > /proc/sys/kernel/domainname | ||
1234 | |||
1235 | |||
1236 | would suffice to set your hostname and NIS domainname. | ||
1237 | |||
1238 | osrelease, ostype and version | ||
1239 | ----------------------------- | ||
1240 | |||
1241 | The names make it pretty obvious what these fields contain: | ||
1242 | |||
1243 | > cat /proc/sys/kernel/osrelease | ||
1244 | 2.2.12 | ||
1245 | |||
1246 | > cat /proc/sys/kernel/ostype | ||
1247 | Linux | ||
1248 | |||
1249 | > cat /proc/sys/kernel/version | ||
1250 | #4 Fri Oct 1 12:41:14 PDT 1999 | ||
1251 | |||
1252 | |||
1253 | The files osrelease and ostype should be clear enough. Version needs a little | ||
1254 | more clarification. The #4 means that this is the 4th kernel built from this | ||
1255 | source base and the date after it indicates the time the kernel was built. The | ||
1256 | only way to tune these values is to rebuild the kernel. | ||
1257 | |||
1258 | panic | ||
1259 | ----- | ||
1260 | |||
1261 | The value in this file represents the number of seconds the kernel waits | ||
1262 | before rebooting on a panic. When you use the software watchdog, the | ||
1263 | recommended setting is 60. If set to 0, the auto reboot after a kernel panic | ||
1264 | is disabled, which is the default setting. | ||
1265 | |||
1266 | printk | ||
1267 | ------ | ||
1268 | |||
1269 | The four values in printk denote | ||
1270 | * console_loglevel, | ||
1271 | * default_message_loglevel, | ||
1272 | * minimum_console_loglevel and | ||
1273 | * default_console_loglevel | ||
1274 | respectively. | ||
1275 | |||
1276 | These values influence printk() behavior when printing or logging error | ||
1277 | messages, which come from inside the kernel. See syslog(2) for more | ||
1278 | information on the different log levels. | ||
1279 | |||
1280 | console_loglevel | ||
1281 | ---------------- | ||
1282 | |||
1283 | Messages with a higher priority than this will be printed to the console. | ||
1284 | |||
1285 | default_message_level | ||
1286 | --------------------- | ||
1287 | |||
1288 | Messages without an explicit priority will be printed with this priority. | ||
1289 | |||
1290 | minimum_console_loglevel | ||
1291 | ------------------------ | ||
1292 | |||
1293 | Minimum (highest) value to which the console_loglevel can be set. | ||
1294 | |||
1295 | default_console_loglevel | ||
1296 | ------------------------ | ||
1297 | |||
1298 | Default value for console_loglevel. | ||
1299 | |||
1300 | sg-big-buff | ||
1301 | ----------- | ||
1302 | |||
1303 | This file shows the size of the generic SCSI (sg) buffer. At this point, you | ||
1304 | can't tune it yet, but you can change it at compile time by editing | ||
1305 | include/scsi/sg.h and changing the value of SG_BIG_BUFF. | ||
1306 | |||
1307 | If you use a scanner with SANE (Scanner Access Now Easy) you might want to set | ||
1308 | this to a higher value. Refer to the SANE documentation on this issue. | ||
1309 | |||
1310 | modprobe | ||
1311 | -------- | ||
1312 | |||
1313 | The location where the modprobe binary is located. The kernel uses this | ||
1314 | program to load modules on demand. | ||
1315 | |||
1316 | unknown_nmi_panic | ||
1317 | ----------------- | ||
1318 | |||
1319 | The value in this file affects behavior of handling NMI. When the value is | ||
1320 | non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel | ||
1321 | debugging information is displayed on console. | ||
1322 | |||
1323 | NMI switch that most IA32 servers have fires unknown NMI up, for example. | ||
1324 | If a system hangs up, try pressing the NMI switch. | ||
1325 | |||
1326 | panic_on_unrecovered_nmi | ||
1327 | ------------------------ | ||
1328 | |||
1329 | The default Linux behaviour on an NMI of either memory or unknown is to continue | ||
1330 | operation. For many environments such as scientific computing it is preferable | ||
1331 | that the box is taken out and the error dealt with than an uncorrected | ||
1332 | parity/ECC error get propogated. | ||
1333 | |||
1334 | A small number of systems do generate NMI's for bizarre random reasons such as | ||
1335 | power management so the default is off. That sysctl works like the existing | ||
1336 | panic controls already in that directory. | ||
1337 | |||
1338 | nmi_watchdog | ||
1339 | ------------ | ||
1340 | |||
1341 | Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero | ||
1342 | the NMI watchdog is enabled and will continuously test all online cpus to | ||
1343 | determine whether or not they are still functioning properly. Currently, | ||
1344 | passing "nmi_watchdog=" parameter at boot time is required for this function | ||
1345 | to work. | ||
1346 | |||
1347 | If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel parameter), the | ||
1348 | NMI watchdog shares registers with oprofile. By disabling the NMI watchdog, | ||
1349 | oprofile may have more registers to utilize. | ||
1350 | |||
1351 | msgmni | ||
1352 | ------ | ||
1353 | |||
1354 | Maximum number of message queue ids on the system. | ||
1355 | This value scales to the amount of lowmem. It is automatically recomputed | ||
1356 | upon memory add/remove or ipc namespace creation/removal. | ||
1357 | When a value is written into this file, msgmni's value becomes fixed, i.e. it | ||
1358 | is not recomputed anymore when one of the above events occurs. | ||
1359 | Use auto_msgmni to change this behavior. | ||
1360 | |||
1361 | auto_msgmni | ||
1362 | ----------- | ||
1363 | |||
1364 | Enables/Disables automatic recomputing of msgmni upon memory add/remove or | ||
1365 | upon ipc namespace creation/removal (see the msgmni description above). | ||
1366 | Echoing "1" into this file enables msgmni automatic recomputing. | ||
1367 | Echoing "0" turns it off. | ||
1368 | auto_msgmni default value is 1. | ||
1369 | |||
1370 | |||
1371 | 2.4 /proc/sys/vm - The virtual memory subsystem | ||
1372 | ----------------------------------------------- | ||
1373 | |||
1374 | Please see: Documentation/sysctls/vm.txt for a description of these | ||
1375 | entries. | 987 | entries. |
1376 | 988 | ||
989 | ------------------------------------------------------------------------------ | ||
990 | Summary | ||
991 | ------------------------------------------------------------------------------ | ||
992 | Certain aspects of kernel behavior can be modified at runtime, without the | ||
993 | need to recompile the kernel, or even to reboot the system. The files in the | ||
994 | /proc/sys tree can not only be read, but also modified. You can use the echo | ||
995 | command to write value into these files, thereby changing the default settings | ||
996 | of the kernel. | ||
997 | ------------------------------------------------------------------------------ | ||
1377 | 998 | ||
1378 | 2.5 /proc/sys/dev - Device specific parameters | 999 | ------------------------------------------------------------------------------ |
1379 | ---------------------------------------------- | 1000 | CHAPTER 3: PER-PROCESS PARAMETERS |
1380 | 1001 | ------------------------------------------------------------------------------ | |
1381 | Currently there is only support for CDROM drives, and for those, there is only | ||
1382 | one read-only file containing information about the CD-ROM drives attached to | ||
1383 | the system: | ||
1384 | |||
1385 | >cat /proc/sys/dev/cdrom/info | ||
1386 | CD-ROM information, Id: cdrom.c 2.55 1999/04/25 | ||
1387 | |||
1388 | drive name: sr0 hdb | ||
1389 | drive speed: 32 40 | ||
1390 | drive # of slots: 1 0 | ||
1391 | Can close tray: 1 1 | ||
1392 | Can open tray: 1 1 | ||
1393 | Can lock tray: 1 1 | ||
1394 | Can change speed: 1 1 | ||
1395 | Can select disk: 0 1 | ||
1396 | Can read multisession: 1 1 | ||
1397 | Can read MCN: 1 1 | ||
1398 | Reports media changed: 1 1 | ||
1399 | Can play audio: 1 1 | ||
1400 | |||
1401 | |||
1402 | You see two drives, sr0 and hdb, along with a list of their features. | ||
1403 | |||
1404 | 2.6 /proc/sys/sunrpc - Remote procedure calls | ||
1405 | --------------------------------------------- | ||
1406 | |||
1407 | This directory contains four files, which enable or disable debugging for the | ||
1408 | RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can | ||
1409 | be set to one to turn debugging on. (The default value is 0 for each) | ||
1410 | |||
1411 | 2.7 /proc/sys/net - Networking stuff | ||
1412 | ------------------------------------ | ||
1413 | |||
1414 | The interface to the networking parts of the kernel is located in | ||
1415 | /proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only | ||
1416 | some of them, depending on your kernel's configuration. | ||
1417 | |||
1418 | |||
1419 | Table 2-3: Subdirectories in /proc/sys/net | ||
1420 | .............................................................................. | ||
1421 | Directory Content Directory Content | ||
1422 | core General parameter appletalk Appletalk protocol | ||
1423 | unix Unix domain sockets netrom NET/ROM | ||
1424 | 802 E802 protocol ax25 AX25 | ||
1425 | ethernet Ethernet protocol rose X.25 PLP layer | ||
1426 | ipv4 IP version 4 x25 X.25 protocol | ||
1427 | ipx IPX token-ring IBM token ring | ||
1428 | bridge Bridging decnet DEC net | ||
1429 | ipv6 IP version 6 | ||
1430 | .............................................................................. | ||
1431 | |||
1432 | We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are | ||
1433 | only minor players in the Linux world, we'll skip them in this chapter. You'll | ||
1434 | find some short info on Appletalk and IPX further on in this chapter. Review | ||
1435 | the online documentation and the kernel source to get a detailed view of the | ||
1436 | parameters for those protocols. In this section we'll discuss the | ||
1437 | subdirectories printed in bold letters in the table above. As default values | ||
1438 | are suitable for most needs, there is no need to change these values. | ||
1439 | |||
1440 | /proc/sys/net/core - Network core options | ||
1441 | ----------------------------------------- | ||
1442 | |||
1443 | rmem_default | ||
1444 | ------------ | ||
1445 | |||
1446 | The default setting of the socket receive buffer in bytes. | ||
1447 | |||
1448 | rmem_max | ||
1449 | -------- | ||
1450 | |||
1451 | The maximum receive socket buffer size in bytes. | ||
1452 | |||
1453 | wmem_default | ||
1454 | ------------ | ||
1455 | |||
1456 | The default setting (in bytes) of the socket send buffer. | ||
1457 | |||
1458 | wmem_max | ||
1459 | -------- | ||
1460 | |||
1461 | The maximum send socket buffer size in bytes. | ||
1462 | |||
1463 | message_burst and message_cost | ||
1464 | ------------------------------ | ||
1465 | |||
1466 | These parameters are used to limit the warning messages written to the kernel | ||
1467 | log from the networking code. They enforce a rate limit to make a | ||
1468 | denial-of-service attack impossible. A higher message_cost factor, results in | ||
1469 | fewer messages that will be written. Message_burst controls when messages will | ||
1470 | be dropped. The default settings limit warning messages to one every five | ||
1471 | seconds. | ||
1472 | |||
1473 | warnings | ||
1474 | -------- | ||
1475 | |||
1476 | This controls console messages from the networking stack that can occur because | ||
1477 | of problems on the network like duplicate address or bad checksums. Normally, | ||
1478 | this should be enabled, but if the problem persists the messages can be | ||
1479 | disabled. | ||
1480 | |||
1481 | netdev_budget | ||
1482 | ------------- | ||
1483 | |||
1484 | Maximum number of packets taken from all interfaces in one polling cycle (NAPI | ||
1485 | poll). In one polling cycle interfaces which are registered to polling are | ||
1486 | probed in a round-robin manner. The limit of packets in one such probe can be | ||
1487 | set per-device via sysfs class/net/<device>/weight . | ||
1488 | |||
1489 | netdev_max_backlog | ||
1490 | ------------------ | ||
1491 | |||
1492 | Maximum number of packets, queued on the INPUT side, when the interface | ||
1493 | receives packets faster than kernel can process them. | ||
1494 | |||
1495 | optmem_max | ||
1496 | ---------- | ||
1497 | |||
1498 | Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence | ||
1499 | of struct cmsghdr structures with appended data. | ||
1500 | |||
1501 | /proc/sys/net/unix - Parameters for Unix domain sockets | ||
1502 | ------------------------------------------------------- | ||
1503 | |||
1504 | There are only two files in this subdirectory. They control the delays for | ||
1505 | deleting and destroying socket descriptors. | ||
1506 | |||
1507 | 2.8 /proc/sys/net/ipv4 - IPV4 settings | ||
1508 | -------------------------------------- | ||
1509 | |||
1510 | IP version 4 is still the most used protocol in Unix networking. It will be | ||
1511 | replaced by IP version 6 in the next couple of years, but for the moment it's | ||
1512 | the de facto standard for the internet and is used in most networking | ||
1513 | environments around the world. Because of the importance of this protocol, | ||
1514 | we'll have a deeper look into the subtree controlling the behavior of the IPv4 | ||
1515 | subsystem of the Linux kernel. | ||
1516 | |||
1517 | Let's start with the entries in /proc/sys/net/ipv4. | ||
1518 | |||
1519 | ICMP settings | ||
1520 | ------------- | ||
1521 | |||
1522 | icmp_echo_ignore_all and icmp_echo_ignore_broadcasts | ||
1523 | ---------------------------------------------------- | ||
1524 | |||
1525 | Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or | ||
1526 | just those to broadcast and multicast addresses. | ||
1527 | |||
1528 | Please note that if you accept ICMP echo requests with a broadcast/multi\-cast | ||
1529 | destination address your network may be used as an exploder for denial of | ||
1530 | service packet flooding attacks to other hosts. | ||
1531 | |||
1532 | icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate | ||
1533 | --------------------------------------------------------------------------------------- | ||
1534 | |||
1535 | Sets limits for sending ICMP packets to specific targets. A value of zero | ||
1536 | disables all limiting. Any positive value sets the maximum package rate in | ||
1537 | hundredth of a second (on Intel systems). | ||
1538 | |||
1539 | IP settings | ||
1540 | ----------- | ||
1541 | |||
1542 | ip_autoconfig | ||
1543 | ------------- | ||
1544 | |||
1545 | This file contains the number one if the host received its IP configuration by | ||
1546 | RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. | ||
1547 | |||
1548 | ip_default_ttl | ||
1549 | -------------- | ||
1550 | |||
1551 | TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of | ||
1552 | hops a packet may travel. | ||
1553 | |||
1554 | ip_dynaddr | ||
1555 | ---------- | ||
1556 | |||
1557 | Enable dynamic socket address rewriting on interface address change. This is | ||
1558 | useful for dialup interface with changing IP addresses. | ||
1559 | |||
1560 | ip_forward | ||
1561 | ---------- | ||
1562 | |||
1563 | Enable or disable forwarding of IP packages between interfaces. Changing this | ||
1564 | value resets all other parameters to their default values. They differ if the | ||
1565 | kernel is configured as host or router. | ||
1566 | |||
1567 | ip_local_port_range | ||
1568 | ------------------- | ||
1569 | |||
1570 | Range of ports used by TCP and UDP to choose the local port. Contains two | ||
1571 | numbers, the first number is the lowest port, the second number the highest | ||
1572 | local port. Default is 1024-4999. Should be changed to 32768-61000 for | ||
1573 | high-usage systems. | ||
1574 | |||
1575 | ip_no_pmtu_disc | ||
1576 | --------------- | ||
1577 | |||
1578 | Global switch to turn path MTU discovery off. It can also be set on a per | ||
1579 | socket basis by the applications or on a per route basis. | ||
1580 | |||
1581 | ip_masq_debug | ||
1582 | ------------- | ||
1583 | |||
1584 | Enable/disable debugging of IP masquerading. | ||
1585 | |||
1586 | IP fragmentation settings | ||
1587 | ------------------------- | ||
1588 | |||
1589 | ipfrag_high_trash and ipfrag_low_trash | ||
1590 | -------------------------------------- | ||
1591 | |||
1592 | Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes | ||
1593 | of memory is allocated for this purpose, the fragment handler will toss | ||
1594 | packets until ipfrag_low_thresh is reached. | ||
1595 | |||
1596 | ipfrag_time | ||
1597 | ----------- | ||
1598 | |||
1599 | Time in seconds to keep an IP fragment in memory. | ||
1600 | |||
1601 | TCP settings | ||
1602 | ------------ | ||
1603 | |||
1604 | tcp_ecn | ||
1605 | ------- | ||
1606 | |||
1607 | This file controls the use of the ECN bit in the IPv4 headers. This is a new | ||
1608 | feature about Explicit Congestion Notification, but some routers and firewalls | ||
1609 | block traffic that has this bit set, so it could be necessary to echo 0 to | ||
1610 | /proc/sys/net/ipv4/tcp_ecn if you want to talk to these sites. For more info | ||
1611 | you could read RFC2481. | ||
1612 | |||
1613 | tcp_retrans_collapse | ||
1614 | -------------------- | ||
1615 | |||
1616 | Bug-to-bug compatibility with some broken printers. On retransmit, try to send | ||
1617 | larger packets to work around bugs in certain TCP stacks. Can be turned off by | ||
1618 | setting it to zero. | ||
1619 | |||
1620 | tcp_keepalive_probes | ||
1621 | -------------------- | ||
1622 | |||
1623 | Number of keep alive probes TCP sends out, until it decides that the | ||
1624 | connection is broken. | ||
1625 | |||
1626 | tcp_keepalive_time | ||
1627 | ------------------ | ||
1628 | |||
1629 | How often TCP sends out keep alive messages, when keep alive is enabled. The | ||
1630 | default is 2 hours. | ||
1631 | |||
1632 | tcp_syn_retries | ||
1633 | --------------- | ||
1634 | |||
1635 | Number of times initial SYNs for a TCP connection attempt will be | ||
1636 | retransmitted. Should not be higher than 255. This is only the timeout for | ||
1637 | outgoing connections, for incoming connections the number of retransmits is | ||
1638 | defined by tcp_retries1. | ||
1639 | |||
1640 | tcp_sack | ||
1641 | -------- | ||
1642 | |||
1643 | Enable select acknowledgments after RFC2018. | ||
1644 | |||
1645 | tcp_timestamps | ||
1646 | -------------- | ||
1647 | |||
1648 | Enable timestamps as defined in RFC1323. | ||
1649 | |||
1650 | tcp_stdurg | ||
1651 | ---------- | ||
1652 | |||
1653 | Enable the strict RFC793 interpretation of the TCP urgent pointer field. The | ||
1654 | default is to use the BSD compatible interpretation of the urgent pointer | ||
1655 | pointing to the first byte after the urgent data. The RFC793 interpretation is | ||
1656 | to have it point to the last byte of urgent data. Enabling this option may | ||
1657 | lead to interoperability problems. Disabled by default. | ||
1658 | |||
1659 | tcp_syncookies | ||
1660 | -------------- | ||
1661 | |||
1662 | Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out | ||
1663 | syncookies when the syn backlog queue of a socket overflows. This is to ward | ||
1664 | off the common 'syn flood attack'. Disabled by default. | ||
1665 | |||
1666 | Note that the concept of a socket backlog is abandoned. This means the peer | ||
1667 | may not receive reliable error messages from an over loaded server with | ||
1668 | syncookies enabled. | ||
1669 | |||
1670 | tcp_window_scaling | ||
1671 | ------------------ | ||
1672 | |||
1673 | Enable window scaling as defined in RFC1323. | ||
1674 | |||
1675 | tcp_fin_timeout | ||
1676 | --------------- | ||
1677 | |||
1678 | The length of time in seconds it takes to receive a final FIN before the | ||
1679 | socket is always closed. This is strictly a violation of the TCP | ||
1680 | specification, but required to prevent denial-of-service attacks. | ||
1681 | |||
1682 | tcp_max_ka_probes | ||
1683 | ----------------- | ||
1684 | |||
1685 | Indicates how many keep alive probes are sent per slow timer run. Should not | ||
1686 | be set too high to prevent bursts. | ||
1687 | |||
1688 | tcp_max_syn_backlog | ||
1689 | ------------------- | ||
1690 | |||
1691 | Length of the per socket backlog queue. Since Linux 2.2 the backlog specified | ||
1692 | in listen(2) only specifies the length of the backlog queue of already | ||
1693 | established sockets. When more connection requests arrive Linux starts to drop | ||
1694 | packets. When syncookies are enabled the packets are still answered and the | ||
1695 | maximum queue is effectively ignored. | ||
1696 | |||
1697 | tcp_retries1 | ||
1698 | ------------ | ||
1699 | |||
1700 | Defines how often an answer to a TCP connection request is retransmitted | ||
1701 | before giving up. | ||
1702 | |||
1703 | tcp_retries2 | ||
1704 | ------------ | ||
1705 | |||
1706 | Defines how often a TCP packet is retransmitted before giving up. | ||
1707 | |||
1708 | Interface specific settings | ||
1709 | --------------------------- | ||
1710 | |||
1711 | In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each | ||
1712 | interface the system knows about and one directory calls all. Changes in the | ||
1713 | all subdirectory affect all interfaces, whereas changes in the other | ||
1714 | subdirectories affect only one interface. All directories have the same | ||
1715 | entries: | ||
1716 | |||
1717 | accept_redirects | ||
1718 | ---------------- | ||
1719 | |||
1720 | This switch decides if the kernel accepts ICMP redirect messages or not. The | ||
1721 | default is 'yes' if the kernel is configured for a regular host and 'no' for a | ||
1722 | router configuration. | ||
1723 | |||
1724 | accept_source_route | ||
1725 | ------------------- | ||
1726 | |||
1727 | Should source routed packages be accepted or declined. The default is | ||
1728 | dependent on the kernel configuration. It's 'yes' for routers and 'no' for | ||
1729 | hosts. | ||
1730 | |||
1731 | bootp_relay | ||
1732 | ~~~~~~~~~~~ | ||
1733 | |||
1734 | Accept packets with source address 0.b.c.d with destinations not to this host | ||
1735 | as local ones. It is supposed that a BOOTP relay daemon will catch and forward | ||
1736 | such packets. | ||
1737 | |||
1738 | The default is 0, since this feature is not implemented yet (kernel version | ||
1739 | 2.2.12). | ||
1740 | |||
1741 | forwarding | ||
1742 | ---------- | ||
1743 | |||
1744 | Enable or disable IP forwarding on this interface. | ||
1745 | |||
1746 | log_martians | ||
1747 | ------------ | ||
1748 | |||
1749 | Log packets with source addresses with no known route to kernel log. | ||
1750 | |||
1751 | mc_forwarding | ||
1752 | ------------- | ||
1753 | |||
1754 | Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a | ||
1755 | multicast routing daemon is required. | ||
1756 | |||
1757 | proxy_arp | ||
1758 | --------- | ||
1759 | |||
1760 | Does (1) or does not (0) perform proxy ARP. | ||
1761 | |||
1762 | rp_filter | ||
1763 | --------- | ||
1764 | |||
1765 | Integer value determines if a source validation should be made. 1 means yes, 0 | ||
1766 | means no. Disabled by default, but local/broadcast address spoofing is always | ||
1767 | on. | ||
1768 | |||
1769 | If you set this to 1 on a router that is the only connection for a network to | ||
1770 | the net, it will prevent spoofing attacks against your internal networks | ||
1771 | (external addresses can still be spoofed), without the need for additional | ||
1772 | firewall rules. | ||
1773 | |||
1774 | secure_redirects | ||
1775 | ---------------- | ||
1776 | |||
1777 | Accept ICMP redirect messages only for gateways, listed in default gateway | ||
1778 | list. Enabled by default. | ||
1779 | |||
1780 | shared_media | ||
1781 | ------------ | ||
1782 | |||
1783 | If it is not set the kernel does not assume that different subnets on this | ||
1784 | device can communicate directly. Default setting is 'yes'. | ||
1785 | |||
1786 | send_redirects | ||
1787 | -------------- | ||
1788 | |||
1789 | Determines whether to send ICMP redirects to other hosts. | ||
1790 | |||
1791 | Routing settings | ||
1792 | ---------------- | ||
1793 | |||
1794 | The directory /proc/sys/net/ipv4/route contains several file to control | ||
1795 | routing issues. | ||
1796 | |||
1797 | error_burst and error_cost | ||
1798 | -------------------------- | ||
1799 | |||
1800 | These parameters are used to limit how many ICMP destination unreachable to | ||
1801 | send from the host in question. ICMP destination unreachable messages are | ||
1802 | sent when we cannot reach the next hop while trying to transmit a packet. | ||
1803 | It will also print some error messages to kernel logs if someone is ignoring | ||
1804 | our ICMP redirects. The higher the error_cost factor is, the fewer | ||
1805 | destination unreachable and error messages will be let through. Error_burst | ||
1806 | controls when destination unreachable messages and error messages will be | ||
1807 | dropped. The default settings limit warning messages to five every second. | ||
1808 | |||
1809 | flush | ||
1810 | ----- | ||
1811 | |||
1812 | Writing to this file results in a flush of the routing cache. | ||
1813 | |||
1814 | gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh | ||
1815 | --------------------------------------------------------------------- | ||
1816 | |||
1817 | Values to control the frequency and behavior of the garbage collection | ||
1818 | algorithm for the routing cache. gc_min_interval is deprecated and replaced | ||
1819 | by gc_min_interval_ms. | ||
1820 | |||
1821 | |||
1822 | max_size | ||
1823 | -------- | ||
1824 | |||
1825 | Maximum size of the routing cache. Old entries will be purged once the cache | ||
1826 | reached has this size. | ||
1827 | |||
1828 | redirect_load, redirect_number | ||
1829 | ------------------------------ | ||
1830 | |||
1831 | Factors which determine if more ICPM redirects should be sent to a specific | ||
1832 | host. No redirects will be sent once the load limit or the maximum number of | ||
1833 | redirects has been reached. | ||
1834 | |||
1835 | redirect_silence | ||
1836 | ---------------- | ||
1837 | |||
1838 | Timeout for redirects. After this period redirects will be sent again, even if | ||
1839 | this has been stopped, because the load or number limit has been reached. | ||
1840 | |||
1841 | Network Neighbor handling | ||
1842 | ------------------------- | ||
1843 | |||
1844 | Settings about how to handle connections with direct neighbors (nodes attached | ||
1845 | to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. | ||
1846 | |||
1847 | As we saw it in the conf directory, there is a default subdirectory which | ||
1848 | holds the default values, and one directory for each interface. The contents | ||
1849 | of the directories are identical, with the single exception that the default | ||
1850 | settings contain additional options to set garbage collection parameters. | ||
1851 | |||
1852 | In the interface directories you'll find the following entries: | ||
1853 | |||
1854 | base_reachable_time, base_reachable_time_ms | ||
1855 | ------------------------------------------- | ||
1856 | |||
1857 | A base value used for computing the random reachable time value as specified | ||
1858 | in RFC2461. | ||
1859 | |||
1860 | Expression of base_reachable_time, which is deprecated, is in seconds. | ||
1861 | Expression of base_reachable_time_ms is in milliseconds. | ||
1862 | |||
1863 | retrans_time, retrans_time_ms | ||
1864 | ----------------------------- | ||
1865 | |||
1866 | The time between retransmitted Neighbor Solicitation messages. | ||
1867 | Used for address resolution and to determine if a neighbor is | ||
1868 | unreachable. | ||
1869 | |||
1870 | Expression of retrans_time, which is deprecated, is in 1/100 seconds (for | ||
1871 | IPv4) or in jiffies (for IPv6). | ||
1872 | Expression of retrans_time_ms is in milliseconds. | ||
1873 | |||
1874 | unres_qlen | ||
1875 | ---------- | ||
1876 | |||
1877 | Maximum queue length for a pending arp request - the number of packets which | ||
1878 | are accepted from other layers while the ARP address is still resolved. | ||
1879 | |||
1880 | anycast_delay | ||
1881 | ------------- | ||
1882 | |||
1883 | Maximum for random delay of answers to neighbor solicitation messages in | ||
1884 | jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support | ||
1885 | yet). | ||
1886 | |||
1887 | ucast_solicit | ||
1888 | ------------- | ||
1889 | |||
1890 | Maximum number of retries for unicast solicitation. | ||
1891 | |||
1892 | mcast_solicit | ||
1893 | ------------- | ||
1894 | |||
1895 | Maximum number of retries for multicast solicitation. | ||
1896 | |||
1897 | delay_first_probe_time | ||
1898 | ---------------------- | ||
1899 | |||
1900 | Delay for the first time probe if the neighbor is reachable. (see | ||
1901 | gc_stale_time) | ||
1902 | |||
1903 | locktime | ||
1904 | -------- | ||
1905 | |||
1906 | An ARP/neighbor entry is only replaced with a new one if the old is at least | ||
1907 | locktime old. This prevents ARP cache thrashing. | ||
1908 | |||
1909 | proxy_delay | ||
1910 | ----------- | ||
1911 | |||
1912 | Maximum time (real time is random [0..proxytime]) before answering to an ARP | ||
1913 | request for which we have an proxy ARP entry. In some cases, this is used to | ||
1914 | prevent network flooding. | ||
1915 | |||
1916 | proxy_qlen | ||
1917 | ---------- | ||
1918 | |||
1919 | Maximum queue length of the delayed proxy arp timer. (see proxy_delay). | ||
1920 | |||
1921 | app_solicit | ||
1922 | ---------- | ||
1923 | |||
1924 | Determines the number of requests to send to the user level ARP daemon. Use 0 | ||
1925 | to turn off. | ||
1926 | |||
1927 | gc_stale_time | ||
1928 | ------------- | ||
1929 | |||
1930 | Determines how often to check for stale ARP entries. After an ARP entry is | ||
1931 | stale it will be resolved again (which is useful when an IP address migrates | ||
1932 | to another machine). When ucast_solicit is greater than 0 it first tries to | ||
1933 | send an ARP packet directly to the known host When that fails and | ||
1934 | mcast_solicit is greater than 0, an ARP request is broadcasted. | ||
1935 | |||
1936 | 2.9 Appletalk | ||
1937 | ------------- | ||
1938 | |||
1939 | The /proc/sys/net/appletalk directory holds the Appletalk configuration data | ||
1940 | when Appletalk is loaded. The configurable parameters are: | ||
1941 | |||
1942 | aarp-expiry-time | ||
1943 | ---------------- | ||
1944 | |||
1945 | The amount of time we keep an ARP entry before expiring it. Used to age out | ||
1946 | old hosts. | ||
1947 | |||
1948 | aarp-resolve-time | ||
1949 | ----------------- | ||
1950 | |||
1951 | The amount of time we will spend trying to resolve an Appletalk address. | ||
1952 | |||
1953 | aarp-retransmit-limit | ||
1954 | --------------------- | ||
1955 | |||
1956 | The number of times we will retransmit a query before giving up. | ||
1957 | |||
1958 | aarp-tick-time | ||
1959 | -------------- | ||
1960 | |||
1961 | Controls the rate at which expires are checked. | ||
1962 | |||
1963 | The directory /proc/net/appletalk holds the list of active Appletalk sockets | ||
1964 | on a machine. | ||
1965 | |||
1966 | The fields indicate the DDP type, the local address (in network:node format) | ||
1967 | the remote address, the size of the transmit pending queue, the size of the | ||
1968 | received queue (bytes waiting for applications to read) the state and the uid | ||
1969 | owning the socket. | ||
1970 | |||
1971 | /proc/net/atalk_iface lists all the interfaces configured for appletalk.It | ||
1972 | shows the name of the interface, its Appletalk address, the network range on | ||
1973 | that address (or network number for phase 1 networks), and the status of the | ||
1974 | interface. | ||
1975 | |||
1976 | /proc/net/atalk_route lists each known network route. It lists the target | ||
1977 | (network) that the route leads to, the router (may be directly connected), the | ||
1978 | route flags, and the device the route is using. | ||
1979 | |||
1980 | 2.10 IPX | ||
1981 | -------- | ||
1982 | |||
1983 | The IPX protocol has no tunable values in proc/sys/net. | ||
1984 | |||
1985 | The IPX protocol does, however, provide proc/net/ipx. This lists each IPX | ||
1986 | socket giving the local and remote addresses in Novell format (that is | ||
1987 | network:node:port). In accordance with the strange Novell tradition, | ||
1988 | everything but the port is in hex. Not_Connected is displayed for sockets that | ||
1989 | are not tied to a specific remote address. The Tx and Rx queue sizes indicate | ||
1990 | the number of bytes pending for transmission and reception. The state | ||
1991 | indicates the state the socket is in and the uid is the owning uid of the | ||
1992 | socket. | ||
1993 | |||
1994 | The /proc/net/ipx_interface file lists all IPX interfaces. For each interface | ||
1995 | it gives the network number, the node number, and indicates if the network is | ||
1996 | the primary network. It also indicates which device it is bound to (or | ||
1997 | Internal for internal networks) and the Frame Type if appropriate. Linux | ||
1998 | supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for | ||
1999 | IPX. | ||
2000 | |||
2001 | The /proc/net/ipx_route table holds a list of IPX routes. For each route it | ||
2002 | gives the destination network, the router node (or Directly) and the network | ||
2003 | address of the router (or Connected) for internal networks. | ||
2004 | |||
2005 | 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem | ||
2006 | ---------------------------------------------------------- | ||
2007 | |||
2008 | The "mqueue" filesystem provides the necessary kernel features to enable the | ||
2009 | creation of a user space library that implements the POSIX message queues | ||
2010 | API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System | ||
2011 | Interfaces specification.) | ||
2012 | |||
2013 | The "mqueue" filesystem contains values for determining/setting the amount of | ||
2014 | resources used by the file system. | ||
2015 | |||
2016 | /proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the | ||
2017 | maximum number of message queues allowed on the system. | ||
2018 | |||
2019 | /proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the | ||
2020 | maximum number of messages in a queue value. In fact it is the limiting value | ||
2021 | for another (user) limit which is set in mq_open invocation. This attribute of | ||
2022 | a queue must be less or equal then msg_max. | ||
2023 | |||
2024 | /proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the | ||
2025 | maximum message size value (it is every message queue's attribute set during | ||
2026 | its creation). | ||
2027 | 1002 | ||
2028 | 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score | 1003 | 3.1 /proc/<pid>/oom_adj - Adjust the oom-killer score |
2029 | ------------------------------------------------------ | 1004 | ------------------------------------------------------ |
2030 | 1005 | ||
2031 | This file can be used to adjust the score used to select which processes | 1006 | This file can be used to adjust the score used to select which processes |
@@ -2062,25 +1037,15 @@ The task with the highest badness score is then selected and its children | |||
2062 | are killed, process itself will be killed in an OOM situation when it does | 1037 | are killed, process itself will be killed in an OOM situation when it does |
2063 | not have children or some of them disabled oom like described above. | 1038 | not have children or some of them disabled oom like described above. |
2064 | 1039 | ||
2065 | 2.13 /proc/<pid>/oom_score - Display current oom-killer score | 1040 | 3.2 /proc/<pid>/oom_score - Display current oom-killer score |
2066 | ------------------------------------------------------------- | 1041 | ------------------------------------------------------------- |
2067 | 1042 | ||
2068 | ------------------------------------------------------------------------------ | ||
2069 | This file can be used to check the current score used by the oom-killer is for | 1043 | This file can be used to check the current score used by the oom-killer is for |
2070 | any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which | 1044 | any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which |
2071 | process should be killed in an out-of-memory situation. | 1045 | process should be killed in an out-of-memory situation. |
2072 | 1046 | ||
2073 | ------------------------------------------------------------------------------ | ||
2074 | Summary | ||
2075 | ------------------------------------------------------------------------------ | ||
2076 | Certain aspects of kernel behavior can be modified at runtime, without the | ||
2077 | need to recompile the kernel, or even to reboot the system. The files in the | ||
2078 | /proc/sys tree can not only be read, but also modified. You can use the echo | ||
2079 | command to write value into these files, thereby changing the default settings | ||
2080 | of the kernel. | ||
2081 | ------------------------------------------------------------------------------ | ||
2082 | 1047 | ||
2083 | 2.14 /proc/<pid>/io - Display the IO accounting fields | 1048 | 3.3 /proc/<pid>/io - Display the IO accounting fields |
2084 | ------------------------------------------------------- | 1049 | ------------------------------------------------------- |
2085 | 1050 | ||
2086 | This file contains IO statistics for each running process | 1051 | This file contains IO statistics for each running process |
@@ -2182,7 +1147,7 @@ those 64-bit counters, process A could see an intermediate result. | |||
2182 | More information about this can be found within the taskstats documentation in | 1147 | More information about this can be found within the taskstats documentation in |
2183 | Documentation/accounting. | 1148 | Documentation/accounting. |
2184 | 1149 | ||
2185 | 2.15 /proc/<pid>/coredump_filter - Core dump filtering settings | 1150 | 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings |
2186 | --------------------------------------------------------------- | 1151 | --------------------------------------------------------------- |
2187 | When a process is dumped, all anonymous memory is written to a core file as | 1152 | When a process is dumped, all anonymous memory is written to a core file as |
2188 | long as the size of the core file isn't limited. But sometimes we don't want | 1153 | long as the size of the core file isn't limited. But sometimes we don't want |
@@ -2226,7 +1191,7 @@ For example: | |||
2226 | $ echo 0x7 > /proc/self/coredump_filter | 1191 | $ echo 0x7 > /proc/self/coredump_filter |
2227 | $ ./some_program | 1192 | $ ./some_program |
2228 | 1193 | ||
2229 | 2.16 /proc/<pid>/mountinfo - Information about mounts | 1194 | 3.5 /proc/<pid>/mountinfo - Information about mounts |
2230 | -------------------------------------------------------- | 1195 | -------------------------------------------------------- |
2231 | 1196 | ||
2232 | This file contains lines of the form: | 1197 | This file contains lines of the form: |
@@ -2263,30 +1228,3 @@ For more information on mount propagation see: | |||
2263 | 1228 | ||
2264 | Documentation/filesystems/sharedsubtree.txt | 1229 | Documentation/filesystems/sharedsubtree.txt |
2265 | 1230 | ||
2266 | 2.17 /proc/sys/fs/epoll - Configuration options for the epoll interface | ||
2267 | -------------------------------------------------------- | ||
2268 | |||
2269 | This directory contains configuration options for the epoll(7) interface. | ||
2270 | |||
2271 | max_user_instances | ||
2272 | ------------------ | ||
2273 | |||
2274 | This is the maximum number of epoll file descriptors that a single user can | ||
2275 | have open at a given time. The default value is 128, and should be enough | ||
2276 | for normal users. | ||
2277 | |||
2278 | max_user_watches | ||
2279 | ---------------- | ||
2280 | |||
2281 | Every epoll file descriptor can store a number of files to be monitored | ||
2282 | for event readiness. Each one of these monitored files constitutes a "watch". | ||
2283 | This configuration option sets the maximum number of "watches" that are | ||
2284 | allowed for each user. | ||
2285 | Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes | ||
2286 | on a 64bit one. | ||
2287 | The current default value for max_user_watches is the 1/32 of the available | ||
2288 | low memory, divided for the "watch" cost in bytes. | ||
2289 | |||
2290 | |||
2291 | ------------------------------------------------------------------------------ | ||
2292 | |||
diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 9f8740ca3f3b..26e4b8bc53ee 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt | |||
@@ -12,6 +12,7 @@ that support it. For example, a given bus might look like this: | |||
12 | | |-- enable | 12 | | |-- enable |
13 | | |-- irq | 13 | | |-- irq |
14 | | |-- local_cpus | 14 | | |-- local_cpus |
15 | | |-- remove | ||
15 | | |-- resource | 16 | | |-- resource |
16 | | |-- resource0 | 17 | | |-- resource0 |
17 | | |-- resource1 | 18 | | |-- resource1 |
@@ -36,6 +37,7 @@ files, each with their own function. | |||
36 | enable Whether the device is enabled (ascii, rw) | 37 | enable Whether the device is enabled (ascii, rw) |
37 | irq IRQ number (ascii, ro) | 38 | irq IRQ number (ascii, ro) |
38 | local_cpus nearby CPU mask (cpumask, ro) | 39 | local_cpus nearby CPU mask (cpumask, ro) |
40 | remove remove device from kernel's list (ascii, wo) | ||
39 | resource PCI resource host addresses (ascii, ro) | 41 | resource PCI resource host addresses (ascii, ro) |
40 | resource0..N PCI resource N, if present (binary, mmap) | 42 | resource0..N PCI resource N, if present (binary, mmap) |
41 | resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) | 43 | resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) |
@@ -46,6 +48,7 @@ files, each with their own function. | |||
46 | 48 | ||
47 | ro - read only file | 49 | ro - read only file |
48 | rw - file is readable and writable | 50 | rw - file is readable and writable |
51 | wo - write only file | ||
49 | mmap - file is mmapable | 52 | mmap - file is mmapable |
50 | ascii - file contains ascii text | 53 | ascii - file contains ascii text |
51 | binary - file contains binary data | 54 | binary - file contains binary data |
@@ -73,6 +76,13 @@ that the device must be enabled for a rom read to return data succesfully. | |||
73 | In the event a driver is not bound to the device, it can be enabled using the | 76 | In the event a driver is not bound to the device, it can be enabled using the |
74 | 'enable' file, documented above. | 77 | 'enable' file, documented above. |
75 | 78 | ||
79 | The 'remove' file is used to remove the PCI device, by writing a non-zero | ||
80 | integer to the file. This does not involve any kind of hot-plug functionality, | ||
81 | e.g. powering off the device. The device is removed from the kernel's list of | ||
82 | PCI devices, the sysfs directory for it is removed, and the device will be | ||
83 | removed from any drivers attached to it. Removal of PCI root buses is | ||
84 | disallowed. | ||
85 | |||
76 | Accessing legacy resources through sysfs | 86 | Accessing legacy resources through sysfs |
77 | ---------------------------------------- | 87 | ---------------------------------------- |
78 | 88 | ||
diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt index fde829a756e6..902b95d0ee51 100644 --- a/Documentation/filesystems/udf.txt +++ b/Documentation/filesystems/udf.txt | |||
@@ -24,6 +24,8 @@ The following mount options are supported: | |||
24 | 24 | ||
25 | gid= Set the default group. | 25 | gid= Set the default group. |
26 | umask= Set the default umask. | 26 | umask= Set the default umask. |
27 | mode= Set the default file permissions. | ||
28 | dmode= Set the default directory permissions. | ||
27 | uid= Set the default user. | 29 | uid= Set the default user. |
28 | bs= Set the block size. | 30 | bs= Set the block size. |
29 | unhide Show otherwise hidden files. | 31 | unhide Show otherwise hidden files. |
diff --git a/Documentation/gpio.txt b/Documentation/gpio.txt index b1b988701247..145c25a170c7 100644 --- a/Documentation/gpio.txt +++ b/Documentation/gpio.txt | |||
@@ -123,7 +123,10 @@ platform-specific implementation issue. | |||
123 | 123 | ||
124 | Using GPIOs | 124 | Using GPIOs |
125 | ----------- | 125 | ----------- |
126 | One of the first things to do with a GPIO, often in board setup code when | 126 | The first thing a system should do with a GPIO is allocate it, using |
127 | the gpio_request() call; see later. | ||
128 | |||
129 | One of the next things to do with a GPIO, often in board setup code when | ||
127 | setting up a platform_device using the GPIO, is mark its direction: | 130 | setting up a platform_device using the GPIO, is mark its direction: |
128 | 131 | ||
129 | /* set as input or output, returning 0 or negative errno */ | 132 | /* set as input or output, returning 0 or negative errno */ |
@@ -141,8 +144,8 @@ This helps avoid signal glitching during system startup. | |||
141 | 144 | ||
142 | For compatibility with legacy interfaces to GPIOs, setting the direction | 145 | For compatibility with legacy interfaces to GPIOs, setting the direction |
143 | of a GPIO implicitly requests that GPIO (see below) if it has not been | 146 | of a GPIO implicitly requests that GPIO (see below) if it has not been |
144 | requested already. That compatibility may be removed in the future; | 147 | requested already. That compatibility is being removed from the optional |
145 | explicitly requesting GPIOs is strongly preferred. | 148 | gpiolib framework. |
146 | 149 | ||
147 | Setting the direction can fail if the GPIO number is invalid, or when | 150 | Setting the direction can fail if the GPIO number is invalid, or when |
148 | that particular GPIO can't be used in that mode. It's generally a bad | 151 | that particular GPIO can't be used in that mode. It's generally a bad |
@@ -195,7 +198,7 @@ This requires sleeping, which can't be done from inside IRQ handlers. | |||
195 | 198 | ||
196 | Platforms that support this type of GPIO distinguish them from other GPIOs | 199 | Platforms that support this type of GPIO distinguish them from other GPIOs |
197 | by returning nonzero from this call (which requires a valid GPIO number, | 200 | by returning nonzero from this call (which requires a valid GPIO number, |
198 | either explicitly or implicitly requested): | 201 | which should have been previously allocated with gpio_request): |
199 | 202 | ||
200 | int gpio_cansleep(unsigned gpio); | 203 | int gpio_cansleep(unsigned gpio); |
201 | 204 | ||
@@ -212,10 +215,9 @@ for GPIOs that can't be accessed from IRQ handlers, these calls act the | |||
212 | same as the spinlock-safe calls. | 215 | same as the spinlock-safe calls. |
213 | 216 | ||
214 | 217 | ||
215 | Claiming and Releasing GPIOs (OPTIONAL) | 218 | Claiming and Releasing GPIOs |
216 | --------------------------------------- | 219 | ---------------------------- |
217 | To help catch system configuration errors, two calls are defined. | 220 | To help catch system configuration errors, two calls are defined. |
218 | However, many platforms don't currently support this mechanism. | ||
219 | 221 | ||
220 | /* request GPIO, returning 0 or negative errno. | 222 | /* request GPIO, returning 0 or negative errno. |
221 | * non-null labels may be useful for diagnostics. | 223 | * non-null labels may be useful for diagnostics. |
@@ -244,13 +246,6 @@ Some platforms may also use knowledge about what GPIOs are active for | |||
244 | power management, such as by powering down unused chip sectors and, more | 246 | power management, such as by powering down unused chip sectors and, more |
245 | easily, gating off unused clocks. | 247 | easily, gating off unused clocks. |
246 | 248 | ||
247 | These two calls are optional because not not all current Linux platforms | ||
248 | offer such functionality in their GPIO support; a valid implementation | ||
249 | could return success for all gpio_request() calls. Unlike the other calls, | ||
250 | the state they represent doesn't normally match anything from a hardware | ||
251 | register; it's just a software bitmap which clearly is not necessary for | ||
252 | correct operation of hardware or (bug free) drivers. | ||
253 | |||
254 | Note that requesting a GPIO does NOT cause it to be configured in any | 249 | Note that requesting a GPIO does NOT cause it to be configured in any |
255 | way; it just marks that GPIO as in use. Separate code must handle any | 250 | way; it just marks that GPIO as in use. Separate code must handle any |
256 | pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown). | 251 | pin setup (e.g. controlling which pin the GPIO uses, pullup/pulldown). |
diff --git a/Documentation/hwmon/lis3lv02d b/Documentation/hwmon/lis3lv02d index 287f8c902656..effe949a7282 100644 --- a/Documentation/hwmon/lis3lv02d +++ b/Documentation/hwmon/lis3lv02d | |||
@@ -1,11 +1,11 @@ | |||
1 | Kernel driver lis3lv02d | 1 | Kernel driver lis3lv02d |
2 | ================== | 2 | ======================= |
3 | 3 | ||
4 | Supported chips: | 4 | Supported chips: |
5 | 5 | ||
6 | * STMicroelectronics LIS3LV02DL and LIS3LV02DQ | 6 | * STMicroelectronics LIS3LV02DL and LIS3LV02DQ |
7 | 7 | ||
8 | Author: | 8 | Authors: |
9 | Yan Burman <burman.yan@gmail.com> | 9 | Yan Burman <burman.yan@gmail.com> |
10 | Eric Piel <eric.piel@tremplin-utc.net> | 10 | Eric Piel <eric.piel@tremplin-utc.net> |
11 | 11 | ||
@@ -15,7 +15,7 @@ Description | |||
15 | 15 | ||
16 | This driver provides support for the accelerometer found in various HP | 16 | This driver provides support for the accelerometer found in various HP |
17 | laptops sporting the feature officially called "HP Mobile Data | 17 | laptops sporting the feature officially called "HP Mobile Data |
18 | Protection System 3D" or "HP 3D DriveGuard". It detect automatically | 18 | Protection System 3D" or "HP 3D DriveGuard". It detects automatically |
19 | laptops with this sensor. Known models (for now the HP 2133, nc6420, | 19 | laptops with this sensor. Known models (for now the HP 2133, nc6420, |
20 | nc2510, nc8510, nc84x0, nw9440 and nx9420) will have their axis | 20 | nc2510, nc8510, nc84x0, nw9440 and nx9420) will have their axis |
21 | automatically oriented on standard way (eg: you can directly play | 21 | automatically oriented on standard way (eg: you can directly play |
@@ -27,7 +27,7 @@ position - 3D position that the accelerometer reports. Format: "(x,y,z)" | |||
27 | calibrate - read: values (x, y, z) that are used as the base for input | 27 | calibrate - read: values (x, y, z) that are used as the base for input |
28 | class device operation. | 28 | class device operation. |
29 | write: forces the base to be recalibrated with the current | 29 | write: forces the base to be recalibrated with the current |
30 | position. | 30 | position. |
31 | rate - reports the sampling rate of the accelerometer device in HZ | 31 | rate - reports the sampling rate of the accelerometer device in HZ |
32 | 32 | ||
33 | This driver also provides an absolute input class device, allowing | 33 | This driver also provides an absolute input class device, allowing |
@@ -48,7 +48,7 @@ For better compatibility between the various laptops. The values reported by | |||
48 | the accelerometer are converted into a "standard" organisation of the axes | 48 | the accelerometer are converted into a "standard" organisation of the axes |
49 | (aka "can play neverball out of the box"): | 49 | (aka "can play neverball out of the box"): |
50 | * When the laptop is horizontal the position reported is about 0 for X and Y | 50 | * When the laptop is horizontal the position reported is about 0 for X and Y |
51 | and a positive value for Z | 51 | and a positive value for Z |
52 | * If the left side is elevated, X increases (becomes positive) | 52 | * If the left side is elevated, X increases (becomes positive) |
53 | * If the front side (where the touchpad is) is elevated, Y decreases | 53 | * If the front side (where the touchpad is) is elevated, Y decreases |
54 | (becomes negative) | 54 | (becomes negative) |
@@ -59,3 +59,13 @@ email to the authors to add it to the database. When reporting a new | |||
59 | laptop, please include the output of "dmidecode" plus the value of | 59 | laptop, please include the output of "dmidecode" plus the value of |
60 | /sys/devices/platform/lis3lv02d/position in these four cases. | 60 | /sys/devices/platform/lis3lv02d/position in these four cases. |
61 | 61 | ||
62 | Q&A | ||
63 | --- | ||
64 | |||
65 | Q: How do I safely simulate freefall? I have an HP "portable | ||
66 | workstation" which has about 3.5kg and a plastic case, so letting it | ||
67 | fall to the ground is out of question... | ||
68 | |||
69 | A: The sensor is pretty sensitive, so your hands can do it. Lift it | ||
70 | into free space, follow the fall with your hands for like 10 | ||
71 | centimeters. That should be enough to trigger the detection. | ||
diff --git a/Documentation/hwmon/ltc4215 b/Documentation/hwmon/ltc4215 new file mode 100644 index 000000000000..2e6a21eb656c --- /dev/null +++ b/Documentation/hwmon/ltc4215 | |||
@@ -0,0 +1,50 @@ | |||
1 | Kernel driver ltc4215 | ||
2 | ===================== | ||
3 | |||
4 | Supported chips: | ||
5 | * Linear Technology LTC4215 | ||
6 | Prefix: 'ltc4215' | ||
7 | Addresses scanned: 0x44 | ||
8 | Datasheet: | ||
9 | http://www.linear.com/pc/downloadDocument.do?navId=H0,C1,C1003,C1006,C1163,P17572,D12697 | ||
10 | |||
11 | Author: Ira W. Snyder <iws@ovro.caltech.edu> | ||
12 | |||
13 | |||
14 | Description | ||
15 | ----------- | ||
16 | |||
17 | The LTC4215 controller allows a board to be safely inserted and removed | ||
18 | from a live backplane. | ||
19 | |||
20 | |||
21 | Usage Notes | ||
22 | ----------- | ||
23 | |||
24 | This driver does not probe for LTC4215 devices, due to the fact that some | ||
25 | of the possible addresses are unfriendly to probing. You will need to use | ||
26 | the "force" parameter to tell the driver where to find the device. | ||
27 | |||
28 | Example: the following will load the driver for an LTC4215 at address 0x44 | ||
29 | on I2C bus #0: | ||
30 | $ modprobe ltc4215 force=0,0x44 | ||
31 | |||
32 | |||
33 | Sysfs entries | ||
34 | ------------- | ||
35 | |||
36 | The LTC4215 has built-in limits for overvoltage, undervoltage, and | ||
37 | undercurrent warnings. This makes it very likely that the reference | ||
38 | circuit will be used. | ||
39 | |||
40 | in1_input input voltage | ||
41 | in2_input output voltage | ||
42 | |||
43 | in1_min_alarm input undervoltage alarm | ||
44 | in1_max_alarm input overvoltage alarm | ||
45 | |||
46 | curr1_input current | ||
47 | curr1_max_alarm overcurrent alarm | ||
48 | |||
49 | power1_input power usage | ||
50 | power1_alarm power bad alarm | ||
diff --git a/Documentation/ia64/kvm.txt b/Documentation/ia64/kvm.txt index 84f7cb3d5bec..ffb5c80bec3e 100644 --- a/Documentation/ia64/kvm.txt +++ b/Documentation/ia64/kvm.txt | |||
@@ -42,7 +42,7 @@ Note: For step 2, please make sure that host page size == TARGET_PAGE_SIZE of qe | |||
42 | hg clone http://xenbits.xensource.com/ext/efi-vfirmware.hg | 42 | hg clone http://xenbits.xensource.com/ext/efi-vfirmware.hg |
43 | you can get the firmware's binary in the directory of efi-vfirmware.hg/binaries. | 43 | you can get the firmware's binary in the directory of efi-vfirmware.hg/binaries. |
44 | 44 | ||
45 | (3) Rename the firware you owned to Flash.fd, and copy it to /usr/local/share/qemu | 45 | (3) Rename the firmware you owned to Flash.fd, and copy it to /usr/local/share/qemu |
46 | 46 | ||
47 | 4. Boot up Linux or Windows guests: | 47 | 4. Boot up Linux or Windows guests: |
48 | 4.1 Create or install a image for guest boot. If you have xen experience, it should be easy. | 48 | 4.1 Create or install a image for guest boot. If you have xen experience, it should be easy. |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index aeedb89a307a..0b7351b0582c 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -617,6 +617,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
617 | 617 | ||
618 | debug_objects [KNL] Enable object debugging | 618 | debug_objects [KNL] Enable object debugging |
619 | 619 | ||
620 | no_debug_objects | ||
621 | [KNL] Disable object debugging | ||
622 | |||
620 | debugpat [X86] Enable PAT debugging | 623 | debugpat [X86] Enable PAT debugging |
621 | 624 | ||
622 | decnet.addr= [HW,NET] | 625 | decnet.addr= [HW,NET] |
@@ -1523,7 +1526,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1523 | 1526 | ||
1524 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction | 1527 | noclflush [BUGS=X86] Don't use the CLFLUSH instruction |
1525 | 1528 | ||
1526 | nohlt [BUGS=ARM,SH] | 1529 | nohlt [BUGS=ARM,SH] Tells the kernel that the sleep(SH) or |
1530 | wfi(ARM) instruction doesn't work correctly and not to | ||
1531 | use it. This is also useful when using JTAG debugger. | ||
1527 | 1532 | ||
1528 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt | 1533 | no-hlt [BUGS=X86-32] Tells the kernel that the hlt |
1529 | instruction doesn't work correctly and not to | 1534 | instruction doesn't work correctly and not to |
@@ -1603,7 +1608,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1603 | nosoftlockup [KNL] Disable the soft-lockup detector. | 1608 | nosoftlockup [KNL] Disable the soft-lockup detector. |
1604 | 1609 | ||
1605 | noswapaccount [KNL] Disable accounting of swap in memory resource | 1610 | noswapaccount [KNL] Disable accounting of swap in memory resource |
1606 | controller. (See Documentation/controllers/memory.txt) | 1611 | controller. (See Documentation/cgroups/memory.txt) |
1607 | 1612 | ||
1608 | nosync [HW,M68K] Disables sync negotiation for all devices. | 1613 | nosync [HW,M68K] Disables sync negotiation for all devices. |
1609 | 1614 | ||
@@ -1695,6 +1700,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1695 | See also Documentation/blockdev/paride.txt. | 1700 | See also Documentation/blockdev/paride.txt. |
1696 | 1701 | ||
1697 | pci=option[,option...] [PCI] various PCI subsystem options: | 1702 | pci=option[,option...] [PCI] various PCI subsystem options: |
1703 | earlydump [X86] dump PCI config space before the kernel | ||
1704 | changes anything | ||
1698 | off [X86] don't probe for the PCI bus | 1705 | off [X86] don't probe for the PCI bus |
1699 | bios [X86-32] force use of PCI BIOS, don't access | 1706 | bios [X86-32] force use of PCI BIOS, don't access |
1700 | the hardware directly. Use this if your machine | 1707 | the hardware directly. Use this if your machine |
@@ -1794,6 +1801,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1794 | cbmemsize=nn[KMG] The fixed amount of bus space which is | 1801 | cbmemsize=nn[KMG] The fixed amount of bus space which is |
1795 | reserved for the CardBus bridge's memory | 1802 | reserved for the CardBus bridge's memory |
1796 | window. The default value is 64 megabytes. | 1803 | window. The default value is 64 megabytes. |
1804 | resource_alignment= | ||
1805 | Format: | ||
1806 | [<order of align>@][<domain>:]<bus>:<slot>.<func>[; ...] | ||
1807 | Specifies alignment and device to reassign | ||
1808 | aligned memory resources. | ||
1809 | If <order of align> is not specified, | ||
1810 | PAGE_SIZE is used as alignment. | ||
1811 | PCI-PCI bridge can be specified, if resource | ||
1812 | windows need to be expanded. | ||
1797 | 1813 | ||
1798 | pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power | 1814 | pcie_aspm= [PCIE] Forcibly enable or disable PCIe Active State Power |
1799 | Management. | 1815 | Management. |
@@ -1942,7 +1958,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1942 | 1958 | ||
1943 | relax_domain_level= | 1959 | relax_domain_level= |
1944 | [KNL, SMP] Set scheduler's default relax_domain_level. | 1960 | [KNL, SMP] Set scheduler's default relax_domain_level. |
1945 | See Documentation/cpusets.txt. | 1961 | See Documentation/cgroups/cpusets.txt. |
1946 | 1962 | ||
1947 | reserve= [KNL,BUGS] Force the kernel to ignore some iomem area | 1963 | reserve= [KNL,BUGS] Force the kernel to ignore some iomem area |
1948 | 1964 | ||
diff --git a/Documentation/md.txt b/Documentation/md.txt index 1da9d1b1793f..4edd39ec7db9 100644 --- a/Documentation/md.txt +++ b/Documentation/md.txt | |||
@@ -164,15 +164,19 @@ All md devices contain: | |||
164 | raid_disks | 164 | raid_disks |
165 | a text file with a simple number indicating the number of devices | 165 | a text file with a simple number indicating the number of devices |
166 | in a fully functional array. If this is not yet known, the file | 166 | in a fully functional array. If this is not yet known, the file |
167 | will be empty. If an array is being resized (not currently | 167 | will be empty. If an array is being resized this will contain |
168 | possible) this will contain the larger of the old and new sizes. | 168 | the new number of devices. |
169 | Some raid level (RAID1) allow this value to be set while the | 169 | Some raid levels allow this value to be set while the array is |
170 | array is active. This will reconfigure the array. Otherwise | 170 | active. This will reconfigure the array. Otherwise it can only |
171 | it can only be set while assembling an array. | 171 | be set while assembling an array. |
172 | A change to this attribute will not be permitted if it would | ||
173 | reduce the size of the array. To reduce the number of drives | ||
174 | in an e.g. raid5, the array size must first be reduced by | ||
175 | setting the 'array_size' attribute. | ||
172 | 176 | ||
173 | chunk_size | 177 | chunk_size |
174 | This is the size if bytes for 'chunks' and is only relevant to | 178 | This is the size in bytes for 'chunks' and is only relevant to |
175 | raid levels that involve striping (1,4,5,6,10). The address space | 179 | raid levels that involve striping (0,4,5,6,10). The address space |
176 | of the array is conceptually divided into chunks and consecutive | 180 | of the array is conceptually divided into chunks and consecutive |
177 | chunks are striped onto neighbouring devices. | 181 | chunks are striped onto neighbouring devices. |
178 | The size should be at least PAGE_SIZE (4k) and should be a power | 182 | The size should be at least PAGE_SIZE (4k) and should be a power |
@@ -183,6 +187,20 @@ All md devices contain: | |||
183 | simply a number that is interpretted differently by different | 187 | simply a number that is interpretted differently by different |
184 | levels. It can be written while assembling an array. | 188 | levels. It can be written while assembling an array. |
185 | 189 | ||
190 | array_size | ||
191 | This can be used to artificially constrain the available space in | ||
192 | the array to be less than is actually available on the combined | ||
193 | devices. Writing a number (in Kilobytes) which is less than | ||
194 | the available size will set the size. Any reconfiguration of the | ||
195 | array (e.g. adding devices) will not cause the size to change. | ||
196 | Writing the word 'default' will cause the effective size of the | ||
197 | array to be whatever size is actually available based on | ||
198 | 'level', 'chunk_size' and 'component_size'. | ||
199 | |||
200 | This can be used to reduce the size of the array before reducing | ||
201 | the number of devices in a raid4/5/6, or to support external | ||
202 | metadata formats which mandate such clipping. | ||
203 | |||
186 | reshape_position | 204 | reshape_position |
187 | This is either "none" or a sector number within the devices of | 205 | This is either "none" or a sector number within the devices of |
188 | the array where "reshape" is up to. If this is set, the three | 206 | the array where "reshape" is up to. If this is set, the three |
@@ -207,6 +225,11 @@ All md devices contain: | |||
207 | about the array. It can be 0.90 (traditional format), 1.0, 1.1, | 225 | about the array. It can be 0.90 (traditional format), 1.0, 1.1, |
208 | 1.2 (newer format in varying locations) or "none" indicating that | 226 | 1.2 (newer format in varying locations) or "none" indicating that |
209 | the kernel isn't managing metadata at all. | 227 | the kernel isn't managing metadata at all. |
228 | Alternately it can be "external:" followed by a string which | ||
229 | is set by user-space. This indicates that metadata is managed | ||
230 | by a user-space program. Any device failure or other event that | ||
231 | requires a metadata update will cause array activity to be | ||
232 | suspended until the event is acknowledged. | ||
210 | 233 | ||
211 | resync_start | 234 | resync_start |
212 | The point at which resync should start. If no resync is needed, | 235 | The point at which resync should start. If no resync is needed, |
diff --git a/Documentation/misc-devices/isl29003 b/Documentation/misc-devices/isl29003 new file mode 100644 index 000000000000..c4ff5f38e010 --- /dev/null +++ b/Documentation/misc-devices/isl29003 | |||
@@ -0,0 +1,62 @@ | |||
1 | Kernel driver isl29003 | ||
2 | ===================== | ||
3 | |||
4 | Supported chips: | ||
5 | * Intersil ISL29003 | ||
6 | Prefix: 'isl29003' | ||
7 | Addresses scanned: none | ||
8 | Datasheet: | ||
9 | http://www.intersil.com/data/fn/fn7464.pdf | ||
10 | |||
11 | Author: Daniel Mack <daniel@caiaq.de> | ||
12 | |||
13 | |||
14 | Description | ||
15 | ----------- | ||
16 | The ISL29003 is an integrated light sensor with a 16-bit integrating type | ||
17 | ADC, I2C user programmable lux range select for optimized counts/lux, and | ||
18 | I2C multi-function control and monitoring capabilities. The internal ADC | ||
19 | provides 16-bit resolution while rejecting 50Hz and 60Hz flicker caused by | ||
20 | artificial light sources. | ||
21 | |||
22 | The driver allows to set the lux range, the bit resolution, the operational | ||
23 | mode (see below) and the power state of device and can read the current lux | ||
24 | value, of course. | ||
25 | |||
26 | |||
27 | Detection | ||
28 | --------- | ||
29 | |||
30 | The ISL29003 does not have an ID register which could be used to identify | ||
31 | it, so the detection routine will just try to read from the configured I2C | ||
32 | addess and consider the device to be present as soon as it ACKs the | ||
33 | transfer. | ||
34 | |||
35 | |||
36 | Sysfs entries | ||
37 | ------------- | ||
38 | |||
39 | range: | ||
40 | 0: 0 lux to 1000 lux (default) | ||
41 | 1: 0 lux to 4000 lux | ||
42 | 2: 0 lux to 16,000 lux | ||
43 | 3: 0 lux to 64,000 lux | ||
44 | |||
45 | resolution: | ||
46 | 0: 2^16 cycles (default) | ||
47 | 1: 2^12 cycles | ||
48 | 2: 2^8 cycles | ||
49 | 3: 2^4 cycles | ||
50 | |||
51 | mode: | ||
52 | 0: diode1's current (unsigned 16bit) (default) | ||
53 | 1: diode1's current (unsigned 16bit) | ||
54 | 2: difference between diodes (l1 - l2, signed 15bit) | ||
55 | |||
56 | power_state: | ||
57 | 0: device is disabled (default) | ||
58 | 1: device is enabled | ||
59 | |||
60 | lux (read only): | ||
61 | returns the value from the last sensor reading | ||
62 | |||
diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt new file mode 100644 index 000000000000..d2e2997e6fa0 --- /dev/null +++ b/Documentation/networking/vxge.txt | |||
@@ -0,0 +1,100 @@ | |||
1 | Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver | ||
2 | ============================================================================== | ||
3 | |||
4 | Contents | ||
5 | -------- | ||
6 | |||
7 | 1) Introduction | ||
8 | 2) Features supported | ||
9 | 3) Configurable driver parameters | ||
10 | 4) Troubleshooting | ||
11 | |||
12 | 1) Introduction: | ||
13 | ---------------- | ||
14 | This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O | ||
15 | Virtualized Server adapters. | ||
16 | The X3100 series supports four modes of operation, configurable via | ||
17 | firmware - | ||
18 | Single function mode | ||
19 | Multi function mode | ||
20 | SRIOV mode | ||
21 | MRIOV mode | ||
22 | The functions share a 10GbE link and the pci-e bus, but hardly anything else | ||
23 | inside the ASIC. Features like independent hw reset, statistics, bandwidth/ | ||
24 | priority allocation and guarantees, GRO, TSO, interrupt moderation etc are | ||
25 | supported independently on each function. | ||
26 | |||
27 | (See below for a complete list of features supported for both IPv4 and IPv6) | ||
28 | |||
29 | 2) Features supported: | ||
30 | ---------------------- | ||
31 | |||
32 | i) Single function mode (up to 17 queues) | ||
33 | |||
34 | ii) Multi function mode (up to 17 functions) | ||
35 | |||
36 | iii) PCI-SIG's I/O Virtualization | ||
37 | - Single Root mode: v1.0 (up to 17 functions) | ||
38 | - Multi-Root mode: v1.0 (up to 17 functions) | ||
39 | |||
40 | iv) Jumbo frames | ||
41 | X3100 Series supports MTU up to 9600 bytes, modifiable using | ||
42 | ifconfig command. | ||
43 | |||
44 | v) Offloads supported: (Enabled by default) | ||
45 | Checksum offload (TCP/UDP/IP) on transmit and receive paths | ||
46 | TCP Segmentation Offload (TSO) on transmit path | ||
47 | Generic Receive Offload (GRO) on receive path | ||
48 | |||
49 | vi) MSI-X: (Enabled by default) | ||
50 | Resulting in noticeable performance improvement (up to 7% on certain | ||
51 | platforms). | ||
52 | |||
53 | vii) NAPI: (Enabled by default) | ||
54 | For better Rx interrupt moderation. | ||
55 | |||
56 | viii)RTH (Receive Traffic Hash): (Enabled by default) | ||
57 | Receive side steering for better scaling. | ||
58 | |||
59 | ix) Statistics | ||
60 | Comprehensive MAC-level and software statistics displayed using | ||
61 | "ethtool -S" option. | ||
62 | |||
63 | x) Multiple hardware queues: (Enabled by default) | ||
64 | Up to 17 hardware based transmit and receive data channels, with | ||
65 | multiple steering options (transmit multiqueue enabled by default). | ||
66 | |||
67 | 3) Configurable driver parameters: | ||
68 | ---------------------------------- | ||
69 | |||
70 | i) max_config_dev | ||
71 | Specifies maximum device functions to be enabled. | ||
72 | Valid range: 1-8 | ||
73 | |||
74 | ii) max_config_port | ||
75 | Specifies number of ports to be enabled. | ||
76 | Valid range: 1,2 | ||
77 | Default: 1 | ||
78 | |||
79 | iii)max_config_vpath | ||
80 | Specifies maximum VPATH(s) configured for each device function. | ||
81 | Valid range: 1-17 | ||
82 | |||
83 | iv) vlan_tag_strip | ||
84 | Enables/disables vlan tag stripping from all received tagged frames that | ||
85 | are not replicated at the internal L2 switch. | ||
86 | Valid range: 0,1 (disabled, enabled respectively) | ||
87 | Default: 1 | ||
88 | |||
89 | v) addr_learn_en | ||
90 | Enable learning the mac address of the guest OS interface in | ||
91 | virtualization environment. | ||
92 | Valid range: 0,1 (disabled, enabled respectively) | ||
93 | Default: 0 | ||
94 | |||
95 | 4) Troubleshooting: | ||
96 | ------------------- | ||
97 | |||
98 | To resolve an issue with the source code or X3100 series adapter, please collect | ||
99 | the statistics, register dumps using ethool, relevant logs and email them to | ||
100 | support@neterion.com. | ||
diff --git a/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/firmware.txt b/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/firmware.txt index 6c238f59b2a9..249db3a15d15 100644 --- a/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/firmware.txt +++ b/Documentation/powerpc/dts-bindings/fsl/cpm_qe/qe/firmware.txt | |||
@@ -1,6 +1,6 @@ | |||
1 | * Uploaded QE firmware | 1 | * Uploaded QE firmware |
2 | 2 | ||
3 | If a new firwmare has been uploaded to the QE (usually by the | 3 | If a new firmware has been uploaded to the QE (usually by the |
4 | boot loader), then a 'firmware' child node should be added to the QE | 4 | boot loader), then a 'firmware' child node should be added to the QE |
5 | node. This node provides information on the uploaded firmware that | 5 | node. This node provides information on the uploaded firmware that |
6 | device drivers may need. | 6 | device drivers may need. |
diff --git a/Documentation/powerpc/dts-bindings/mmc-spi-slot.txt b/Documentation/powerpc/dts-bindings/mmc-spi-slot.txt new file mode 100644 index 000000000000..c39ac2891951 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/mmc-spi-slot.txt | |||
@@ -0,0 +1,23 @@ | |||
1 | MMC/SD/SDIO slot directly connected to a SPI bus | ||
2 | |||
3 | Required properties: | ||
4 | - compatible : should be "mmc-spi-slot". | ||
5 | - reg : should specify SPI address (chip-select number). | ||
6 | - spi-max-frequency : maximum frequency for this device (Hz). | ||
7 | - voltage-ranges : two cells are required, first cell specifies minimum | ||
8 | slot voltage (mV), second cell specifies maximum slot voltage (mV). | ||
9 | Several ranges could be specified. | ||
10 | - gpios : (optional) may specify GPIOs in this order: Card-Detect GPIO, | ||
11 | Write-Protect GPIO. | ||
12 | |||
13 | Example: | ||
14 | |||
15 | mmc-slot@0 { | ||
16 | compatible = "fsl,mpc8323rdb-mmc-slot", | ||
17 | "mmc-spi-slot"; | ||
18 | reg = <0>; | ||
19 | gpios = <&qe_pio_d 14 1 | ||
20 | &qe_pio_d 15 0>; | ||
21 | voltage-ranges = <3300 3300>; | ||
22 | spi-max-frequency = <50000000>; | ||
23 | }; | ||
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 3ef339f491e0..5ba4d3fc625a 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt | |||
@@ -126,7 +126,7 @@ This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_u | |||
126 | to control the CPU time reserved for each control group instead. | 126 | to control the CPU time reserved for each control group instead. |
127 | 127 | ||
128 | For more information on working with control groups, you should read | 128 | For more information on working with control groups, you should read |
129 | Documentation/cgroups.txt as well. | 129 | Documentation/cgroups/cgroups.txt as well. |
130 | 130 | ||
131 | Group settings are checked against the following limits in order to keep the configuration | 131 | Group settings are checked against the following limits in order to keep the configuration |
132 | schedulable: | 132 | schedulable: |
diff --git a/Documentation/slow-work.txt b/Documentation/slow-work.txt new file mode 100644 index 000000000000..ebc50f808ea4 --- /dev/null +++ b/Documentation/slow-work.txt | |||
@@ -0,0 +1,174 @@ | |||
1 | ==================================== | ||
2 | SLOW WORK ITEM EXECUTION THREAD POOL | ||
3 | ==================================== | ||
4 | |||
5 | By: David Howells <dhowells@redhat.com> | ||
6 | |||
7 | The slow work item execution thread pool is a pool of threads for performing | ||
8 | things that take a relatively long time, such as making mkdir calls. | ||
9 | Typically, when processing something, these items will spend a lot of time | ||
10 | blocking a thread on I/O, thus making that thread unavailable for doing other | ||
11 | work. | ||
12 | |||
13 | The standard workqueue model is unsuitable for this class of work item as that | ||
14 | limits the owner to a single thread or a single thread per CPU. For some | ||
15 | tasks, however, more threads - or fewer - are required. | ||
16 | |||
17 | There is just one pool per system. It contains no threads unless something | ||
18 | wants to use it - and that something must register its interest first. When | ||
19 | the pool is active, the number of threads it contains is dynamic, varying | ||
20 | between a maximum and minimum setting, depending on the load. | ||
21 | |||
22 | |||
23 | ==================== | ||
24 | CLASSES OF WORK ITEM | ||
25 | ==================== | ||
26 | |||
27 | This pool support two classes of work items: | ||
28 | |||
29 | (*) Slow work items. | ||
30 | |||
31 | (*) Very slow work items. | ||
32 | |||
33 | The former are expected to finish much quicker than the latter. | ||
34 | |||
35 | An operation of the very slow class may do a batch combination of several | ||
36 | lookups, mkdirs, and a create for instance. | ||
37 | |||
38 | An operation of the ordinarily slow class may, for example, write stuff or | ||
39 | expand files, provided the time taken to do so isn't too long. | ||
40 | |||
41 | Operations of both types may sleep during execution, thus tying up the thread | ||
42 | loaned to it. | ||
43 | |||
44 | |||
45 | THREAD-TO-CLASS ALLOCATION | ||
46 | -------------------------- | ||
47 | |||
48 | Not all the threads in the pool are available to work on very slow work items. | ||
49 | The number will be between one and one fewer than the number of active threads. | ||
50 | This is configurable (see the "Pool Configuration" section). | ||
51 | |||
52 | All the threads are available to work on ordinarily slow work items, but a | ||
53 | percentage of the threads will prefer to work on very slow work items. | ||
54 | |||
55 | The configuration ensures that at least one thread will be available to work on | ||
56 | very slow work items, and at least one thread will be available that won't work | ||
57 | on very slow work items at all. | ||
58 | |||
59 | |||
60 | ===================== | ||
61 | USING SLOW WORK ITEMS | ||
62 | ===================== | ||
63 | |||
64 | Firstly, a module or subsystem wanting to make use of slow work items must | ||
65 | register its interest: | ||
66 | |||
67 | int ret = slow_work_register_user(); | ||
68 | |||
69 | This will return 0 if successful, or a -ve error upon failure. | ||
70 | |||
71 | |||
72 | Slow work items may then be set up by: | ||
73 | |||
74 | (1) Declaring a slow_work struct type variable: | ||
75 | |||
76 | #include <linux/slow-work.h> | ||
77 | |||
78 | struct slow_work myitem; | ||
79 | |||
80 | (2) Declaring the operations to be used for this item: | ||
81 | |||
82 | struct slow_work_ops myitem_ops = { | ||
83 | .get_ref = myitem_get_ref, | ||
84 | .put_ref = myitem_put_ref, | ||
85 | .execute = myitem_execute, | ||
86 | }; | ||
87 | |||
88 | [*] For a description of the ops, see section "Item Operations". | ||
89 | |||
90 | (3) Initialising the item: | ||
91 | |||
92 | slow_work_init(&myitem, &myitem_ops); | ||
93 | |||
94 | or: | ||
95 | |||
96 | vslow_work_init(&myitem, &myitem_ops); | ||
97 | |||
98 | depending on its class. | ||
99 | |||
100 | A suitably set up work item can then be enqueued for processing: | ||
101 | |||
102 | int ret = slow_work_enqueue(&myitem); | ||
103 | |||
104 | This will return a -ve error if the thread pool is unable to gain a reference | ||
105 | on the item, 0 otherwise. | ||
106 | |||
107 | |||
108 | The items are reference counted, so there ought to be no need for a flush | ||
109 | operation. When all a module's slow work items have been processed, and the | ||
110 | module has no further interest in the facility, it should unregister its | ||
111 | interest: | ||
112 | |||
113 | slow_work_unregister_user(); | ||
114 | |||
115 | |||
116 | =============== | ||
117 | ITEM OPERATIONS | ||
118 | =============== | ||
119 | |||
120 | Each work item requires a table of operations of type struct slow_work_ops. | ||
121 | All members are required: | ||
122 | |||
123 | (*) Get a reference on an item: | ||
124 | |||
125 | int (*get_ref)(struct slow_work *work); | ||
126 | |||
127 | This allows the thread pool to attempt to pin an item by getting a | ||
128 | reference on it. This function should return 0 if the reference was | ||
129 | granted, or a -ve error otherwise. If an error is returned, | ||
130 | slow_work_enqueue() will fail. | ||
131 | |||
132 | The reference is held whilst the item is queued and whilst it is being | ||
133 | executed. The item may then be requeued with the same reference held, or | ||
134 | the reference will be released. | ||
135 | |||
136 | (*) Release a reference on an item: | ||
137 | |||
138 | void (*put_ref)(struct slow_work *work); | ||
139 | |||
140 | This allows the thread pool to unpin an item by releasing the reference on | ||
141 | it. The thread pool will not touch the item again once this has been | ||
142 | called. | ||
143 | |||
144 | (*) Execute an item: | ||
145 | |||
146 | void (*execute)(struct slow_work *work); | ||
147 | |||
148 | This should perform the work required of the item. It may sleep, it may | ||
149 | perform disk I/O and it may wait for locks. | ||
150 | |||
151 | |||
152 | ================== | ||
153 | POOL CONFIGURATION | ||
154 | ================== | ||
155 | |||
156 | The slow-work thread pool has a number of configurables: | ||
157 | |||
158 | (*) /proc/sys/kernel/slow-work/min-threads | ||
159 | |||
160 | The minimum number of threads that should be in the pool whilst it is in | ||
161 | use. This may be anywhere between 2 and max-threads. | ||
162 | |||
163 | (*) /proc/sys/kernel/slow-work/max-threads | ||
164 | |||
165 | The maximum number of threads that should in the pool. This may be | ||
166 | anywhere between min-threads and 255 or NR_CPUS * 2, whichever is greater. | ||
167 | |||
168 | (*) /proc/sys/kernel/slow-work/vslow-percentage | ||
169 | |||
170 | The percentage of active threads in the pool that may be used to execute | ||
171 | very slow work items. This may be between 1 and 99. The resultant number | ||
172 | is bounded to between 1 and one fewer than the number of active threads. | ||
173 | This ensures there is always at least one thread that can process very | ||
174 | slow work items, and always at least one thread that won't. | ||
diff --git a/Documentation/sysctl/00-INDEX b/Documentation/sysctl/00-INDEX index a20a9066dc4c..1286f455992f 100644 --- a/Documentation/sysctl/00-INDEX +++ b/Documentation/sysctl/00-INDEX | |||
@@ -10,6 +10,8 @@ fs.txt | |||
10 | - documentation for /proc/sys/fs/*. | 10 | - documentation for /proc/sys/fs/*. |
11 | kernel.txt | 11 | kernel.txt |
12 | - documentation for /proc/sys/kernel/*. | 12 | - documentation for /proc/sys/kernel/*. |
13 | net.txt | ||
14 | - documentation for /proc/sys/net/*. | ||
13 | sunrpc.txt | 15 | sunrpc.txt |
14 | - documentation for /proc/sys/sunrpc/*. | 16 | - documentation for /proc/sys/sunrpc/*. |
15 | vm.txt | 17 | vm.txt |
diff --git a/Documentation/sysctl/fs.txt b/Documentation/sysctl/fs.txt index f99254327ae5..1458448436cc 100644 --- a/Documentation/sysctl/fs.txt +++ b/Documentation/sysctl/fs.txt | |||
@@ -1,5 +1,6 @@ | |||
1 | Documentation for /proc/sys/fs/* kernel version 2.2.10 | 1 | Documentation for /proc/sys/fs/* kernel version 2.2.10 |
2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> | 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
3 | (c) 2009, Shen Feng<shen@cn.fujitsu.com> | ||
3 | 4 | ||
4 | For general info and legal blurb, please look in README. | 5 | For general info and legal blurb, please look in README. |
5 | 6 | ||
@@ -14,7 +15,12 @@ kernel. Since some of the files _can_ be used to screw up your | |||
14 | system, it is advisable to read both documentation and source | 15 | system, it is advisable to read both documentation and source |
15 | before actually making adjustments. | 16 | before actually making adjustments. |
16 | 17 | ||
18 | 1. /proc/sys/fs | ||
19 | ---------------------------------------------------------- | ||
20 | |||
17 | Currently, these files are in /proc/sys/fs: | 21 | Currently, these files are in /proc/sys/fs: |
22 | - aio-max-nr | ||
23 | - aio-nr | ||
18 | - dentry-state | 24 | - dentry-state |
19 | - dquot-max | 25 | - dquot-max |
20 | - dquot-nr | 26 | - dquot-nr |
@@ -30,8 +36,15 @@ Currently, these files are in /proc/sys/fs: | |||
30 | - super-max | 36 | - super-max |
31 | - super-nr | 37 | - super-nr |
32 | 38 | ||
33 | Documentation for the files in /proc/sys/fs/binfmt_misc is | 39 | ============================================================== |
34 | in Documentation/binfmt_misc.txt. | 40 | |
41 | aio-nr & aio-max-nr: | ||
42 | |||
43 | aio-nr is the running total of the number of events specified on the | ||
44 | io_setup system call for all currently active aio contexts. If aio-nr | ||
45 | reaches aio-max-nr then io_setup will fail with EAGAIN. Note that | ||
46 | raising aio-max-nr does not result in the pre-allocation or re-sizing | ||
47 | of any kernel data structures. | ||
35 | 48 | ||
36 | ============================================================== | 49 | ============================================================== |
37 | 50 | ||
@@ -178,3 +191,60 @@ requests. aio-max-nr allows you to change the maximum value | |||
178 | aio-nr can grow to. | 191 | aio-nr can grow to. |
179 | 192 | ||
180 | ============================================================== | 193 | ============================================================== |
194 | |||
195 | |||
196 | 2. /proc/sys/fs/binfmt_misc | ||
197 | ---------------------------------------------------------- | ||
198 | |||
199 | Documentation for the files in /proc/sys/fs/binfmt_misc is | ||
200 | in Documentation/binfmt_misc.txt. | ||
201 | |||
202 | |||
203 | 3. /proc/sys/fs/mqueue - POSIX message queues filesystem | ||
204 | ---------------------------------------------------------- | ||
205 | |||
206 | The "mqueue" filesystem provides the necessary kernel features to enable the | ||
207 | creation of a user space library that implements the POSIX message queues | ||
208 | API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System | ||
209 | Interfaces specification.) | ||
210 | |||
211 | The "mqueue" filesystem contains values for determining/setting the amount of | ||
212 | resources used by the file system. | ||
213 | |||
214 | /proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the | ||
215 | maximum number of message queues allowed on the system. | ||
216 | |||
217 | /proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the | ||
218 | maximum number of messages in a queue value. In fact it is the limiting value | ||
219 | for another (user) limit which is set in mq_open invocation. This attribute of | ||
220 | a queue must be less or equal then msg_max. | ||
221 | |||
222 | /proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the | ||
223 | maximum message size value (it is every message queue's attribute set during | ||
224 | its creation). | ||
225 | |||
226 | |||
227 | 4. /proc/sys/fs/epoll - Configuration options for the epoll interface | ||
228 | -------------------------------------------------------- | ||
229 | |||
230 | This directory contains configuration options for the epoll(7) interface. | ||
231 | |||
232 | max_user_instances | ||
233 | ------------------ | ||
234 | |||
235 | This is the maximum number of epoll file descriptors that a single user can | ||
236 | have open at a given time. The default value is 128, and should be enough | ||
237 | for normal users. | ||
238 | |||
239 | max_user_watches | ||
240 | ---------------- | ||
241 | |||
242 | Every epoll file descriptor can store a number of files to be monitored | ||
243 | for event readiness. Each one of these monitored files constitutes a "watch". | ||
244 | This configuration option sets the maximum number of "watches" that are | ||
245 | allowed for each user. | ||
246 | Each "watch" costs roughly 90 bytes on a 32bit kernel, and roughly 160 bytes | ||
247 | on a 64bit one. | ||
248 | The current default value for max_user_watches is the 1/32 of the available | ||
249 | low memory, divided for the "watch" cost in bytes. | ||
250 | |||
diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index a4ccdd1981cf..f11ca7979fa6 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt | |||
@@ -1,5 +1,6 @@ | |||
1 | Documentation for /proc/sys/kernel/* kernel version 2.2.10 | 1 | Documentation for /proc/sys/kernel/* kernel version 2.2.10 |
2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> | 2 | (c) 1998, 1999, Rik van Riel <riel@nl.linux.org> |
3 | (c) 2009, Shen Feng<shen@cn.fujitsu.com> | ||
3 | 4 | ||
4 | For general info and legal blurb, please look in README. | 5 | For general info and legal blurb, please look in README. |
5 | 6 | ||
@@ -18,6 +19,7 @@ Currently, these files might (depending on your configuration) | |||
18 | show up in /proc/sys/kernel: | 19 | show up in /proc/sys/kernel: |
19 | - acpi_video_flags | 20 | - acpi_video_flags |
20 | - acct | 21 | - acct |
22 | - auto_msgmni | ||
21 | - core_pattern | 23 | - core_pattern |
22 | - core_uses_pid | 24 | - core_uses_pid |
23 | - ctrl-alt-del | 25 | - ctrl-alt-del |
@@ -33,6 +35,7 @@ show up in /proc/sys/kernel: | |||
33 | - msgmax | 35 | - msgmax |
34 | - msgmnb | 36 | - msgmnb |
35 | - msgmni | 37 | - msgmni |
38 | - nmi_watchdog | ||
36 | - osrelease | 39 | - osrelease |
37 | - ostype | 40 | - ostype |
38 | - overflowgid | 41 | - overflowgid |
@@ -40,6 +43,7 @@ show up in /proc/sys/kernel: | |||
40 | - panic | 43 | - panic |
41 | - pid_max | 44 | - pid_max |
42 | - powersave-nap [ PPC only ] | 45 | - powersave-nap [ PPC only ] |
46 | - panic_on_unrecovered_nmi | ||
43 | - printk | 47 | - printk |
44 | - randomize_va_space | 48 | - randomize_va_space |
45 | - real-root-dev ==> Documentation/initrd.txt | 49 | - real-root-dev ==> Documentation/initrd.txt |
@@ -55,6 +59,7 @@ show up in /proc/sys/kernel: | |||
55 | - sysrq ==> Documentation/sysrq.txt | 59 | - sysrq ==> Documentation/sysrq.txt |
56 | - tainted | 60 | - tainted |
57 | - threads-max | 61 | - threads-max |
62 | - unknown_nmi_panic | ||
58 | - version | 63 | - version |
59 | 64 | ||
60 | ============================================================== | 65 | ============================================================== |
@@ -381,3 +386,51 @@ can be ORed together: | |||
381 | 512 - A kernel warning has occurred. | 386 | 512 - A kernel warning has occurred. |
382 | 1024 - A module from drivers/staging was loaded. | 387 | 1024 - A module from drivers/staging was loaded. |
383 | 388 | ||
389 | ============================================================== | ||
390 | |||
391 | auto_msgmni: | ||
392 | |||
393 | Enables/Disables automatic recomputing of msgmni upon memory add/remove or | ||
394 | upon ipc namespace creation/removal (see the msgmni description above). | ||
395 | Echoing "1" into this file enables msgmni automatic recomputing. | ||
396 | Echoing "0" turns it off. | ||
397 | auto_msgmni default value is 1. | ||
398 | |||
399 | ============================================================== | ||
400 | |||
401 | nmi_watchdog: | ||
402 | |||
403 | Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero | ||
404 | the NMI watchdog is enabled and will continuously test all online cpus to | ||
405 | determine whether or not they are still functioning properly. Currently, | ||
406 | passing "nmi_watchdog=" parameter at boot time is required for this function | ||
407 | to work. | ||
408 | |||
409 | If LAPIC NMI watchdog method is in use (nmi_watchdog=2 kernel parameter), the | ||
410 | NMI watchdog shares registers with oprofile. By disabling the NMI watchdog, | ||
411 | oprofile may have more registers to utilize. | ||
412 | |||
413 | ============================================================== | ||
414 | |||
415 | unknown_nmi_panic: | ||
416 | |||
417 | The value in this file affects behavior of handling NMI. When the value is | ||
418 | non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel | ||
419 | debugging information is displayed on console. | ||
420 | |||
421 | NMI switch that most IA32 servers have fires unknown NMI up, for example. | ||
422 | If a system hangs up, try pressing the NMI switch. | ||
423 | |||
424 | ============================================================== | ||
425 | |||
426 | panic_on_unrecovered_nmi: | ||
427 | |||
428 | The default Linux behaviour on an NMI of either memory or unknown is to continue | ||
429 | operation. For many environments such as scientific computing it is preferable | ||
430 | that the box is taken out and the error dealt with than an uncorrected | ||
431 | parity/ECC error get propogated. | ||
432 | |||
433 | A small number of systems do generate NMI's for bizarre random reasons such as | ||
434 | power management so the default is off. That sysctl works like the existing | ||
435 | panic controls already in that directory. | ||
436 | |||
diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt new file mode 100644 index 000000000000..a34d55b65441 --- /dev/null +++ b/Documentation/sysctl/net.txt | |||
@@ -0,0 +1,175 @@ | |||
1 | Documentation for /proc/sys/net/* kernel version 2.4.0-test11-pre4 | ||
2 | (c) 1999 Terrehon Bowden <terrehon@pacbell.net> | ||
3 | Bodo Bauer <bb@ricochet.net> | ||
4 | (c) 2000 Jorge Nerin <comandante@zaralinux.com> | ||
5 | (c) 2009 Shen Feng <shen@cn.fujitsu.com> | ||
6 | |||
7 | For general info and legal blurb, please look in README. | ||
8 | |||
9 | ============================================================== | ||
10 | |||
11 | This file contains the documentation for the sysctl files in | ||
12 | /proc/sys/net and is valid for Linux kernel version 2.4.0-test11-pre4. | ||
13 | |||
14 | The interface to the networking parts of the kernel is located in | ||
15 | /proc/sys/net. The following table shows all possible subdirectories.You may | ||
16 | see only some of them, depending on your kernel's configuration. | ||
17 | |||
18 | |||
19 | Table : Subdirectories in /proc/sys/net | ||
20 | .............................................................................. | ||
21 | Directory Content Directory Content | ||
22 | core General parameter appletalk Appletalk protocol | ||
23 | unix Unix domain sockets netrom NET/ROM | ||
24 | 802 E802 protocol ax25 AX25 | ||
25 | ethernet Ethernet protocol rose X.25 PLP layer | ||
26 | ipv4 IP version 4 x25 X.25 protocol | ||
27 | ipx IPX token-ring IBM token ring | ||
28 | bridge Bridging decnet DEC net | ||
29 | ipv6 IP version 6 | ||
30 | .............................................................................. | ||
31 | |||
32 | 1. /proc/sys/net/core - Network core options | ||
33 | ------------------------------------------------------- | ||
34 | |||
35 | rmem_default | ||
36 | ------------ | ||
37 | |||
38 | The default setting of the socket receive buffer in bytes. | ||
39 | |||
40 | rmem_max | ||
41 | -------- | ||
42 | |||
43 | The maximum receive socket buffer size in bytes. | ||
44 | |||
45 | wmem_default | ||
46 | ------------ | ||
47 | |||
48 | The default setting (in bytes) of the socket send buffer. | ||
49 | |||
50 | wmem_max | ||
51 | -------- | ||
52 | |||
53 | The maximum send socket buffer size in bytes. | ||
54 | |||
55 | message_burst and message_cost | ||
56 | ------------------------------ | ||
57 | |||
58 | These parameters are used to limit the warning messages written to the kernel | ||
59 | log from the networking code. They enforce a rate limit to make a | ||
60 | denial-of-service attack impossible. A higher message_cost factor, results in | ||
61 | fewer messages that will be written. Message_burst controls when messages will | ||
62 | be dropped. The default settings limit warning messages to one every five | ||
63 | seconds. | ||
64 | |||
65 | warnings | ||
66 | -------- | ||
67 | |||
68 | This controls console messages from the networking stack that can occur because | ||
69 | of problems on the network like duplicate address or bad checksums. Normally, | ||
70 | this should be enabled, but if the problem persists the messages can be | ||
71 | disabled. | ||
72 | |||
73 | netdev_budget | ||
74 | ------------- | ||
75 | |||
76 | Maximum number of packets taken from all interfaces in one polling cycle (NAPI | ||
77 | poll). In one polling cycle interfaces which are registered to polling are | ||
78 | probed in a round-robin manner. The limit of packets in one such probe can be | ||
79 | set per-device via sysfs class/net/<device>/weight . | ||
80 | |||
81 | netdev_max_backlog | ||
82 | ------------------ | ||
83 | |||
84 | Maximum number of packets, queued on the INPUT side, when the interface | ||
85 | receives packets faster than kernel can process them. | ||
86 | |||
87 | optmem_max | ||
88 | ---------- | ||
89 | |||
90 | Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence | ||
91 | of struct cmsghdr structures with appended data. | ||
92 | |||
93 | 2. /proc/sys/net/unix - Parameters for Unix domain sockets | ||
94 | ------------------------------------------------------- | ||
95 | |||
96 | There is only one file in this directory. | ||
97 | unix_dgram_qlen limits the max number of datagrams queued in Unix domain | ||
98 | socket's buffer. It will not take effect unless PF_UNIX flag is spicified. | ||
99 | |||
100 | |||
101 | 3. /proc/sys/net/ipv4 - IPV4 settings | ||
102 | ------------------------------------------------------- | ||
103 | Please see: Documentation/networking/ip-sysctl.txt and ipvs-sysctl.txt for | ||
104 | descriptions of these entries. | ||
105 | |||
106 | |||
107 | 4. Appletalk | ||
108 | ------------------------------------------------------- | ||
109 | |||
110 | The /proc/sys/net/appletalk directory holds the Appletalk configuration data | ||
111 | when Appletalk is loaded. The configurable parameters are: | ||
112 | |||
113 | aarp-expiry-time | ||
114 | ---------------- | ||
115 | |||
116 | The amount of time we keep an ARP entry before expiring it. Used to age out | ||
117 | old hosts. | ||
118 | |||
119 | aarp-resolve-time | ||
120 | ----------------- | ||
121 | |||
122 | The amount of time we will spend trying to resolve an Appletalk address. | ||
123 | |||
124 | aarp-retransmit-limit | ||
125 | --------------------- | ||
126 | |||
127 | The number of times we will retransmit a query before giving up. | ||
128 | |||
129 | aarp-tick-time | ||
130 | -------------- | ||
131 | |||
132 | Controls the rate at which expires are checked. | ||
133 | |||
134 | The directory /proc/net/appletalk holds the list of active Appletalk sockets | ||
135 | on a machine. | ||
136 | |||
137 | The fields indicate the DDP type, the local address (in network:node format) | ||
138 | the remote address, the size of the transmit pending queue, the size of the | ||
139 | received queue (bytes waiting for applications to read) the state and the uid | ||
140 | owning the socket. | ||
141 | |||
142 | /proc/net/atalk_iface lists all the interfaces configured for appletalk.It | ||
143 | shows the name of the interface, its Appletalk address, the network range on | ||
144 | that address (or network number for phase 1 networks), and the status of the | ||
145 | interface. | ||
146 | |||
147 | /proc/net/atalk_route lists each known network route. It lists the target | ||
148 | (network) that the route leads to, the router (may be directly connected), the | ||
149 | route flags, and the device the route is using. | ||
150 | |||
151 | |||
152 | 5. IPX | ||
153 | ------------------------------------------------------- | ||
154 | |||
155 | The IPX protocol has no tunable values in proc/sys/net. | ||
156 | |||
157 | The IPX protocol does, however, provide proc/net/ipx. This lists each IPX | ||
158 | socket giving the local and remote addresses in Novell format (that is | ||
159 | network:node:port). In accordance with the strange Novell tradition, | ||
160 | everything but the port is in hex. Not_Connected is displayed for sockets that | ||
161 | are not tied to a specific remote address. The Tx and Rx queue sizes indicate | ||
162 | the number of bytes pending for transmission and reception. The state | ||
163 | indicates the state the socket is in and the uid is the owning uid of the | ||
164 | socket. | ||
165 | |||
166 | The /proc/net/ipx_interface file lists all IPX interfaces. For each interface | ||
167 | it gives the network number, the node number, and indicates if the network is | ||
168 | the primary network. It also indicates which device it is bound to (or | ||
169 | Internal for internal networks) and the Frame Type if appropriate. Linux | ||
170 | supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for | ||
171 | IPX. | ||
172 | |||
173 | The /proc/net/ipx_route table holds a list of IPX routes. For each route it | ||
174 | gives the destination network, the router node (or Directly) and the network | ||
175 | address of the router (or Connected) for internal networks. | ||
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 9e592c718afb..afa2946892da 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -81,6 +81,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
81 | 81 | ||
82 | 'i' - Send a SIGKILL to all processes, except for init. | 82 | 'i' - Send a SIGKILL to all processes, except for init. |
83 | 83 | ||
84 | 'j' - Forcibly "Just thaw it" - filesystems frozen by the FIFREEZE ioctl. | ||
85 | |||
84 | 'k' - Secure Access Key (SAK) Kills all programs on the current virtual | 86 | 'k' - Secure Access Key (SAK) Kills all programs on the current virtual |
85 | console. NOTE: See important comments below in SAK section. | 87 | console. NOTE: See important comments below in SAK section. |
86 | 88 | ||
@@ -160,6 +162,9 @@ t'E'rm and k'I'll are useful if you have some sort of runaway process you | |||
160 | are unable to kill any other way, especially if it's spawning other | 162 | are unable to kill any other way, especially if it's spawning other |
161 | processes. | 163 | processes. |
162 | 164 | ||
165 | "'J'ust thaw it" is useful if your system becomes unresponsive due to a frozen | ||
166 | (probably root) filesystem via the FIFREEZE ioctl. | ||
167 | |||
163 | * Sometimes SysRq seems to get 'stuck' after using it, what can I do? | 168 | * Sometimes SysRq seems to get 'stuck' after using it, what can I do? |
164 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | 169 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
165 | That happens to me, also. I've found that tapping shift, alt, and control | 170 | That happens to me, also. I've found that tapping shift, alt, and control |
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 6aaaeb38730c..be45dbb9d7f2 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -8,7 +8,8 @@ The current memory policy support was added to Linux 2.6 around May 2004. This | |||
8 | document attempts to describe the concepts and APIs of the 2.6 memory policy | 8 | document attempts to describe the concepts and APIs of the 2.6 memory policy |
9 | support. | 9 | support. |
10 | 10 | ||
11 | Memory policies should not be confused with cpusets (Documentation/cpusets.txt) | 11 | Memory policies should not be confused with cpusets |
12 | (Documentation/cgroups/cpusets.txt) | ||
12 | which is an administrative mechanism for restricting the nodes from which | 13 | which is an administrative mechanism for restricting the nodes from which |
13 | memory may be allocated by a set of processes. Memory policies are a | 14 | memory may be allocated by a set of processes. Memory policies are a |
14 | programming interface that a NUMA-aware application can take advantage of. When | 15 | programming interface that a NUMA-aware application can take advantage of. When |
diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration index d5fdfd34bbaf..6513fe2d90b8 100644 --- a/Documentation/vm/page_migration +++ b/Documentation/vm/page_migration | |||
@@ -37,7 +37,8 @@ locations. | |||
37 | 37 | ||
38 | Larger installations usually partition the system using cpusets into | 38 | Larger installations usually partition the system using cpusets into |
39 | sections of nodes. Paul Jackson has equipped cpusets with the ability to | 39 | sections of nodes. Paul Jackson has equipped cpusets with the ability to |
40 | move pages when a task is moved to another cpuset (See ../cpusets.txt). | 40 | move pages when a task is moved to another cpuset (See |
41 | Documentation/cgroups/cpusets.txt). | ||
41 | Cpusets allows the automation of process locality. If a task is moved to | 42 | Cpusets allows the automation of process locality. If a task is moved to |
42 | a new cpuset then also all its pages are moved with it so that the | 43 | a new cpuset then also all its pages are moved with it so that the |
43 | performance of the process does not sink dramatically. Also the pages | 44 | performance of the process does not sink dramatically. Also the pages |
diff --git a/Documentation/x86/x86_64/fake-numa-for-cpusets b/Documentation/x86/x86_64/fake-numa-for-cpusets index 33bb56655991..0f11d9becb0b 100644 --- a/Documentation/x86/x86_64/fake-numa-for-cpusets +++ b/Documentation/x86/x86_64/fake-numa-for-cpusets | |||
@@ -7,7 +7,8 @@ you can create fake NUMA nodes that represent contiguous chunks of memory and | |||
7 | assign them to cpusets and their attached tasks. This is a way of limiting the | 7 | assign them to cpusets and their attached tasks. This is a way of limiting the |
8 | amount of system memory that are available to a certain class of tasks. | 8 | amount of system memory that are available to a certain class of tasks. |
9 | 9 | ||
10 | For more information on the features of cpusets, see Documentation/cpusets.txt. | 10 | For more information on the features of cpusets, see |
11 | Documentation/cgroups/cpusets.txt. | ||
11 | There are a number of different configurations you can use for your needs. For | 12 | There are a number of different configurations you can use for your needs. For |
12 | more information on the numa=fake command line option and its various ways of | 13 | more information on the numa=fake command line option and its various ways of |
13 | configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. | 14 | configuring fake nodes, see Documentation/x86/x86_64/boot-options.txt. |
@@ -32,7 +33,7 @@ A machine may be split as follows with "numa=fake=4*512," as reported by dmesg: | |||
32 | On node 3 totalpages: 131072 | 33 | On node 3 totalpages: 131072 |
33 | 34 | ||
34 | Now following the instructions for mounting the cpusets filesystem from | 35 | Now following the instructions for mounting the cpusets filesystem from |
35 | Documentation/cpusets.txt, you can assign fake nodes (i.e. contiguous memory | 36 | Documentation/cgroups/cpusets.txt, you can assign fake nodes (i.e. contiguous memory |
36 | address spaces) to individual cpusets: | 37 | address spaces) to individual cpusets: |
37 | 38 | ||
38 | [root@xroads /]# mkdir exampleset | 39 | [root@xroads /]# mkdir exampleset |