diff options
Diffstat (limited to 'Documentation')
22 files changed, 1639 insertions, 228 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-regulator b/Documentation/ABI/testing/sysfs-class-regulator index 3731f6f29bcb..873ef1fc1569 100644 --- a/Documentation/ABI/testing/sysfs-class-regulator +++ b/Documentation/ABI/testing/sysfs-class-regulator | |||
@@ -3,8 +3,9 @@ Date: April 2008 | |||
3 | KernelVersion: 2.6.26 | 3 | KernelVersion: 2.6.26 |
4 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 4 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
5 | Description: | 5 | Description: |
6 | Each regulator directory will contain a field called | 6 | Some regulator directories will contain a field called |
7 | state. This holds the regulator output state. | 7 | state. This reports the regulator enable status, for |
8 | regulators which can report that value. | ||
8 | 9 | ||
9 | This will be one of the following strings: | 10 | This will be one of the following strings: |
10 | 11 | ||
@@ -18,7 +19,8 @@ Description: | |||
18 | 'disabled' means the regulator output is OFF and is not | 19 | 'disabled' means the regulator output is OFF and is not |
19 | supplying power to the system.. | 20 | supplying power to the system.. |
20 | 21 | ||
21 | 'unknown' means software cannot determine the state. | 22 | 'unknown' means software cannot determine the state, or |
23 | the reported state is invalid. | ||
22 | 24 | ||
23 | NOTE: this field can be used in conjunction with microvolts | 25 | NOTE: this field can be used in conjunction with microvolts |
24 | and microamps to determine regulator output levels. | 26 | and microamps to determine regulator output levels. |
@@ -53,9 +55,10 @@ Date: April 2008 | |||
53 | KernelVersion: 2.6.26 | 55 | KernelVersion: 2.6.26 |
54 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 56 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
55 | Description: | 57 | Description: |
56 | Each regulator directory will contain a field called | 58 | Some regulator directories will contain a field called |
57 | microvolts. This holds the regulator output voltage setting | 59 | microvolts. This holds the regulator output voltage setting |
58 | measured in microvolts (i.e. E-6 Volts). | 60 | measured in microvolts (i.e. E-6 Volts), for regulators |
61 | which can report that voltage. | ||
59 | 62 | ||
60 | NOTE: This value should not be used to determine the regulator | 63 | NOTE: This value should not be used to determine the regulator |
61 | output voltage level as this value is the same regardless of | 64 | output voltage level as this value is the same regardless of |
@@ -67,9 +70,10 @@ Date: April 2008 | |||
67 | KernelVersion: 2.6.26 | 70 | KernelVersion: 2.6.26 |
68 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 71 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
69 | Description: | 72 | Description: |
70 | Each regulator directory will contain a field called | 73 | Some regulator directories will contain a field called |
71 | microamps. This holds the regulator output current limit | 74 | microamps. This holds the regulator output current limit |
72 | setting measured in microamps (i.e. E-6 Amps). | 75 | setting measured in microamps (i.e. E-6 Amps), for regulators |
76 | which can report that current. | ||
73 | 77 | ||
74 | NOTE: This value should not be used to determine the regulator | 78 | NOTE: This value should not be used to determine the regulator |
75 | output current level as this value is the same regardless of | 79 | output current level as this value is the same regardless of |
@@ -81,8 +85,9 @@ Date: April 2008 | |||
81 | KernelVersion: 2.6.26 | 85 | KernelVersion: 2.6.26 |
82 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 86 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
83 | Description: | 87 | Description: |
84 | Each regulator directory will contain a field called | 88 | Some regulator directories will contain a field called |
85 | opmode. This holds the regulator operating mode setting. | 89 | opmode. This holds the current regulator operating mode, |
90 | for regulators which can report it. | ||
86 | 91 | ||
87 | The opmode value can be one of the following strings: | 92 | The opmode value can be one of the following strings: |
88 | 93 | ||
@@ -92,7 +97,7 @@ Description: | |||
92 | 'standby' | 97 | 'standby' |
93 | 'unknown' | 98 | 'unknown' |
94 | 99 | ||
95 | The modes are described in include/linux/regulator/regulator.h | 100 | The modes are described in include/linux/regulator/consumer.h |
96 | 101 | ||
97 | NOTE: This value should not be used to determine the regulator | 102 | NOTE: This value should not be used to determine the regulator |
98 | output operating mode as this value is the same regardless of | 103 | output operating mode as this value is the same regardless of |
@@ -104,9 +109,10 @@ Date: April 2008 | |||
104 | KernelVersion: 2.6.26 | 109 | KernelVersion: 2.6.26 |
105 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 110 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
106 | Description: | 111 | Description: |
107 | Each regulator directory will contain a field called | 112 | Some regulator directories will contain a field called |
108 | min_microvolts. This holds the minimum safe working regulator | 113 | min_microvolts. This holds the minimum safe working regulator |
109 | output voltage setting for this domain measured in microvolts. | 114 | output voltage setting for this domain measured in microvolts, |
115 | for regulators which support voltage constraints. | ||
110 | 116 | ||
111 | NOTE: this will return the string 'constraint not defined' if | 117 | NOTE: this will return the string 'constraint not defined' if |
112 | the power domain has no min microvolts constraint defined by | 118 | the power domain has no min microvolts constraint defined by |
@@ -118,9 +124,10 @@ Date: April 2008 | |||
118 | KernelVersion: 2.6.26 | 124 | KernelVersion: 2.6.26 |
119 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 125 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
120 | Description: | 126 | Description: |
121 | Each regulator directory will contain a field called | 127 | Some regulator directories will contain a field called |
122 | max_microvolts. This holds the maximum safe working regulator | 128 | max_microvolts. This holds the maximum safe working regulator |
123 | output voltage setting for this domain measured in microvolts. | 129 | output voltage setting for this domain measured in microvolts, |
130 | for regulators which support voltage constraints. | ||
124 | 131 | ||
125 | NOTE: this will return the string 'constraint not defined' if | 132 | NOTE: this will return the string 'constraint not defined' if |
126 | the power domain has no max microvolts constraint defined by | 133 | the power domain has no max microvolts constraint defined by |
@@ -132,10 +139,10 @@ Date: April 2008 | |||
132 | KernelVersion: 2.6.26 | 139 | KernelVersion: 2.6.26 |
133 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 140 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
134 | Description: | 141 | Description: |
135 | Each regulator directory will contain a field called | 142 | Some regulator directories will contain a field called |
136 | min_microamps. This holds the minimum safe working regulator | 143 | min_microamps. This holds the minimum safe working regulator |
137 | output current limit setting for this domain measured in | 144 | output current limit setting for this domain measured in |
138 | microamps. | 145 | microamps, for regulators which support current constraints. |
139 | 146 | ||
140 | NOTE: this will return the string 'constraint not defined' if | 147 | NOTE: this will return the string 'constraint not defined' if |
141 | the power domain has no min microamps constraint defined by | 148 | the power domain has no min microamps constraint defined by |
@@ -147,10 +154,10 @@ Date: April 2008 | |||
147 | KernelVersion: 2.6.26 | 154 | KernelVersion: 2.6.26 |
148 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 155 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
149 | Description: | 156 | Description: |
150 | Each regulator directory will contain a field called | 157 | Some regulator directories will contain a field called |
151 | max_microamps. This holds the maximum safe working regulator | 158 | max_microamps. This holds the maximum safe working regulator |
152 | output current limit setting for this domain measured in | 159 | output current limit setting for this domain measured in |
153 | microamps. | 160 | microamps, for regulators which support current constraints. |
154 | 161 | ||
155 | NOTE: this will return the string 'constraint not defined' if | 162 | NOTE: this will return the string 'constraint not defined' if |
156 | the power domain has no max microamps constraint defined by | 163 | the power domain has no max microamps constraint defined by |
@@ -185,7 +192,7 @@ Date: April 2008 | |||
185 | KernelVersion: 2.6.26 | 192 | KernelVersion: 2.6.26 |
186 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 193 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
187 | Description: | 194 | Description: |
188 | Each regulator directory will contain a field called | 195 | Some regulator directories will contain a field called |
189 | requested_microamps. This holds the total requested load | 196 | requested_microamps. This holds the total requested load |
190 | current in microamps for this regulator from all its consumer | 197 | current in microamps for this regulator from all its consumer |
191 | devices. | 198 | devices. |
@@ -204,125 +211,102 @@ Date: May 2008 | |||
204 | KernelVersion: 2.6.26 | 211 | KernelVersion: 2.6.26 |
205 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 212 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
206 | Description: | 213 | Description: |
207 | Each regulator directory will contain a field called | 214 | Some regulator directories will contain a field called |
208 | suspend_mem_microvolts. This holds the regulator output | 215 | suspend_mem_microvolts. This holds the regulator output |
209 | voltage setting for this domain measured in microvolts when | 216 | voltage setting for this domain measured in microvolts when |
210 | the system is suspended to memory. | 217 | the system is suspended to memory, for voltage regulators |
211 | 218 | implementing suspend voltage configuration constraints. | |
212 | NOTE: this will return the string 'not defined' if | ||
213 | the power domain has no suspend to memory voltage defined by | ||
214 | platform code. | ||
215 | 219 | ||
216 | What: /sys/class/regulator/.../suspend_disk_microvolts | 220 | What: /sys/class/regulator/.../suspend_disk_microvolts |
217 | Date: May 2008 | 221 | Date: May 2008 |
218 | KernelVersion: 2.6.26 | 222 | KernelVersion: 2.6.26 |
219 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 223 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
220 | Description: | 224 | Description: |
221 | Each regulator directory will contain a field called | 225 | Some regulator directories will contain a field called |
222 | suspend_disk_microvolts. This holds the regulator output | 226 | suspend_disk_microvolts. This holds the regulator output |
223 | voltage setting for this domain measured in microvolts when | 227 | voltage setting for this domain measured in microvolts when |
224 | the system is suspended to disk. | 228 | the system is suspended to disk, for voltage regulators |
225 | 229 | implementing suspend voltage configuration constraints. | |
226 | NOTE: this will return the string 'not defined' if | ||
227 | the power domain has no suspend to disk voltage defined by | ||
228 | platform code. | ||
229 | 230 | ||
230 | What: /sys/class/regulator/.../suspend_standby_microvolts | 231 | What: /sys/class/regulator/.../suspend_standby_microvolts |
231 | Date: May 2008 | 232 | Date: May 2008 |
232 | KernelVersion: 2.6.26 | 233 | KernelVersion: 2.6.26 |
233 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 234 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
234 | Description: | 235 | Description: |
235 | Each regulator directory will contain a field called | 236 | Some regulator directories will contain a field called |
236 | suspend_standby_microvolts. This holds the regulator output | 237 | suspend_standby_microvolts. This holds the regulator output |
237 | voltage setting for this domain measured in microvolts when | 238 | voltage setting for this domain measured in microvolts when |
238 | the system is suspended to standby. | 239 | the system is suspended to standby, for voltage regulators |
239 | 240 | implementing suspend voltage configuration constraints. | |
240 | NOTE: this will return the string 'not defined' if | ||
241 | the power domain has no suspend to standby voltage defined by | ||
242 | platform code. | ||
243 | 241 | ||
244 | What: /sys/class/regulator/.../suspend_mem_mode | 242 | What: /sys/class/regulator/.../suspend_mem_mode |
245 | Date: May 2008 | 243 | Date: May 2008 |
246 | KernelVersion: 2.6.26 | 244 | KernelVersion: 2.6.26 |
247 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 245 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
248 | Description: | 246 | Description: |
249 | Each regulator directory will contain a field called | 247 | Some regulator directories will contain a field called |
250 | suspend_mem_mode. This holds the regulator operating mode | 248 | suspend_mem_mode. This holds the regulator operating mode |
251 | setting for this domain when the system is suspended to | 249 | setting for this domain when the system is suspended to |
252 | memory. | 250 | memory, for regulators implementing suspend mode |
253 | 251 | configuration constraints. | |
254 | NOTE: this will return the string 'not defined' if | ||
255 | the power domain has no suspend to memory mode defined by | ||
256 | platform code. | ||
257 | 252 | ||
258 | What: /sys/class/regulator/.../suspend_disk_mode | 253 | What: /sys/class/regulator/.../suspend_disk_mode |
259 | Date: May 2008 | 254 | Date: May 2008 |
260 | KernelVersion: 2.6.26 | 255 | KernelVersion: 2.6.26 |
261 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 256 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
262 | Description: | 257 | Description: |
263 | Each regulator directory will contain a field called | 258 | Some regulator directories will contain a field called |
264 | suspend_disk_mode. This holds the regulator operating mode | 259 | suspend_disk_mode. This holds the regulator operating mode |
265 | setting for this domain when the system is suspended to disk. | 260 | setting for this domain when the system is suspended to disk, |
266 | 261 | for regulators implementing suspend mode configuration | |
267 | NOTE: this will return the string 'not defined' if | 262 | constraints. |
268 | the power domain has no suspend to disk mode defined by | ||
269 | platform code. | ||
270 | 263 | ||
271 | What: /sys/class/regulator/.../suspend_standby_mode | 264 | What: /sys/class/regulator/.../suspend_standby_mode |
272 | Date: May 2008 | 265 | Date: May 2008 |
273 | KernelVersion: 2.6.26 | 266 | KernelVersion: 2.6.26 |
274 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 267 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
275 | Description: | 268 | Description: |
276 | Each regulator directory will contain a field called | 269 | Some regulator directories will contain a field called |
277 | suspend_standby_mode. This holds the regulator operating mode | 270 | suspend_standby_mode. This holds the regulator operating mode |
278 | setting for this domain when the system is suspended to | 271 | setting for this domain when the system is suspended to |
279 | standby. | 272 | standby, for regulators implementing suspend mode |
280 | 273 | configuration constraints. | |
281 | NOTE: this will return the string 'not defined' if | ||
282 | the power domain has no suspend to standby mode defined by | ||
283 | platform code. | ||
284 | 274 | ||
285 | What: /sys/class/regulator/.../suspend_mem_state | 275 | What: /sys/class/regulator/.../suspend_mem_state |
286 | Date: May 2008 | 276 | Date: May 2008 |
287 | KernelVersion: 2.6.26 | 277 | KernelVersion: 2.6.26 |
288 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 278 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
289 | Description: | 279 | Description: |
290 | Each regulator directory will contain a field called | 280 | Some regulator directories will contain a field called |
291 | suspend_mem_state. This holds the regulator operating state | 281 | suspend_mem_state. This holds the regulator operating state |
292 | when suspended to memory. | 282 | when suspended to memory, for regulators implementing suspend |
293 | 283 | configuration constraints. | |
294 | This will be one of the following strings: | ||
295 | 284 | ||
296 | 'enabled' | 285 | This will be one of the same strings reported by |
297 | 'disabled' | 286 | the "state" attribute. |
298 | 'not defined' | ||
299 | 287 | ||
300 | What: /sys/class/regulator/.../suspend_disk_state | 288 | What: /sys/class/regulator/.../suspend_disk_state |
301 | Date: May 2008 | 289 | Date: May 2008 |
302 | KernelVersion: 2.6.26 | 290 | KernelVersion: 2.6.26 |
303 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 291 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
304 | Description: | 292 | Description: |
305 | Each regulator directory will contain a field called | 293 | Some regulator directories will contain a field called |
306 | suspend_disk_state. This holds the regulator operating state | 294 | suspend_disk_state. This holds the regulator operating state |
307 | when suspended to disk. | 295 | when suspended to disk, for regulators implementing |
308 | 296 | suspend configuration constraints. | |
309 | This will be one of the following strings: | ||
310 | 297 | ||
311 | 'enabled' | 298 | This will be one of the same strings reported by |
312 | 'disabled' | 299 | the "state" attribute. |
313 | 'not defined' | ||
314 | 300 | ||
315 | What: /sys/class/regulator/.../suspend_standby_state | 301 | What: /sys/class/regulator/.../suspend_standby_state |
316 | Date: May 2008 | 302 | Date: May 2008 |
317 | KernelVersion: 2.6.26 | 303 | KernelVersion: 2.6.26 |
318 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> | 304 | Contact: Liam Girdwood <lrg@slimlogic.co.uk> |
319 | Description: | 305 | Description: |
320 | Each regulator directory will contain a field called | 306 | Some regulator directories will contain a field called |
321 | suspend_standby_state. This holds the regulator operating | 307 | suspend_standby_state. This holds the regulator operating |
322 | state when suspended to standby. | 308 | state when suspended to standby, for regulators implementing |
323 | 309 | suspend configuration constraints. | |
324 | This will be one of the following strings: | ||
325 | 310 | ||
326 | 'enabled' | 311 | This will be one of the same strings reported by |
327 | 'disabled' | 312 | the "state" attribute. |
328 | 'not defined' | ||
diff --git a/Documentation/DocBook/Makefile b/Documentation/DocBook/Makefile index 0a08126d3094..dc3154e49279 100644 --- a/Documentation/DocBook/Makefile +++ b/Documentation/DocBook/Makefile | |||
@@ -12,7 +12,7 @@ DOCBOOKS := z8530book.xml mcabook.xml \ | |||
12 | kernel-api.xml filesystems.xml lsm.xml usb.xml kgdb.xml \ | 12 | kernel-api.xml filesystems.xml lsm.xml usb.xml kgdb.xml \ |
13 | gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \ | 13 | gadget.xml libata.xml mtdnand.xml librs.xml rapidio.xml \ |
14 | genericirq.xml s390-drivers.xml uio-howto.xml scsi.xml \ | 14 | genericirq.xml s390-drivers.xml uio-howto.xml scsi.xml \ |
15 | mac80211.xml debugobjects.xml sh.xml | 15 | mac80211.xml debugobjects.xml sh.xml regulator.xml |
16 | 16 | ||
17 | ### | 17 | ### |
18 | # The build process is as follows (targets): | 18 | # The build process is as follows (targets): |
diff --git a/Documentation/DocBook/regulator.tmpl b/Documentation/DocBook/regulator.tmpl new file mode 100644 index 000000000000..53f4f8d3b810 --- /dev/null +++ b/Documentation/DocBook/regulator.tmpl | |||
@@ -0,0 +1,304 @@ | |||
1 | <?xml version="1.0" encoding="UTF-8"?> | ||
2 | <!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN" | ||
3 | "http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []> | ||
4 | |||
5 | <book id="regulator-api"> | ||
6 | <bookinfo> | ||
7 | <title>Voltage and current regulator API</title> | ||
8 | |||
9 | <authorgroup> | ||
10 | <author> | ||
11 | <firstname>Liam</firstname> | ||
12 | <surname>Girdwood</surname> | ||
13 | <affiliation> | ||
14 | <address> | ||
15 | <email>lrg@slimlogic.co.uk</email> | ||
16 | </address> | ||
17 | </affiliation> | ||
18 | </author> | ||
19 | <author> | ||
20 | <firstname>Mark</firstname> | ||
21 | <surname>Brown</surname> | ||
22 | <affiliation> | ||
23 | <orgname>Wolfson Microelectronics</orgname> | ||
24 | <address> | ||
25 | <email>broonie@opensource.wolfsonmicro.com</email> | ||
26 | </address> | ||
27 | </affiliation> | ||
28 | </author> | ||
29 | </authorgroup> | ||
30 | |||
31 | <copyright> | ||
32 | <year>2007-2008</year> | ||
33 | <holder>Wolfson Microelectronics</holder> | ||
34 | </copyright> | ||
35 | <copyright> | ||
36 | <year>2008</year> | ||
37 | <holder>Liam Girdwood</holder> | ||
38 | </copyright> | ||
39 | |||
40 | <legalnotice> | ||
41 | <para> | ||
42 | This documentation is free software; you can redistribute | ||
43 | it and/or modify it under the terms of the GNU General Public | ||
44 | License version 2 as published by the Free Software Foundation. | ||
45 | </para> | ||
46 | |||
47 | <para> | ||
48 | This program is distributed in the hope that it will be | ||
49 | useful, but WITHOUT ANY WARRANTY; without even the implied | ||
50 | warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
51 | See the GNU General Public License for more details. | ||
52 | </para> | ||
53 | |||
54 | <para> | ||
55 | You should have received a copy of the GNU General Public | ||
56 | License along with this program; if not, write to the Free | ||
57 | Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, | ||
58 | MA 02111-1307 USA | ||
59 | </para> | ||
60 | |||
61 | <para> | ||
62 | For more details see the file COPYING in the source | ||
63 | distribution of Linux. | ||
64 | </para> | ||
65 | </legalnotice> | ||
66 | </bookinfo> | ||
67 | |||
68 | <toc></toc> | ||
69 | |||
70 | <chapter id="intro"> | ||
71 | <title>Introduction</title> | ||
72 | <para> | ||
73 | This framework is designed to provide a standard kernel | ||
74 | interface to control voltage and current regulators. | ||
75 | </para> | ||
76 | <para> | ||
77 | The intention is to allow systems to dynamically control | ||
78 | regulator power output in order to save power and prolong | ||
79 | battery life. This applies to both voltage regulators (where | ||
80 | voltage output is controllable) and current sinks (where current | ||
81 | limit is controllable). | ||
82 | </para> | ||
83 | <para> | ||
84 | Note that additional (and currently more complete) documentation | ||
85 | is available in the Linux kernel source under | ||
86 | <filename>Documentation/power/regulator</filename>. | ||
87 | </para> | ||
88 | |||
89 | <sect1 id="glossary"> | ||
90 | <title>Glossary</title> | ||
91 | <para> | ||
92 | The regulator API uses a number of terms which may not be | ||
93 | familiar: | ||
94 | </para> | ||
95 | <glossary> | ||
96 | |||
97 | <glossentry> | ||
98 | <glossterm>Regulator</glossterm> | ||
99 | <glossdef> | ||
100 | <para> | ||
101 | Electronic device that supplies power to other devices. Most | ||
102 | regulators can enable and disable their output and some can also | ||
103 | control their output voltage or current. | ||
104 | </para> | ||
105 | </glossdef> | ||
106 | </glossentry> | ||
107 | |||
108 | <glossentry> | ||
109 | <glossterm>Consumer</glossterm> | ||
110 | <glossdef> | ||
111 | <para> | ||
112 | Electronic device which consumes power provided by a regulator. | ||
113 | These may either be static, requiring only a fixed supply, or | ||
114 | dynamic, requiring active management of the regulator at | ||
115 | runtime. | ||
116 | </para> | ||
117 | </glossdef> | ||
118 | </glossentry> | ||
119 | |||
120 | <glossentry> | ||
121 | <glossterm>Power Domain</glossterm> | ||
122 | <glossdef> | ||
123 | <para> | ||
124 | The electronic circuit supplied by a given regulator, including | ||
125 | the regulator and all consumer devices. The configuration of | ||
126 | the regulator is shared between all the components in the | ||
127 | circuit. | ||
128 | </para> | ||
129 | </glossdef> | ||
130 | </glossentry> | ||
131 | |||
132 | <glossentry> | ||
133 | <glossterm>Power Management Integrated Circuit</glossterm> | ||
134 | <acronym>PMIC</acronym> | ||
135 | <glossdef> | ||
136 | <para> | ||
137 | An IC which contains numerous regulators and often also other | ||
138 | subsystems. In an embedded system the primary PMIC is often | ||
139 | equivalent to a combination of the PSU and southbridge in a | ||
140 | desktop system. | ||
141 | </para> | ||
142 | </glossdef> | ||
143 | </glossentry> | ||
144 | </glossary> | ||
145 | </sect1> | ||
146 | </chapter> | ||
147 | |||
148 | <chapter id="consumer"> | ||
149 | <title>Consumer driver interface</title> | ||
150 | <para> | ||
151 | This offers a similar API to the kernel clock framework. | ||
152 | Consumer drivers use <link | ||
153 | linkend='API-regulator-get'>get</link> and <link | ||
154 | linkend='API-regulator-put'>put</link> operations to acquire and | ||
155 | release regulators. Functions are | ||
156 | provided to <link linkend='API-regulator-enable'>enable</link> | ||
157 | and <link linkend='API-regulator-disable'>disable</link> the | ||
158 | reguator and to get and set the runtime parameters of the | ||
159 | regulator. | ||
160 | </para> | ||
161 | <para> | ||
162 | When requesting regulators consumers use symbolic names for their | ||
163 | supplies, such as "Vcc", which are mapped into actual regulator | ||
164 | devices by the machine interface. | ||
165 | </para> | ||
166 | <para> | ||
167 | A stub version of this API is provided when the regulator | ||
168 | framework is not in use in order to minimise the need to use | ||
169 | ifdefs. | ||
170 | </para> | ||
171 | |||
172 | <sect1 id="consumer-enable"> | ||
173 | <title>Enabling and disabling</title> | ||
174 | <para> | ||
175 | The regulator API provides reference counted enabling and | ||
176 | disabling of regulators. Consumer devices use the <function><link | ||
177 | linkend='API-regulator-enable'>regulator_enable</link></function> | ||
178 | and <function><link | ||
179 | linkend='API-regulator-disable'>regulator_disable</link> | ||
180 | </function> functions to enable and disable regulators. Calls | ||
181 | to the two functions must be balanced. | ||
182 | </para> | ||
183 | <para> | ||
184 | Note that since multiple consumers may be using a regulator and | ||
185 | machine constraints may not allow the regulator to be disabled | ||
186 | there is no guarantee that calling | ||
187 | <function>regulator_disable</function> will actually cause the | ||
188 | supply provided by the regulator to be disabled. Consumer | ||
189 | drivers should assume that the regulator may be enabled at all | ||
190 | times. | ||
191 | </para> | ||
192 | </sect1> | ||
193 | |||
194 | <sect1 id="consumer-config"> | ||
195 | <title>Configuration</title> | ||
196 | <para> | ||
197 | Some consumer devices may need to be able to dynamically | ||
198 | configure their supplies. For example, MMC drivers may need to | ||
199 | select the correct operating voltage for their cards. This may | ||
200 | be done while the regulator is enabled or disabled. | ||
201 | </para> | ||
202 | <para> | ||
203 | The <function><link | ||
204 | linkend='API-regulator-set-voltage'>regulator_set_voltage</link> | ||
205 | </function> and <function><link | ||
206 | linkend='API-regulator-set-current-limit' | ||
207 | >regulator_set_current_limit</link> | ||
208 | </function> functions provide the primary interface for this. | ||
209 | Both take ranges of voltages and currents, supporting drivers | ||
210 | that do not require a specific value (eg, CPU frequency scaling | ||
211 | normally permits the CPU to use a wider range of supply | ||
212 | voltages at lower frequencies but does not require that the | ||
213 | supply voltage be lowered). Where an exact value is required | ||
214 | both minimum and maximum values should be identical. | ||
215 | </para> | ||
216 | </sect1> | ||
217 | |||
218 | <sect1 id="consumer-callback"> | ||
219 | <title>Callbacks</title> | ||
220 | <para> | ||
221 | Callbacks may also be <link | ||
222 | linkend='API-regulator-register-notifier'>registered</link> | ||
223 | for events such as regulation failures. | ||
224 | </para> | ||
225 | </sect1> | ||
226 | </chapter> | ||
227 | |||
228 | <chapter id="driver"> | ||
229 | <title>Regulator driver interface</title> | ||
230 | <para> | ||
231 | Drivers for regulator chips <link | ||
232 | linkend='API-regulator-register'>register</link> the regulators | ||
233 | with the regulator core, providing operations structures to the | ||
234 | core. A <link | ||
235 | linkend='API-regulator-notifier-call-chain'>notifier</link> interface | ||
236 | allows error conditions to be reported to the core. | ||
237 | </para> | ||
238 | <para> | ||
239 | Registration should be triggered by explicit setup done by the | ||
240 | platform, supplying a <link | ||
241 | linkend='API-struct-regulator-init-data'>struct | ||
242 | regulator_init_data</link> for the regulator containing | ||
243 | <link linkend='machine-constraint'>constraint</link> and | ||
244 | <link linkend='machine-supply'>supply</link> information. | ||
245 | </para> | ||
246 | </chapter> | ||
247 | |||
248 | <chapter id="machine"> | ||
249 | <title>Machine interface</title> | ||
250 | <para> | ||
251 | This interface provides a way to define how regulators are | ||
252 | connected to consumers on a given system and what the valid | ||
253 | operating parameters are for the system. | ||
254 | </para> | ||
255 | |||
256 | <sect1 id="machine-supply"> | ||
257 | <title>Supplies</title> | ||
258 | <para> | ||
259 | Regulator supplies are specified using <link | ||
260 | linkend='API-struct-regulator-consumer-supply'>struct | ||
261 | regulator_consumer_supply</link>. This is done at | ||
262 | <link linkend='driver'>driver registration | ||
263 | time</link> as part of the machine constraints. | ||
264 | </para> | ||
265 | </sect1> | ||
266 | |||
267 | <sect1 id="machine-constraint"> | ||
268 | <title>Constraints</title> | ||
269 | <para> | ||
270 | As well as definining the connections the machine interface | ||
271 | also provides constraints definining the operations that | ||
272 | clients are allowed to perform and the parameters that may be | ||
273 | set. This is required since generally regulator devices will | ||
274 | offer more flexibility than it is safe to use on a given | ||
275 | system, for example supporting higher supply voltages than the | ||
276 | consumers are rated for. | ||
277 | </para> | ||
278 | <para> | ||
279 | This is done at <link linkend='driver'>driver | ||
280 | registration time</link> by providing a <link | ||
281 | linkend='API-struct-regulation-constraints'>struct | ||
282 | regulation_constraints</link>. | ||
283 | </para> | ||
284 | <para> | ||
285 | The constraints may also specify an initial configuration for the | ||
286 | regulator in the constraints, which is particularly useful for | ||
287 | use with static consumers. | ||
288 | </para> | ||
289 | </sect1> | ||
290 | </chapter> | ||
291 | |||
292 | <chapter id="api"> | ||
293 | <title>API reference</title> | ||
294 | <para> | ||
295 | Due to limitations of the kernel documentation framework and the | ||
296 | existing layout of the source code the entire regulator API is | ||
297 | documented here. | ||
298 | </para> | ||
299 | !Iinclude/linux/regulator/consumer.h | ||
300 | !Iinclude/linux/regulator/machine.h | ||
301 | !Iinclude/linux/regulator/driver.h | ||
302 | !Edrivers/regulator/core.c | ||
303 | </chapter> | ||
304 | </book> | ||
diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index 7dc0695a8f90..9bb62f7b89c3 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX | |||
@@ -12,6 +12,8 @@ rcuref.txt | |||
12 | - Reference-count design for elements of lists/arrays protected by RCU | 12 | - Reference-count design for elements of lists/arrays protected by RCU |
13 | rcu.txt | 13 | rcu.txt |
14 | - RCU Concepts | 14 | - RCU Concepts |
15 | rcubarrier.txt | ||
16 | - Unloading modules that use RCU callbacks | ||
15 | RTFP.txt | 17 | RTFP.txt |
16 | - List of RCU papers (bibliography) going back to 1980. | 18 | - List of RCU papers (bibliography) going back to 1980. |
17 | torture.txt | 19 | torture.txt |
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt new file mode 100644 index 000000000000..909602d409bb --- /dev/null +++ b/Documentation/RCU/rcubarrier.txt | |||
@@ -0,0 +1,304 @@ | |||
1 | RCU and Unloadable Modules | ||
2 | |||
3 | [Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/] | ||
4 | |||
5 | RCU (read-copy update) is a synchronization mechanism that can be thought | ||
6 | of as a replacement for read-writer locking (among other things), but with | ||
7 | very low-overhead readers that are immune to deadlock, priority inversion, | ||
8 | and unbounded latency. RCU read-side critical sections are delimited | ||
9 | by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT | ||
10 | kernels, generate no code whatsoever. | ||
11 | |||
12 | This means that RCU writers are unaware of the presence of concurrent | ||
13 | readers, so that RCU updates to shared data must be undertaken quite | ||
14 | carefully, leaving an old version of the data structure in place until all | ||
15 | pre-existing readers have finished. These old versions are needed because | ||
16 | such readers might hold a reference to them. RCU updates can therefore be | ||
17 | rather expensive, and RCU is thus best suited for read-mostly situations. | ||
18 | |||
19 | How can an RCU writer possibly determine when all readers are finished, | ||
20 | given that readers might well leave absolutely no trace of their | ||
21 | presence? There is a synchronize_rcu() primitive that blocks until all | ||
22 | pre-existing readers have completed. An updater wishing to delete an | ||
23 | element p from a linked list might do the following, while holding an | ||
24 | appropriate lock, of course: | ||
25 | |||
26 | list_del_rcu(p); | ||
27 | synchronize_rcu(); | ||
28 | kfree(p); | ||
29 | |||
30 | But the above code cannot be used in IRQ context -- the call_rcu() | ||
31 | primitive must be used instead. This primitive takes a pointer to an | ||
32 | rcu_head struct placed within the RCU-protected data structure and | ||
33 | another pointer to a function that may be invoked later to free that | ||
34 | structure. Code to delete an element p from the linked list from IRQ | ||
35 | context might then be as follows: | ||
36 | |||
37 | list_del_rcu(p); | ||
38 | call_rcu(&p->rcu, p_callback); | ||
39 | |||
40 | Since call_rcu() never blocks, this code can safely be used from within | ||
41 | IRQ context. The function p_callback() might be defined as follows: | ||
42 | |||
43 | static void p_callback(struct rcu_head *rp) | ||
44 | { | ||
45 | struct pstruct *p = container_of(rp, struct pstruct, rcu); | ||
46 | |||
47 | kfree(p); | ||
48 | } | ||
49 | |||
50 | |||
51 | Unloading Modules That Use call_rcu() | ||
52 | |||
53 | But what if p_callback is defined in an unloadable module? | ||
54 | |||
55 | If we unload the module while some RCU callbacks are pending, | ||
56 | the CPUs executing these callbacks are going to be severely | ||
57 | disappointed when they are later invoked, as fancifully depicted at | ||
58 | http://lwn.net/images/ns/kernel/rcu-drop.jpg. | ||
59 | |||
60 | We could try placing a synchronize_rcu() in the module-exit code path, | ||
61 | but this is not sufficient. Although synchronize_rcu() does wait for a | ||
62 | grace period to elapse, it does not wait for the callbacks to complete. | ||
63 | |||
64 | One might be tempted to try several back-to-back synchronize_rcu() | ||
65 | calls, but this is still not guaranteed to work. If there is a very | ||
66 | heavy RCU-callback load, then some of the callbacks might be deferred | ||
67 | in order to allow other processing to proceed. Such deferral is required | ||
68 | in realtime kernels in order to avoid excessive scheduling latencies. | ||
69 | |||
70 | |||
71 | rcu_barrier() | ||
72 | |||
73 | We instead need the rcu_barrier() primitive. This primitive is similar | ||
74 | to synchronize_rcu(), but instead of waiting solely for a grace | ||
75 | period to elapse, it also waits for all outstanding RCU callbacks to | ||
76 | complete. Pseudo-code using rcu_barrier() is as follows: | ||
77 | |||
78 | 1. Prevent any new RCU callbacks from being posted. | ||
79 | 2. Execute rcu_barrier(). | ||
80 | 3. Allow the module to be unloaded. | ||
81 | |||
82 | Quick Quiz #1: Why is there no srcu_barrier()? | ||
83 | |||
84 | The rcutorture module makes use of rcu_barrier in its exit function | ||
85 | as follows: | ||
86 | |||
87 | 1 static void | ||
88 | 2 rcu_torture_cleanup(void) | ||
89 | 3 { | ||
90 | 4 int i; | ||
91 | 5 | ||
92 | 6 fullstop = 1; | ||
93 | 7 if (shuffler_task != NULL) { | ||
94 | 8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task"); | ||
95 | 9 kthread_stop(shuffler_task); | ||
96 | 10 } | ||
97 | 11 shuffler_task = NULL; | ||
98 | 12 | ||
99 | 13 if (writer_task != NULL) { | ||
100 | 14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task"); | ||
101 | 15 kthread_stop(writer_task); | ||
102 | 16 } | ||
103 | 17 writer_task = NULL; | ||
104 | 18 | ||
105 | 19 if (reader_tasks != NULL) { | ||
106 | 20 for (i = 0; i < nrealreaders; i++) { | ||
107 | 21 if (reader_tasks[i] != NULL) { | ||
108 | 22 VERBOSE_PRINTK_STRING( | ||
109 | 23 "Stopping rcu_torture_reader task"); | ||
110 | 24 kthread_stop(reader_tasks[i]); | ||
111 | 25 } | ||
112 | 26 reader_tasks[i] = NULL; | ||
113 | 27 } | ||
114 | 28 kfree(reader_tasks); | ||
115 | 29 reader_tasks = NULL; | ||
116 | 30 } | ||
117 | 31 rcu_torture_current = NULL; | ||
118 | 32 | ||
119 | 33 if (fakewriter_tasks != NULL) { | ||
120 | 34 for (i = 0; i < nfakewriters; i++) { | ||
121 | 35 if (fakewriter_tasks[i] != NULL) { | ||
122 | 36 VERBOSE_PRINTK_STRING( | ||
123 | 37 "Stopping rcu_torture_fakewriter task"); | ||
124 | 38 kthread_stop(fakewriter_tasks[i]); | ||
125 | 39 } | ||
126 | 40 fakewriter_tasks[i] = NULL; | ||
127 | 41 } | ||
128 | 42 kfree(fakewriter_tasks); | ||
129 | 43 fakewriter_tasks = NULL; | ||
130 | 44 } | ||
131 | 45 | ||
132 | 46 if (stats_task != NULL) { | ||
133 | 47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task"); | ||
134 | 48 kthread_stop(stats_task); | ||
135 | 49 } | ||
136 | 50 stats_task = NULL; | ||
137 | 51 | ||
138 | 52 /* Wait for all RCU callbacks to fire. */ | ||
139 | 53 rcu_barrier(); | ||
140 | 54 | ||
141 | 55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */ | ||
142 | 56 | ||
143 | 57 if (cur_ops->cleanup != NULL) | ||
144 | 58 cur_ops->cleanup(); | ||
145 | 59 if (atomic_read(&n_rcu_torture_error)) | ||
146 | 60 rcu_torture_print_module_parms("End of test: FAILURE"); | ||
147 | 61 else | ||
148 | 62 rcu_torture_print_module_parms("End of test: SUCCESS"); | ||
149 | 63 } | ||
150 | |||
151 | Line 6 sets a global variable that prevents any RCU callbacks from | ||
152 | re-posting themselves. This will not be necessary in most cases, since | ||
153 | RCU callbacks rarely include calls to call_rcu(). However, the rcutorture | ||
154 | module is an exception to this rule, and therefore needs to set this | ||
155 | global variable. | ||
156 | |||
157 | Lines 7-50 stop all the kernel tasks associated with the rcutorture | ||
158 | module. Therefore, once execution reaches line 53, no more rcutorture | ||
159 | RCU callbacks will be posted. The rcu_barrier() call on line 53 waits | ||
160 | for any pre-existing callbacks to complete. | ||
161 | |||
162 | Then lines 55-62 print status and do operation-specific cleanup, and | ||
163 | then return, permitting the module-unload operation to be completed. | ||
164 | |||
165 | Quick Quiz #2: Is there any other situation where rcu_barrier() might | ||
166 | be required? | ||
167 | |||
168 | Your module might have additional complications. For example, if your | ||
169 | module invokes call_rcu() from timers, you will need to first cancel all | ||
170 | the timers, and only then invoke rcu_barrier() to wait for any remaining | ||
171 | RCU callbacks to complete. | ||
172 | |||
173 | |||
174 | Implementing rcu_barrier() | ||
175 | |||
176 | Dipankar Sarma's implementation of rcu_barrier() makes use of the fact | ||
177 | that RCU callbacks are never reordered once queued on one of the per-CPU | ||
178 | queues. His implementation queues an RCU callback on each of the per-CPU | ||
179 | callback queues, and then waits until they have all started executing, at | ||
180 | which point, all earlier RCU callbacks are guaranteed to have completed. | ||
181 | |||
182 | The original code for rcu_barrier() was as follows: | ||
183 | |||
184 | 1 void rcu_barrier(void) | ||
185 | 2 { | ||
186 | 3 BUG_ON(in_interrupt()); | ||
187 | 4 /* Take cpucontrol mutex to protect against CPU hotplug */ | ||
188 | 5 mutex_lock(&rcu_barrier_mutex); | ||
189 | 6 init_completion(&rcu_barrier_completion); | ||
190 | 7 atomic_set(&rcu_barrier_cpu_count, 0); | ||
191 | 8 on_each_cpu(rcu_barrier_func, NULL, 0, 1); | ||
192 | 9 wait_for_completion(&rcu_barrier_completion); | ||
193 | 10 mutex_unlock(&rcu_barrier_mutex); | ||
194 | 11 } | ||
195 | |||
196 | Line 3 verifies that the caller is in process context, and lines 5 and 10 | ||
197 | use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the | ||
198 | global completion and counters at a time, which are initialized on lines | ||
199 | 6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is | ||
200 | shown below. Note that the final "1" in on_each_cpu()'s argument list | ||
201 | ensures that all the calls to rcu_barrier_func() will have completed | ||
202 | before on_each_cpu() returns. Line 9 then waits for the completion. | ||
203 | |||
204 | This code was rewritten in 2008 to support rcu_barrier_bh() and | ||
205 | rcu_barrier_sched() in addition to the original rcu_barrier(). | ||
206 | |||
207 | The rcu_barrier_func() runs on each CPU, where it invokes call_rcu() | ||
208 | to post an RCU callback, as follows: | ||
209 | |||
210 | 1 static void rcu_barrier_func(void *notused) | ||
211 | 2 { | ||
212 | 3 int cpu = smp_processor_id(); | ||
213 | 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu); | ||
214 | 5 struct rcu_head *head; | ||
215 | 6 | ||
216 | 7 head = &rdp->barrier; | ||
217 | 8 atomic_inc(&rcu_barrier_cpu_count); | ||
218 | 9 call_rcu(head, rcu_barrier_callback); | ||
219 | 10 } | ||
220 | |||
221 | Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure, | ||
222 | which contains the struct rcu_head that needed for the later call to | ||
223 | call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line | ||
224 | 8 increments a global counter. This counter will later be decremented | ||
225 | by the callback. Line 9 then registers the rcu_barrier_callback() on | ||
226 | the current CPU's queue. | ||
227 | |||
228 | The rcu_barrier_callback() function simply atomically decrements the | ||
229 | rcu_barrier_cpu_count variable and finalizes the completion when it | ||
230 | reaches zero, as follows: | ||
231 | |||
232 | 1 static void rcu_barrier_callback(struct rcu_head *notused) | ||
233 | 2 { | ||
234 | 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count)) | ||
235 | 4 complete(&rcu_barrier_completion); | ||
236 | 5 } | ||
237 | |||
238 | Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes | ||
239 | immediately (thus incrementing rcu_barrier_cpu_count to the | ||
240 | value one), but the other CPU's rcu_barrier_func() invocations | ||
241 | are delayed for a full grace period? Couldn't this result in | ||
242 | rcu_barrier() returning prematurely? | ||
243 | |||
244 | |||
245 | rcu_barrier() Summary | ||
246 | |||
247 | The rcu_barrier() primitive has seen relatively little use, since most | ||
248 | code using RCU is in the core kernel rather than in modules. However, if | ||
249 | you are using RCU from an unloadable module, you need to use rcu_barrier() | ||
250 | so that your module may be safely unloaded. | ||
251 | |||
252 | |||
253 | Answers to Quick Quizzes | ||
254 | |||
255 | Quick Quiz #1: Why is there no srcu_barrier()? | ||
256 | |||
257 | Answer: Since there is no call_srcu(), there can be no outstanding SRCU | ||
258 | callbacks. Therefore, there is no need to wait for them. | ||
259 | |||
260 | Quick Quiz #2: Is there any other situation where rcu_barrier() might | ||
261 | be required? | ||
262 | |||
263 | Answer: Interestingly enough, rcu_barrier() was not originally | ||
264 | implemented for module unloading. Nikita Danilov was using | ||
265 | RCU in a filesystem, which resulted in a similar situation at | ||
266 | filesystem-unmount time. Dipankar Sarma coded up rcu_barrier() | ||
267 | in response, so that Nikita could invoke it during the | ||
268 | filesystem-unmount process. | ||
269 | |||
270 | Much later, yours truly hit the RCU module-unload problem when | ||
271 | implementing rcutorture, and found that rcu_barrier() solves | ||
272 | this problem as well. | ||
273 | |||
274 | Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes | ||
275 | immediately (thus incrementing rcu_barrier_cpu_count to the | ||
276 | value one), but the other CPU's rcu_barrier_func() invocations | ||
277 | are delayed for a full grace period? Couldn't this result in | ||
278 | rcu_barrier() returning prematurely? | ||
279 | |||
280 | Answer: This cannot happen. The reason is that on_each_cpu() has its last | ||
281 | argument, the wait flag, set to "1". This flag is passed through | ||
282 | to smp_call_function() and further to smp_call_function_on_cpu(), | ||
283 | causing this latter to spin until the cross-CPU invocation of | ||
284 | rcu_barrier_func() has completed. This by itself would prevent | ||
285 | a grace period from completing on non-CONFIG_PREEMPT kernels, | ||
286 | since each CPU must undergo a context switch (or other quiescent | ||
287 | state) before the grace period can complete. However, this is | ||
288 | of no use in CONFIG_PREEMPT kernels. | ||
289 | |||
290 | Therefore, on_each_cpu() disables preemption across its call | ||
291 | to smp_call_function() and also across the local call to | ||
292 | rcu_barrier_func(). This prevents the local CPU from context | ||
293 | switching, again preventing grace periods from completing. This | ||
294 | means that all CPUs have executed rcu_barrier_func() before | ||
295 | the first rcu_barrier_callback() can possibly execute, in turn | ||
296 | preventing rcu_barrier_cpu_count from prematurely reaching zero. | ||
297 | |||
298 | Currently, -rt implementations of RCU keep but a single global | ||
299 | queue for RCU callbacks, and thus do not suffer from this | ||
300 | problem. However, when the -rt RCU eventually does have per-CPU | ||
301 | callback queues, things will have to change. One simple change | ||
302 | is to add an rcu_read_lock() before line 8 of rcu_barrier() | ||
303 | and an rcu_read_unlock() after line 8 of this same function. If | ||
304 | you can think of a better change, please let me know! | ||
diff --git a/Documentation/bad_memory.txt b/Documentation/bad_memory.txt new file mode 100644 index 000000000000..df8416213202 --- /dev/null +++ b/Documentation/bad_memory.txt | |||
@@ -0,0 +1,45 @@ | |||
1 | March 2008 | ||
2 | Jan-Simon Moeller, dl9pf@gmx.de | ||
3 | |||
4 | |||
5 | How to deal with bad memory e.g. reported by memtest86+ ? | ||
6 | ######################################################### | ||
7 | |||
8 | There are three possibilities I know of: | ||
9 | |||
10 | 1) Reinsert/swap the memory modules | ||
11 | |||
12 | 2) Buy new modules (best!) or try to exchange the memory | ||
13 | if you have spare-parts | ||
14 | |||
15 | 3) Use BadRAM or memmap | ||
16 | |||
17 | This Howto is about number 3) . | ||
18 | |||
19 | |||
20 | BadRAM | ||
21 | ###### | ||
22 | BadRAM is the actively developed and available as kernel-patch | ||
23 | here: http://rick.vanrein.org/linux/badram/ | ||
24 | |||
25 | For more details see the BadRAM documentation. | ||
26 | |||
27 | memmap | ||
28 | ###### | ||
29 | |||
30 | memmap is already in the kernel and usable as kernel-parameter at | ||
31 | boot-time. Its syntax is slightly strange and you may need to | ||
32 | calculate the values by yourself! | ||
33 | |||
34 | Syntax to exclude a memory area (see kernel-parameters.txt for details): | ||
35 | memmap=<size>$<address> | ||
36 | |||
37 | Example: memtest86+ reported here errors at address 0x18691458, 0x18698424 and | ||
38 | some others. All had 0x1869xxxx in common, so I chose a pattern of | ||
39 | 0x18690000,0xffff0000. | ||
40 | |||
41 | With the numbers of the example above: | ||
42 | memmap=64K$0x18690000 | ||
43 | or | ||
44 | memmap=0x10000$0x18690000 | ||
45 | |||
diff --git a/Documentation/cgroups/cgroups.txt b/Documentation/cgroups/cgroups.txt index d9014aa0eb68..e33ee74eee77 100644 --- a/Documentation/cgroups/cgroups.txt +++ b/Documentation/cgroups/cgroups.txt | |||
@@ -227,7 +227,6 @@ Each cgroup is represented by a directory in the cgroup file system | |||
227 | containing the following files describing that cgroup: | 227 | containing the following files describing that cgroup: |
228 | 228 | ||
229 | - tasks: list of tasks (by pid) attached to that cgroup | 229 | - tasks: list of tasks (by pid) attached to that cgroup |
230 | - releasable flag: cgroup currently removeable? | ||
231 | - notify_on_release flag: run the release agent on exit? | 230 | - notify_on_release flag: run the release agent on exit? |
232 | - release_agent: the path to use for release notifications (this file | 231 | - release_agent: the path to use for release notifications (this file |
233 | exists in the top cgroup only) | 232 | exists in the top cgroup only) |
@@ -360,7 +359,7 @@ Now you want to do something with this cgroup. | |||
360 | 359 | ||
361 | In this directory you can find several files: | 360 | In this directory you can find several files: |
362 | # ls | 361 | # ls |
363 | notify_on_release releasable tasks | 362 | notify_on_release tasks |
364 | (plus whatever files added by the attached subsystems) | 363 | (plus whatever files added by the attached subsystems) |
365 | 364 | ||
366 | Now attach your shell to this cgroup: | 365 | Now attach your shell to this cgroup: |
@@ -479,7 +478,6 @@ newly-created cgroup if an error occurs after this subsystem's | |||
479 | create() method has been called for the new cgroup). | 478 | create() method has been called for the new cgroup). |
480 | 479 | ||
481 | void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); | 480 | void pre_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp); |
482 | (cgroup_mutex held by caller) | ||
483 | 481 | ||
484 | Called before checking the reference count on each subsystem. This may | 482 | Called before checking the reference count on each subsystem. This may |
485 | be useful for subsystems which have some extra references even if | 483 | be useful for subsystems which have some extra references even if |
@@ -498,6 +496,7 @@ remain valid while the caller holds cgroup_mutex. | |||
498 | 496 | ||
499 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, | 497 | void attach(struct cgroup_subsys *ss, struct cgroup *cgrp, |
500 | struct cgroup *old_cgrp, struct task_struct *task) | 498 | struct cgroup *old_cgrp, struct task_struct *task) |
499 | (cgroup_mutex held by caller) | ||
501 | 500 | ||
502 | Called after the task has been attached to the cgroup, to allow any | 501 | Called after the task has been attached to the cgroup, to allow any |
503 | post-attachment activity that requires memory allocations or blocking. | 502 | post-attachment activity that requires memory allocations or blocking. |
@@ -511,6 +510,7 @@ void exit(struct cgroup_subsys *ss, struct task_struct *task) | |||
511 | Called during task exit. | 510 | Called during task exit. |
512 | 511 | ||
513 | int populate(struct cgroup_subsys *ss, struct cgroup *cgrp) | 512 | int populate(struct cgroup_subsys *ss, struct cgroup *cgrp) |
513 | (cgroup_mutex held by caller) | ||
514 | 514 | ||
515 | Called after creation of a cgroup to allow a subsystem to populate | 515 | Called after creation of a cgroup to allow a subsystem to populate |
516 | the cgroup directory with file entries. The subsystem should make | 516 | the cgroup directory with file entries. The subsystem should make |
@@ -520,6 +520,7 @@ method can return an error code, the error code is currently not | |||
520 | always handled well. | 520 | always handled well. |
521 | 521 | ||
522 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) | 522 | void post_clone(struct cgroup_subsys *ss, struct cgroup *cgrp) |
523 | (cgroup_mutex held by caller) | ||
523 | 524 | ||
524 | Called at the end of cgroup_clone() to do any paramater | 525 | Called at the end of cgroup_clone() to do any paramater |
525 | initialization which might be required before a task could attach. For | 526 | initialization which might be required before a task could attach. For |
@@ -527,7 +528,7 @@ example in cpusets, no task may attach before 'cpus' and 'mems' are set | |||
527 | up. | 528 | up. |
528 | 529 | ||
529 | void bind(struct cgroup_subsys *ss, struct cgroup *root) | 530 | void bind(struct cgroup_subsys *ss, struct cgroup *root) |
530 | (cgroup_mutex held by caller) | 531 | (cgroup_mutex and ss->hierarchy_mutex held by caller) |
531 | 532 | ||
532 | Called when a cgroup subsystem is rebound to a different hierarchy | 533 | Called when a cgroup subsystem is rebound to a different hierarchy |
533 | and root cgroup. Currently this will only involve movement between | 534 | and root cgroup. Currently this will only involve movement between |
diff --git a/Documentation/controllers/memcg_test.txt b/Documentation/controllers/memcg_test.txt new file mode 100644 index 000000000000..08d4d3ea0d79 --- /dev/null +++ b/Documentation/controllers/memcg_test.txt | |||
@@ -0,0 +1,342 @@ | |||
1 | Memory Resource Controller(Memcg) Implementation Memo. | ||
2 | Last Updated: 2008/12/15 | ||
3 | Base Kernel Version: based on 2.6.28-rc8-mm. | ||
4 | |||
5 | Because VM is getting complex (one of reasons is memcg...), memcg's behavior | ||
6 | is complex. This is a document for memcg's internal behavior. | ||
7 | Please note that implementation details can be changed. | ||
8 | |||
9 | (*) Topics on API should be in Documentation/controllers/memory.txt) | ||
10 | |||
11 | 0. How to record usage ? | ||
12 | 2 objects are used. | ||
13 | |||
14 | page_cgroup ....an object per page. | ||
15 | Allocated at boot or memory hotplug. Freed at memory hot removal. | ||
16 | |||
17 | swap_cgroup ... an entry per swp_entry. | ||
18 | Allocated at swapon(). Freed at swapoff(). | ||
19 | |||
20 | The page_cgroup has USED bit and double count against a page_cgroup never | ||
21 | occurs. swap_cgroup is used only when a charged page is swapped-out. | ||
22 | |||
23 | 1. Charge | ||
24 | |||
25 | a page/swp_entry may be charged (usage += PAGE_SIZE) at | ||
26 | |||
27 | mem_cgroup_newpage_charge() | ||
28 | Called at new page fault and Copy-On-Write. | ||
29 | |||
30 | mem_cgroup_try_charge_swapin() | ||
31 | Called at do_swap_page() (page fault on swap entry) and swapoff. | ||
32 | Followed by charge-commit-cancel protocol. (With swap accounting) | ||
33 | At commit, a charge recorded in swap_cgroup is removed. | ||
34 | |||
35 | mem_cgroup_cache_charge() | ||
36 | Called at add_to_page_cache() | ||
37 | |||
38 | mem_cgroup_cache_charge_swapin() | ||
39 | Called at shmem's swapin. | ||
40 | |||
41 | mem_cgroup_prepare_migration() | ||
42 | Called before migration. "extra" charge is done and followed by | ||
43 | charge-commit-cancel protocol. | ||
44 | At commit, charge against oldpage or newpage will be committed. | ||
45 | |||
46 | 2. Uncharge | ||
47 | a page/swp_entry may be uncharged (usage -= PAGE_SIZE) by | ||
48 | |||
49 | mem_cgroup_uncharge_page() | ||
50 | Called when an anonymous page is fully unmapped. I.e., mapcount goes | ||
51 | to 0. If the page is SwapCache, uncharge is delayed until | ||
52 | mem_cgroup_uncharge_swapcache(). | ||
53 | |||
54 | mem_cgroup_uncharge_cache_page() | ||
55 | Called when a page-cache is deleted from radix-tree. If the page is | ||
56 | SwapCache, uncharge is delayed until mem_cgroup_uncharge_swapcache(). | ||
57 | |||
58 | mem_cgroup_uncharge_swapcache() | ||
59 | Called when SwapCache is removed from radix-tree. The charge itself | ||
60 | is moved to swap_cgroup. (If mem+swap controller is disabled, no | ||
61 | charge to swap occurs.) | ||
62 | |||
63 | mem_cgroup_uncharge_swap() | ||
64 | Called when swp_entry's refcnt goes down to 0. A charge against swap | ||
65 | disappears. | ||
66 | |||
67 | mem_cgroup_end_migration(old, new) | ||
68 | At success of migration old is uncharged (if necessary), a charge | ||
69 | to new page is committed. At failure, charge to old page is committed. | ||
70 | |||
71 | 3. charge-commit-cancel | ||
72 | In some case, we can't know this "charge" is valid or not at charging | ||
73 | (because of races). | ||
74 | To handle such case, there are charge-commit-cancel functions. | ||
75 | mem_cgroup_try_charge_XXX | ||
76 | mem_cgroup_commit_charge_XXX | ||
77 | mem_cgroup_cancel_charge_XXX | ||
78 | these are used in swap-in and migration. | ||
79 | |||
80 | At try_charge(), there are no flags to say "this page is charged". | ||
81 | at this point, usage += PAGE_SIZE. | ||
82 | |||
83 | At commit(), the function checks the page should be charged or not | ||
84 | and set flags or avoid charging.(usage -= PAGE_SIZE) | ||
85 | |||
86 | At cancel(), simply usage -= PAGE_SIZE. | ||
87 | |||
88 | Under below explanation, we assume CONFIG_MEM_RES_CTRL_SWAP=y. | ||
89 | |||
90 | 4. Anonymous | ||
91 | Anonymous page is newly allocated at | ||
92 | - page fault into MAP_ANONYMOUS mapping. | ||
93 | - Copy-On-Write. | ||
94 | It is charged right after it's allocated before doing any page table | ||
95 | related operations. Of course, it's uncharged when another page is used | ||
96 | for the fault address. | ||
97 | |||
98 | At freeing anonymous page (by exit() or munmap()), zap_pte() is called | ||
99 | and pages for ptes are freed one by one.(see mm/memory.c). Uncharges | ||
100 | are done at page_remove_rmap() when page_mapcount() goes down to 0. | ||
101 | |||
102 | Another page freeing is by page-reclaim (vmscan.c) and anonymous | ||
103 | pages are swapped out. In this case, the page is marked as | ||
104 | PageSwapCache(). uncharge() routine doesn't uncharge the page marked | ||
105 | as SwapCache(). It's delayed until __delete_from_swap_cache(). | ||
106 | |||
107 | 4.1 Swap-in. | ||
108 | At swap-in, the page is taken from swap-cache. There are 2 cases. | ||
109 | |||
110 | (a) If the SwapCache is newly allocated and read, it has no charges. | ||
111 | (b) If the SwapCache has been mapped by processes, it has been | ||
112 | charged already. | ||
113 | |||
114 | This swap-in is one of the most complicated work. In do_swap_page(), | ||
115 | following events occur when pte is unchanged. | ||
116 | |||
117 | (1) the page (SwapCache) is looked up. | ||
118 | (2) lock_page() | ||
119 | (3) try_charge_swapin() | ||
120 | (4) reuse_swap_page() (may call delete_swap_cache()) | ||
121 | (5) commit_charge_swapin() | ||
122 | (6) swap_free(). | ||
123 | |||
124 | Considering following situation for example. | ||
125 | |||
126 | (A) The page has not been charged before (2) and reuse_swap_page() | ||
127 | doesn't call delete_from_swap_cache(). | ||
128 | (B) The page has not been charged before (2) and reuse_swap_page() | ||
129 | calls delete_from_swap_cache(). | ||
130 | (C) The page has been charged before (2) and reuse_swap_page() doesn't | ||
131 | call delete_from_swap_cache(). | ||
132 | (D) The page has been charged before (2) and reuse_swap_page() calls | ||
133 | delete_from_swap_cache(). | ||
134 | |||
135 | memory.usage/memsw.usage changes to this page/swp_entry will be | ||
136 | Case (A) (B) (C) (D) | ||
137 | Event | ||
138 | Before (2) 0/ 1 0/ 1 1/ 1 1/ 1 | ||
139 | =========================================== | ||
140 | (3) +1/+1 +1/+1 +1/+1 +1/+1 | ||
141 | (4) - 0/ 0 - -1/ 0 | ||
142 | (5) 0/-1 0/ 0 -1/-1 0/ 0 | ||
143 | (6) - 0/-1 - 0/-1 | ||
144 | =========================================== | ||
145 | Result 1/ 1 1/ 1 1/ 1 1/ 1 | ||
146 | |||
147 | In any cases, charges to this page should be 1/ 1. | ||
148 | |||
149 | 4.2 Swap-out. | ||
150 | At swap-out, typical state transition is below. | ||
151 | |||
152 | (a) add to swap cache. (marked as SwapCache) | ||
153 | swp_entry's refcnt += 1. | ||
154 | (b) fully unmapped. | ||
155 | swp_entry's refcnt += # of ptes. | ||
156 | (c) write back to swap. | ||
157 | (d) delete from swap cache. (remove from SwapCache) | ||
158 | swp_entry's refcnt -= 1. | ||
159 | |||
160 | |||
161 | At (b), the page is marked as SwapCache and not uncharged. | ||
162 | At (d), the page is removed from SwapCache and a charge in page_cgroup | ||
163 | is moved to swap_cgroup. | ||
164 | |||
165 | Finally, at task exit, | ||
166 | (e) zap_pte() is called and swp_entry's refcnt -=1 -> 0. | ||
167 | Here, a charge in swap_cgroup disappears. | ||
168 | |||
169 | 5. Page Cache | ||
170 | Page Cache is charged at | ||
171 | - add_to_page_cache_locked(). | ||
172 | |||
173 | uncharged at | ||
174 | - __remove_from_page_cache(). | ||
175 | |||
176 | The logic is very clear. (About migration, see below) | ||
177 | Note: __remove_from_page_cache() is called by remove_from_page_cache() | ||
178 | and __remove_mapping(). | ||
179 | |||
180 | 6. Shmem(tmpfs) Page Cache | ||
181 | Memcg's charge/uncharge have special handlers of shmem. The best way | ||
182 | to understand shmem's page state transition is to read mm/shmem.c. | ||
183 | But brief explanation of the behavior of memcg around shmem will be | ||
184 | helpful to understand the logic. | ||
185 | |||
186 | Shmem's page (just leaf page, not direct/indirect block) can be on | ||
187 | - radix-tree of shmem's inode. | ||
188 | - SwapCache. | ||
189 | - Both on radix-tree and SwapCache. This happens at swap-in | ||
190 | and swap-out, | ||
191 | |||
192 | It's charged when... | ||
193 | - A new page is added to shmem's radix-tree. | ||
194 | - A swp page is read. (move a charge from swap_cgroup to page_cgroup) | ||
195 | It's uncharged when | ||
196 | - A page is removed from radix-tree and not SwapCache. | ||
197 | - When SwapCache is removed, a charge is moved to swap_cgroup. | ||
198 | - When swp_entry's refcnt goes down to 0, a charge in swap_cgroup | ||
199 | disappears. | ||
200 | |||
201 | 7. Page Migration | ||
202 | One of the most complicated functions is page-migration-handler. | ||
203 | Memcg has 2 routines. Assume that we are migrating a page's contents | ||
204 | from OLDPAGE to NEWPAGE. | ||
205 | |||
206 | Usual migration logic is.. | ||
207 | (a) remove the page from LRU. | ||
208 | (b) allocate NEWPAGE (migration target) | ||
209 | (c) lock by lock_page(). | ||
210 | (d) unmap all mappings. | ||
211 | (e-1) If necessary, replace entry in radix-tree. | ||
212 | (e-2) move contents of a page. | ||
213 | (f) map all mappings again. | ||
214 | (g) pushback the page to LRU. | ||
215 | (-) OLDPAGE will be freed. | ||
216 | |||
217 | Before (g), memcg should complete all necessary charge/uncharge to | ||
218 | NEWPAGE/OLDPAGE. | ||
219 | |||
220 | The point is.... | ||
221 | - If OLDPAGE is anonymous, all charges will be dropped at (d) because | ||
222 | try_to_unmap() drops all mapcount and the page will not be | ||
223 | SwapCache. | ||
224 | |||
225 | - If OLDPAGE is SwapCache, charges will be kept at (g) because | ||
226 | __delete_from_swap_cache() isn't called at (e-1) | ||
227 | |||
228 | - If OLDPAGE is page-cache, charges will be kept at (g) because | ||
229 | remove_from_swap_cache() isn't called at (e-1) | ||
230 | |||
231 | memcg provides following hooks. | ||
232 | |||
233 | - mem_cgroup_prepare_migration(OLDPAGE) | ||
234 | Called after (b) to account a charge (usage += PAGE_SIZE) against | ||
235 | memcg which OLDPAGE belongs to. | ||
236 | |||
237 | - mem_cgroup_end_migration(OLDPAGE, NEWPAGE) | ||
238 | Called after (f) before (g). | ||
239 | If OLDPAGE is used, commit OLDPAGE again. If OLDPAGE is already | ||
240 | charged, a charge by prepare_migration() is automatically canceled. | ||
241 | If NEWPAGE is used, commit NEWPAGE and uncharge OLDPAGE. | ||
242 | |||
243 | But zap_pte() (by exit or munmap) can be called while migration, | ||
244 | we have to check if OLDPAGE/NEWPAGE is a valid page after commit(). | ||
245 | |||
246 | 8. LRU | ||
247 | Each memcg has its own private LRU. Now, it's handling is under global | ||
248 | VM's control (means that it's handled under global zone->lru_lock). | ||
249 | Almost all routines around memcg's LRU is called by global LRU's | ||
250 | list management functions under zone->lru_lock(). | ||
251 | |||
252 | A special function is mem_cgroup_isolate_pages(). This scans | ||
253 | memcg's private LRU and call __isolate_lru_page() to extract a page | ||
254 | from LRU. | ||
255 | (By __isolate_lru_page(), the page is removed from both of global and | ||
256 | private LRU.) | ||
257 | |||
258 | |||
259 | 9. Typical Tests. | ||
260 | |||
261 | Tests for racy cases. | ||
262 | |||
263 | 9.1 Small limit to memcg. | ||
264 | When you do test to do racy case, it's good test to set memcg's limit | ||
265 | to be very small rather than GB. Many races found in the test under | ||
266 | xKB or xxMB limits. | ||
267 | (Memory behavior under GB and Memory behavior under MB shows very | ||
268 | different situation.) | ||
269 | |||
270 | 9.2 Shmem | ||
271 | Historically, memcg's shmem handling was poor and we saw some amount | ||
272 | of troubles here. This is because shmem is page-cache but can be | ||
273 | SwapCache. Test with shmem/tmpfs is always good test. | ||
274 | |||
275 | 9.3 Migration | ||
276 | For NUMA, migration is an another special case. To do easy test, cpuset | ||
277 | is useful. Following is a sample script to do migration. | ||
278 | |||
279 | mount -t cgroup -o cpuset none /opt/cpuset | ||
280 | |||
281 | mkdir /opt/cpuset/01 | ||
282 | echo 1 > /opt/cpuset/01/cpuset.cpus | ||
283 | echo 0 > /opt/cpuset/01/cpuset.mems | ||
284 | echo 1 > /opt/cpuset/01/cpuset.memory_migrate | ||
285 | mkdir /opt/cpuset/02 | ||
286 | echo 1 > /opt/cpuset/02/cpuset.cpus | ||
287 | echo 1 > /opt/cpuset/02/cpuset.mems | ||
288 | echo 1 > /opt/cpuset/02/cpuset.memory_migrate | ||
289 | |||
290 | In above set, when you moves a task from 01 to 02, page migration to | ||
291 | node 0 to node 1 will occur. Following is a script to migrate all | ||
292 | under cpuset. | ||
293 | -- | ||
294 | move_task() | ||
295 | { | ||
296 | for pid in $1 | ||
297 | do | ||
298 | /bin/echo $pid >$2/tasks 2>/dev/null | ||
299 | echo -n $pid | ||
300 | echo -n " " | ||
301 | done | ||
302 | echo END | ||
303 | } | ||
304 | |||
305 | G1_TASK=`cat ${G1}/tasks` | ||
306 | G2_TASK=`cat ${G2}/tasks` | ||
307 | move_task "${G1_TASK}" ${G2} & | ||
308 | -- | ||
309 | 9.4 Memory hotplug. | ||
310 | memory hotplug test is one of good test. | ||
311 | to offline memory, do following. | ||
312 | # echo offline > /sys/devices/system/memory/memoryXXX/state | ||
313 | (XXX is the place of memory) | ||
314 | This is an easy way to test page migration, too. | ||
315 | |||
316 | 9.5 mkdir/rmdir | ||
317 | When using hierarchy, mkdir/rmdir test should be done. | ||
318 | Use tests like the following. | ||
319 | |||
320 | echo 1 >/opt/cgroup/01/memory/use_hierarchy | ||
321 | mkdir /opt/cgroup/01/child_a | ||
322 | mkdir /opt/cgroup/01/child_b | ||
323 | |||
324 | set limit to 01. | ||
325 | add limit to 01/child_b | ||
326 | run jobs under child_a and child_b | ||
327 | |||
328 | create/delete following groups at random while jobs are running. | ||
329 | /opt/cgroup/01/child_a/child_aa | ||
330 | /opt/cgroup/01/child_b/child_bb | ||
331 | /opt/cgroup/01/child_c | ||
332 | |||
333 | running new jobs in new group is also good. | ||
334 | |||
335 | 9.6 Mount with other subsystems. | ||
336 | Mounting with other subsystems is a good test because there is a | ||
337 | race and lock dependency with other cgroup subsystems. | ||
338 | |||
339 | example) | ||
340 | # mount -t cgroup none /cgroup -t cpuset,memory,cpu,devices | ||
341 | |||
342 | and do task move, mkdir, rmdir etc...under this. | ||
diff --git a/Documentation/controllers/memory.txt b/Documentation/controllers/memory.txt index 1c07547d3f81..e1501964df1e 100644 --- a/Documentation/controllers/memory.txt +++ b/Documentation/controllers/memory.txt | |||
@@ -137,7 +137,32 @@ behind this approach is that a cgroup that aggressively uses a shared | |||
137 | page will eventually get charged for it (once it is uncharged from | 137 | page will eventually get charged for it (once it is uncharged from |
138 | the cgroup that brought it in -- this will happen on memory pressure). | 138 | the cgroup that brought it in -- this will happen on memory pressure). |
139 | 139 | ||
140 | 2.4 Reclaim | 140 | Exception: If CONFIG_CGROUP_CGROUP_MEM_RES_CTLR_SWAP is not used.. |
141 | When you do swapoff and make swapped-out pages of shmem(tmpfs) to | ||
142 | be backed into memory in force, charges for pages are accounted against the | ||
143 | caller of swapoff rather than the users of shmem. | ||
144 | |||
145 | |||
146 | 2.4 Swap Extension (CONFIG_CGROUP_MEM_RES_CTLR_SWAP) | ||
147 | Swap Extension allows you to record charge for swap. A swapped-in page is | ||
148 | charged back to original page allocator if possible. | ||
149 | |||
150 | When swap is accounted, following files are added. | ||
151 | - memory.memsw.usage_in_bytes. | ||
152 | - memory.memsw.limit_in_bytes. | ||
153 | |||
154 | usage of mem+swap is limited by memsw.limit_in_bytes. | ||
155 | |||
156 | Note: why 'mem+swap' rather than swap. | ||
157 | The global LRU(kswapd) can swap out arbitrary pages. Swap-out means | ||
158 | to move account from memory to swap...there is no change in usage of | ||
159 | mem+swap. | ||
160 | |||
161 | In other words, when we want to limit the usage of swap without affecting | ||
162 | global LRU, mem+swap limit is better than just limiting swap from OS point | ||
163 | of view. | ||
164 | |||
165 | 2.5 Reclaim | ||
141 | 166 | ||
142 | Each cgroup maintains a per cgroup LRU that consists of an active | 167 | Each cgroup maintains a per cgroup LRU that consists of an active |
143 | and inactive list. When a cgroup goes over its limit, we first try | 168 | and inactive list. When a cgroup goes over its limit, we first try |
@@ -207,12 +232,6 @@ exceeded. | |||
207 | The memory.stat file gives accounting information. Now, the number of | 232 | The memory.stat file gives accounting information. Now, the number of |
208 | caches, RSS and Active pages/Inactive pages are shown. | 233 | caches, RSS and Active pages/Inactive pages are shown. |
209 | 234 | ||
210 | The memory.force_empty gives an interface to drop *all* charges by force. | ||
211 | |||
212 | # echo 1 > memory.force_empty | ||
213 | |||
214 | will drop all charges in cgroup. Currently, this is maintained for test. | ||
215 | |||
216 | 4. Testing | 235 | 4. Testing |
217 | 236 | ||
218 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. | 237 | Balbir posted lmbench, AIM9, LTP and vmmstress results [10] and [11]. |
@@ -242,10 +261,106 @@ reclaimed. | |||
242 | 261 | ||
243 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a | 262 | A cgroup can be removed by rmdir, but as discussed in sections 4.1 and 4.2, a |
244 | cgroup might have some charge associated with it, even though all | 263 | cgroup might have some charge associated with it, even though all |
245 | tasks have migrated away from it. Such charges are automatically dropped at | 264 | tasks have migrated away from it. |
246 | rmdir() if there are no tasks. | 265 | Such charges are freed(at default) or moved to its parent. When moved, |
266 | both of RSS and CACHES are moved to parent. | ||
267 | If both of them are busy, rmdir() returns -EBUSY. See 5.1 Also. | ||
268 | |||
269 | Charges recorded in swap information is not updated at removal of cgroup. | ||
270 | Recorded information is discarded and a cgroup which uses swap (swapcache) | ||
271 | will be charged as a new owner of it. | ||
272 | |||
273 | |||
274 | 5. Misc. interfaces. | ||
275 | |||
276 | 5.1 force_empty | ||
277 | memory.force_empty interface is provided to make cgroup's memory usage empty. | ||
278 | You can use this interface only when the cgroup has no tasks. | ||
279 | When writing anything to this | ||
280 | |||
281 | # echo 0 > memory.force_empty | ||
282 | |||
283 | Almost all pages tracked by this memcg will be unmapped and freed. Some of | ||
284 | pages cannot be freed because it's locked or in-use. Such pages are moved | ||
285 | to parent and this cgroup will be empty. But this may return -EBUSY in | ||
286 | some too busy case. | ||
287 | |||
288 | Typical use case of this interface is that calling this before rmdir(). | ||
289 | Because rmdir() moves all pages to parent, some out-of-use page caches can be | ||
290 | moved to the parent. If you want to avoid that, force_empty will be useful. | ||
291 | |||
292 | 5.2 stat file | ||
293 | memory.stat file includes following statistics (now) | ||
294 | cache - # of pages from page-cache and shmem. | ||
295 | rss - # of pages from anonymous memory. | ||
296 | pgpgin - # of event of charging | ||
297 | pgpgout - # of event of uncharging | ||
298 | active_anon - # of pages on active lru of anon, shmem. | ||
299 | inactive_anon - # of pages on active lru of anon, shmem | ||
300 | active_file - # of pages on active lru of file-cache | ||
301 | inactive_file - # of pages on inactive lru of file cache | ||
302 | unevictable - # of pages cannot be reclaimed.(mlocked etc) | ||
303 | |||
304 | Below is depend on CONFIG_DEBUG_VM. | ||
305 | inactive_ratio - VM inernal parameter. (see mm/page_alloc.c) | ||
306 | recent_rotated_anon - VM internal parameter. (see mm/vmscan.c) | ||
307 | recent_rotated_file - VM internal parameter. (see mm/vmscan.c) | ||
308 | recent_scanned_anon - VM internal parameter. (see mm/vmscan.c) | ||
309 | recent_scanned_file - VM internal parameter. (see mm/vmscan.c) | ||
310 | |||
311 | Memo: | ||
312 | recent_rotated means recent frequency of lru rotation. | ||
313 | recent_scanned means recent # of scans to lru. | ||
314 | showing for better debug please see the code for meanings. | ||
315 | |||
316 | |||
317 | 5.3 swappiness | ||
318 | Similar to /proc/sys/vm/swappiness, but affecting a hierarchy of groups only. | ||
319 | |||
320 | Following cgroup's swapiness can't be changed. | ||
321 | - root cgroup (uses /proc/sys/vm/swappiness). | ||
322 | - a cgroup which uses hierarchy and it has child cgroup. | ||
323 | - a cgroup which uses hierarchy and not the root of hierarchy. | ||
324 | |||
325 | |||
326 | 6. Hierarchy support | ||
327 | |||
328 | The memory controller supports a deep hierarchy and hierarchical accounting. | ||
329 | The hierarchy is created by creating the appropriate cgroups in the | ||
330 | cgroup filesystem. Consider for example, the following cgroup filesystem | ||
331 | hierarchy | ||
332 | |||
333 | root | ||
334 | / | \ | ||
335 | / | \ | ||
336 | a b c | ||
337 | | \ | ||
338 | | \ | ||
339 | d e | ||
340 | |||
341 | In the diagram above, with hierarchical accounting enabled, all memory | ||
342 | usage of e, is accounted to its ancestors up until the root (i.e, c and root), | ||
343 | that has memory.use_hierarchy enabled. If one of the ancestors goes over its | ||
344 | limit, the reclaim algorithm reclaims from the tasks in the ancestor and the | ||
345 | children of the ancestor. | ||
346 | |||
347 | 6.1 Enabling hierarchical accounting and reclaim | ||
348 | |||
349 | The memory controller by default disables the hierarchy feature. Support | ||
350 | can be enabled by writing 1 to memory.use_hierarchy file of the root cgroup | ||
351 | |||
352 | # echo 1 > memory.use_hierarchy | ||
353 | |||
354 | The feature can be disabled by | ||
355 | |||
356 | # echo 0 > memory.use_hierarchy | ||
357 | |||
358 | NOTE1: Enabling/disabling will fail if the cgroup already has other | ||
359 | cgroups created below it. | ||
360 | |||
361 | NOTE2: This feature can be enabled/disabled per subtree. | ||
247 | 362 | ||
248 | 5. TODO | 363 | 7. TODO |
249 | 364 | ||
250 | 1. Add support for accounting huge pages (as a separate controller) | 365 | 1. Add support for accounting huge pages (as a separate controller) |
251 | 2. Make per-cgroup scanner reclaim not-shared pages first | 366 | 2. Make per-cgroup scanner reclaim not-shared pages first |
diff --git a/Documentation/crypto/async-tx-api.txt b/Documentation/crypto/async-tx-api.txt index c1e9545c59bd..9f59fcbf5d82 100644 --- a/Documentation/crypto/async-tx-api.txt +++ b/Documentation/crypto/async-tx-api.txt | |||
@@ -13,9 +13,9 @@ | |||
13 | 3.6 Constraints | 13 | 3.6 Constraints |
14 | 3.7 Example | 14 | 3.7 Example |
15 | 15 | ||
16 | 4 DRIVER DEVELOPER NOTES | 16 | 4 DMAENGINE DRIVER DEVELOPER NOTES |
17 | 4.1 Conformance points | 17 | 4.1 Conformance points |
18 | 4.2 "My application needs finer control of hardware channels" | 18 | 4.2 "My application needs exclusive control of hardware channels" |
19 | 19 | ||
20 | 5 SOURCE | 20 | 5 SOURCE |
21 | 21 | ||
@@ -150,6 +150,7 @@ ops_run_* and ops_complete_* routines in drivers/md/raid5.c for more | |||
150 | implementation examples. | 150 | implementation examples. |
151 | 151 | ||
152 | 4 DRIVER DEVELOPMENT NOTES | 152 | 4 DRIVER DEVELOPMENT NOTES |
153 | |||
153 | 4.1 Conformance points: | 154 | 4.1 Conformance points: |
154 | There are a few conformance points required in dmaengine drivers to | 155 | There are a few conformance points required in dmaengine drivers to |
155 | accommodate assumptions made by applications using the async_tx API: | 156 | accommodate assumptions made by applications using the async_tx API: |
@@ -158,58 +159,49 @@ accommodate assumptions made by applications using the async_tx API: | |||
158 | 3/ Use async_tx_run_dependencies() in the descriptor clean up path to | 159 | 3/ Use async_tx_run_dependencies() in the descriptor clean up path to |
159 | handle submission of dependent operations | 160 | handle submission of dependent operations |
160 | 161 | ||
161 | 4.2 "My application needs finer control of hardware channels" | 162 | 4.2 "My application needs exclusive control of hardware channels" |
162 | This requirement seems to arise from cases where a DMA engine driver is | 163 | Primarily this requirement arises from cases where a DMA engine driver |
163 | trying to support device-to-memory DMA. The dmaengine and async_tx | 164 | is being used to support device-to-memory operations. A channel that is |
164 | implementations were designed for offloading memory-to-memory | 165 | performing these operations cannot, for many platform specific reasons, |
165 | operations; however, there are some capabilities of the dmaengine layer | 166 | be shared. For these cases the dma_request_channel() interface is |
166 | that can be used for platform-specific channel management. | 167 | provided. |
167 | Platform-specific constraints can be handled by registering the | 168 | |
168 | application as a 'dma_client' and implementing a 'dma_event_callback' to | 169 | The interface is: |
169 | apply a filter to the available channels in the system. Before showing | 170 | struct dma_chan *dma_request_channel(dma_cap_mask_t mask, |
170 | how to implement a custom dma_event callback some background of | 171 | dma_filter_fn filter_fn, |
171 | dmaengine's client support is required. | 172 | void *filter_param); |
172 | 173 | ||
173 | The following routines in dmaengine support multiple clients requesting | 174 | Where dma_filter_fn is defined as: |
174 | use of a channel: | 175 | typedef bool (*dma_filter_fn)(struct dma_chan *chan, void *filter_param); |
175 | - dma_async_client_register(struct dma_client *client) | 176 | |
176 | - dma_async_client_chan_request(struct dma_client *client) | 177 | When the optional 'filter_fn' parameter is set to NULL |
177 | 178 | dma_request_channel simply returns the first channel that satisfies the | |
178 | dma_async_client_register takes a pointer to an initialized dma_client | 179 | capability mask. Otherwise, when the mask parameter is insufficient for |
179 | structure. It expects that the 'event_callback' and 'cap_mask' fields | 180 | specifying the necessary channel, the filter_fn routine can be used to |
180 | are already initialized. | 181 | disposition the available channels in the system. The filter_fn routine |
181 | 182 | is called once for each free channel in the system. Upon seeing a | |
182 | dma_async_client_chan_request triggers dmaengine to notify the client of | 183 | suitable channel filter_fn returns DMA_ACK which flags that channel to |
183 | all channels that satisfy the capability mask. It is up to the client's | 184 | be the return value from dma_request_channel. A channel allocated via |
184 | event_callback routine to track how many channels the client needs and | 185 | this interface is exclusive to the caller, until dma_release_channel() |
185 | how many it is currently using. The dma_event_callback routine returns a | 186 | is called. |
186 | dma_state_client code to let dmaengine know the status of the | 187 | |
187 | allocation. | 188 | The DMA_PRIVATE capability flag is used to tag dma devices that should |
188 | 189 | not be used by the general-purpose allocator. It can be set at | |
189 | Below is the example of how to extend this functionality for | 190 | initialization time if it is known that a channel will always be |
190 | platform-specific filtering of the available channels beyond the | 191 | private. Alternatively, it is set when dma_request_channel() finds an |
191 | standard capability mask: | 192 | unused "public" channel. |
192 | 193 | ||
193 | static enum dma_state_client | 194 | A couple caveats to note when implementing a driver and consumer: |
194 | my_dma_client_callback(struct dma_client *client, | 195 | 1/ Once a channel has been privately allocated it will no longer be |
195 | struct dma_chan *chan, enum dma_state state) | 196 | considered by the general-purpose allocator even after a call to |
196 | { | 197 | dma_release_channel(). |
197 | struct dma_device *dma_dev; | 198 | 2/ Since capabilities are specified at the device level a dma_device |
198 | struct my_platform_specific_dma *plat_dma_dev; | 199 | with multiple channels will either have all channels public, or all |
199 | 200 | channels private. | |
200 | dma_dev = chan->device; | ||
201 | plat_dma_dev = container_of(dma_dev, | ||
202 | struct my_platform_specific_dma, | ||
203 | dma_dev); | ||
204 | |||
205 | if (!plat_dma_dev->platform_specific_capability) | ||
206 | return DMA_DUP; | ||
207 | |||
208 | . . . | ||
209 | } | ||
210 | 201 | ||
211 | 5 SOURCE | 202 | 5 SOURCE |
212 | include/linux/dmaengine.h: core header file for DMA drivers and clients | 203 | |
204 | include/linux/dmaengine.h: core header file for DMA drivers and api users | ||
213 | drivers/dma/dmaengine.c: offload engine channel management routines | 205 | drivers/dma/dmaengine.c: offload engine channel management routines |
214 | drivers/dma/: location for offload engine drivers | 206 | drivers/dma/: location for offload engine drivers |
215 | include/linux/async_tx.h: core header file for the async_tx api | 207 | include/linux/async_tx.h: core header file for the async_tx api |
diff --git a/Documentation/development-process/4.Coding b/Documentation/development-process/4.Coding index 014aca8f14e2..a5a3450faaa0 100644 --- a/Documentation/development-process/4.Coding +++ b/Documentation/development-process/4.Coding | |||
@@ -375,10 +375,10 @@ say, this can be a large job, so it is best to be sure that the | |||
375 | justification is solid. | 375 | justification is solid. |
376 | 376 | ||
377 | When making an incompatible API change, one should, whenever possible, | 377 | When making an incompatible API change, one should, whenever possible, |
378 | ensure that code which has not been updated is caught by the compiler. | 378 | ensure that code which has not been updated is caught by the compiler. |
379 | This will help you to be sure that you have found all in-tree uses of that | 379 | This will help you to be sure that you have found all in-tree uses of that |
380 | interface. It will also alert developers of out-of-tree code that there is | 380 | interface. It will also alert developers of out-of-tree code that there is |
381 | a change that they need to respond to. Supporting out-of-tree code is not | 381 | a change that they need to respond to. Supporting out-of-tree code is not |
382 | something that kernel developers need to be worried about, but we also do | 382 | something that kernel developers need to be worried about, but we also do |
383 | not have to make life harder for out-of-tree developers than it it needs to | 383 | not have to make life harder for out-of-tree developers than it needs to |
384 | be. | 384 | be. |
diff --git a/Documentation/dmaengine.txt b/Documentation/dmaengine.txt new file mode 100644 index 000000000000..0c1c2f63c0a9 --- /dev/null +++ b/Documentation/dmaengine.txt | |||
@@ -0,0 +1 @@ | |||
See Documentation/crypto/async-tx-api.txt | |||
diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt new file mode 100644 index 000000000000..64087c34327f --- /dev/null +++ b/Documentation/filesystems/btrfs.txt | |||
@@ -0,0 +1,91 @@ | |||
1 | |||
2 | BTRFS | ||
3 | ===== | ||
4 | |||
5 | Btrfs is a new copy on write filesystem for Linux aimed at | ||
6 | implementing advanced features while focusing on fault tolerance, | ||
7 | repair and easy administration. Initially developed by Oracle, Btrfs | ||
8 | is licensed under the GPL and open for contribution from anyone. | ||
9 | |||
10 | Linux has a wealth of filesystems to choose from, but we are facing a | ||
11 | number of challenges with scaling to the large storage subsystems that | ||
12 | are becoming common in today's data centers. Filesystems need to scale | ||
13 | in their ability to address and manage large storage, and also in | ||
14 | their ability to detect, repair and tolerate errors in the data stored | ||
15 | on disk. Btrfs is under heavy development, and is not suitable for | ||
16 | any uses other than benchmarking and review. The Btrfs disk format is | ||
17 | not yet finalized. | ||
18 | |||
19 | The main Btrfs features include: | ||
20 | |||
21 | * Extent based file storage (2^64 max file size) | ||
22 | * Space efficient packing of small files | ||
23 | * Space efficient indexed directories | ||
24 | * Dynamic inode allocation | ||
25 | * Writable snapshots | ||
26 | * Subvolumes (separate internal filesystem roots) | ||
27 | * Object level mirroring and striping | ||
28 | * Checksums on data and metadata (multiple algorithms available) | ||
29 | * Compression | ||
30 | * Integrated multiple device support, with several raid algorithms | ||
31 | * Online filesystem check (not yet implemented) | ||
32 | * Very fast offline filesystem check | ||
33 | * Efficient incremental backup and FS mirroring (not yet implemented) | ||
34 | * Online filesystem defragmentation | ||
35 | |||
36 | |||
37 | |||
38 | MAILING LIST | ||
39 | ============ | ||
40 | |||
41 | There is a Btrfs mailing list hosted on vger.kernel.org. You can | ||
42 | find details on how to subscribe here: | ||
43 | |||
44 | http://vger.kernel.org/vger-lists.html#linux-btrfs | ||
45 | |||
46 | Mailing list archives are available from gmane: | ||
47 | |||
48 | http://dir.gmane.org/gmane.comp.file-systems.btrfs | ||
49 | |||
50 | |||
51 | |||
52 | IRC | ||
53 | === | ||
54 | |||
55 | Discussion of Btrfs also occurs on the #btrfs channel of the Freenode | ||
56 | IRC network. | ||
57 | |||
58 | |||
59 | |||
60 | UTILITIES | ||
61 | ========= | ||
62 | |||
63 | Userspace tools for creating and manipulating Btrfs file systems are | ||
64 | available from the git repository at the following location: | ||
65 | |||
66 | http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs-unstable.git | ||
67 | git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git | ||
68 | |||
69 | These include the following tools: | ||
70 | |||
71 | mkfs.btrfs: create a filesystem | ||
72 | |||
73 | btrfsctl: control program to create snapshots and subvolumes: | ||
74 | |||
75 | mount /dev/sda2 /mnt | ||
76 | btrfsctl -s new_subvol_name /mnt | ||
77 | btrfsctl -s snapshot_of_default /mnt/default | ||
78 | btrfsctl -s snapshot_of_new_subvol /mnt/new_subvol_name | ||
79 | btrfsctl -s snapshot_of_a_snapshot /mnt/snapshot_of_new_subvol | ||
80 | ls /mnt | ||
81 | default snapshot_of_a_snapshot snapshot_of_new_subvol | ||
82 | new_subvol_name snapshot_of_default | ||
83 | |||
84 | Snapshots and subvolumes cannot be deleted right now, but you can | ||
85 | rm -rf all the files and directories inside them. | ||
86 | |||
87 | btrfsck: do a limited check of the FS extent trees. | ||
88 | |||
89 | btrfs-debug-tree: print all of the FS metadata in text form. Example: | ||
90 | |||
91 | btrfs-debug-tree /dev/sda2 >& big_output_file | ||
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 174eaff7ded9..cec829bc7291 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt | |||
@@ -58,13 +58,22 @@ Note: More extensive information for getting started with ext4 can be | |||
58 | 58 | ||
59 | # mount -t ext4 /dev/hda1 /wherever | 59 | # mount -t ext4 /dev/hda1 /wherever |
60 | 60 | ||
61 | - When comparing performance with other filesystems, remember that | 61 | - When comparing performance with other filesystems, it's always |
62 | ext3/4 by default offers higher data integrity guarantees than most. | 62 | important to try multiple workloads; very often a subtle change in a |
63 | So when comparing with a metadata-only journalling filesystem, such | 63 | workload parameter can completely change the ranking of which |
64 | as ext3, use `mount -o data=writeback'. And you might as well use | 64 | filesystems do well compared to others. When comparing versus ext3, |
65 | `mount -o nobh' too along with it. Making the journal larger than | 65 | note that ext4 enables write barriers by default, while ext3 does |
66 | the mke2fs default often helps performance with metadata-intensive | 66 | not enable write barriers by default. So it is useful to use |
67 | workloads. | 67 | explicitly specify whether barriers are enabled or not when via the |
68 | '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems | ||
69 | for a fair comparison. When tuning ext3 for best benchmark numbers, | ||
70 | it is often worthwhile to try changing the data journaling mode; '-o | ||
71 | data=writeback,nobh' can be faster for some workloads. (Note | ||
72 | however that running mounted with data=writeback can potentially | ||
73 | leave stale data exposed in recently written files in case of an | ||
74 | unclean shutdown, which could be a security exposure in some | ||
75 | situations.) Configuring the filesystem with a large journal can | ||
76 | also be helpful for metadata-intensive workloads. | ||
68 | 77 | ||
69 | 2. Features | 78 | 2. Features |
70 | =========== | 79 | =========== |
@@ -74,7 +83,7 @@ Note: More extensive information for getting started with ext4 can be | |||
74 | * ability to use filesystems > 16TB (e2fsprogs support not available yet) | 83 | * ability to use filesystems > 16TB (e2fsprogs support not available yet) |
75 | * extent format reduces metadata overhead (RAM, IO for access, transactions) | 84 | * extent format reduces metadata overhead (RAM, IO for access, transactions) |
76 | * extent format more robust in face of on-disk corruption due to magics, | 85 | * extent format more robust in face of on-disk corruption due to magics, |
77 | * internal redunancy in tree | 86 | * internal redundancy in tree |
78 | * improved file allocation (multi-block alloc) | 87 | * improved file allocation (multi-block alloc) |
79 | * fix 32000 subdirectory limit | 88 | * fix 32000 subdirectory limit |
80 | * nsec timestamps for mtime, atime, ctime, create time | 89 | * nsec timestamps for mtime, atime, ctime, create time |
@@ -116,10 +125,11 @@ grouping of bitmaps and inode tables. Some test results available here: | |||
116 | When mounting an ext4 filesystem, the following option are accepted: | 125 | When mounting an ext4 filesystem, the following option are accepted: |
117 | (*) == default | 126 | (*) == default |
118 | 127 | ||
119 | extents (*) ext4 will use extents to address file data. The | 128 | ro Mount filesystem read only. Note that ext4 will |
120 | file system will no longer be mountable by ext3. | 129 | replay the journal (and thus write to the |
121 | 130 | partition) even when mounted "read only". The | |
122 | noextents ext4 will not use extents for newly created files | 131 | mount options "ro,noload" can be used to prevent |
132 | writes to the filesystem. | ||
123 | 133 | ||
124 | journal_checksum Enable checksumming of the journal transactions. | 134 | journal_checksum Enable checksumming of the journal transactions. |
125 | This will allow the recovery code in e2fsck and the | 135 | This will allow the recovery code in e2fsck and the |
@@ -134,17 +144,17 @@ journal_async_commit Commit block can be written to disk without waiting | |||
134 | journal=update Update the ext4 file system's journal to the current | 144 | journal=update Update the ext4 file system's journal to the current |
135 | format. | 145 | format. |
136 | 146 | ||
137 | journal=inum When a journal already exists, this option is ignored. | ||
138 | Otherwise, it specifies the number of the inode which | ||
139 | will represent the ext4 file system's journal file. | ||
140 | |||
141 | journal_dev=devnum When the external journal device's major/minor numbers | 147 | journal_dev=devnum When the external journal device's major/minor numbers |
142 | have changed, this option allows the user to specify | 148 | have changed, this option allows the user to specify |
143 | the new journal location. The journal device is | 149 | the new journal location. The journal device is |
144 | identified through its new major/minor numbers encoded | 150 | identified through its new major/minor numbers encoded |
145 | in devnum. | 151 | in devnum. |
146 | 152 | ||
147 | noload Don't load the journal on mounting. | 153 | noload Don't load the journal on mounting. Note that |
154 | if the filesystem was not unmounted cleanly, | ||
155 | skipping the journal replay will lead to the | ||
156 | filesystem containing inconsistencies that can | ||
157 | lead to any number of problems. | ||
148 | 158 | ||
149 | data=journal All data are committed into the journal prior to being | 159 | data=journal All data are committed into the journal prior to being |
150 | written into the main file system. | 160 | written into the main file system. |
@@ -219,9 +229,12 @@ minixdf Make 'df' act like Minix. | |||
219 | 229 | ||
220 | debug Extra debugging information is sent to syslog. | 230 | debug Extra debugging information is sent to syslog. |
221 | 231 | ||
222 | errors=remount-ro(*) Remount the filesystem read-only on an error. | 232 | errors=remount-ro Remount the filesystem read-only on an error. |
223 | errors=continue Keep going on a filesystem error. | 233 | errors=continue Keep going on a filesystem error. |
224 | errors=panic Panic and halt the machine if an error occurs. | 234 | errors=panic Panic and halt the machine if an error occurs. |
235 | (These mount options override the errors behavior | ||
236 | specified in the superblock, which can be configured | ||
237 | using tune2fs) | ||
225 | 238 | ||
226 | data_err=ignore(*) Just print an error message if an error occurs | 239 | data_err=ignore(*) Just print an error message if an error occurs |
227 | in a file data buffer in ordered mode. | 240 | in a file data buffer in ordered mode. |
@@ -261,6 +274,42 @@ delalloc (*) Deferring block allocation until write-out time. | |||
261 | nodelalloc Disable delayed allocation. Blocks are allocation | 274 | nodelalloc Disable delayed allocation. Blocks are allocation |
262 | when data is copied from user to page cache. | 275 | when data is copied from user to page cache. |
263 | 276 | ||
277 | max_batch_time=usec Maximum amount of time ext4 should wait for | ||
278 | additional filesystem operations to be batch | ||
279 | together with a synchronous write operation. | ||
280 | Since a synchronous write operation is going to | ||
281 | force a commit and then a wait for the I/O | ||
282 | complete, it doesn't cost much, and can be a | ||
283 | huge throughput win, we wait for a small amount | ||
284 | of time to see if any other transactions can | ||
285 | piggyback on the synchronous write. The | ||
286 | algorithm used is designed to automatically tune | ||
287 | for the speed of the disk, by measuring the | ||
288 | amount of time (on average) that it takes to | ||
289 | finish committing a transaction. Call this time | ||
290 | the "commit time". If the time that the | ||
291 | transactoin has been running is less than the | ||
292 | commit time, ext4 will try sleeping for the | ||
293 | commit time to see if other operations will join | ||
294 | the transaction. The commit time is capped by | ||
295 | the max_batch_time, which defaults to 15000us | ||
296 | (15ms). This optimization can be turned off | ||
297 | entirely by setting max_batch_time to 0. | ||
298 | |||
299 | min_batch_time=usec This parameter sets the commit time (as | ||
300 | described above) to be at least min_batch_time. | ||
301 | It defaults to zero microseconds. Increasing | ||
302 | this parameter may improve the throughput of | ||
303 | multi-threaded, synchronous workloads on very | ||
304 | fast disks, at the cost of increasing latency. | ||
305 | |||
306 | journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the | ||
307 | highest priorty) which should be used for I/O | ||
308 | operations submitted by kjournald2 during a | ||
309 | commit operation. This defaults to 3, which is | ||
310 | a slightly higher priority than the default I/O | ||
311 | priority. | ||
312 | |||
264 | Data Mode | 313 | Data Mode |
265 | ========= | 314 | ========= |
266 | There are 3 different data modes: | 315 | There are 3 different data modes: |
diff --git a/Documentation/hwmon/abituguru-datasheet b/Documentation/hwmon/abituguru-datasheet index 4d184f2db0ea..d9251efdcec7 100644 --- a/Documentation/hwmon/abituguru-datasheet +++ b/Documentation/hwmon/abituguru-datasheet | |||
@@ -121,7 +121,7 @@ Once all bytes have been read data will hold 0x09, but there is no reason to | |||
121 | test for this. Notice that the number of bytes is bank address dependent see | 121 | test for this. Notice that the number of bytes is bank address dependent see |
122 | above and below. | 122 | above and below. |
123 | 123 | ||
124 | After completing a successfull read it is advised to put the uGuru back in | 124 | After completing a successful read it is advised to put the uGuru back in |
125 | ready mode, so that it is ready for the next read / write cycle. This way | 125 | ready mode, so that it is ready for the next read / write cycle. This way |
126 | if your program / driver is unloaded and later loaded again the detection | 126 | if your program / driver is unloaded and later loaded again the detection |
127 | algorithm described above will still work. | 127 | algorithm described above will still work. |
@@ -141,7 +141,7 @@ don't ask why this is the way it is. | |||
141 | 141 | ||
142 | Once DATA holds 0x01 read CMD it should hold 0xAC now. | 142 | Once DATA holds 0x01 read CMD it should hold 0xAC now. |
143 | 143 | ||
144 | After completing a successfull write it is advised to put the uGuru back in | 144 | After completing a successful write it is advised to put the uGuru back in |
145 | ready mode, so that it is ready for the next read / write cycle. This way | 145 | ready mode, so that it is ready for the next read / write cycle. This way |
146 | if your program / driver is unloaded and later loaded again the detection | 146 | if your program / driver is unloaded and later loaded again the detection |
147 | algorithm described above will still work. | 147 | algorithm described above will still work. |
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 532eacbbed62..8511d3532c27 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -141,6 +141,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
141 | ht -- run only enough ACPI to enable Hyper Threading | 141 | ht -- run only enough ACPI to enable Hyper Threading |
142 | strict -- Be less tolerant of platforms that are not | 142 | strict -- Be less tolerant of platforms that are not |
143 | strictly ACPI specification compliant. | 143 | strictly ACPI specification compliant. |
144 | rsdt -- prefer RSDT over (default) XSDT | ||
144 | 145 | ||
145 | See also Documentation/power/pm.txt, pci=noacpi | 146 | See also Documentation/power/pm.txt, pci=noacpi |
146 | 147 | ||
@@ -151,16 +152,20 @@ and is between 256 and 4096 characters. It is defined in the file | |||
151 | default: 0 | 152 | default: 0 |
152 | 153 | ||
153 | acpi_sleep= [HW,ACPI] Sleep options | 154 | acpi_sleep= [HW,ACPI] Sleep options |
154 | Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, old_ordering } | 155 | Format: { s3_bios, s3_mode, s3_beep, s4_nohwsig, |
155 | See Documentation/power/video.txt for s3_bios and s3_mode. | 156 | old_ordering, s4_nonvs } |
157 | See Documentation/power/video.txt for information on | ||
158 | s3_bios and s3_mode. | ||
156 | s3_beep is for debugging; it makes the PC's speaker beep | 159 | s3_beep is for debugging; it makes the PC's speaker beep |
157 | as soon as the kernel's real-mode entry point is called. | 160 | as soon as the kernel's real-mode entry point is called. |
158 | s4_nohwsig prevents ACPI hardware signature from being | 161 | s4_nohwsig prevents ACPI hardware signature from being |
159 | used during resume from hibernation. | 162 | used during resume from hibernation. |
160 | old_ordering causes the ACPI 1.0 ordering of the _PTS | 163 | old_ordering causes the ACPI 1.0 ordering of the _PTS |
161 | control method, wrt putting devices into low power | 164 | control method, with respect to putting devices into |
162 | states, to be enforced (the ACPI 2.0 ordering of _PTS is | 165 | low power states, to be enforced (the ACPI 2.0 ordering |
163 | used by default). | 166 | of _PTS is used by default). |
167 | s4_nonvs prevents the kernel from saving/restoring the | ||
168 | ACPI NVS memory during hibernation. | ||
164 | 169 | ||
165 | acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode | 170 | acpi_sci= [HW,ACPI] ACPI System Control Interrupt trigger mode |
166 | Format: { level | edge | high | low } | 171 | Format: { level | edge | high | low } |
@@ -195,7 +200,7 @@ and is between 256 and 4096 characters. It is defined in the file | |||
195 | acpi_skip_timer_override [HW,ACPI] | 200 | acpi_skip_timer_override [HW,ACPI] |
196 | Recognize and ignore IRQ0/pin2 Interrupt Override. | 201 | Recognize and ignore IRQ0/pin2 Interrupt Override. |
197 | For broken nForce2 BIOS resulting in XT-PIC timer. | 202 | For broken nForce2 BIOS resulting in XT-PIC timer. |
198 | acpi_use_timer_override [HW,ACPI} | 203 | acpi_use_timer_override [HW,ACPI] |
199 | Use timer override. For some broken Nvidia NF5 boards | 204 | Use timer override. For some broken Nvidia NF5 boards |
200 | that require a timer override, but don't have | 205 | that require a timer override, but don't have |
201 | HPET | 206 | HPET |
@@ -829,8 +834,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
829 | 834 | ||
830 | hlt [BUGS=ARM,SH] | 835 | hlt [BUGS=ARM,SH] |
831 | 836 | ||
832 | hvc_iucv= [S390] Number of z/VM IUCV Hypervisor console (HVC) | 837 | hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) |
833 | back-ends. Valid parameters: 0..8 | 838 | terminal devices. Valid values: 0..8 |
834 | 839 | ||
835 | i8042.debug [HW] Toggle i8042 debug mode | 840 | i8042.debug [HW] Toggle i8042 debug mode |
836 | i8042.direct [HW] Put keyboard port into non-translated mode | 841 | i8042.direct [HW] Put keyboard port into non-translated mode |
@@ -878,17 +883,19 @@ and is between 256 and 4096 characters. It is defined in the file | |||
878 | See Documentation/ide/ide.txt. | 883 | See Documentation/ide/ide.txt. |
879 | 884 | ||
880 | idle= [X86] | 885 | idle= [X86] |
881 | Format: idle=poll or idle=mwait, idle=halt, idle=nomwait | 886 | Format: idle=poll, idle=mwait, idle=halt, idle=nomwait |
882 | Poll forces a polling idle loop that can slightly improves the performance | 887 | Poll forces a polling idle loop that can slightly |
883 | of waking up a idle CPU, but will use a lot of power and make the system | 888 | improve the performance of waking up a idle CPU, but |
884 | run hot. Not recommended. | 889 | will use a lot of power and make the system run hot. |
885 | idle=mwait. On systems which support MONITOR/MWAIT but the kernel chose | 890 | Not recommended. |
886 | to not use it because it doesn't save as much power as a normal idle | 891 | idle=mwait: On systems which support MONITOR/MWAIT but |
887 | loop use the MONITOR/MWAIT idle loop anyways. Performance should be the same | 892 | the kernel chose to not use it because it doesn't save |
888 | as idle=poll. | 893 | as much power as a normal idle loop, use the |
889 | idle=halt. Halt is forced to be used for CPU idle. | 894 | MONITOR/MWAIT idle loop anyways. Performance should be |
895 | the same as idle=poll. | ||
896 | idle=halt: Halt is forced to be used for CPU idle. | ||
890 | In such case C2/C3 won't be used again. | 897 | In such case C2/C3 won't be used again. |
891 | idle=nomwait. Disable mwait for CPU C-states | 898 | idle=nomwait: Disable mwait for CPU C-states |
892 | 899 | ||
893 | ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem | 900 | ide-pci-generic.all-generic-ide [HW] (E)IDE subsystem |
894 | Claim all unknown PCI IDE storage controllers. | 901 | Claim all unknown PCI IDE storage controllers. |
@@ -1074,8 +1081,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1074 | lapic [X86-32,APIC] Enable the local APIC even if BIOS | 1081 | lapic [X86-32,APIC] Enable the local APIC even if BIOS |
1075 | disabled it. | 1082 | disabled it. |
1076 | 1083 | ||
1077 | lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer in | 1084 | lapic_timer_c2_ok [X86-32,x86-64,APIC] trust the local apic timer |
1078 | C2 power state. | 1085 | in C2 power state. |
1079 | 1086 | ||
1080 | libata.dma= [LIBATA] DMA control | 1087 | libata.dma= [LIBATA] DMA control |
1081 | libata.dma=0 Disable all PATA and SATA DMA | 1088 | libata.dma=0 Disable all PATA and SATA DMA |
@@ -1562,6 +1569,9 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1562 | 1569 | ||
1563 | nosoftlockup [KNL] Disable the soft-lockup detector. | 1570 | nosoftlockup [KNL] Disable the soft-lockup detector. |
1564 | 1571 | ||
1572 | noswapaccount [KNL] Disable accounting of swap in memory resource | ||
1573 | controller. (See Documentation/controllers/memory.txt) | ||
1574 | |||
1565 | nosync [HW,M68K] Disables sync negotiation for all devices. | 1575 | nosync [HW,M68K] Disables sync negotiation for all devices. |
1566 | 1576 | ||
1567 | notsc [BUGS=X86-32] Disable Time Stamp Counter | 1577 | notsc [BUGS=X86-32] Disable Time Stamp Counter |
@@ -2300,7 +2310,8 @@ and is between 256 and 4096 characters. It is defined in the file | |||
2300 | 2310 | ||
2301 | thermal.psv= [HW,ACPI] | 2311 | thermal.psv= [HW,ACPI] |
2302 | -1: disable all passive trip points | 2312 | -1: disable all passive trip points |
2303 | <degrees C>: override all passive trip points to this value | 2313 | <degrees C>: override all passive trip points to this |
2314 | value | ||
2304 | 2315 | ||
2305 | thermal.tzp= [HW,ACPI] | 2316 | thermal.tzp= [HW,ACPI] |
2306 | Specify global default ACPI thermal zone polling rate | 2317 | Specify global default ACPI thermal zone polling rate |
diff --git a/Documentation/powerpc/dts-bindings/4xx/ndfc.txt b/Documentation/powerpc/dts-bindings/4xx/ndfc.txt new file mode 100644 index 000000000000..869f0b5f16e8 --- /dev/null +++ b/Documentation/powerpc/dts-bindings/4xx/ndfc.txt | |||
@@ -0,0 +1,39 @@ | |||
1 | AMCC NDFC (NanD Flash Controller) | ||
2 | |||
3 | Required properties: | ||
4 | - compatible : "ibm,ndfc". | ||
5 | - reg : should specify chip select and size used for the chip (0x2000). | ||
6 | |||
7 | Optional properties: | ||
8 | - ccr : NDFC config and control register value (default 0). | ||
9 | - bank-settings : NDFC bank configuration register value (default 0). | ||
10 | |||
11 | Notes: | ||
12 | - partition(s) - follows the OF MTD standard for partitions | ||
13 | |||
14 | Example: | ||
15 | |||
16 | ndfc@1,0 { | ||
17 | compatible = "ibm,ndfc"; | ||
18 | reg = <0x00000001 0x00000000 0x00002000>; | ||
19 | ccr = <0x00001000>; | ||
20 | bank-settings = <0x80002222>; | ||
21 | #address-cells = <1>; | ||
22 | #size-cells = <1>; | ||
23 | |||
24 | nand { | ||
25 | #address-cells = <1>; | ||
26 | #size-cells = <1>; | ||
27 | |||
28 | partition@0 { | ||
29 | label = "kernel"; | ||
30 | reg = <0x00000000 0x00200000>; | ||
31 | }; | ||
32 | partition@200000 { | ||
33 | label = "root"; | ||
34 | reg = <0x00200000 0x03E00000>; | ||
35 | }; | ||
36 | }; | ||
37 | }; | ||
38 | |||
39 | |||
diff --git a/Documentation/powerpc/dts-bindings/fsl/board.txt b/Documentation/powerpc/dts-bindings/fsl/board.txt index 81a917ef96e9..6c974d28eeb4 100644 --- a/Documentation/powerpc/dts-bindings/fsl/board.txt +++ b/Documentation/powerpc/dts-bindings/fsl/board.txt | |||
@@ -18,7 +18,7 @@ This is the memory-mapped registers for on board FPGA. | |||
18 | 18 | ||
19 | Required properities: | 19 | Required properities: |
20 | - compatible : should be "fsl,fpga-pixis". | 20 | - compatible : should be "fsl,fpga-pixis". |
21 | - reg : should contain the address and the lenght of the FPPGA register | 21 | - reg : should contain the address and the length of the FPPGA register |
22 | set. | 22 | set. |
23 | 23 | ||
24 | Example (MPC8610HPCD): | 24 | Example (MPC8610HPCD): |
@@ -27,3 +27,33 @@ Example (MPC8610HPCD): | |||
27 | compatible = "fsl,fpga-pixis"; | 27 | compatible = "fsl,fpga-pixis"; |
28 | reg = <0xe8000000 32>; | 28 | reg = <0xe8000000 32>; |
29 | }; | 29 | }; |
30 | |||
31 | * Freescale BCSR GPIO banks | ||
32 | |||
33 | Some BCSR registers act as simple GPIO controllers, each such | ||
34 | register can be represented by the gpio-controller node. | ||
35 | |||
36 | Required properities: | ||
37 | - compatible : Should be "fsl,<board>-bcsr-gpio". | ||
38 | - reg : Should contain the address and the length of the GPIO bank | ||
39 | register. | ||
40 | - #gpio-cells : Should be two. The first cell is the pin number and the | ||
41 | second cell is used to specify optional paramters (currently unused). | ||
42 | - gpio-controller : Marks the port as GPIO controller. | ||
43 | |||
44 | Example: | ||
45 | |||
46 | bcsr@1,0 { | ||
47 | #address-cells = <1>; | ||
48 | #size-cells = <1>; | ||
49 | compatible = "fsl,mpc8360mds-bcsr"; | ||
50 | reg = <1 0 0x8000>; | ||
51 | ranges = <0 1 0 0x8000>; | ||
52 | |||
53 | bcsr13: gpio-controller@d { | ||
54 | #gpio-cells = <2>; | ||
55 | compatible = "fsl,mpc8360mds-bcsr-gpio"; | ||
56 | reg = <0xd 1>; | ||
57 | gpio-controller; | ||
58 | }; | ||
59 | }; | ||
diff --git a/Documentation/scsi/scsi_fc_transport.txt b/Documentation/scsi/scsi_fc_transport.txt index 38d324d62b25..e5b071d46619 100644 --- a/Documentation/scsi/scsi_fc_transport.txt +++ b/Documentation/scsi/scsi_fc_transport.txt | |||
@@ -191,7 +191,7 @@ Vport States: | |||
191 | This is equivalent to a driver "attach" on an adapter, which is | 191 | This is equivalent to a driver "attach" on an adapter, which is |
192 | independent of the adapter's link state. | 192 | independent of the adapter's link state. |
193 | - Instantiation of the vport on the FC link via ELS traffic, etc. | 193 | - Instantiation of the vport on the FC link via ELS traffic, etc. |
194 | This is equivalent to a "link up" and successfull link initialization. | 194 | This is equivalent to a "link up" and successful link initialization. |
195 | Further information can be found in the interfaces section below for | 195 | Further information can be found in the interfaces section below for |
196 | Vport Creation. | 196 | Vport Creation. |
197 | 197 | ||
@@ -320,7 +320,7 @@ Vport Creation: | |||
320 | This is equivalent to a driver "attach" on an adapter, which is | 320 | This is equivalent to a driver "attach" on an adapter, which is |
321 | independent of the adapter's link state. | 321 | independent of the adapter's link state. |
322 | - Instantiation of the vport on the FC link via ELS traffic, etc. | 322 | - Instantiation of the vport on the FC link via ELS traffic, etc. |
323 | This is equivalent to a "link up" and successfull link initialization. | 323 | This is equivalent to a "link up" and successful link initialization. |
324 | 324 | ||
325 | The LLDD's vport_create() function will not synchronously wait for both | 325 | The LLDD's vport_create() function will not synchronously wait for both |
326 | parts to be fully completed before returning. It must validate that the | 326 | parts to be fully completed before returning. It must validate that the |
diff --git a/Documentation/w1/masters/00-INDEX b/Documentation/w1/masters/00-INDEX index 7b0ceaaad7af..d63fa024ac05 100644 --- a/Documentation/w1/masters/00-INDEX +++ b/Documentation/w1/masters/00-INDEX | |||
@@ -4,5 +4,7 @@ ds2482 | |||
4 | - The Maxim/Dallas Semiconductor DS2482 provides 1-wire busses. | 4 | - The Maxim/Dallas Semiconductor DS2482 provides 1-wire busses. |
5 | ds2490 | 5 | ds2490 |
6 | - The Maxim/Dallas Semiconductor DS2490 builds USB <-> W1 bridges. | 6 | - The Maxim/Dallas Semiconductor DS2490 builds USB <-> W1 bridges. |
7 | mxc_w1 | ||
8 | - W1 master controller driver found on Freescale MX2/MX3 SoCs | ||
7 | w1-gpio | 9 | w1-gpio |
8 | - GPIO 1-wire bus master driver. | 10 | - GPIO 1-wire bus master driver. |
diff --git a/Documentation/w1/masters/mxc-w1 b/Documentation/w1/masters/mxc-w1 new file mode 100644 index 000000000000..97f6199a7f39 --- /dev/null +++ b/Documentation/w1/masters/mxc-w1 | |||
@@ -0,0 +1,11 @@ | |||
1 | Kernel driver mxc_w1 | ||
2 | ==================== | ||
3 | |||
4 | Supported chips: | ||
5 | * Freescale MX27, MX31 and probably other i.MX SoCs | ||
6 | Datasheets: | ||
7 | http://www.freescale.com/files/32bit/doc/data_sheet/MCIMX31.pdf?fpsp=1 | ||
8 | http://www.freescale.com/files/dsp/MCIMX27.pdf?fpsp=1 | ||
9 | |||
10 | Author: Originally based on Freescale code, prepared for mainline by | ||
11 | Sascha Hauer <s.hauer@pengutronix.de> | ||
diff --git a/Documentation/w1/w1.netlink b/Documentation/w1/w1.netlink index 3640c7c87d45..804445f745ed 100644 --- a/Documentation/w1/w1.netlink +++ b/Documentation/w1/w1.netlink | |||
@@ -5,69 +5,157 @@ Message types. | |||
5 | ============= | 5 | ============= |
6 | 6 | ||
7 | There are three types of messages between w1 core and userspace: | 7 | There are three types of messages between w1 core and userspace: |
8 | 1. Events. They are generated each time new master or slave device found | 8 | 1. Events. They are generated each time new master or slave device |
9 | either due to automatic or requested search. | 9 | found either due to automatic or requested search. |
10 | 2. Userspace commands. Includes read/write and search/alarm search comamnds. | 10 | 2. Userspace commands. |
11 | 3. Replies to userspace commands. | 11 | 3. Replies to userspace commands. |
12 | 12 | ||
13 | 13 | ||
14 | Protocol. | 14 | Protocol. |
15 | ======== | 15 | ======== |
16 | 16 | ||
17 | [struct cn_msg] - connector header. It's length field is equal to size of the attached data. | 17 | [struct cn_msg] - connector header. |
18 | Its length field is equal to size of the attached data | ||
18 | [struct w1_netlink_msg] - w1 netlink header. | 19 | [struct w1_netlink_msg] - w1 netlink header. |
19 | __u8 type - message type. | 20 | __u8 type - message type. |
20 | W1_SLAVE_ADD/W1_SLAVE_REMOVE - slave add/remove events. | 21 | W1_LIST_MASTERS |
21 | W1_MASTER_ADD/W1_MASTER_REMOVE - master add/remove events. | 22 | list current bus masters |
22 | W1_MASTER_CMD - userspace command for bus master device (search/alarm search). | 23 | W1_SLAVE_ADD/W1_SLAVE_REMOVE |
23 | W1_SLAVE_CMD - userspace command for slave device (read/write/ search/alarm search | 24 | slave add/remove events |
24 | for bus master device where given slave device found). | 25 | W1_MASTER_ADD/W1_MASTER_REMOVE |
26 | master add/remove events | ||
27 | W1_MASTER_CMD | ||
28 | userspace command for bus master | ||
29 | device (search/alarm search) | ||
30 | W1_SLAVE_CMD | ||
31 | userspace command for slave device | ||
32 | (read/write/touch) | ||
25 | __u8 res - reserved | 33 | __u8 res - reserved |
26 | __u16 len - size of attached to this header data. | 34 | __u16 len - size of data attached to this header data |
27 | union { | 35 | union { |
28 | __u8 id; - slave unique device id | 36 | __u8 id[8]; - slave unique device id |
29 | struct w1_mst { | 37 | struct w1_mst { |
30 | __u32 id; - master's id. | 38 | __u32 id; - master's id |
31 | __u32 res; - reserved | 39 | __u32 res; - reserved |
32 | } mst; | 40 | } mst; |
33 | } id; | 41 | } id; |
34 | 42 | ||
35 | [strucrt w1_netlink_cmd] - command for gived master or slave device. | 43 | [struct w1_netlink_cmd] - command for given master or slave device. |
36 | __u8 cmd - command opcode. | 44 | __u8 cmd - command opcode. |
37 | W1_CMD_READ - read command. | 45 | W1_CMD_READ - read command |
38 | W1_CMD_WRITE - write command. | 46 | W1_CMD_WRITE - write command |
39 | W1_CMD_SEARCH - search command. | 47 | W1_CMD_TOUCH - touch command |
40 | W1_CMD_ALARM_SEARCH - alarm search command. | 48 | (write and sample data back to userspace) |
49 | W1_CMD_SEARCH - search command | ||
50 | W1_CMD_ALARM_SEARCH - alarm search command | ||
41 | __u8 res - reserved | 51 | __u8 res - reserved |
42 | __u16 len - length of data for this command. | 52 | __u16 len - length of data for this command |
43 | For read command data must be allocated like for write command. | 53 | For read command data must be allocated like for write command |
44 | __u8 data[0] - data for this command. | 54 | __u8 data[0] - data for this command |
45 | 55 | ||
46 | 56 | ||
47 | Each connector message can include one or more w1_netlink_msg with zero of more attached w1_netlink_cmd messages. | 57 | Each connector message can include one or more w1_netlink_msg with |
58 | zero or more attached w1_netlink_cmd messages. | ||
48 | 59 | ||
49 | For event messages there are no w1_netlink_cmd embedded structures, only connector header | 60 | For event messages there are no w1_netlink_cmd embedded structures, |
50 | and w1_netlink_msg strucutre with "len" field being zero and filled type (one of event types) | 61 | only connector header and w1_netlink_msg strucutre with "len" field |
51 | and id - either 8 bytes of slave unique id in host order, or master's id, which is assigned | 62 | being zero and filled type (one of event types) and id: |
52 | to bus master device when it is added to w1 core. | 63 | either 8 bytes of slave unique id in host order, |
64 | or master's id, which is assigned to bus master device | ||
65 | when it is added to w1 core. | ||
66 | |||
67 | Currently replies to userspace commands are only generated for read | ||
68 | command request. One reply is generated exactly for one w1_netlink_cmd | ||
69 | read request. Replies are not combined when sent - i.e. typical reply | ||
70 | messages looks like the following: | ||
53 | 71 | ||
54 | Currently replies to userspace commands are only generated for read command request. | ||
55 | One reply is generated exactly for one w1_netlink_cmd read request. | ||
56 | Replies are not combined when sent - i.e. typical reply messages looks like the following: | ||
57 | [cn_msg][w1_netlink_msg][w1_netlink_cmd] | 72 | [cn_msg][w1_netlink_msg][w1_netlink_cmd] |
58 | cn_msg.len = sizeof(struct w1_netlink_msg) + sizeof(struct w1_netlink_cmd) + cmd->len; | 73 | cn_msg.len = sizeof(struct w1_netlink_msg) + |
74 | sizeof(struct w1_netlink_cmd) + | ||
75 | cmd->len; | ||
59 | w1_netlink_msg.len = sizeof(struct w1_netlink_cmd) + cmd->len; | 76 | w1_netlink_msg.len = sizeof(struct w1_netlink_cmd) + cmd->len; |
60 | w1_netlink_cmd.len = cmd->len; | 77 | w1_netlink_cmd.len = cmd->len; |
61 | 78 | ||
79 | Replies to W1_LIST_MASTERS should send a message back to the userspace | ||
80 | which will contain list of all registered master ids in the following | ||
81 | format: | ||
82 | |||
83 | cn_msg (CN_W1_IDX.CN_W1_VAL as id, len is equal to sizeof(struct | ||
84 | w1_netlink_msg) plus number of masters multipled by 4) | ||
85 | w1_netlink_msg (type: W1_LIST_MASTERS, len is equal to | ||
86 | number of masters multiplied by 4 (u32 size)) | ||
87 | id0 ... idN | ||
88 | |||
89 | Each message is at most 4k in size, so if number of master devices | ||
90 | exceeds this, it will be split into several messages, | ||
91 | cn.seq will be increased for each one. | ||
92 | |||
93 | W1 search and alarm search commands. | ||
94 | request: | ||
95 | [cn_msg] | ||
96 | [w1_netlink_msg type = W1_MASTER_CMD | ||
97 | id is equal to the bus master id to use for searching] | ||
98 | [w1_netlink_cmd cmd = W1_CMD_SEARCH or W1_CMD_ALARM_SEARCH] | ||
99 | |||
100 | reply: | ||
101 | [cn_msg, ack = 1 and increasing, 0 means the last message, | ||
102 | seq is equal to the request seq] | ||
103 | [w1_netlink_msg type = W1_MASTER_CMD] | ||
104 | [w1_netlink_cmd cmd = W1_CMD_SEARCH or W1_CMD_ALARM_SEARCH | ||
105 | len is equal to number of IDs multiplied by 8] | ||
106 | [64bit-id0 ... 64bit-idN] | ||
107 | Length in each header corresponds to the size of the data behind it, so | ||
108 | w1_netlink_cmd->len = N * 8; where N is number of IDs in this message. | ||
109 | Can be zero. | ||
110 | w1_netlink_msg->len = sizeof(struct w1_netlink_cmd) + N * 8; | ||
111 | cn_msg->len = sizeof(struct w1_netlink_msg) + | ||
112 | sizeof(struct w1_netlink_cmd) + | ||
113 | N*8; | ||
114 | |||
115 | W1 reset command. | ||
116 | [cn_msg] | ||
117 | [w1_netlink_msg type = W1_MASTER_CMD | ||
118 | id is equal to the bus master id to use for searching] | ||
119 | [w1_netlink_cmd cmd = W1_CMD_RESET] | ||
120 | |||
121 | |||
122 | Command status replies. | ||
123 | ====================== | ||
124 | |||
125 | Each command (either root, master or slave with or without w1_netlink_cmd | ||
126 | structure) will be 'acked' by the w1 core. Format of the reply is the same | ||
127 | as request message except that length parameters do not account for data | ||
128 | requested by the user, i.e. read/write/touch IO requests will not contain | ||
129 | data, so w1_netlink_cmd.len will be 0, w1_netlink_msg.len will be size | ||
130 | of the w1_netlink_cmd structure and cn_msg.len will be equal to the sum | ||
131 | of the sizeof(struct w1_netlink_msg) and sizeof(struct w1_netlink_cmd). | ||
132 | If reply is generated for master or root command (which do not have | ||
133 | w1_netlink_cmd attached), reply will contain only cn_msg and w1_netlink_msg | ||
134 | structires. | ||
135 | |||
136 | w1_netlink_msg.status field will carry positive error value | ||
137 | (EINVAL for example) or zero in case of success. | ||
138 | |||
139 | All other fields in every structure will mirror the same parameters in the | ||
140 | request message (except lengths as described above). | ||
141 | |||
142 | Status reply is generated for every w1_netlink_cmd embedded in the | ||
143 | w1_netlink_msg, if there are no w1_netlink_cmd structures, | ||
144 | reply will be generated for the w1_netlink_msg. | ||
145 | |||
146 | All w1_netlink_cmd command structures are handled in every w1_netlink_msg, | ||
147 | even if there were errors, only length mismatch interrupts message processing. | ||
148 | |||
62 | 149 | ||
63 | Operation steps in w1 core when new command is received. | 150 | Operation steps in w1 core when new command is received. |
64 | ======================================================= | 151 | ======================================================= |
65 | 152 | ||
66 | When new message (w1_netlink_msg) is received w1 core detects if it is master of slave request, | 153 | When new message (w1_netlink_msg) is received w1 core detects if it is |
67 | according to w1_netlink_msg.type field. | 154 | master or slave request, according to w1_netlink_msg.type field. |
68 | Then master or slave device is searched for. | 155 | Then master or slave device is searched for. |
69 | When found, master device (requested or those one on where slave device is found) is locked. | 156 | When found, master device (requested or those one on where slave device |
70 | If slave command is requested, then reset/select procedure is started to select given device. | 157 | is found) is locked. If slave command is requested, then reset/select |
158 | procedure is started to select given device. | ||
71 | 159 | ||
72 | Then all requested in w1_netlink_msg operations are performed one by one. | 160 | Then all requested in w1_netlink_msg operations are performed one by one. |
73 | If command requires reply (like read command) it is sent on command completion. | 161 | If command requires reply (like read command) it is sent on command completion. |
@@ -82,8 +170,8 @@ Connector [1] specific documentation. | |||
82 | Each connector message includes two u32 fields as "address". | 170 | Each connector message includes two u32 fields as "address". |
83 | w1 uses CN_W1_IDX and CN_W1_VAL defined in include/linux/connector.h header. | 171 | w1 uses CN_W1_IDX and CN_W1_VAL defined in include/linux/connector.h header. |
84 | Each message also includes sequence and acknowledge numbers. | 172 | Each message also includes sequence and acknowledge numbers. |
85 | Sequence number for event messages is appropriate bus master sequence number increased with | 173 | Sequence number for event messages is appropriate bus master sequence number |
86 | each event message sent "through" this master. | 174 | increased with each event message sent "through" this master. |
87 | Sequence number for userspace requests is set by userspace application. | 175 | Sequence number for userspace requests is set by userspace application. |
88 | Sequence number for reply is the same as was in request, and | 176 | Sequence number for reply is the same as was in request, and |
89 | acknowledge number is set to seq+1. | 177 | acknowledge number is set to seq+1. |
@@ -93,6 +181,6 @@ Additional documantion, source code examples. | |||
93 | ============================================ | 181 | ============================================ |
94 | 182 | ||
95 | 1. Documentation/connector | 183 | 1. Documentation/connector |
96 | 2. http://tservice.net.ru/~s0mbre/archive/w1 | 184 | 2. http://www.ioremap.net/archive/w1 |
97 | This archive includes userspace application w1d.c which | 185 | This archive includes userspace application w1d.c which uses |
98 | uses read/write/search commands for all master/slave devices found on the bus. | 186 | read/write/search commands for all master/slave devices found on the bus. |