aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorBenjamin Herrenschmidt <benh@kernel.crashing.org>2008-07-15 21:07:59 -0400
committerBenjamin Herrenschmidt <benh@kernel.crashing.org>2008-07-15 21:07:59 -0400
commit84c3d4aaec3338201b449034beac41635866bddf (patch)
tree3412951682fb2dd4feb8a5532f8efbaf8b345933 /Documentation
parent43d2548bb2ef7e6d753f91468a746784041e522d (diff)
parentfafa3a3f16723997f039a0193997464d66dafd8f (diff)
Merge commit 'origin/master'
Manual merge of: arch/powerpc/Kconfig arch/powerpc/kernel/stacktrace.c arch/powerpc/mm/slice.c arch/ppc/kernel/smp.c
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/IRQ-affinity.txt37
-rw-r--r--Documentation/RCU/NMI-RCU.txt3
-rw-r--r--Documentation/RCU/RTFP.txt108
-rw-r--r--Documentation/RCU/checklist.txt89
-rw-r--r--Documentation/RCU/torture.txt48
-rw-r--r--Documentation/RCU/whatisRCU.txt58
-rw-r--r--Documentation/cputopology.txt26
-rw-r--r--Documentation/feature-removal-schedule.txt7
-rw-r--r--Documentation/filesystems/ext4.txt125
-rw-r--r--Documentation/filesystems/gfs2-glocks.txt114
-rw-r--r--Documentation/filesystems/proc.txt29
-rw-r--r--Documentation/i2c/busses/i2c-i81047
-rw-r--r--Documentation/i2c/busses/i2c-prosavage23
-rw-r--r--Documentation/i2c/busses/i2c-savage426
-rw-r--r--Documentation/i2c/fault-codes127
-rw-r--r--Documentation/i2c/smbus-protocol4
-rw-r--r--Documentation/i2c/writing-clients51
-rw-r--r--Documentation/kernel-parameters.txt9
18 files changed, 662 insertions, 269 deletions
diff --git a/Documentation/IRQ-affinity.txt b/Documentation/IRQ-affinity.txt
index 938d7dd05490..b4a615b78403 100644
--- a/Documentation/IRQ-affinity.txt
+++ b/Documentation/IRQ-affinity.txt
@@ -1,17 +1,26 @@
1ChangeLog:
2 Started by Ingo Molnar <mingo@redhat.com>
3 Update by Max Krasnyansky <maxk@qualcomm.com>
1 4
2SMP IRQ affinity, started by Ingo Molnar <mingo@redhat.com> 5SMP IRQ affinity
3
4 6
5/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted 7/proc/irq/IRQ#/smp_affinity specifies which target CPUs are permitted
6for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed 8for a given IRQ source. It's a bitmask of allowed CPUs. It's not allowed
7to turn off all CPUs, and if an IRQ controller does not support IRQ 9to turn off all CPUs, and if an IRQ controller does not support IRQ
8affinity then the value will not change from the default 0xffffffff. 10affinity then the value will not change from the default 0xffffffff.
9 11
12/proc/irq/default_smp_affinity specifies default affinity mask that applies
13to all non-active IRQs. Once IRQ is allocated/activated its affinity bitmask
14will be set to the default mask. It can then be changed as described above.
15Default mask is 0xffffffff.
16
10Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting 17Here is an example of restricting IRQ44 (eth1) to CPU0-3 then restricting
11the IRQ to CPU4-7 (this is an 8-CPU SMP box): 18it to CPU4-7 (this is an 8-CPU SMP box):
12 19
20[root@moon 44]# cd /proc/irq/44
13[root@moon 44]# cat smp_affinity 21[root@moon 44]# cat smp_affinity
14ffffffff 22ffffffff
23
15[root@moon 44]# echo 0f > smp_affinity 24[root@moon 44]# echo 0f > smp_affinity
16[root@moon 44]# cat smp_affinity 25[root@moon 44]# cat smp_affinity
170000000f 260000000f
@@ -21,17 +30,27 @@ PING hell (195.4.7.3): 56 data bytes
21--- hell ping statistics --- 30--- hell ping statistics ---
226029 packets transmitted, 6027 packets received, 0% packet loss 316029 packets transmitted, 6027 packets received, 0% packet loss
23round-trip min/avg/max = 0.1/0.1/0.4 ms 32round-trip min/avg/max = 0.1/0.1/0.4 ms
24[root@moon 44]# cat /proc/interrupts | grep 44: 33[root@moon 44]# cat /proc/interrupts | grep 'CPU\|44:'
25 44: 0 1785 1785 1783 1783 1 34 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
261 0 IO-APIC-level eth1 35 44: 1068 1785 1785 1783 0 0 0 0 IO-APIC-level eth1
36
37As can be seen from the line above IRQ44 was delivered only to the first four
38processors (0-3).
39Now lets restrict that IRQ to CPU(4-7).
40
27[root@moon 44]# echo f0 > smp_affinity 41[root@moon 44]# echo f0 > smp_affinity
42[root@moon 44]# cat smp_affinity
43000000f0
28[root@moon 44]# ping -f h 44[root@moon 44]# ping -f h
29PING hell (195.4.7.3): 56 data bytes 45PING hell (195.4.7.3): 56 data bytes
30.. 46..
31--- hell ping statistics --- 47--- hell ping statistics ---
322779 packets transmitted, 2777 packets received, 0% packet loss 482779 packets transmitted, 2777 packets received, 0% packet loss
33round-trip min/avg/max = 0.1/0.5/585.4 ms 49round-trip min/avg/max = 0.1/0.5/585.4 ms
34[root@moon 44]# cat /proc/interrupts | grep 44: 50[root@moon 44]# cat /proc/interrupts | 'CPU\|44:'
35 44: 1068 1785 1785 1784 1784 1069 1070 1069 IO-APIC-level eth1 51 CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7
36[root@moon 44]# 52 44: 1068 1785 1785 1783 1784 1069 1070 1069 IO-APIC-level eth1
53
54This time around IRQ44 was delivered only to the last four processors.
55i.e counters for the CPU0-3 did not change.
37 56
diff --git a/Documentation/RCU/NMI-RCU.txt b/Documentation/RCU/NMI-RCU.txt
index c64158ecde43..a6d32e65d222 100644
--- a/Documentation/RCU/NMI-RCU.txt
+++ b/Documentation/RCU/NMI-RCU.txt
@@ -93,6 +93,9 @@ Since NMI handlers disable preemption, synchronize_sched() is guaranteed
93not to return until all ongoing NMI handlers exit. It is therefore safe 93not to return until all ongoing NMI handlers exit. It is therefore safe
94to free up the handler's data as soon as synchronize_sched() returns. 94to free up the handler's data as soon as synchronize_sched() returns.
95 95
96Important note: for this to work, the architecture in question must
97invoke irq_enter() and irq_exit() on NMI entry and exit, respectively.
98
96 99
97Answer to Quick Quiz 100Answer to Quick Quiz
98 101
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt
index 39ad8f56783a..9f711d2df91b 100644
--- a/Documentation/RCU/RTFP.txt
+++ b/Documentation/RCU/RTFP.txt
@@ -52,6 +52,10 @@ of each iteration. Unfortunately, chaotic relaxation requires highly
52structured data, such as the matrices used in scientific programs, and 52structured data, such as the matrices used in scientific programs, and
53is thus inapplicable to most data structures in operating-system kernels. 53is thus inapplicable to most data structures in operating-system kernels.
54 54
55In 1992, Henry (now Alexia) Massalin completed a dissertation advising
56parallel programmers to defer processing when feasible to simplify
57synchronization. RCU makes extremely heavy use of this advice.
58
55In 1993, Jacobson [Jacobson93] verbally described what is perhaps the 59In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
56simplest deferred-free technique: simply waiting a fixed amount of time 60simplest deferred-free technique: simply waiting a fixed amount of time
57before freeing blocks awaiting deferred free. Jacobson did not describe 61before freeing blocks awaiting deferred free. Jacobson did not describe
@@ -138,6 +142,13 @@ blocking in read-side critical sections appeared [PaulEMcKenney2006c],
138Robert Olsson described an RCU-protected trie-hash combination 142Robert Olsson described an RCU-protected trie-hash combination
139[RobertOlsson2006a]. 143[RobertOlsson2006a].
140 144
1452007 saw the journal version of the award-winning RCU paper from 2006
146[ThomasEHart2007a], as well as a paper demonstrating use of Promela
147and Spin to mechanically verify an optimization to Oleg Nesterov's
148QRCU [PaulEMcKenney2007QRCUspin], a design document describing
149preemptible RCU [PaulEMcKenney2007PreemptibleRCU], and the three-part
150LWN "What is RCU?" series [PaulEMcKenney2007WhatIsRCUFundamentally,
151PaulEMcKenney2008WhatIsRCUUsage, and PaulEMcKenney2008WhatIsRCUAPI].
141 152
142Bibtex Entries 153Bibtex Entries
143 154
@@ -202,6 +213,20 @@ Bibtex Entries
202,Year="1991" 213,Year="1991"
203} 214}
204 215
216@phdthesis{HMassalinPhD
217,author="H. Massalin"
218,title="Synthesis: An Efficient Implementation of Fundamental Operating
219System Services"
220,school="Columbia University"
221,address="New York, NY"
222,year="1992"
223,annotation="
224 Mondo optimizing compiler.
225 Wait-free stuff.
226 Good advice: defer work to avoid synchronization.
227"
228}
229
205@unpublished{Jacobson93 230@unpublished{Jacobson93
206,author="Van Jacobson" 231,author="Van Jacobson"
207,title="Avoid Read-Side Locking Via Delayed Free" 232,title="Avoid Read-Side Locking Via Delayed Free"
@@ -635,3 +660,86 @@ Revised:
635" 660"
636} 661}
637 662
663@unpublished{PaulEMcKenney2007PreemptibleRCU
664,Author="Paul E. McKenney"
665,Title="The design of preemptible read-copy-update"
666,month="October"
667,day="8"
668,year="2007"
669,note="Available:
670\url{http://lwn.net/Articles/253651/}
671[Viewed October 25, 2007]"
672,annotation="
673 LWN article describing the design of preemptible RCU.
674"
675}
676
677########################################################################
678#
679# "What is RCU?" LWN series.
680#
681
682@unpublished{PaulEMcKenney2007WhatIsRCUFundamentally
683,Author="Paul E. McKenney and Jonathan Walpole"
684,Title="What is {RCU}, Fundamentally?"
685,month="December"
686,day="17"
687,year="2007"
688,note="Available:
689\url{http://lwn.net/Articles/262464/}
690[Viewed December 27, 2007]"
691,annotation="
692 Lays out the three basic components of RCU: (1) publish-subscribe,
693 (2) wait for pre-existing readers to complete, and (2) maintain
694 multiple versions.
695"
696}
697
698@unpublished{PaulEMcKenney2008WhatIsRCUUsage
699,Author="Paul E. McKenney"
700,Title="What is {RCU}? Part 2: Usage"
701,month="January"
702,day="4"
703,year="2008"
704,note="Available:
705\url{http://lwn.net/Articles/263130/}
706[Viewed January 4, 2008]"
707,annotation="
708 Lays out six uses of RCU:
709 1. RCU is a Reader-Writer Lock Replacement
710 2. RCU is a Restricted Reference-Counting Mechanism
711 3. RCU is a Bulk Reference-Counting Mechanism
712 4. RCU is a Poor Man's Garbage Collector
713 5. RCU is a Way of Providing Existence Guarantees
714 6. RCU is a Way of Waiting for Things to Finish
715"
716}
717
718@unpublished{PaulEMcKenney2008WhatIsRCUAPI
719,Author="Paul E. McKenney"
720,Title="{RCU} part 3: the {RCU} {API}"
721,month="January"
722,day="17"
723,year="2008"
724,note="Available:
725\url{http://lwn.net/Articles/264090/}
726[Viewed January 10, 2008]"
727,annotation="
728 Gives an overview of the Linux-kernel RCU API and a brief annotated RCU
729 bibliography.
730"
731}
732
733@article{DinakarGuniguntala2008IBMSysJ
734,author="D. Guniguntala and P. E. McKenney and J. Triplett and J. Walpole"
735,title="The read-copy-update mechanism for supporting real-time applications on shared-memory multiprocessor systems with {Linux}"
736,Year="2008"
737,Month="April"
738,journal="IBM Systems Journal"
739,volume="47"
740,number="2"
741,pages="@@-@@"
742,annotation="
743 RCU, realtime RCU, sleepable RCU, performance.
744"
745}
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
index 42b01bc2e1b4..cf5562cbe356 100644
--- a/Documentation/RCU/checklist.txt
+++ b/Documentation/RCU/checklist.txt
@@ -13,10 +13,13 @@ over a rather long period of time, but improvements are always welcome!
13 detailed performance measurements show that RCU is nonetheless 13 detailed performance measurements show that RCU is nonetheless
14 the right tool for the job. 14 the right tool for the job.
15 15
16 The other exception would be where performance is not an issue, 16 Another exception is where performance is not an issue, and RCU
17 and RCU provides a simpler implementation. An example of this 17 provides a simpler implementation. An example of this situation
18 situation is the dynamic NMI code in the Linux 2.6 kernel, 18 is the dynamic NMI code in the Linux 2.6 kernel, at least on
19 at least on architectures where NMIs are rare. 19 architectures where NMIs are rare.
20
21 Yet another exception is where the low real-time latency of RCU's
22 read-side primitives is critically important.
20 23
211. Does the update code have proper mutual exclusion? 241. Does the update code have proper mutual exclusion?
22 25
@@ -39,9 +42,10 @@ over a rather long period of time, but improvements are always welcome!
39 42
402. Do the RCU read-side critical sections make proper use of 432. Do the RCU read-side critical sections make proper use of
41 rcu_read_lock() and friends? These primitives are needed 44 rcu_read_lock() and friends? These primitives are needed
42 to suppress preemption (or bottom halves, in the case of 45 to prevent grace periods from ending prematurely, which
43 rcu_read_lock_bh()) in the read-side critical sections, 46 could result in data being unceremoniously freed out from
44 and are also an excellent aid to readability. 47 under your read-side code, which can greatly increase the
48 actuarial risk of your kernel.
45 49
46 As a rough rule of thumb, any dereference of an RCU-protected 50 As a rough rule of thumb, any dereference of an RCU-protected
47 pointer must be covered by rcu_read_lock() or rcu_read_lock_bh() 51 pointer must be covered by rcu_read_lock() or rcu_read_lock_bh()
@@ -54,15 +58,30 @@ over a rather long period of time, but improvements are always welcome!
54 be running while updates are in progress. There are a number 58 be running while updates are in progress. There are a number
55 of ways to handle this concurrency, depending on the situation: 59 of ways to handle this concurrency, depending on the situation:
56 60
57 a. Make updates appear atomic to readers. For example, 61 a. Use the RCU variants of the list and hlist update
62 primitives to add, remove, and replace elements on an
63 RCU-protected list. Alternatively, use the RCU-protected
64 trees that have been added to the Linux kernel.
65
66 This is almost always the best approach.
67
68 b. Proceed as in (a) above, but also maintain per-element
69 locks (that are acquired by both readers and writers)
70 that guard per-element state. Of course, fields that
71 the readers refrain from accessing can be guarded by the
72 update-side lock.
73
74 This works quite well, also.
75
76 c. Make updates appear atomic to readers. For example,
58 pointer updates to properly aligned fields will appear 77 pointer updates to properly aligned fields will appear
59 atomic, as will individual atomic primitives. Operations 78 atomic, as will individual atomic primitives. Operations
60 performed under a lock and sequences of multiple atomic 79 performed under a lock and sequences of multiple atomic
61 primitives will -not- appear to be atomic. 80 primitives will -not- appear to be atomic.
62 81
63 This is almost always the best approach. 82 This can work, but is starting to get a bit tricky.
64 83
65 b. Carefully order the updates and the reads so that 84 d. Carefully order the updates and the reads so that
66 readers see valid data at all phases of the update. 85 readers see valid data at all phases of the update.
67 This is often more difficult than it sounds, especially 86 This is often more difficult than it sounds, especially
68 given modern CPUs' tendency to reorder memory references. 87 given modern CPUs' tendency to reorder memory references.
@@ -123,18 +142,22 @@ over a rather long period of time, but improvements are always welcome!
123 when publicizing a pointer to a structure that can 142 when publicizing a pointer to a structure that can
124 be traversed by an RCU read-side critical section. 143 be traversed by an RCU read-side critical section.
125 144
1265. If call_rcu(), or a related primitive such as call_rcu_bh(), 1455. If call_rcu(), or a related primitive such as call_rcu_bh() or
127 is used, the callback function must be written to be called 146 call_rcu_sched(), is used, the callback function must be
128 from softirq context. In particular, it cannot block. 147 written to be called from softirq context. In particular,
148 it cannot block.
129 149
1306. Since synchronize_rcu() can block, it cannot be called from 1506. Since synchronize_rcu() can block, it cannot be called from
131 any sort of irq context. 151 any sort of irq context. Ditto for synchronize_sched() and
152 synchronize_srcu().
132 153
1337. If the updater uses call_rcu(), then the corresponding readers 1547. If the updater uses call_rcu(), then the corresponding readers
134 must use rcu_read_lock() and rcu_read_unlock(). If the updater 155 must use rcu_read_lock() and rcu_read_unlock(). If the updater
135 uses call_rcu_bh(), then the corresponding readers must use 156 uses call_rcu_bh(), then the corresponding readers must use
136 rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up 157 rcu_read_lock_bh() and rcu_read_unlock_bh(). If the updater
137 will result in confusion and broken kernels. 158 uses call_rcu_sched(), then the corresponding readers must
159 disable preemption. Mixing things up will result in confusion
160 and broken kernels.
138 161
139 One exception to this rule: rcu_read_lock() and rcu_read_unlock() 162 One exception to this rule: rcu_read_lock() and rcu_read_unlock()
140 may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() 163 may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
@@ -143,9 +166,9 @@ over a rather long period of time, but improvements are always welcome!
143 such cases is a must, of course! And the jury is still out on 166 such cases is a must, of course! And the jury is still out on
144 whether the increased speed is worth it. 167 whether the increased speed is worth it.
145 168
1468. Although synchronize_rcu() is a bit slower than is call_rcu(), 1698. Although synchronize_rcu() is slower than is call_rcu(), it
147 it usually results in simpler code. So, unless update 170 usually results in simpler code. So, unless update performance
148 performance is critically important or the updaters cannot block, 171 is critically important or the updaters cannot block,
149 synchronize_rcu() should be used in preference to call_rcu(). 172 synchronize_rcu() should be used in preference to call_rcu().
150 173
151 An especially important property of the synchronize_rcu() 174 An especially important property of the synchronize_rcu()
@@ -187,23 +210,23 @@ over a rather long period of time, but improvements are always welcome!
187 number of updates per grace period. 210 number of updates per grace period.
188 211
1899. All RCU list-traversal primitives, which include 2129. All RCU list-traversal primitives, which include
190 list_for_each_rcu(), list_for_each_entry_rcu(), 213 rcu_dereference(), list_for_each_rcu(), list_for_each_entry_rcu(),
191 list_for_each_continue_rcu(), and list_for_each_safe_rcu(), 214 list_for_each_continue_rcu(), and list_for_each_safe_rcu(),
192 must be within an RCU read-side critical section. RCU 215 must be either within an RCU read-side critical section or
216 must be protected by appropriate update-side locks. RCU
193 read-side critical sections are delimited by rcu_read_lock() 217 read-side critical sections are delimited by rcu_read_lock()
194 and rcu_read_unlock(), or by similar primitives such as 218 and rcu_read_unlock(), or by similar primitives such as
195 rcu_read_lock_bh() and rcu_read_unlock_bh(). 219 rcu_read_lock_bh() and rcu_read_unlock_bh().
196 220
197 Use of the _rcu() list-traversal primitives outside of an 221 The reason that it is permissible to use RCU list-traversal
198 RCU read-side critical section causes no harm other than 222 primitives when the update-side lock is held is that doing so
199 a slight performance degradation on Alpha CPUs. It can 223 can be quite helpful in reducing code bloat when common code is
200 also be quite helpful in reducing code bloat when common 224 shared between readers and updaters.
201 code is shared between readers and updaters.
202 225
20310. Conversely, if you are in an RCU read-side critical section, 22610. Conversely, if you are in an RCU read-side critical section,
204 you -must- use the "_rcu()" variants of the list macros. 227 and you don't hold the appropriate update-side lock, you -must-
205 Failing to do so will break Alpha and confuse people reading 228 use the "_rcu()" variants of the list macros. Failing to do so
206 your code. 229 will break Alpha and confuse people reading your code.
207 230
20811. Note that synchronize_rcu() -only- guarantees to wait until 23111. Note that synchronize_rcu() -only- guarantees to wait until
209 all currently executing rcu_read_lock()-protected RCU read-side 232 all currently executing rcu_read_lock()-protected RCU read-side
@@ -230,6 +253,14 @@ over a rather long period of time, but improvements are always welcome!
230 must use whatever locking or other synchronization is required 253 must use whatever locking or other synchronization is required
231 to safely access and/or modify that data structure. 254 to safely access and/or modify that data structure.
232 255
256 RCU callbacks are -usually- executed on the same CPU that executed
257 the corresponding call_rcu(), call_rcu_bh(), or call_rcu_sched(),
258 but are by -no- means guaranteed to be. For example, if a given
259 CPU goes offline while having an RCU callback pending, then that
260 RCU callback will execute on some surviving CPU. (If this was
261 not the case, a self-spawning RCU callback would prevent the
262 victim CPU from ever going offline.)
263
23314. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu()) 26414. SRCU (srcu_read_lock(), srcu_read_unlock(), and synchronize_srcu())
234 may only be invoked from process context. Unlike other forms of 265 may only be invoked from process context. Unlike other forms of
235 RCU, it -is- permissible to block in an SRCU read-side critical 266 RCU, it -is- permissible to block in an SRCU read-side critical
diff --git a/Documentation/RCU/torture.txt b/Documentation/RCU/torture.txt
index 2967a65269d8..a342b6e1cc10 100644
--- a/Documentation/RCU/torture.txt
+++ b/Documentation/RCU/torture.txt
@@ -10,23 +10,30 @@ status messages via printk(), which can be examined via the dmesg
10command (perhaps grepping for "torture"). The test is started 10command (perhaps grepping for "torture"). The test is started
11when the module is loaded, and stops when the module is unloaded. 11when the module is loaded, and stops when the module is unloaded.
12 12
13However, actually setting this config option to "y" results in the system 13CONFIG_RCU_TORTURE_TEST_RUNNABLE
14running the test immediately upon boot, and ending only when the system 14
15is taken down. Normally, one will instead want to build the system 15It is also possible to specify CONFIG_RCU_TORTURE_TEST=y, which will
16with CONFIG_RCU_TORTURE_TEST=m and to use modprobe and rmmod to control 16result in the tests being loaded into the base kernel. In this case,
17the test, perhaps using a script similar to the one shown at the end of 17the CONFIG_RCU_TORTURE_TEST_RUNNABLE config option is used to specify
18this document. Note that you will need CONFIG_MODULE_UNLOAD in order 18whether the RCU torture tests are to be started immediately during
19to be able to end the test. 19boot or whether the /proc/sys/kernel/rcutorture_runnable file is used
20to enable them. This /proc file can be used to repeatedly pause and
21restart the tests, regardless of the initial state specified by the
22CONFIG_RCU_TORTURE_TEST_RUNNABLE config option.
23
24You will normally -not- want to start the RCU torture tests during boot
25(and thus the default is CONFIG_RCU_TORTURE_TEST_RUNNABLE=n), but doing
26this can sometimes be useful in finding boot-time bugs.
20 27
21 28
22MODULE PARAMETERS 29MODULE PARAMETERS
23 30
24This module has the following parameters: 31This module has the following parameters:
25 32
26nreaders This is the number of RCU reading threads supported. 33irqreaders Says to invoke RCU readers from irq level. This is currently
27 The default is twice the number of CPUs. Why twice? 34 done via timers. Defaults to "1" for variants of RCU that
28 To properly exercise RCU implementations with preemptible 35 permit this. (Or, more accurately, variants of RCU that do
29 read-side critical sections. 36 -not- permit this know to ignore this variable.)
30 37
31nfakewriters This is the number of RCU fake writer threads to run. Fake 38nfakewriters This is the number of RCU fake writer threads to run. Fake
32 writer threads repeatedly use the synchronous "wait for 39 writer threads repeatedly use the synchronous "wait for
@@ -37,6 +44,16 @@ nfakewriters This is the number of RCU fake writer threads to run. Fake
37 to trigger special cases caused by multiple writers, such as 44 to trigger special cases caused by multiple writers, such as
38 the synchronize_srcu() early return optimization. 45 the synchronize_srcu() early return optimization.
39 46
47nreaders This is the number of RCU reading threads supported.
48 The default is twice the number of CPUs. Why twice?
49 To properly exercise RCU implementations with preemptible
50 read-side critical sections.
51
52shuffle_interval
53 The number of seconds to keep the test threads affinitied
54 to a particular subset of the CPUs, defaults to 3 seconds.
55 Used in conjunction with test_no_idle_hz.
56
40stat_interval The number of seconds between output of torture 57stat_interval The number of seconds between output of torture
41 statistics (via printk()). Regardless of the interval, 58 statistics (via printk()). Regardless of the interval,
42 statistics are printed when the module is unloaded. 59 statistics are printed when the module is unloaded.
@@ -44,10 +61,11 @@ stat_interval The number of seconds between output of torture
44 be printed -only- when the module is unloaded, and this 61 be printed -only- when the module is unloaded, and this
45 is the default. 62 is the default.
46 63
47shuffle_interval 64stutter The length of time to run the test before pausing for this
48 The number of seconds to keep the test threads affinitied 65 same period of time. Defaults to "stutter=5", so as
49 to a particular subset of the CPUs, defaults to 5 seconds. 66 to run and pause for (roughly) five-second intervals.
50 Used in conjunction with test_no_idle_hz. 67 Specifying "stutter=0" causes the test to run continuously
68 without pausing, which is the old default behavior.
51 69
52test_no_idle_hz Whether or not to test the ability of RCU to operate in 70test_no_idle_hz Whether or not to test the ability of RCU to operate in
53 a kernel that disables the scheduling-clock interrupt to 71 a kernel that disables the scheduling-clock interrupt to
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index e0d6d99b8f9b..e04d643a9f57 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -1,3 +1,11 @@
1Please note that the "What is RCU?" LWN series is an excellent place
2to start learning about RCU:
3
41. What is RCU, Fundamentally? http://lwn.net/Articles/262464/
52. What is RCU? Part 2: Usage http://lwn.net/Articles/263130/
63. RCU part 3: the RCU API http://lwn.net/Articles/264090/
7
8
1What is RCU? 9What is RCU?
2 10
3RCU is a synchronization mechanism that was added to the Linux kernel 11RCU is a synchronization mechanism that was added to the Linux kernel
@@ -772,26 +780,18 @@ Linux-kernel source code, but it helps to have a full list of the
772APIs, since there does not appear to be a way to categorize them 780APIs, since there does not appear to be a way to categorize them
773in docbook. Here is the list, by category. 781in docbook. Here is the list, by category.
774 782
775Markers for RCU read-side critical sections:
776
777 rcu_read_lock
778 rcu_read_unlock
779 rcu_read_lock_bh
780 rcu_read_unlock_bh
781 srcu_read_lock
782 srcu_read_unlock
783
784RCU pointer/list traversal: 783RCU pointer/list traversal:
785 784
786 rcu_dereference 785 rcu_dereference
786 list_for_each_entry_rcu
787 hlist_for_each_entry_rcu
788
787 list_for_each_rcu (to be deprecated in favor of 789 list_for_each_rcu (to be deprecated in favor of
788 list_for_each_entry_rcu) 790 list_for_each_entry_rcu)
789 list_for_each_entry_rcu
790 list_for_each_continue_rcu (to be deprecated in favor of new 791 list_for_each_continue_rcu (to be deprecated in favor of new
791 list_for_each_entry_continue_rcu) 792 list_for_each_entry_continue_rcu)
792 hlist_for_each_entry_rcu
793 793
794RCU pointer update: 794RCU pointer/list update:
795 795
796 rcu_assign_pointer 796 rcu_assign_pointer
797 list_add_rcu 797 list_add_rcu
@@ -799,16 +799,36 @@ RCU pointer update:
799 list_del_rcu 799 list_del_rcu
800 list_replace_rcu 800 list_replace_rcu
801 hlist_del_rcu 801 hlist_del_rcu
802 hlist_add_after_rcu
803 hlist_add_before_rcu
802 hlist_add_head_rcu 804 hlist_add_head_rcu
805 hlist_replace_rcu
806 list_splice_init_rcu()
803 807
804RCU grace period: 808RCU: Critical sections Grace period Barrier
809
810 rcu_read_lock synchronize_net rcu_barrier
811 rcu_read_unlock synchronize_rcu
812 call_rcu
813
814
815bh: Critical sections Grace period Barrier
816
817 rcu_read_lock_bh call_rcu_bh rcu_barrier_bh
818 rcu_read_unlock_bh
819
820
821sched: Critical sections Grace period Barrier
822
823 [preempt_disable] synchronize_sched rcu_barrier_sched
824 [and friends] call_rcu_sched
825
826
827SRCU: Critical sections Grace period Barrier
828
829 srcu_read_lock synchronize_srcu N/A
830 srcu_read_unlock
805 831
806 synchronize_net
807 synchronize_sched
808 synchronize_rcu
809 synchronize_srcu
810 call_rcu
811 call_rcu_bh
812 832
813See the comment headers in the source code (or the docbook generated 833See the comment headers in the source code (or the docbook generated
814from them) for more information. 834from them) for more information.
diff --git a/Documentation/cputopology.txt b/Documentation/cputopology.txt
index b61cb9564023..bd699da24666 100644
--- a/Documentation/cputopology.txt
+++ b/Documentation/cputopology.txt
@@ -14,9 +14,8 @@ represent the thread siblings to cpu X in the same physical package;
14To implement it in an architecture-neutral way, a new source file, 14To implement it in an architecture-neutral way, a new source file,
15drivers/base/topology.c, is to export the 4 attributes. 15drivers/base/topology.c, is to export the 4 attributes.
16 16
17If one architecture wants to support this feature, it just needs to 17For an architecture to support this feature, it must define some of
18implement 4 defines, typically in file include/asm-XXX/topology.h. 18these macros in include/asm-XXX/topology.h:
19The 4 defines are:
20#define topology_physical_package_id(cpu) 19#define topology_physical_package_id(cpu)
21#define topology_core_id(cpu) 20#define topology_core_id(cpu)
22#define topology_thread_siblings(cpu) 21#define topology_thread_siblings(cpu)
@@ -25,17 +24,10 @@ The 4 defines are:
25The type of **_id is int. 24The type of **_id is int.
26The type of siblings is cpumask_t. 25The type of siblings is cpumask_t.
27 26
28To be consistent on all architectures, the 4 attributes should have 27To be consistent on all architectures, include/linux/topology.h
29default values if their values are unavailable. Below is the rule. 28provides default definitions for any of the above macros that are
301) physical_package_id: If cpu has no physical package id, -1 is the 29not defined by include/asm-XXX/topology.h:
31default value. 301) physical_package_id: -1
322) core_id: If cpu doesn't support multi-core, its core id is 0. 312) core_id: 0
333) thread_siblings: Just include itself, if the cpu doesn't support 323) thread_siblings: just the given CPU
34HT/multi-thread. 334) core_siblings: just the given CPU
354) core_siblings: Just include itself, if the cpu doesn't support
36multi-core and HT/Multi-thread.
37
38So be careful when declaring the 4 defines in include/asm-XXX/topology.h.
39
40If an attribute isn't defined on an architecture, it won't be exported.
41
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 46ece3fba6f9..65a1482457a8 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -222,13 +222,6 @@ Who: Thomas Gleixner <tglx@linutronix.de>
222 222
223--------------------------- 223---------------------------
224 224
225What: i2c-i810, i2c-prosavage and i2c-savage4
226When: May 2008
227Why: These drivers are superseded by i810fb, intelfb and savagefb.
228Who: Jean Delvare <khali@linux-fr.org>
229
230---------------------------
231
232What (Why): 225What (Why):
233 - include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files 226 - include/linux/netfilter_ipv4/ipt_TOS.h ipt_tos.h header files
234 (superseded by xt_TOS/xt_tos target & match) 227 (superseded by xt_TOS/xt_tos target & match)
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 0c5086db8352..80e193d82e2e 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -13,72 +13,93 @@ Mailing list: linux-ext4@vger.kernel.org
131. Quick usage instructions: 131. Quick usage instructions:
14=========================== 14===========================
15 15
16 - Grab updated e2fsprogs from 16 - Compile and install the latest version of e2fsprogs (as of this
17 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ 17 writing version 1.41) from:
18 This is a patchset on top of e2fsprogs-1.39, which can be found at 18
19 http://sourceforge.net/project/showfiles.php?group_id=2406
20
21 or
22
19 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ 23 ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/
20 24
21 - It's still mke2fs -j /dev/hda1 25 or grab the latest git repository from:
26
27 git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git
28
29 - Create a new filesystem using the ext4dev filesystem type:
30
31 # mke2fs -t ext4dev /dev/hda1
32
33 Or configure an existing ext3 filesystem to support extents and set
34 the test_fs flag to indicate that it's ok for an in-development
35 filesystem to touch this filesystem:
22 36
23 - mount /dev/hda1 /wherever -t ext4dev 37 # tune2fs -O extents -E test_fs /dev/hda1
24 38
25 - To enable extents, 39 If the filesystem was created with 128 byte inodes, it can be
40 converted to use 256 byte for greater efficiency via:
26 41
27 mount /dev/hda1 /wherever -t ext4dev -o extents 42 # tune2fs -I 256 /dev/hda1
28 43
29 - The filesystem is compatible with the ext3 driver until you add a file 44 (Note: we currently do not have tools to convert an ext4dev
30 which has extents (ie: `mount -o extents', then create a file). 45 filesystem back to ext3; so please do not do try this on production
46 filesystems.)
31 47
32 NOTE: The "extents" mount flag is temporary. It will soon go away and 48 - Mounting:
33 extents will be enabled by the "-o extents" flag to mke2fs or tune2fs 49
50 # mount -t ext4dev /dev/hda1 /wherever
34 51
35 - When comparing performance with other filesystems, remember that 52 - When comparing performance with other filesystems, remember that
36 ext3/4 by default offers higher data integrity guarantees than most. So 53 ext3/4 by default offers higher data integrity guarantees than most.
37 when comparing with a metadata-only journalling filesystem, use `mount -o 54 So when comparing with a metadata-only journalling filesystem, such
38 data=writeback'. And you might as well use `mount -o nobh' too along 55 as ext3, use `mount -o data=writeback'. And you might as well use
39 with it. Making the journal larger than the mke2fs default often helps 56 `mount -o nobh' too along with it. Making the journal larger than
40 performance with metadata-intensive workloads. 57 the mke2fs default often helps performance with metadata-intensive
58 workloads.
41 59
422. Features 602. Features
43=========== 61===========
44 62
452.1 Currently available 632.1 Currently available
46 64
47* ability to use filesystems > 16TB 65* ability to use filesystems > 16TB (e2fsprogs support not available yet)
48* extent format reduces metadata overhead (RAM, IO for access, transactions) 66* extent format reduces metadata overhead (RAM, IO for access, transactions)
49* extent format more robust in face of on-disk corruption due to magics, 67* extent format more robust in face of on-disk corruption due to magics,
50* internal redunancy in tree 68* internal redunancy in tree
51 69* improved file allocation (multi-block alloc)
522.1 Previously available, soon to be enabled by default by "mkefs.ext4": 70* fix 32000 subdirectory limit
53 71* nsec timestamps for mtime, atime, ctime, create time
54* dir_index and resize inode will be on by default 72* inode version field on disk (NFSv4, Lustre)
55* large inodes will be used by default for fast EAs, nsec timestamps, etc 73* reduced e2fsck time via uninit_bg feature
74* journal checksumming for robustness, performance
75* persistent file preallocation (e.g for streaming media, databases)
76* ability to pack bitmaps and inode tables into larger virtual groups via the
77 flex_bg feature
78* large file support
79* Inode allocation using large virtual block groups via flex_bg
80* delayed allocation
81* large block (up to pagesize) support
82* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
83 the ordering)
56 84
572.2 Candidate features for future inclusion 852.2 Candidate features for future inclusion
58 86
59There are several under discussion, whether they all make it in is 87* Online defrag (patches available but not well tested)
60partly a function of how much time everyone has to work on them: 88* reduced mke2fs time via lazy itable initialization in conjuction with
89 the uninit_bg feature (capability to do this is available in e2fsprogs
90 but a kernel thread to do lazy zeroing of unused inode table blocks
91 after filesystem is first mounted is required for safety)
61 92
62* improved file allocation (multi-block alloc, delayed alloc; basically done) 93There are several others under discussion, whether they all make it in is
63* fix 32000 subdirectory limit (patch exists, needs some e2fsck work) 94partly a function of how much time everyone has to work on them. Features like
64* nsec timestamps for mtime, atime, ctime, create time (patch exists, 95metadata checksumming have been discussed and planned for a bit but no patches
65 needs some e2fsck work) 96exist yet so I'm not sure they're in the near-term roadmap.
66* inode version field on disk (NFSv4, Lustre; prototype exists)
67* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists)
68* journal checksumming for robustness, performance (prototype exists)
69* persistent file preallocation (e.g for streaming media, databases)
70 97
71Features like metadata checksumming have been discussed and planned for 98The big performance win will come with mballoc, delalloc and flex_bg
72a bit but no patches exist yet so I'm not sure they're in the near-term 99grouping of bitmaps and inode tables. Some test results available here:
73roadmap.
74 100
75The big performance win will come with mballoc and delalloc. CFS has 101 - http://www.bullopensource.org/ext4/20080530/ffsb-write-2.6.26-rc2.html
76been using mballoc for a few years already with Lustre, and IBM + Bull 102 - http://www.bullopensource.org/ext4/20080530/ffsb-readwrite-2.6.26-rc2.html
77did a lot of benchmarking on it. The reason it isn't in the first set of
78patches is partly a manageability issue, and partly because it doesn't
79directly affect the on-disk format (outside of much better allocation)
80so it isn't critical to get into the first round of changes. I believe
81Alex is working on a new set of patches right now.
82 103
833. Options 1043. Options
84========== 105==========
@@ -222,9 +243,11 @@ stripe=n Number of filesystem blocks that mballoc will try
222 to use for allocation size and alignment. For RAID5/6 243 to use for allocation size and alignment. For RAID5/6
223 systems this should be the number of data 244 systems this should be the number of data
224 disks * RAID chunk size in file system blocks. 245 disks * RAID chunk size in file system blocks.
225 246delalloc (*) Deferring block allocation until write-out time.
247nodelalloc Disable delayed allocation. Blocks are allocation
248 when data is copied from user to page cache.
226Data Mode 249Data Mode
227--------- 250=========
228There are 3 different data modes: 251There are 3 different data modes:
229 252
230* writeback mode 253* writeback mode
@@ -236,10 +259,10 @@ typically provide the best ext4 performance.
236 259
237* ordered mode 260* ordered mode
238In data=ordered mode, ext4 only officially journals metadata, but it logically 261In data=ordered mode, ext4 only officially journals metadata, but it logically
239groups metadata and data blocks into a single unit called a transaction. When 262groups metadata information related to data changes with the data blocks into a
240it's time to write the new metadata out to disk, the associated data blocks 263single unit called a transaction. When it's time to write the new metadata
241are written first. In general, this mode performs slightly slower than 264out to disk, the associated data blocks are written first. In general,
242writeback but significantly faster than journal mode. 265this mode performs slightly slower than writeback but significantly faster than journal mode.
243 266
244* journal mode 267* journal mode
245data=journal mode provides full data and metadata journaling. All new data is 268data=journal mode provides full data and metadata journaling. All new data is
@@ -247,7 +270,8 @@ written to the journal first, and then to its final location.
247In the event of a crash, the journal can be replayed, bringing both data and 270In the event of a crash, the journal can be replayed, bringing both data and
248metadata into a consistent state. This mode is the slowest except when data 271metadata into a consistent state. This mode is the slowest except when data
249needs to be read from and written to disk at the same time where it 272needs to be read from and written to disk at the same time where it
250outperforms all others modes. 273outperforms all others modes. Curently ext4 does not have delayed
274allocation support if this data journalling mode is selected.
251 275
252References 276References
253========== 277==========
@@ -256,7 +280,8 @@ kernel source: <file:fs/ext4/>
256 <file:fs/jbd2/> 280 <file:fs/jbd2/>
257 281
258programs: http://e2fsprogs.sourceforge.net/ 282programs: http://e2fsprogs.sourceforge.net/
259 http://ext2resize.sourceforge.net
260 283
261useful links: http://fedoraproject.org/wiki/ext3-devel 284useful links: http://fedoraproject.org/wiki/ext3-devel
262 http://www.bullopensource.org/ext4/ 285 http://www.bullopensource.org/ext4/
286 http://ext4.wiki.kernel.org/index.php/Main_Page
287 http://fedoraproject.org/wiki/Features/Ext4
diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt
new file mode 100644
index 000000000000..4dae9a3840bf
--- /dev/null
+++ b/Documentation/filesystems/gfs2-glocks.txt
@@ -0,0 +1,114 @@
1 Glock internal locking rules
2 ------------------------------
3
4This documents the basic principles of the glock state machine
5internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h)
6has two main (internal) locks:
7
8 1. A spinlock (gl_spin) which protects the internal state such
9 as gl_state, gl_target and the list of holders (gl_holders)
10 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other
11 threads from making calls to the DLM, etc. at the same time. If a
12 thread takes this lock, it must then call run_queue (usually via the
13 workqueue) when it releases it in order to ensure any pending tasks
14 are completed.
15
16The gl_holders list contains all the queued lock requests (not
17just the holders) associated with the glock. If there are any
18held locks, then they will be contiguous entries at the head
19of the list. Locks are granted in strictly the order that they
20are queued, except for those marked LM_FLAG_PRIORITY which are
21used only during recovery, and even then only for journal locks.
22
23There are three lock states that users of the glock layer can request,
24namely shared (SH), deferred (DF) and exclusive (EX). Those translate
25to the following DLM lock modes:
26
27Glock mode | DLM lock mode
28------------------------------
29 UN | IV/NL Unlocked (no DLM lock associated with glock) or NL
30 SH | PR (Protected read)
31 DF | CW (Concurrent write)
32 EX | EX (Exclusive)
33
34Thus DF is basically a shared mode which is incompatible with the "normal"
35shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O
36operations. The glocks are basically a lock plus some routines which deal
37with cache management. The following rules apply for the cache:
38
39Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata
40--------------------------------------------------------------------------
41 UN | No | No | No | No
42 SH | Yes | Yes | No | No
43 DF | No | Yes | No | No
44 EX | Yes | Yes | Yes | Yes
45
46These rules are implemented using the various glock operations which
47are defined for each type of glock. Not all types of glocks use
48all the modes. Only inode glocks use the DF mode for example.
49
50Table of glock operations and per type constants:
51
52Field | Purpose
53----------------------------------------------------------------------------
54go_xmote_th | Called before remote state change (e.g. to sync dirty data)
55go_xmote_bh | Called after remote state change (e.g. to refill cache)
56go_inval | Called if remote state change requires invalidating the cache
57go_demote_ok | Returns boolean value of whether its ok to demote a glock
58 | (e.g. checks timeout, and that there is no cached data)
59go_lock | Called for the first local holder of a lock
60go_unlock | Called on the final local unlock of a lock
61go_dump | Called to print content of object for debugfs file, or on
62 | error to dump glock to the log.
63go_type; | The type of the glock, LM_TYPE_.....
64go_min_hold_time | The minimum hold time
65
66The minimum hold time for each lock is the time after a remote lock
67grant for which we ignore remote demote requests. This is in order to
68prevent a situation where locks are being bounced around the cluster
69from node to node with none of the nodes making any progress. This
70tends to show up most with shared mmaped files which are being written
71to by multiple nodes. By delaying the demotion in response to a
72remote callback, that gives the userspace program time to make
73some progress before the pages are unmapped.
74
75There is a plan to try and remove the go_lock and go_unlock callbacks
76if possible, in order to try and speed up the fast path though the locking.
77Also, eventually we hope to make the glock "EX" mode locally shared
78such that any local locking will be done with the i_mutex as required
79rather than via the glock.
80
81Locking rules for glock operations:
82
83Operation | GLF_LOCK bit lock held | gl_spin spinlock held
84-----------------------------------------------------------------
85go_xmote_th | Yes | No
86go_xmote_bh | Yes | No
87go_inval | Yes | No
88go_demote_ok | Sometimes | Yes
89go_lock | Yes | No
90go_unlock | Yes | No
91go_dump | Sometimes | Yes
92
93N.B. Operations must not drop either the bit lock or the spinlock
94if its held on entry. go_dump and do_demote_ok must never block.
95Note that go_dump will only be called if the glock's state
96indicates that it is caching uptodate data.
97
98Glock locking order within GFS2:
99
100 1. i_mutex (if required)
101 2. Rename glock (for rename only)
102 3. Inode glock(s)
103 (Parents before children, inodes at "same level" with same parent in
104 lock number order)
105 4. Rgrp glock(s) (for (de)allocation operations)
106 5. Transaction glock (via gfs2_trans_begin) for non-read operations
107 6. Page lock (always last, very important!)
108
109There are two glocks per inode. One deals with access to the inode
110itself (locking order as above), and the other, known as the iopen
111glock is used in conjunction with the i_nlink field in the inode to
112determine the lifetime of the inode in question. Locking of inodes
113is on a per-inode basis. Locking of rgrps is on a per rgrp basis.
114
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index dbc3c6a3650f..7f268f327d75 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -380,28 +380,35 @@ i386 and x86_64 platforms support the new IRQ vector displays.
380Of some interest is the introduction of the /proc/irq directory to 2.4. 380Of some interest is the introduction of the /proc/irq directory to 2.4.
381It could be used to set IRQ to CPU affinity, this means that you can "hook" an 381It could be used to set IRQ to CPU affinity, this means that you can "hook" an
382IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the 382IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the
383irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask 383irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and
384prof_cpu_mask.
384 385
385For example 386For example
386 > ls /proc/irq/ 387 > ls /proc/irq/
387 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask 388 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask
388 1 11 13 15 17 19 3 5 7 9 389 1 11 13 15 17 19 3 5 7 9 default_smp_affinity
389 > ls /proc/irq/0/ 390 > ls /proc/irq/0/
390 smp_affinity 391 smp_affinity
391 392
392The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ 393smp_affinity is a bitmask, in which you can specify which CPUs can handle the
393is the same by default: 394IRQ, you can set it by doing:
394 395
395 > cat /proc/irq/0/smp_affinity 396 > echo 1 > /proc/irq/10/smp_affinity
396 ffffffff 397
398This means that only the first CPU will handle the IRQ, but you can also echo
3995 which means that only the first and fourth CPU can handle the IRQ.
397 400
398It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can 401The contents of each smp_affinity file is the same by default:
399set it by doing: 402
403 > cat /proc/irq/0/smp_affinity
404 ffffffff
400 405
401 > echo 1 > /proc/irq/prof_cpu_mask 406The default_smp_affinity mask applies to all non-active IRQs, which are the
407IRQs which have not yet been allocated/activated, and hence which lack a
408/proc/irq/[0-9]* directory.
402 409
403This means that only the first CPU will handle the IRQ, but you can also echo 5 410prof_cpu_mask specifies which CPUs are to be profiled by the system wide
404which means that only the first and fourth CPU can handle the IRQ. 411profiler. Default value is ffffffff (all cpus).
405 412
406The way IRQs are routed is handled by the IO-APIC, and it's Round Robin 413The way IRQs are routed is handled by the IO-APIC, and it's Round Robin
407between all the CPUs which are allowed to handle it. As usual the kernel has 414between all the CPUs which are allowed to handle it. As usual the kernel has
diff --git a/Documentation/i2c/busses/i2c-i810 b/Documentation/i2c/busses/i2c-i810
deleted file mode 100644
index 778210ee1583..000000000000
--- a/Documentation/i2c/busses/i2c-i810
+++ /dev/null
@@ -1,47 +0,0 @@
1Kernel driver i2c-i810
2
3Supported adapters:
4 * Intel 82810, 82810-DC100, 82810E, and 82815 (GMCH)
5 * Intel 82845G (GMCH)
6
7Authors:
8 Frodo Looijaard <frodol@dds.nl>,
9 Philip Edelbrock <phil@netroedge.com>,
10 Kyösti Mälkki <kmalkki@cc.hut.fi>,
11 Ralph Metzler <rjkm@thp.uni-koeln.de>,
12 Mark D. Studebaker <mdsxyz123@yahoo.com>
13
14Main contact: Mark Studebaker <mdsxyz123@yahoo.com>
15
16Description
17-----------
18
19WARNING: If you have an '810' or '815' motherboard, your standard I2C
20temperature sensors are most likely on the 801's I2C bus. You want the
21i2c-i801 driver for those, not this driver.
22
23Now for the i2c-i810...
24
25The GMCH chip contains two I2C interfaces.
26
27The first interface is used for DDC (Data Display Channel) which is a
28serial channel through the VGA monitor connector to a DDC-compliant
29monitor. This interface is defined by the Video Electronics Standards
30Association (VESA). The standards are available for purchase at
31http://www.vesa.org .
32
33The second interface is a general-purpose I2C bus. It may be connected to a
34TV-out chip such as the BT869 or possibly to a digital flat-panel display.
35
36Features
37--------
38
39Both busses use the i2c-algo-bit driver for 'bit banging'
40and support for specific transactions is provided by i2c-algo-bit.
41
42Issues
43------
44
45If you enable bus testing in i2c-algo-bit (insmod i2c-algo-bit bit_test=1),
46the test may fail; if so, the i2c-i810 driver won't be inserted. However,
47we think this has been fixed.
diff --git a/Documentation/i2c/busses/i2c-prosavage b/Documentation/i2c/busses/i2c-prosavage
deleted file mode 100644
index 703687902511..000000000000
--- a/Documentation/i2c/busses/i2c-prosavage
+++ /dev/null
@@ -1,23 +0,0 @@
1Kernel driver i2c-prosavage
2
3Supported adapters:
4
5 S3/VIA KM266/VT8375 aka ProSavage8
6 S3/VIA KM133/VT8365 aka Savage4
7
8Author: Henk Vergonet <henk@god.dyndns.org>
9
10Description
11-----------
12
13The Savage4 chips contain two I2C interfaces (aka a I2C 'master' or
14'host').
15
16The first interface is used for DDC (Data Display Channel) which is a
17serial channel through the VGA monitor connector to a DDC-compliant
18monitor. This interface is defined by the Video Electronics Standards
19Association (VESA). The standards are available for purchase at
20http://www.vesa.org . The second interface is a general-purpose I2C bus.
21
22Usefull for gaining access to the TV Encoder chips.
23
diff --git a/Documentation/i2c/busses/i2c-savage4 b/Documentation/i2c/busses/i2c-savage4
deleted file mode 100644
index 6ecceab618d3..000000000000
--- a/Documentation/i2c/busses/i2c-savage4
+++ /dev/null
@@ -1,26 +0,0 @@
1Kernel driver i2c-savage4
2
3Supported adapters:
4 * Savage4
5 * Savage2000
6
7Authors:
8 Alexander Wold <awold@bigfoot.com>,
9 Mark D. Studebaker <mdsxyz123@yahoo.com>
10
11Description
12-----------
13
14The Savage4 chips contain two I2C interfaces (aka a I2C 'master'
15or 'host').
16
17The first interface is used for DDC (Data Display Channel) which is a
18serial channel through the VGA monitor connector to a DDC-compliant
19monitor. This interface is defined by the Video Electronics Standards
20Association (VESA). The standards are available for purchase at
21http://www.vesa.org . The DDC bus is not yet supported because its register
22is not directly memory-mapped.
23
24The second interface is a general-purpose I2C bus. This is the only
25interface supported by the driver at the moment.
26
diff --git a/Documentation/i2c/fault-codes b/Documentation/i2c/fault-codes
new file mode 100644
index 000000000000..045765c0b9b5
--- /dev/null
+++ b/Documentation/i2c/fault-codes
@@ -0,0 +1,127 @@
1This is a summary of the most important conventions for use of fault
2codes in the I2C/SMBus stack.
3
4
5A "Fault" is not always an "Error"
6----------------------------------
7Not all fault reports imply errors; "page faults" should be a familiar
8example. Software often retries idempotent operations after transient
9faults. There may be fancier recovery schemes that are appropriate in
10some cases, such as re-initializing (and maybe resetting). After such
11recovery, triggered by a fault report, there is no error.
12
13In a similar way, sometimes a "fault" code just reports one defined
14result for an operation ... it doesn't indicate that anything is wrong
15at all, just that the outcome wasn't on the "golden path".
16
17In short, your I2C driver code may need to know these codes in order
18to respond correctly. Other code may need to rely on YOUR code reporting
19the right fault code, so that it can (in turn) behave correctly.
20
21
22I2C and SMBus fault codes
23-------------------------
24These are returned as negative numbers from most calls, with zero or
25some positive number indicating a non-fault return. The specific
26numbers associated with these symbols differ between architectures,
27though most Linux systems use <asm-generic/errno*.h> numbering.
28
29Note that the descriptions here are not exhaustive. There are other
30codes that may be returned, and other cases where these codes should
31be returned. However, drivers should not return other codes for these
32cases (unless the hardware doesn't provide unique fault reports).
33
34Also, codes returned by adapter probe methods follow rules which are
35specific to their host bus (such as PCI, or the platform bus).
36
37
38EAGAIN
39 Returned by I2C adapters when they lose arbitration in master
40 transmit mode: some other master was transmitting different
41 data at the same time.
42
43 Also returned when trying to invoke an I2C operation in an
44 atomic context, when some task is already using that I2C bus
45 to execute some other operation.
46
47EBADMSG
48 Returned by SMBus logic when an invalid Packet Error Code byte
49 is received. This code is a CRC covering all bytes in the
50 transaction, and is sent before the terminating STOP. This
51 fault is only reported on read transactions; the SMBus slave
52 may have a way to report PEC mismatches on writes from the
53 host. Note that even if PECs are in use, you should not rely
54 on these as the only way to detect incorrect data transfers.
55
56EBUSY
57 Returned by SMBus adapters when the bus was busy for longer
58 than allowed. This usually indicates some device (maybe the
59 SMBus adapter) needs some fault recovery (such as resetting),
60 or that the reset was attempted but failed.
61
62EINVAL
63 This rather vague error means an invalid parameter has been
64 detected before any I/O operation was started. Use a more
65 specific fault code when you can.
66
67 One example would be a driver trying an SMBus Block Write
68 with block size outside the range of 1-32 bytes.
69
70EIO
71 This rather vague error means something went wrong when
72 performing an I/O operation. Use a more specific fault
73 code when you can.
74
75ENODEV
76 Returned by driver probe() methods. This is a bit more
77 specific than ENXIO, implying the problem isn't with the
78 address, but with the device found there. Driver probes
79 may verify the device returns *correct* responses, and
80 return this as appropriate. (The driver core will warn
81 about probe faults other than ENXIO and ENODEV.)
82
83ENOMEM
84 Returned by any component that can't allocate memory when
85 it needs to do so.
86
87ENXIO
88 Returned by I2C adapters to indicate that the address phase
89 of a transfer didn't get an ACK. While it might just mean
90 an I2C device was temporarily not responding, usually it
91 means there's nothing listening at that address.
92
93 Returned by driver probe() methods to indicate that they
94 found no device to bind to. (ENODEV may also be used.)
95
96EOPNOTSUPP
97 Returned by an adapter when asked to perform an operation
98 that it doesn't, or can't, support.
99
100 For example, this would be returned when an adapter that
101 doesn't support SMBus block transfers is asked to execute
102 one. In that case, the driver making that request should
103 have verified that functionality was supported before it
104 made that block transfer request.
105
106 Similarly, if an I2C adapter can't execute all legal I2C
107 messages, it should return this when asked to perform a
108 transaction it can't. (These limitations can't be seen in
109 the adapter's functionality mask, since the assumption is
110 that if an adapter supports I2C it supports all of I2C.)
111
112EPROTO
113 Returned when slave does not conform to the relevant I2C
114 or SMBus (or chip-specific) protocol specifications. One
115 case is when the length of an SMBus block data response
116 (from the SMBus slave) is outside the range 1-32 bytes.
117
118ETIMEDOUT
119 This is returned by drivers when an operation took too much
120 time, and was aborted before it completed.
121
122 SMBus adapters may return it when an operation took more
123 time than allowed by the SMBus specification; for example,
124 when a slave stretches clocks too far. I2C has no such
125 timeouts, but it's normal for I2C adapters to impose some
126 arbitrary limits (much longer than SMBus!) too.
127
diff --git a/Documentation/i2c/smbus-protocol b/Documentation/i2c/smbus-protocol
index 03f08fb491cc..24bfb65da17d 100644
--- a/Documentation/i2c/smbus-protocol
+++ b/Documentation/i2c/smbus-protocol
@@ -42,8 +42,8 @@ Count (8 bits): A data byte containing the length of a block operation.
42[..]: Data sent by I2C device, as opposed to data sent by the host adapter. 42[..]: Data sent by I2C device, as opposed to data sent by the host adapter.
43 43
44 44
45SMBus Quick Command: i2c_smbus_write_quick() 45SMBus Quick Command
46============================================= 46===================
47 47
48This sends a single bit to the device, at the place of the Rd/Wr bit. 48This sends a single bit to the device, at the place of the Rd/Wr bit.
49 49
diff --git a/Documentation/i2c/writing-clients b/Documentation/i2c/writing-clients
index d4cd4126d1ad..6b61b3a2e90b 100644
--- a/Documentation/i2c/writing-clients
+++ b/Documentation/i2c/writing-clients
@@ -44,6 +44,10 @@ static struct i2c_driver foo_driver = {
44 .id_table = foo_ids, 44 .id_table = foo_ids,
45 .probe = foo_probe, 45 .probe = foo_probe,
46 .remove = foo_remove, 46 .remove = foo_remove,
47 /* if device autodetection is needed: */
48 .class = I2C_CLASS_SOMETHING,
49 .detect = foo_detect,
50 .address_data = &addr_data,
47 51
48 /* else, driver uses "legacy" binding model: */ 52 /* else, driver uses "legacy" binding model: */
49 .attach_adapter = foo_attach_adapter, 53 .attach_adapter = foo_attach_adapter,
@@ -217,6 +221,31 @@ in the I2C bus driver. You may want to save the returned i2c_client
217reference for later use. 221reference for later use.
218 222
219 223
224Device Detection (Standard driver model)
225----------------------------------------
226
227Sometimes you do not know in advance which I2C devices are connected to
228a given I2C bus. This is for example the case of hardware monitoring
229devices on a PC's SMBus. In that case, you may want to let your driver
230detect supported devices automatically. This is how the legacy model
231was working, and is now available as an extension to the standard
232driver model (so that we can finally get rid of the legacy model.)
233
234You simply have to define a detect callback which will attempt to
235identify supported devices (returning 0 for supported ones and -ENODEV
236for unsupported ones), a list of addresses to probe, and a device type
237(or class) so that only I2C buses which may have that type of device
238connected (and not otherwise enumerated) will be probed. The i2c
239core will then call you back as needed and will instantiate a device
240for you for every successful detection.
241
242Note that this mechanism is purely optional and not suitable for all
243devices. You need some reliable way to identify the supported devices
244(typically using device-specific, dedicated identification registers),
245otherwise misdetections are likely to occur and things can get wrong
246quickly.
247
248
220Device Deletion (Standard driver model) 249Device Deletion (Standard driver model)
221--------------------------------------- 250---------------------------------------
222 251
@@ -569,7 +598,6 @@ SMBus communication
569 in terms of it. Never use this function directly! 598 in terms of it. Never use this function directly!
570 599
571 600
572 extern s32 i2c_smbus_write_quick(struct i2c_client * client, u8 value);
573 extern s32 i2c_smbus_read_byte(struct i2c_client * client); 601 extern s32 i2c_smbus_read_byte(struct i2c_client * client);
574 extern s32 i2c_smbus_write_byte(struct i2c_client * client, u8 value); 602 extern s32 i2c_smbus_write_byte(struct i2c_client * client, u8 value);
575 extern s32 i2c_smbus_read_byte_data(struct i2c_client * client, u8 command); 603 extern s32 i2c_smbus_read_byte_data(struct i2c_client * client, u8 command);
@@ -578,30 +606,31 @@ SMBus communication
578 extern s32 i2c_smbus_read_word_data(struct i2c_client * client, u8 command); 606 extern s32 i2c_smbus_read_word_data(struct i2c_client * client, u8 command);
579 extern s32 i2c_smbus_write_word_data(struct i2c_client * client, 607 extern s32 i2c_smbus_write_word_data(struct i2c_client * client,
580 u8 command, u16 value); 608 u8 command, u16 value);
609 extern s32 i2c_smbus_read_block_data(struct i2c_client * client,
610 u8 command, u8 *values);
581 extern s32 i2c_smbus_write_block_data(struct i2c_client * client, 611 extern s32 i2c_smbus_write_block_data(struct i2c_client * client,
582 u8 command, u8 length, 612 u8 command, u8 length,
583 u8 *values); 613 u8 *values);
584 extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client, 614 extern s32 i2c_smbus_read_i2c_block_data(struct i2c_client * client,
585 u8 command, u8 length, u8 *values); 615 u8 command, u8 length, u8 *values);
586
587These ones were removed in Linux 2.6.10 because they had no users, but could
588be added back later if needed:
589
590 extern s32 i2c_smbus_read_block_data(struct i2c_client * client,
591 u8 command, u8 *values);
592 extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client, 616 extern s32 i2c_smbus_write_i2c_block_data(struct i2c_client * client,
593 u8 command, u8 length, 617 u8 command, u8 length,
594 u8 *values); 618 u8 *values);
619
620These ones were removed from i2c-core because they had no users, but could
621be added back later if needed:
622
623 extern s32 i2c_smbus_write_quick(struct i2c_client * client, u8 value);
595 extern s32 i2c_smbus_process_call(struct i2c_client * client, 624 extern s32 i2c_smbus_process_call(struct i2c_client * client,
596 u8 command, u16 value); 625 u8 command, u16 value);
597 extern s32 i2c_smbus_block_process_call(struct i2c_client *client, 626 extern s32 i2c_smbus_block_process_call(struct i2c_client *client,
598 u8 command, u8 length, 627 u8 command, u8 length,
599 u8 *values) 628 u8 *values)
600 629
601All these transactions return -1 on failure. The 'write' transactions 630All these transactions return a negative errno value on failure. The 'write'
602return 0 on success; the 'read' transactions return the read value, except 631transactions return 0 on success; the 'read' transactions return the read
603for read_block, which returns the number of values read. The block buffers 632value, except for block transactions, which return the number of values
604need not be longer than 32 bytes. 633read. The block buffers need not be longer than 32 bytes.
605 634
606You can read the file `smbus-protocol' for more information about the 635You can read the file `smbus-protocol' for more information about the
607actual SMBus protocol. 636actual SMBus protocol.
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b3a5aad7e629..312fe77764a4 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -571,6 +571,8 @@ and is between 256 and 4096 characters. It is defined in the file
571 571
572 debug_objects [KNL] Enable object debugging 572 debug_objects [KNL] Enable object debugging
573 573
574 debugpat [X86] Enable PAT debugging
575
574 decnet.addr= [HW,NET] 576 decnet.addr= [HW,NET]
575 Format: <area>[,<node>] 577 Format: <area>[,<node>]
576 See also Documentation/networking/decnet.txt. 578 See also Documentation/networking/decnet.txt.
@@ -756,9 +758,6 @@ and is between 256 and 4096 characters. It is defined in the file
756 hd= [EIDE] (E)IDE hard drive subsystem geometry 758 hd= [EIDE] (E)IDE hard drive subsystem geometry
757 Format: <cyl>,<head>,<sect> 759 Format: <cyl>,<head>,<sect>
758 760
759 hd?= [HW] (E)IDE subsystem
760 hd?lun= See Documentation/ide/ide.txt.
761
762 highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact 761 highmem=nn[KMG] [KNL,BOOT] forces the highmem zone to have an exact
763 size of <nn>. This works even on boxes that have no 762 size of <nn>. This works even on boxes that have no
764 highmem otherwise. This also works to reduce highmem 763 highmem otherwise. This also works to reduce highmem
@@ -1610,6 +1609,10 @@ and is between 256 and 4096 characters. It is defined in the file
1610 Format: { parport<nr> | timid | 0 } 1609 Format: { parport<nr> | timid | 0 }
1611 See also Documentation/parport.txt. 1610 See also Documentation/parport.txt.
1612 1611
1612 pmtmr= [X86] Manual setup of pmtmr I/O Port.
1613 Override pmtimer IOPort with a hex value.
1614 e.g. pmtmr=0x508
1615
1613 pnpacpi= [ACPI] 1616 pnpacpi= [ACPI]
1614 { off } 1617 { off }
1615 1618