diff options
author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/RCU |
Linux-2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation/RCU')
-rw-r--r-- | Documentation/RCU/RTFP.txt | 387 | ||||
-rw-r--r-- | Documentation/RCU/UP.txt | 64 | ||||
-rw-r--r-- | Documentation/RCU/arrayRCU.txt | 141 | ||||
-rw-r--r-- | Documentation/RCU/checklist.txt | 157 | ||||
-rw-r--r-- | Documentation/RCU/listRCU.txt | 307 | ||||
-rw-r--r-- | Documentation/RCU/rcu.txt | 67 |
6 files changed, 1123 insertions, 0 deletions
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt new file mode 100644 index 00000000000..12250b342e1 --- /dev/null +++ b/Documentation/RCU/RTFP.txt | |||
@@ -0,0 +1,387 @@ | |||
1 | Read the F-ing Papers! | ||
2 | |||
3 | |||
4 | This document describes RCU-related publications, and is followed by | ||
5 | the corresponding bibtex entries. | ||
6 | |||
7 | The first thing resembling RCU was published in 1980, when Kung and Lehman | ||
8 | [Kung80] recommended use of a garbage collector to defer destruction | ||
9 | of nodes in a parallel binary search tree in order to simplify its | ||
10 | implementation. This works well in environments that have garbage | ||
11 | collectors, but current production garbage collectors incur significant | ||
12 | read-side overhead. | ||
13 | |||
14 | In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring | ||
15 | destruction until all threads running at that time have terminated, again | ||
16 | for a parallel binary search tree. This approach works well in systems | ||
17 | with short-lived threads, such as the K42 research operating system. | ||
18 | However, Linux has long-lived tasks, so more is needed. | ||
19 | |||
20 | In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive | ||
21 | serialization, which is an RCU-like mechanism that relies on the presence | ||
22 | of "quiescent states" in the VM/XA hypervisor that are guaranteed not | ||
23 | to be referencing the data structure. However, this mechanism was not | ||
24 | optimized for modern computer systems, which is not surprising given | ||
25 | that these overheads were not so expensive in the mid-80s. Nonetheless, | ||
26 | passive serialization appears to be the first deferred-destruction | ||
27 | mechanism to be used in production. Furthermore, the relevant patent has | ||
28 | lapsed, so this approach may be used in non-GPL software, if desired. | ||
29 | (In contrast, use of RCU is permitted only in software licensed under | ||
30 | GPL. Sorry!!!) | ||
31 | |||
32 | In 1990, Pugh [Pugh90] noted that explicitly tracking which threads | ||
33 | were reading a given data structure permitted deferred free to operate | ||
34 | in the presence of non-terminating threads. However, this explicit | ||
35 | tracking imposes significant read-side overhead, which is undesirable | ||
36 | in read-mostly situations. This algorithm does take pains to avoid | ||
37 | write-side contention and parallelize the other write-side overheads by | ||
38 | providing a fine-grained locking design, however, it would be interesting | ||
39 | to see how much of the performance advantage reported in 1990 remains | ||
40 | in 2004. | ||
41 | |||
42 | At about this same time, Adams [Adams91] described ``chaotic relaxation'', | ||
43 | where the normal barriers between successive iterations of convergent | ||
44 | numerical algorithms are relaxed, so that iteration $n$ might use | ||
45 | data from iteration $n-1$ or even $n-2$. This introduces error, | ||
46 | which typically slows convergence and thus increases the number of | ||
47 | iterations required. However, this increase is sometimes more than made | ||
48 | up for by a reduction in the number of expensive barrier operations, | ||
49 | which are otherwise required to synchronize the threads at the end | ||
50 | of each iteration. Unfortunately, chaotic relaxation requires highly | ||
51 | structured data, such as the matrices used in scientific programs, and | ||
52 | is thus inapplicable to most data structures in operating-system kernels. | ||
53 | |||
54 | In 1993, Jacobson [Jacobson93] verbally described what is perhaps the | ||
55 | simplest deferred-free technique: simply waiting a fixed amount of time | ||
56 | before freeing blocks awaiting deferred free. Jacobson did not describe | ||
57 | any write-side changes he might have made in this work using SGI's Irix | ||
58 | kernel. Aju John published a similar technique in 1995 [AjuJohn95]. | ||
59 | This works well if there is a well-defined upper bound on the length of | ||
60 | time that reading threads can hold references, as there might well be in | ||
61 | hard real-time systems. However, if this time is exceeded, perhaps due | ||
62 | to preemption, excessive interrupts, or larger-than-anticipated load, | ||
63 | memory corruption can ensue, with no reasonable means of diagnosis. | ||
64 | Jacobson's technique is therefore inappropriate for use in production | ||
65 | operating-system kernels, except when such kernels can provide hard | ||
66 | real-time response guarantees for all operations. | ||
67 | |||
68 | Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's | ||
69 | read-side-tracking to permit replugging of algorithms within a commercial | ||
70 | Unix operating system. However, this replugging permitted only a single | ||
71 | reader at a time. The following year, this same group of researchers | ||
72 | extended their technique to allow for multiple readers [Cowan96a]. | ||
73 | Their approach requires memory barriers (and thus pipeline stalls), | ||
74 | but reduces memory latency, contention, and locking overheads. | ||
75 | |||
76 | 1995 also saw the first publication of DYNIX/ptx's RCU mechanism | ||
77 | [Slingwine95], which was optimized for modern CPU architectures, | ||
78 | and was successfully applied to a number of situations within the | ||
79 | DYNIX/ptx kernel. The corresponding conference paper appeared in 1998 | ||
80 | [McKenney98]. | ||
81 | |||
82 | In 1999, the Tornado and K42 groups described their "generations" | ||
83 | mechanism, which quite similar to RCU [Gamsa99]. These operating systems | ||
84 | made pervasive use of RCU in place of "existence locks", which greatly | ||
85 | simplifies locking hierarchies. | ||
86 | |||
87 | 2001 saw the first RCU presentation involving Linux [McKenney01a] | ||
88 | at OLS. The resulting abundance of RCU patches was presented the | ||
89 | following year [McKenney02a], and use of RCU in dcache was first | ||
90 | described that same year [Linder02a]. | ||
91 | |||
92 | Also in 2002, Michael [Michael02b,Michael02a] presented techniques | ||
93 | that defer the destruction of data structures to simplify non-blocking | ||
94 | synchronization (wait-free synchronization, lock-free synchronization, | ||
95 | and obstruction-free synchronization are all examples of non-blocking | ||
96 | synchronization). In particular, this technique eliminates locking, | ||
97 | reduces contention, reduces memory latency for readers, and parallelizes | ||
98 | pipeline stalls and memory latency for writers. However, these | ||
99 | techniques still impose significant read-side overhead in the form of | ||
100 | memory barriers. Researchers at Sun worked along similar lines in the | ||
101 | same timeframe [HerlihyLM02,HerlihyLMS03]. | ||
102 | |||
103 | In 2003, the K42 group described how RCU could be used to create | ||
104 | hot-pluggable implementations of operating-system functions. Later that | ||
105 | year saw a paper describing an RCU implementation of System V IPC | ||
106 | [Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a]. | ||
107 | |||
108 | 2004 has seen a Linux-Journal article on use of RCU in dcache | ||
109 | [McKenney04a], a performance comparison of locking to RCU on several | ||
110 | different CPUs [McKenney04b], a dissertation describing use of RCU in a | ||
111 | number of operating-system kernels [PaulEdwardMcKenneyPhD], and a paper | ||
112 | describing how to make RCU safe for soft-realtime applications [Sarma04c]. | ||
113 | |||
114 | |||
115 | Bibtex Entries | ||
116 | |||
117 | @article{Kung80 | ||
118 | ,author="H. T. Kung and Q. Lehman" | ||
119 | ,title="Concurrent Maintenance of Binary Search Trees" | ||
120 | ,Year="1980" | ||
121 | ,Month="September" | ||
122 | ,journal="ACM Transactions on Database Systems" | ||
123 | ,volume="5" | ||
124 | ,number="3" | ||
125 | ,pages="354-382" | ||
126 | } | ||
127 | |||
128 | @techreport{Manber82 | ||
129 | ,author="Udi Manber and Richard E. Ladner" | ||
130 | ,title="Concurrency Control in a Dynamic Search Structure" | ||
131 | ,institution="Department of Computer Science, University of Washington" | ||
132 | ,address="Seattle, Washington" | ||
133 | ,year="1982" | ||
134 | ,number="82-01-01" | ||
135 | ,month="January" | ||
136 | ,pages="28" | ||
137 | } | ||
138 | |||
139 | @article{Manber84 | ||
140 | ,author="Udi Manber and Richard E. Ladner" | ||
141 | ,title="Concurrency Control in a Dynamic Search Structure" | ||
142 | ,Year="1984" | ||
143 | ,Month="September" | ||
144 | ,journal="ACM Transactions on Database Systems" | ||
145 | ,volume="9" | ||
146 | ,number="3" | ||
147 | ,pages="439-455" | ||
148 | } | ||
149 | |||
150 | @techreport{Hennessy89 | ||
151 | ,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}" | ||
152 | ,title="Passive Serialization in a Multitasking Environment" | ||
153 | ,institution="US Patent and Trademark Office" | ||
154 | ,address="Washington, DC" | ||
155 | ,year="1989" | ||
156 | ,number="US Patent 4,809,168 (lapsed)" | ||
157 | ,month="February" | ||
158 | ,pages="11" | ||
159 | } | ||
160 | |||
161 | @techreport{Pugh90 | ||
162 | ,author="William Pugh" | ||
163 | ,title="Concurrent Maintenance of Skip Lists" | ||
164 | ,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland" | ||
165 | ,address="College Park, Maryland" | ||
166 | ,year="1990" | ||
167 | ,number="CS-TR-2222.1" | ||
168 | ,month="June" | ||
169 | } | ||
170 | |||
171 | @Book{Adams91 | ||
172 | ,Author="Gregory R. Adams" | ||
173 | ,title="Concurrent Programming, Principles, and Practices" | ||
174 | ,Publisher="Benjamin Cummins" | ||
175 | ,Year="1991" | ||
176 | } | ||
177 | |||
178 | @unpublished{Jacobson93 | ||
179 | ,author="Van Jacobson" | ||
180 | ,title="Avoid Read-Side Locking Via Delayed Free" | ||
181 | ,year="1993" | ||
182 | ,month="September" | ||
183 | ,note="Verbal discussion" | ||
184 | } | ||
185 | |||
186 | @Conference{AjuJohn95 | ||
187 | ,Author="Aju John" | ||
188 | ,Title="Dynamic vnodes -- Design and Implementation" | ||
189 | ,Booktitle="{USENIX Winter 1995}" | ||
190 | ,Publisher="USENIX Association" | ||
191 | ,Month="January" | ||
192 | ,Year="1995" | ||
193 | ,pages="11-23" | ||
194 | ,Address="New Orleans, LA" | ||
195 | } | ||
196 | |||
197 | @techreport{Slingwine95 | ||
198 | ,author="John D. Slingwine and Paul E. McKenney" | ||
199 | ,title="Apparatus and Method for Achieving Reduced Overhead Mutual | ||
200 | Exclusion and Maintaining Coherency in a Multiprocessor System | ||
201 | Utilizing Execution History and Thread Monitoring" | ||
202 | ,institution="US Patent and Trademark Office" | ||
203 | ,address="Washington, DC" | ||
204 | ,year="1995" | ||
205 | ,number="US Patent 5,442,758 (contributed under GPL)" | ||
206 | ,month="August" | ||
207 | } | ||
208 | |||
209 | @techreport{Slingwine97 | ||
210 | ,author="John D. Slingwine and Paul E. McKenney" | ||
211 | ,title="Method for maintaining data coherency using thread | ||
212 | activity summaries in a multicomputer system" | ||
213 | ,institution="US Patent and Trademark Office" | ||
214 | ,address="Washington, DC" | ||
215 | ,year="1997" | ||
216 | ,number="US Patent 5,608,893 (contributed under GPL)" | ||
217 | ,month="March" | ||
218 | } | ||
219 | |||
220 | @techreport{Slingwine98 | ||
221 | ,author="John D. Slingwine and Paul E. McKenney" | ||
222 | ,title="Apparatus and method for achieving reduced overhead | ||
223 | mutual exclusion and maintaining coherency in a multiprocessor | ||
224 | system utilizing execution history and thread monitoring" | ||
225 | ,institution="US Patent and Trademark Office" | ||
226 | ,address="Washington, DC" | ||
227 | ,year="1998" | ||
228 | ,number="US Patent 5,727,209 (contributed under GPL)" | ||
229 | ,month="March" | ||
230 | } | ||
231 | |||
232 | @Conference{McKenney98 | ||
233 | ,Author="Paul E. McKenney and John D. Slingwine" | ||
234 | ,Title="Read-Copy Update: Using Execution History to Solve Concurrency | ||
235 | Problems" | ||
236 | ,Booktitle="{Parallel and Distributed Computing and Systems}" | ||
237 | ,Month="October" | ||
238 | ,Year="1998" | ||
239 | ,pages="509-518" | ||
240 | ,Address="Las Vegas, NV" | ||
241 | } | ||
242 | |||
243 | @Conference{Gamsa99 | ||
244 | ,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm" | ||
245 | ,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory | ||
246 | Multiprocessor Operating System" | ||
247 | ,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on | ||
248 | Operating System Design and Implementation}" | ||
249 | ,Month="February" | ||
250 | ,Year="1999" | ||
251 | ,pages="87-100" | ||
252 | ,Address="New Orleans, LA" | ||
253 | } | ||
254 | |||
255 | @techreport{Slingwine01 | ||
256 | ,author="John D. Slingwine and Paul E. McKenney" | ||
257 | ,title="Apparatus and method for achieving reduced overhead | ||
258 | mutual exclusion and maintaining coherency in a multiprocessor | ||
259 | system utilizing execution history and thread monitoring" | ||
260 | ,institution="US Patent and Trademark Office" | ||
261 | ,address="Washington, DC" | ||
262 | ,year="2001" | ||
263 | ,number="US Patent 5,219,690 (contributed under GPL)" | ||
264 | ,month="April" | ||
265 | } | ||
266 | |||
267 | @Conference{McKenney01a | ||
268 | ,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and | ||
269 | Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni" | ||
270 | ,Title="Read-Copy Update" | ||
271 | ,Booktitle="{Ottawa Linux Symposium}" | ||
272 | ,Month="July" | ||
273 | ,Year="2001" | ||
274 | ,note="Available: | ||
275 | \url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php} | ||
276 | \url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf} | ||
277 | [Viewed June 23, 2004]" | ||
278 | annotation=" | ||
279 | Described RCU, and presented some patches implementing and using it in | ||
280 | the Linux kernel. | ||
281 | " | ||
282 | } | ||
283 | |||
284 | @Conference{Linder02a | ||
285 | ,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni" | ||
286 | ,Title="Scalability of the Directory Entry Cache" | ||
287 | ,Booktitle="{Ottawa Linux Symposium}" | ||
288 | ,Month="June" | ||
289 | ,Year="2002" | ||
290 | ,pages="289-300" | ||
291 | } | ||
292 | |||
293 | @Conference{McKenney02a | ||
294 | ,Author="Paul E. McKenney and Dipankar Sarma and | ||
295 | Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell" | ||
296 | ,Title="Read-Copy Update" | ||
297 | ,Booktitle="{Ottawa Linux Symposium}" | ||
298 | ,Month="June" | ||
299 | ,Year="2002" | ||
300 | ,pages="338-367" | ||
301 | ,note="Available: | ||
302 | \url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz} | ||
303 | [Viewed June 23, 2004]" | ||
304 | } | ||
305 | |||
306 | @article{Appavoo03a | ||
307 | ,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and | ||
308 | D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and | ||
309 | B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and | ||
310 | B. Rosenburg and M. Stumm and J. Xenidis" | ||
311 | ,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping" | ||
312 | ,Year="2003" | ||
313 | ,Month="January" | ||
314 | ,journal="IBM Systems Journal" | ||
315 | ,volume="42" | ||
316 | ,number="1" | ||
317 | ,pages="60-76" | ||
318 | } | ||
319 | |||
320 | @Conference{Arcangeli03 | ||
321 | ,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and | ||
322 | Dipankar Sarma" | ||
323 | ,Title="Using Read-Copy Update Techniques for {System V IPC} in the | ||
324 | {Linux} 2.5 Kernel" | ||
325 | ,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference | ||
326 | (FREENIX Track)" | ||
327 | ,Publisher="USENIX Association" | ||
328 | ,year="2003" | ||
329 | ,month="June" | ||
330 | ,pages="297-310" | ||
331 | } | ||
332 | |||
333 | @article{McKenney03a | ||
334 | ,author="Paul E. McKenney" | ||
335 | ,title="Using {RCU} in the {Linux} 2.5 Kernel" | ||
336 | ,Year="2003" | ||
337 | ,Month="October" | ||
338 | ,journal="Linux Journal" | ||
339 | ,volume="1" | ||
340 | ,number="114" | ||
341 | ,pages="18-26" | ||
342 | } | ||
343 | |||
344 | @article{McKenney04a | ||
345 | ,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni" | ||
346 | ,title="Scaling dcache with {RCU}" | ||
347 | ,Year="2004" | ||
348 | ,Month="January" | ||
349 | ,journal="Linux Journal" | ||
350 | ,volume="1" | ||
351 | ,number="118" | ||
352 | ,pages="38-46" | ||
353 | } | ||
354 | |||
355 | @Conference{McKenney04b | ||
356 | ,Author="Paul E. McKenney" | ||
357 | ,Title="{RCU} vs. Locking Performance on Different {CPUs}" | ||
358 | ,Booktitle="{linux.conf.au}" | ||
359 | ,Month="January" | ||
360 | ,Year="2004" | ||
361 | ,Address="Adelaide, Australia" | ||
362 | ,note="Available: | ||
363 | \url{http://www.linux.org.au/conf/2004/abstracts.html#90} | ||
364 | \url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf} | ||
365 | [Viewed June 23, 2004]" | ||
366 | } | ||
367 | |||
368 | @phdthesis{PaulEdwardMcKenneyPhD | ||
369 | ,author="Paul E. McKenney" | ||
370 | ,title="Exploiting Deferred Destruction: | ||
371 | An Analysis of Read-Copy-Update Techniques | ||
372 | in Operating System Kernels" | ||
373 | ,school="OGI School of Science and Engineering at | ||
374 | Oregon Health and Sciences University" | ||
375 | ,year="2004" | ||
376 | } | ||
377 | |||
378 | @Conference{Sarma04c | ||
379 | ,Author="Dipankar Sarma and Paul E. McKenney" | ||
380 | ,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications" | ||
381 | ,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference | ||
382 | (FREENIX Track)" | ||
383 | ,Publisher="USENIX Association" | ||
384 | ,year="2004" | ||
385 | ,month="June" | ||
386 | ,pages="182-191" | ||
387 | } | ||
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt new file mode 100644 index 00000000000..551a803d82a --- /dev/null +++ b/Documentation/RCU/UP.txt | |||
@@ -0,0 +1,64 @@ | |||
1 | RCU on Uniprocessor Systems | ||
2 | |||
3 | |||
4 | A common misconception is that, on UP systems, the call_rcu() primitive | ||
5 | may immediately invoke its function, and that the synchronize_kernel | ||
6 | primitive may return immediately. The basis of this misconception | ||
7 | is that since there is only one CPU, it should not be necessary to | ||
8 | wait for anything else to get done, since there are no other CPUs for | ||
9 | anything else to be happening on. Although this approach will sort of | ||
10 | work a surprising amount of the time, it is a very bad idea in general. | ||
11 | This document presents two examples that demonstrate exactly how bad an | ||
12 | idea this is. | ||
13 | |||
14 | |||
15 | Example 1: softirq Suicide | ||
16 | |||
17 | Suppose that an RCU-based algorithm scans a linked list containing | ||
18 | elements A, B, and C in process context, and can delete elements from | ||
19 | this same list in softirq context. Suppose that the process-context scan | ||
20 | is referencing element B when it is interrupted by softirq processing, | ||
21 | which deletes element B, and then invokes call_rcu() to free element B | ||
22 | after a grace period. | ||
23 | |||
24 | Now, if call_rcu() were to directly invoke its arguments, then upon return | ||
25 | from softirq, the list scan would find itself referencing a newly freed | ||
26 | element B. This situation can greatly decrease the life expectancy of | ||
27 | your kernel. | ||
28 | |||
29 | |||
30 | Example 2: Function-Call Fatality | ||
31 | |||
32 | Of course, one could avert the suicide described in the preceding example | ||
33 | by having call_rcu() directly invoke its arguments only if it was called | ||
34 | from process context. However, this can fail in a similar manner. | ||
35 | |||
36 | Suppose that an RCU-based algorithm again scans a linked list containing | ||
37 | elements A, B, and C in process contexts, but that it invokes a function | ||
38 | on each element as it is scanned. Suppose further that this function | ||
39 | deletes element B from the list, then passes it to call_rcu() for deferred | ||
40 | freeing. This may be a bit unconventional, but it is perfectly legal | ||
41 | RCU usage, since call_rcu() must wait for a grace period to elapse. | ||
42 | Therefore, in this case, allowing call_rcu() to immediately invoke | ||
43 | its arguments would cause it to fail to make the fundamental guarantee | ||
44 | underlying RCU, namely that call_rcu() defers invoking its arguments until | ||
45 | all RCU read-side critical sections currently executing have completed. | ||
46 | |||
47 | Quick Quiz: why is it -not- legal to invoke synchronize_kernel() in | ||
48 | this case? | ||
49 | |||
50 | |||
51 | Summary | ||
52 | |||
53 | Permitting call_rcu() to immediately invoke its arguments or permitting | ||
54 | synchronize_kernel() to immediately return breaks RCU, even on a UP system. | ||
55 | So do not do it! Even on a UP system, the RCU infrastructure -must- | ||
56 | respect grace periods. | ||
57 | |||
58 | |||
59 | Answer to Quick Quiz | ||
60 | |||
61 | The calling function is scanning an RCU-protected linked list, and | ||
62 | is therefore within an RCU read-side critical section. Therefore, | ||
63 | the called function has been invoked within an RCU read-side critical | ||
64 | section, and is not permitted to block. | ||
diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt new file mode 100644 index 00000000000..453ebe6953e --- /dev/null +++ b/Documentation/RCU/arrayRCU.txt | |||
@@ -0,0 +1,141 @@ | |||
1 | Using RCU to Protect Read-Mostly Arrays | ||
2 | |||
3 | |||
4 | Although RCU is more commonly used to protect linked lists, it can | ||
5 | also be used to protect arrays. Three situations are as follows: | ||
6 | |||
7 | 1. Hash Tables | ||
8 | |||
9 | 2. Static Arrays | ||
10 | |||
11 | 3. Resizeable Arrays | ||
12 | |||
13 | Each of these situations are discussed below. | ||
14 | |||
15 | |||
16 | Situation 1: Hash Tables | ||
17 | |||
18 | Hash tables are often implemented as an array, where each array entry | ||
19 | has a linked-list hash chain. Each hash chain can be protected by RCU | ||
20 | as described in the listRCU.txt document. This approach also applies | ||
21 | to other array-of-list situations, such as radix trees. | ||
22 | |||
23 | |||
24 | Situation 2: Static Arrays | ||
25 | |||
26 | Static arrays, where the data (rather than a pointer to the data) is | ||
27 | located in each array element, and where the array is never resized, | ||
28 | have not been used with RCU. Rik van Riel recommends using seqlock in | ||
29 | this situation, which would also have minimal read-side overhead as long | ||
30 | as updates are rare. | ||
31 | |||
32 | Quick Quiz: Why is it so important that updates be rare when | ||
33 | using seqlock? | ||
34 | |||
35 | |||
36 | Situation 3: Resizeable Arrays | ||
37 | |||
38 | Use of RCU for resizeable arrays is demonstrated by the grow_ary() | ||
39 | function used by the System V IPC code. The array is used to map from | ||
40 | semaphore, message-queue, and shared-memory IDs to the data structure | ||
41 | that represents the corresponding IPC construct. The grow_ary() | ||
42 | function does not acquire any locks; instead its caller must hold the | ||
43 | ids->sem semaphore. | ||
44 | |||
45 | The grow_ary() function, shown below, does some limit checks, allocates a | ||
46 | new ipc_id_ary, copies the old to the new portion of the new, initializes | ||
47 | the remainder of the new, updates the ids->entries pointer to point to | ||
48 | the new array, and invokes ipc_rcu_putref() to free up the old array. | ||
49 | Note that rcu_assign_pointer() is used to update the ids->entries pointer, | ||
50 | which includes any memory barriers required on whatever architecture | ||
51 | you are running on. | ||
52 | |||
53 | static int grow_ary(struct ipc_ids* ids, int newsize) | ||
54 | { | ||
55 | struct ipc_id_ary* new; | ||
56 | struct ipc_id_ary* old; | ||
57 | int i; | ||
58 | int size = ids->entries->size; | ||
59 | |||
60 | if(newsize > IPCMNI) | ||
61 | newsize = IPCMNI; | ||
62 | if(newsize <= size) | ||
63 | return newsize; | ||
64 | |||
65 | new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize + | ||
66 | sizeof(struct ipc_id_ary)); | ||
67 | if(new == NULL) | ||
68 | return size; | ||
69 | new->size = newsize; | ||
70 | memcpy(new->p, ids->entries->p, | ||
71 | sizeof(struct kern_ipc_perm *)*size + | ||
72 | sizeof(struct ipc_id_ary)); | ||
73 | for(i=size;i<newsize;i++) { | ||
74 | new->p[i] = NULL; | ||
75 | } | ||
76 | old = ids->entries; | ||
77 | |||
78 | /* | ||
79 | * Use rcu_assign_pointer() to make sure the memcpyed | ||
80 | * contents of the new array are visible before the new | ||
81 | * array becomes visible. | ||
82 | */ | ||
83 | rcu_assign_pointer(ids->entries, new); | ||
84 | |||
85 | ipc_rcu_putref(old); | ||
86 | return newsize; | ||
87 | } | ||
88 | |||
89 | The ipc_rcu_putref() function decrements the array's reference count | ||
90 | and then, if the reference count has dropped to zero, uses call_rcu() | ||
91 | to free the array after a grace period has elapsed. | ||
92 | |||
93 | The array is traversed by the ipc_lock() function. This function | ||
94 | indexes into the array under the protection of rcu_read_lock(), | ||
95 | using rcu_dereference() to pick up the pointer to the array so | ||
96 | that it may later safely be dereferenced -- memory barriers are | ||
97 | required on the Alpha CPU. Since the size of the array is stored | ||
98 | with the array itself, there can be no array-size mismatches, so | ||
99 | a simple check suffices. The pointer to the structure corresponding | ||
100 | to the desired IPC object is placed in "out", with NULL indicating | ||
101 | a non-existent entry. After acquiring "out->lock", the "out->deleted" | ||
102 | flag indicates whether the IPC object is in the process of being | ||
103 | deleted, and, if not, the pointer is returned. | ||
104 | |||
105 | struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) | ||
106 | { | ||
107 | struct kern_ipc_perm* out; | ||
108 | int lid = id % SEQ_MULTIPLIER; | ||
109 | struct ipc_id_ary* entries; | ||
110 | |||
111 | rcu_read_lock(); | ||
112 | entries = rcu_dereference(ids->entries); | ||
113 | if(lid >= entries->size) { | ||
114 | rcu_read_unlock(); | ||
115 | return NULL; | ||
116 | } | ||
117 | out = entries->p[lid]; | ||
118 | if(out == NULL) { | ||
119 | rcu_read_unlock(); | ||
120 | return NULL; | ||
121 | } | ||
122 | spin_lock(&out->lock); | ||
123 | |||
124 | /* ipc_rmid() may have already freed the ID while ipc_lock | ||
125 | * was spinning: here verify that the structure is still valid | ||
126 | */ | ||
127 | if (out->deleted) { | ||
128 | spin_unlock(&out->lock); | ||
129 | rcu_read_unlock(); | ||
130 | return NULL; | ||
131 | } | ||
132 | return out; | ||
133 | } | ||
134 | |||
135 | |||
136 | Answer to Quick Quiz: | ||
137 | |||
138 | The reason that it is important that updates be rare when | ||
139 | using seqlock is that frequent updates can livelock readers. | ||
140 | One way to avoid this problem is to assign a seqlock for | ||
141 | each array entry rather than to the entire array. | ||
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt new file mode 100644 index 00000000000..b3a568abe6b --- /dev/null +++ b/Documentation/RCU/checklist.txt | |||
@@ -0,0 +1,157 @@ | |||
1 | Review Checklist for RCU Patches | ||
2 | |||
3 | |||
4 | This document contains a checklist for producing and reviewing patches | ||
5 | that make use of RCU. Violating any of the rules listed below will | ||
6 | result in the same sorts of problems that leaving out a locking primitive | ||
7 | would cause. This list is based on experiences reviewing such patches | ||
8 | over a rather long period of time, but improvements are always welcome! | ||
9 | |||
10 | 0. Is RCU being applied to a read-mostly situation? If the data | ||
11 | structure is updated more than about 10% of the time, then | ||
12 | you should strongly consider some other approach, unless | ||
13 | detailed performance measurements show that RCU is nonetheless | ||
14 | the right tool for the job. | ||
15 | |||
16 | The other exception would be where performance is not an issue, | ||
17 | and RCU provides a simpler implementation. An example of this | ||
18 | situation is the dynamic NMI code in the Linux 2.6 kernel, | ||
19 | at least on architectures where NMIs are rare. | ||
20 | |||
21 | 1. Does the update code have proper mutual exclusion? | ||
22 | |||
23 | RCU does allow -readers- to run (almost) naked, but -writers- must | ||
24 | still use some sort of mutual exclusion, such as: | ||
25 | |||
26 | a. locking, | ||
27 | b. atomic operations, or | ||
28 | c. restricting updates to a single task. | ||
29 | |||
30 | If you choose #b, be prepared to describe how you have handled | ||
31 | memory barriers on weakly ordered machines (pretty much all of | ||
32 | them -- even x86 allows reads to be reordered), and be prepared | ||
33 | to explain why this added complexity is worthwhile. If you | ||
34 | choose #c, be prepared to explain how this single task does not | ||
35 | become a major bottleneck on big multiprocessor machines. | ||
36 | |||
37 | 2. Do the RCU read-side critical sections make proper use of | ||
38 | rcu_read_lock() and friends? These primitives are needed | ||
39 | to suppress preemption (or bottom halves, in the case of | ||
40 | rcu_read_lock_bh()) in the read-side critical sections, | ||
41 | and are also an excellent aid to readability. | ||
42 | |||
43 | 3. Does the update code tolerate concurrent accesses? | ||
44 | |||
45 | The whole point of RCU is to permit readers to run without | ||
46 | any locks or atomic operations. This means that readers will | ||
47 | be running while updates are in progress. There are a number | ||
48 | of ways to handle this concurrency, depending on the situation: | ||
49 | |||
50 | a. Make updates appear atomic to readers. For example, | ||
51 | pointer updates to properly aligned fields will appear | ||
52 | atomic, as will individual atomic primitives. Operations | ||
53 | performed under a lock and sequences of multiple atomic | ||
54 | primitives will -not- appear to be atomic. | ||
55 | |||
56 | This is almost always the best approach. | ||
57 | |||
58 | b. Carefully order the updates and the reads so that | ||
59 | readers see valid data at all phases of the update. | ||
60 | This is often more difficult than it sounds, especially | ||
61 | given modern CPUs' tendency to reorder memory references. | ||
62 | One must usually liberally sprinkle memory barriers | ||
63 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, | ||
64 | making it difficult to understand and to test. | ||
65 | |||
66 | It is usually better to group the changing data into | ||
67 | a separate structure, so that the change may be made | ||
68 | to appear atomic by updating a pointer to reference | ||
69 | a new structure containing updated values. | ||
70 | |||
71 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs | ||
72 | are weakly ordered -- even i386 CPUs allow reads to be reordered. | ||
73 | RCU code must take all of the following measures to prevent | ||
74 | memory-corruption problems: | ||
75 | |||
76 | a. Readers must maintain proper ordering of their memory | ||
77 | accesses. The rcu_dereference() primitive ensures that | ||
78 | the CPU picks up the pointer before it picks up the data | ||
79 | that the pointer points to. This really is necessary | ||
80 | on Alpha CPUs. If you don't believe me, see: | ||
81 | |||
82 | http://www.openvms.compaq.com/wizard/wiz_2637.html | ||
83 | |||
84 | The rcu_dereference() primitive is also an excellent | ||
85 | documentation aid, letting the person reading the code | ||
86 | know exactly which pointers are protected by RCU. | ||
87 | |||
88 | The rcu_dereference() primitive is used by the various | ||
89 | "_rcu()" list-traversal primitives, such as the | ||
90 | list_for_each_entry_rcu(). | ||
91 | |||
92 | b. If the list macros are being used, the list_del_rcu(), | ||
93 | list_add_tail_rcu(), and list_del_rcu() primitives must | ||
94 | be used in order to prevent weakly ordered machines from | ||
95 | misordering structure initialization and pointer planting. | ||
96 | Similarly, if the hlist macros are being used, the | ||
97 | hlist_del_rcu() and hlist_add_head_rcu() primitives | ||
98 | are required. | ||
99 | |||
100 | c. Updates must ensure that initialization of a given | ||
101 | structure happens before pointers to that structure are | ||
102 | publicized. Use the rcu_assign_pointer() primitive | ||
103 | when publicizing a pointer to a structure that can | ||
104 | be traversed by an RCU read-side critical section. | ||
105 | |||
106 | [The rcu_assign_pointer() primitive is in process.] | ||
107 | |||
108 | 5. If call_rcu(), or a related primitive such as call_rcu_bh(), | ||
109 | is used, the callback function must be written to be called | ||
110 | from softirq context. In particular, it cannot block. | ||
111 | |||
112 | 6. Since synchronize_kernel() blocks, it cannot be called from | ||
113 | any sort of irq context. | ||
114 | |||
115 | 7. If the updater uses call_rcu(), then the corresponding readers | ||
116 | must use rcu_read_lock() and rcu_read_unlock(). If the updater | ||
117 | uses call_rcu_bh(), then the corresponding readers must use | ||
118 | rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up | ||
119 | will result in confusion and broken kernels. | ||
120 | |||
121 | One exception to this rule: rcu_read_lock() and rcu_read_unlock() | ||
122 | may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() | ||
123 | in cases where local bottom halves are already known to be | ||
124 | disabled, for example, in irq or softirq context. Commenting | ||
125 | such cases is a must, of course! And the jury is still out on | ||
126 | whether the increased speed is worth it. | ||
127 | |||
128 | 8. Although synchronize_kernel() is a bit slower than is call_rcu(), | ||
129 | it usually results in simpler code. So, unless update performance | ||
130 | is important or the updaters cannot block, synchronize_kernel() | ||
131 | should be used in preference to call_rcu(). | ||
132 | |||
133 | 9. All RCU list-traversal primitives, which include | ||
134 | list_for_each_rcu(), list_for_each_entry_rcu(), | ||
135 | list_for_each_continue_rcu(), and list_for_each_safe_rcu(), | ||
136 | must be within an RCU read-side critical section. RCU | ||
137 | read-side critical sections are delimited by rcu_read_lock() | ||
138 | and rcu_read_unlock(), or by similar primitives such as | ||
139 | rcu_read_lock_bh() and rcu_read_unlock_bh(). | ||
140 | |||
141 | Use of the _rcu() list-traversal primitives outside of an | ||
142 | RCU read-side critical section causes no harm other than | ||
143 | a slight performance degradation on Alpha CPUs and some | ||
144 | confusion on the part of people trying to read the code. | ||
145 | |||
146 | Another way of thinking of this is "If you are holding the | ||
147 | lock that prevents the data structure from changing, why do | ||
148 | you also need RCU-based protection?" That said, there may | ||
149 | well be situations where use of the _rcu() list-traversal | ||
150 | primitives while the update-side lock is held results in | ||
151 | simpler and more maintainable code. The jury is still out | ||
152 | on this question. | ||
153 | |||
154 | 10. Conversely, if you are in an RCU read-side critical section, | ||
155 | you -must- use the "_rcu()" variants of the list macros. | ||
156 | Failing to do so will break Alpha and confuse people reading | ||
157 | your code. | ||
diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt new file mode 100644 index 00000000000..bda6ead69bd --- /dev/null +++ b/Documentation/RCU/listRCU.txt | |||
@@ -0,0 +1,307 @@ | |||
1 | Using RCU to Protect Read-Mostly Linked Lists | ||
2 | |||
3 | |||
4 | One of the best applications of RCU is to protect read-mostly linked lists | ||
5 | ("struct list_head" in list.h). One big advantage of this approach | ||
6 | is that all of the required memory barriers are included for you in | ||
7 | the list macros. This document describes several applications of RCU, | ||
8 | with the best fits first. | ||
9 | |||
10 | |||
11 | Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates | ||
12 | |||
13 | The best applications are cases where, if reader-writer locking were | ||
14 | used, the read-side lock would be dropped before taking any action | ||
15 | based on the results of the search. The most celebrated example is | ||
16 | the routing table. Because the routing table is tracking the state of | ||
17 | equipment outside of the computer, it will at times contain stale data. | ||
18 | Therefore, once the route has been computed, there is no need to hold | ||
19 | the routing table static during transmission of the packet. After all, | ||
20 | you can hold the routing table static all you want, but that won't keep | ||
21 | the external Internet from changing, and it is the state of the external | ||
22 | Internet that really matters. In addition, routing entries are typically | ||
23 | added or deleted, rather than being modified in place. | ||
24 | |||
25 | A straightforward example of this use of RCU may be found in the | ||
26 | system-call auditing support. For example, a reader-writer locked | ||
27 | implementation of audit_filter_task() might be as follows: | ||
28 | |||
29 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
30 | { | ||
31 | struct audit_entry *e; | ||
32 | enum audit_state state; | ||
33 | |||
34 | read_lock(&auditsc_lock); | ||
35 | list_for_each_entry(e, &audit_tsklist, list) { | ||
36 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
37 | read_unlock(&auditsc_lock); | ||
38 | return state; | ||
39 | } | ||
40 | } | ||
41 | read_unlock(&auditsc_lock); | ||
42 | return AUDIT_BUILD_CONTEXT; | ||
43 | } | ||
44 | |||
45 | Here the list is searched under the lock, but the lock is dropped before | ||
46 | the corresponding value is returned. By the time that this value is acted | ||
47 | on, the list may well have been modified. This makes sense, since if | ||
48 | you are turning auditing off, it is OK to audit a few extra system calls. | ||
49 | |||
50 | This means that RCU can be easily applied to the read side, as follows: | ||
51 | |||
52 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
53 | { | ||
54 | struct audit_entry *e; | ||
55 | enum audit_state state; | ||
56 | |||
57 | rcu_read_lock(); | ||
58 | list_for_each_entry_rcu(e, &audit_tsklist, list) { | ||
59 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
60 | rcu_read_unlock(); | ||
61 | return state; | ||
62 | } | ||
63 | } | ||
64 | rcu_read_unlock(); | ||
65 | return AUDIT_BUILD_CONTEXT; | ||
66 | } | ||
67 | |||
68 | The read_lock() and read_unlock() calls have become rcu_read_lock() | ||
69 | and rcu_read_unlock(), respectively, and the list_for_each_entry() has | ||
70 | become list_for_each_entry_rcu(). The _rcu() list-traversal primitives | ||
71 | insert the read-side memory barriers that are required on DEC Alpha CPUs. | ||
72 | |||
73 | The changes to the update side are also straightforward. A reader-writer | ||
74 | lock might be used as follows for deletion and insertion: | ||
75 | |||
76 | static inline int audit_del_rule(struct audit_rule *rule, | ||
77 | struct list_head *list) | ||
78 | { | ||
79 | struct audit_entry *e; | ||
80 | |||
81 | write_lock(&auditsc_lock); | ||
82 | list_for_each_entry(e, list, list) { | ||
83 | if (!audit_compare_rule(rule, &e->rule)) { | ||
84 | list_del(&e->list); | ||
85 | write_unlock(&auditsc_lock); | ||
86 | return 0; | ||
87 | } | ||
88 | } | ||
89 | write_unlock(&auditsc_lock); | ||
90 | return -EFAULT; /* No matching rule */ | ||
91 | } | ||
92 | |||
93 | static inline int audit_add_rule(struct audit_entry *entry, | ||
94 | struct list_head *list) | ||
95 | { | ||
96 | write_lock(&auditsc_lock); | ||
97 | if (entry->rule.flags & AUDIT_PREPEND) { | ||
98 | entry->rule.flags &= ~AUDIT_PREPEND; | ||
99 | list_add(&entry->list, list); | ||
100 | } else { | ||
101 | list_add_tail(&entry->list, list); | ||
102 | } | ||
103 | write_unlock(&auditsc_lock); | ||
104 | return 0; | ||
105 | } | ||
106 | |||
107 | Following are the RCU equivalents for these two functions: | ||
108 | |||
109 | static inline int audit_del_rule(struct audit_rule *rule, | ||
110 | struct list_head *list) | ||
111 | { | ||
112 | struct audit_entry *e; | ||
113 | |||
114 | /* Do not use the _rcu iterator here, since this is the only | ||
115 | * deletion routine. */ | ||
116 | list_for_each_entry(e, list, list) { | ||
117 | if (!audit_compare_rule(rule, &e->rule)) { | ||
118 | list_del_rcu(&e->list); | ||
119 | call_rcu(&e->rcu, audit_free_rule, e); | ||
120 | return 0; | ||
121 | } | ||
122 | } | ||
123 | return -EFAULT; /* No matching rule */ | ||
124 | } | ||
125 | |||
126 | static inline int audit_add_rule(struct audit_entry *entry, | ||
127 | struct list_head *list) | ||
128 | { | ||
129 | if (entry->rule.flags & AUDIT_PREPEND) { | ||
130 | entry->rule.flags &= ~AUDIT_PREPEND; | ||
131 | list_add_rcu(&entry->list, list); | ||
132 | } else { | ||
133 | list_add_tail_rcu(&entry->list, list); | ||
134 | } | ||
135 | return 0; | ||
136 | } | ||
137 | |||
138 | Normally, the write_lock() and write_unlock() would be replaced by | ||
139 | a spin_lock() and a spin_unlock(), but in this case, all callers hold | ||
140 | audit_netlink_sem, so no additional locking is required. The auditsc_lock | ||
141 | can therefore be eliminated, since use of RCU eliminates the need for | ||
142 | writers to exclude readers. | ||
143 | |||
144 | The list_del(), list_add(), and list_add_tail() primitives have been | ||
145 | replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). | ||
146 | The _rcu() list-manipulation primitives add memory barriers that are | ||
147 | needed on weakly ordered CPUs (most of them!). | ||
148 | |||
149 | So, when readers can tolerate stale data and when entries are either added | ||
150 | or deleted, without in-place modification, it is very easy to use RCU! | ||
151 | |||
152 | |||
153 | Example 2: Handling In-Place Updates | ||
154 | |||
155 | The system-call auditing code does not update auditing rules in place. | ||
156 | However, if it did, reader-writer-locked code to do so might look as | ||
157 | follows (presumably, the field_count is only permitted to decrease, | ||
158 | otherwise, the added fields would need to be filled in): | ||
159 | |||
160 | static inline int audit_upd_rule(struct audit_rule *rule, | ||
161 | struct list_head *list, | ||
162 | __u32 newaction, | ||
163 | __u32 newfield_count) | ||
164 | { | ||
165 | struct audit_entry *e; | ||
166 | struct audit_newentry *ne; | ||
167 | |||
168 | write_lock(&auditsc_lock); | ||
169 | list_for_each_entry(e, list, list) { | ||
170 | if (!audit_compare_rule(rule, &e->rule)) { | ||
171 | e->rule.action = newaction; | ||
172 | e->rule.file_count = newfield_count; | ||
173 | write_unlock(&auditsc_lock); | ||
174 | return 0; | ||
175 | } | ||
176 | } | ||
177 | write_unlock(&auditsc_lock); | ||
178 | return -EFAULT; /* No matching rule */ | ||
179 | } | ||
180 | |||
181 | The RCU version creates a copy, updates the copy, then replaces the old | ||
182 | entry with the newly updated entry. This sequence of actions, allowing | ||
183 | concurrent reads while doing a copy to perform an update, is what gives | ||
184 | RCU ("read-copy update") its name. The RCU code is as follows: | ||
185 | |||
186 | static inline int audit_upd_rule(struct audit_rule *rule, | ||
187 | struct list_head *list, | ||
188 | __u32 newaction, | ||
189 | __u32 newfield_count) | ||
190 | { | ||
191 | struct audit_entry *e; | ||
192 | struct audit_newentry *ne; | ||
193 | |||
194 | list_for_each_entry(e, list, list) { | ||
195 | if (!audit_compare_rule(rule, &e->rule)) { | ||
196 | ne = kmalloc(sizeof(*entry), GFP_ATOMIC); | ||
197 | if (ne == NULL) | ||
198 | return -ENOMEM; | ||
199 | audit_copy_rule(&ne->rule, &e->rule); | ||
200 | ne->rule.action = newaction; | ||
201 | ne->rule.file_count = newfield_count; | ||
202 | list_add_rcu(ne, e); | ||
203 | list_del(e); | ||
204 | call_rcu(&e->rcu, audit_free_rule, e); | ||
205 | return 0; | ||
206 | } | ||
207 | } | ||
208 | return -EFAULT; /* No matching rule */ | ||
209 | } | ||
210 | |||
211 | Again, this assumes that the caller holds audit_netlink_sem. Normally, | ||
212 | the reader-writer lock would become a spinlock in this sort of code. | ||
213 | |||
214 | |||
215 | Example 3: Eliminating Stale Data | ||
216 | |||
217 | The auditing examples above tolerate stale data, as do most algorithms | ||
218 | that are tracking external state. Because there is a delay from the | ||
219 | time the external state changes before Linux becomes aware of the change, | ||
220 | additional RCU-induced staleness is normally not a problem. | ||
221 | |||
222 | However, there are many examples where stale data cannot be tolerated. | ||
223 | One example in the Linux kernel is the System V IPC (see the ipc_lock() | ||
224 | function in ipc/util.c). This code checks a "deleted" flag under a | ||
225 | per-entry spinlock, and, if the "deleted" flag is set, pretends that the | ||
226 | entry does not exist. For this to be helpful, the search function must | ||
227 | return holding the per-entry spinlock, as ipc_lock() does in fact do. | ||
228 | |||
229 | Quick Quiz: Why does the search function need to return holding the | ||
230 | per-entry lock for this deleted-flag technique to be helpful? | ||
231 | |||
232 | If the system-call audit module were to ever need to reject stale data, | ||
233 | one way to accomplish this would be to add a "deleted" flag and a "lock" | ||
234 | spinlock to the audit_entry structure, and modify audit_filter_task() | ||
235 | as follows: | ||
236 | |||
237 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
238 | { | ||
239 | struct audit_entry *e; | ||
240 | enum audit_state state; | ||
241 | |||
242 | rcu_read_lock(); | ||
243 | list_for_each_entry_rcu(e, &audit_tsklist, list) { | ||
244 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
245 | spin_lock(&e->lock); | ||
246 | if (e->deleted) { | ||
247 | spin_unlock(&e->lock); | ||
248 | rcu_read_unlock(); | ||
249 | return AUDIT_BUILD_CONTEXT; | ||
250 | } | ||
251 | rcu_read_unlock(); | ||
252 | return state; | ||
253 | } | ||
254 | } | ||
255 | rcu_read_unlock(); | ||
256 | return AUDIT_BUILD_CONTEXT; | ||
257 | } | ||
258 | |||
259 | Note that this example assumes that entries are only added and deleted. | ||
260 | Additional mechanism is required to deal correctly with the | ||
261 | update-in-place performed by audit_upd_rule(). For one thing, | ||
262 | audit_upd_rule() would need additional memory barriers to ensure | ||
263 | that the list_add_rcu() was really executed before the list_del_rcu(). | ||
264 | |||
265 | The audit_del_rule() function would need to set the "deleted" | ||
266 | flag under the spinlock as follows: | ||
267 | |||
268 | static inline int audit_del_rule(struct audit_rule *rule, | ||
269 | struct list_head *list) | ||
270 | { | ||
271 | struct audit_entry *e; | ||
272 | |||
273 | /* Do not use the _rcu iterator here, since this is the only | ||
274 | * deletion routine. */ | ||
275 | list_for_each_entry(e, list, list) { | ||
276 | if (!audit_compare_rule(rule, &e->rule)) { | ||
277 | spin_lock(&e->lock); | ||
278 | list_del_rcu(&e->list); | ||
279 | e->deleted = 1; | ||
280 | spin_unlock(&e->lock); | ||
281 | call_rcu(&e->rcu, audit_free_rule, e); | ||
282 | return 0; | ||
283 | } | ||
284 | } | ||
285 | return -EFAULT; /* No matching rule */ | ||
286 | } | ||
287 | |||
288 | |||
289 | Summary | ||
290 | |||
291 | Read-mostly list-based data structures that can tolerate stale data are | ||
292 | the most amenable to use of RCU. The simplest case is where entries are | ||
293 | either added or deleted from the data structure (or atomically modified | ||
294 | in place), but non-atomic in-place modifications can be handled by making | ||
295 | a copy, updating the copy, then replacing the original with the copy. | ||
296 | If stale data cannot be tolerated, then a "deleted" flag may be used | ||
297 | in conjunction with a per-entry spinlock in order to allow the search | ||
298 | function to reject newly deleted data. | ||
299 | |||
300 | |||
301 | Answer to Quick Quiz | ||
302 | |||
303 | If the search function drops the per-entry lock before returning, then | ||
304 | the caller will be processing stale data in any case. If it is really | ||
305 | OK to be processing stale data, then you don't need a "deleted" flag. | ||
306 | If processing stale data really is a problem, then you need to hold the | ||
307 | per-entry lock across all of the code that uses the value looked up. | ||
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt new file mode 100644 index 00000000000..7e0c2ab6f2b --- /dev/null +++ b/Documentation/RCU/rcu.txt | |||
@@ -0,0 +1,67 @@ | |||
1 | RCU Concepts | ||
2 | |||
3 | |||
4 | The basic idea behind RCU (read-copy update) is to split destructive | ||
5 | operations into two parts, one that prevents anyone from seeing the data | ||
6 | item being destroyed, and one that actually carries out the destruction. | ||
7 | A "grace period" must elapse between the two parts, and this grace period | ||
8 | must be long enough that any readers accessing the item being deleted have | ||
9 | since dropped their references. For example, an RCU-protected deletion | ||
10 | from a linked list would first remove the item from the list, wait for | ||
11 | a grace period to elapse, then free the element. See the listRCU.txt | ||
12 | file for more information on using RCU with linked lists. | ||
13 | |||
14 | |||
15 | Frequently Asked Questions | ||
16 | |||
17 | o Why would anyone want to use RCU? | ||
18 | |||
19 | The advantage of RCU's two-part approach is that RCU readers need | ||
20 | not acquire any locks, perform any atomic instructions, write to | ||
21 | shared memory, or (on CPUs other than Alpha) execute any memory | ||
22 | barriers. The fact that these operations are quite expensive | ||
23 | on modern CPUs is what gives RCU its performance advantages | ||
24 | in read-mostly situations. The fact that RCU readers need not | ||
25 | acquire locks can also greatly simplify deadlock-avoidance code. | ||
26 | |||
27 | o How can the updater tell when a grace period has completed | ||
28 | if the RCU readers give no indication when they are done? | ||
29 | |||
30 | Just as with spinlocks, RCU readers are not permitted to | ||
31 | block, switch to user-mode execution, or enter the idle loop. | ||
32 | Therefore, as soon as a CPU is seen passing through any of these | ||
33 | three states, we know that that CPU has exited any previous RCU | ||
34 | read-side critical sections. So, if we remove an item from a | ||
35 | linked list, and then wait until all CPUs have switched context, | ||
36 | executed in user mode, or executed in the idle loop, we can | ||
37 | safely free up that item. | ||
38 | |||
39 | o If I am running on a uniprocessor kernel, which can only do one | ||
40 | thing at a time, why should I wait for a grace period? | ||
41 | |||
42 | See the UP.txt file in this directory. | ||
43 | |||
44 | o How can I see where RCU is currently used in the Linux kernel? | ||
45 | |||
46 | Search for "rcu_read_lock", "call_rcu", and "synchronize_kernel". | ||
47 | |||
48 | o What guidelines should I follow when writing code that uses RCU? | ||
49 | |||
50 | See the checklist.txt file in this directory. | ||
51 | |||
52 | o Why the name "RCU"? | ||
53 | |||
54 | "RCU" stands for "read-copy update". The file listRCU.txt has | ||
55 | more information on where this name came from, search for | ||
56 | "read-copy update" to find it. | ||
57 | |||
58 | o I hear that RCU is patented? What is with that? | ||
59 | |||
60 | Yes, it is. There are several known patents related to RCU, | ||
61 | search for the string "Patent" in RTFP.txt to find them. | ||
62 | Of these, one was allowed to lapse by the assignee, and the | ||
63 | others have been contributed to the Linux kernel under GPL. | ||
64 | |||
65 | o Where can I find more information on RCU? | ||
66 | |||
67 | See the RTFP.txt file in this directory. | ||