aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/RCU
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/RCU')
-rw-r--r--Documentation/RCU/RTFP.txt387
-rw-r--r--Documentation/RCU/UP.txt64
-rw-r--r--Documentation/RCU/arrayRCU.txt141
-rw-r--r--Documentation/RCU/checklist.txt157
-rw-r--r--Documentation/RCU/listRCU.txt307
-rw-r--r--Documentation/RCU/rcu.txt67
6 files changed, 1123 insertions, 0 deletions
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt
new file mode 100644
index 000000000000..12250b342e1f
--- /dev/null
+++ b/Documentation/RCU/RTFP.txt
@@ -0,0 +1,387 @@
1Read the F-ing Papers!
2
3
4This document describes RCU-related publications, and is followed by
5the corresponding bibtex entries.
6
7The first thing resembling RCU was published in 1980, when Kung and Lehman
8[Kung80] recommended use of a garbage collector to defer destruction
9of nodes in a parallel binary search tree in order to simplify its
10implementation. This works well in environments that have garbage
11collectors, but current production garbage collectors incur significant
12read-side overhead.
13
14In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring
15destruction until all threads running at that time have terminated, again
16for a parallel binary search tree. This approach works well in systems
17with short-lived threads, such as the K42 research operating system.
18However, Linux has long-lived tasks, so more is needed.
19
20In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive
21serialization, which is an RCU-like mechanism that relies on the presence
22of "quiescent states" in the VM/XA hypervisor that are guaranteed not
23to be referencing the data structure. However, this mechanism was not
24optimized for modern computer systems, which is not surprising given
25that these overheads were not so expensive in the mid-80s. Nonetheless,
26passive serialization appears to be the first deferred-destruction
27mechanism to be used in production. Furthermore, the relevant patent has
28lapsed, so this approach may be used in non-GPL software, if desired.
29(In contrast, use of RCU is permitted only in software licensed under
30GPL. Sorry!!!)
31
32In 1990, Pugh [Pugh90] noted that explicitly tracking which threads
33were reading a given data structure permitted deferred free to operate
34in the presence of non-terminating threads. However, this explicit
35tracking imposes significant read-side overhead, which is undesirable
36in read-mostly situations. This algorithm does take pains to avoid
37write-side contention and parallelize the other write-side overheads by
38providing a fine-grained locking design, however, it would be interesting
39to see how much of the performance advantage reported in 1990 remains
40in 2004.
41
42At about this same time, Adams [Adams91] described ``chaotic relaxation'',
43where the normal barriers between successive iterations of convergent
44numerical algorithms are relaxed, so that iteration $n$ might use
45data from iteration $n-1$ or even $n-2$. This introduces error,
46which typically slows convergence and thus increases the number of
47iterations required. However, this increase is sometimes more than made
48up for by a reduction in the number of expensive barrier operations,
49which are otherwise required to synchronize the threads at the end
50of each iteration. Unfortunately, chaotic relaxation requires highly
51structured data, such as the matrices used in scientific programs, and
52is thus inapplicable to most data structures in operating-system kernels.
53
54In 1993, Jacobson [Jacobson93] verbally described what is perhaps the
55simplest deferred-free technique: simply waiting a fixed amount of time
56before freeing blocks awaiting deferred free. Jacobson did not describe
57any write-side changes he might have made in this work using SGI's Irix
58kernel. Aju John published a similar technique in 1995 [AjuJohn95].
59This works well if there is a well-defined upper bound on the length of
60time that reading threads can hold references, as there might well be in
61hard real-time systems. However, if this time is exceeded, perhaps due
62to preemption, excessive interrupts, or larger-than-anticipated load,
63memory corruption can ensue, with no reasonable means of diagnosis.
64Jacobson's technique is therefore inappropriate for use in production
65operating-system kernels, except when such kernels can provide hard
66real-time response guarantees for all operations.
67
68Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's
69read-side-tracking to permit replugging of algorithms within a commercial
70Unix operating system. However, this replugging permitted only a single
71reader at a time. The following year, this same group of researchers
72extended their technique to allow for multiple readers [Cowan96a].
73Their approach requires memory barriers (and thus pipeline stalls),
74but reduces memory latency, contention, and locking overheads.
75
761995 also saw the first publication of DYNIX/ptx's RCU mechanism
77[Slingwine95], which was optimized for modern CPU architectures,
78and was successfully applied to a number of situations within the
79DYNIX/ptx kernel. The corresponding conference paper appeared in 1998
80[McKenney98].
81
82In 1999, the Tornado and K42 groups described their "generations"
83mechanism, which quite similar to RCU [Gamsa99]. These operating systems
84made pervasive use of RCU in place of "existence locks", which greatly
85simplifies locking hierarchies.
86
872001 saw the first RCU presentation involving Linux [McKenney01a]
88at OLS. The resulting abundance of RCU patches was presented the
89following year [McKenney02a], and use of RCU in dcache was first
90described that same year [Linder02a].
91
92Also in 2002, Michael [Michael02b,Michael02a] presented techniques
93that defer the destruction of data structures to simplify non-blocking
94synchronization (wait-free synchronization, lock-free synchronization,
95and obstruction-free synchronization are all examples of non-blocking
96synchronization). In particular, this technique eliminates locking,
97reduces contention, reduces memory latency for readers, and parallelizes
98pipeline stalls and memory latency for writers. However, these
99techniques still impose significant read-side overhead in the form of
100memory barriers. Researchers at Sun worked along similar lines in the
101same timeframe [HerlihyLM02,HerlihyLMS03].
102
103In 2003, the K42 group described how RCU could be used to create
104hot-pluggable implementations of operating-system functions. Later that
105year saw a paper describing an RCU implementation of System V IPC
106[Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a].
107
1082004 has seen a Linux-Journal article on use of RCU in dcache
109[McKenney04a], a performance comparison of locking to RCU on several
110different CPUs [McKenney04b], a dissertation describing use of RCU in a
111number of operating-system kernels [PaulEdwardMcKenneyPhD], and a paper
112describing how to make RCU safe for soft-realtime applications [Sarma04c].
113
114
115Bibtex Entries
116
117@article{Kung80
118,author="H. T. Kung and Q. Lehman"
119,title="Concurrent Maintenance of Binary Search Trees"
120,Year="1980"
121,Month="September"
122,journal="ACM Transactions on Database Systems"
123,volume="5"
124,number="3"
125,pages="354-382"
126}
127
128@techreport{Manber82
129,author="Udi Manber and Richard E. Ladner"
130,title="Concurrency Control in a Dynamic Search Structure"
131,institution="Department of Computer Science, University of Washington"
132,address="Seattle, Washington"
133,year="1982"
134,number="82-01-01"
135,month="January"
136,pages="28"
137}
138
139@article{Manber84
140,author="Udi Manber and Richard E. Ladner"
141,title="Concurrency Control in a Dynamic Search Structure"
142,Year="1984"
143,Month="September"
144,journal="ACM Transactions on Database Systems"
145,volume="9"
146,number="3"
147,pages="439-455"
148}
149
150@techreport{Hennessy89
151,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}"
152,title="Passive Serialization in a Multitasking Environment"
153,institution="US Patent and Trademark Office"
154,address="Washington, DC"
155,year="1989"
156,number="US Patent 4,809,168 (lapsed)"
157,month="February"
158,pages="11"
159}
160
161@techreport{Pugh90
162,author="William Pugh"
163,title="Concurrent Maintenance of Skip Lists"
164,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland"
165,address="College Park, Maryland"
166,year="1990"
167,number="CS-TR-2222.1"
168,month="June"
169}
170
171@Book{Adams91
172,Author="Gregory R. Adams"
173,title="Concurrent Programming, Principles, and Practices"
174,Publisher="Benjamin Cummins"
175,Year="1991"
176}
177
178@unpublished{Jacobson93
179,author="Van Jacobson"
180,title="Avoid Read-Side Locking Via Delayed Free"
181,year="1993"
182,month="September"
183,note="Verbal discussion"
184}
185
186@Conference{AjuJohn95
187,Author="Aju John"
188,Title="Dynamic vnodes -- Design and Implementation"
189,Booktitle="{USENIX Winter 1995}"
190,Publisher="USENIX Association"
191,Month="January"
192,Year="1995"
193,pages="11-23"
194,Address="New Orleans, LA"
195}
196
197@techreport{Slingwine95
198,author="John D. Slingwine and Paul E. McKenney"
199,title="Apparatus and Method for Achieving Reduced Overhead Mutual
200Exclusion and Maintaining Coherency in a Multiprocessor System
201Utilizing Execution History and Thread Monitoring"
202,institution="US Patent and Trademark Office"
203,address="Washington, DC"
204,year="1995"
205,number="US Patent 5,442,758 (contributed under GPL)"
206,month="August"
207}
208
209@techreport{Slingwine97
210,author="John D. Slingwine and Paul E. McKenney"
211,title="Method for maintaining data coherency using thread
212activity summaries in a multicomputer system"
213,institution="US Patent and Trademark Office"
214,address="Washington, DC"
215,year="1997"
216,number="US Patent 5,608,893 (contributed under GPL)"
217,month="March"
218}
219
220@techreport{Slingwine98
221,author="John D. Slingwine and Paul E. McKenney"
222,title="Apparatus and method for achieving reduced overhead
223mutual exclusion and maintaining coherency in a multiprocessor
224system utilizing execution history and thread monitoring"
225,institution="US Patent and Trademark Office"
226,address="Washington, DC"
227,year="1998"
228,number="US Patent 5,727,209 (contributed under GPL)"
229,month="March"
230}
231
232@Conference{McKenney98
233,Author="Paul E. McKenney and John D. Slingwine"
234,Title="Read-Copy Update: Using Execution History to Solve Concurrency
235Problems"
236,Booktitle="{Parallel and Distributed Computing and Systems}"
237,Month="October"
238,Year="1998"
239,pages="509-518"
240,Address="Las Vegas, NV"
241}
242
243@Conference{Gamsa99
244,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm"
245,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory
246Multiprocessor Operating System"
247,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on
248Operating System Design and Implementation}"
249,Month="February"
250,Year="1999"
251,pages="87-100"
252,Address="New Orleans, LA"
253}
254
255@techreport{Slingwine01
256,author="John D. Slingwine and Paul E. McKenney"
257,title="Apparatus and method for achieving reduced overhead
258mutual exclusion and maintaining coherency in a multiprocessor
259system utilizing execution history and thread monitoring"
260,institution="US Patent and Trademark Office"
261,address="Washington, DC"
262,year="2001"
263,number="US Patent 5,219,690 (contributed under GPL)"
264,month="April"
265}
266
267@Conference{McKenney01a
268,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and
269Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni"
270,Title="Read-Copy Update"
271,Booktitle="{Ottawa Linux Symposium}"
272,Month="July"
273,Year="2001"
274,note="Available:
275\url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php}
276\url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf}
277[Viewed June 23, 2004]"
278annotation="
279Described RCU, and presented some patches implementing and using it in
280the Linux kernel.
281"
282}
283
284@Conference{Linder02a
285,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni"
286,Title="Scalability of the Directory Entry Cache"
287,Booktitle="{Ottawa Linux Symposium}"
288,Month="June"
289,Year="2002"
290,pages="289-300"
291}
292
293@Conference{McKenney02a
294,Author="Paul E. McKenney and Dipankar Sarma and
295Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell"
296,Title="Read-Copy Update"
297,Booktitle="{Ottawa Linux Symposium}"
298,Month="June"
299,Year="2002"
300,pages="338-367"
301,note="Available:
302\url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz}
303[Viewed June 23, 2004]"
304}
305
306@article{Appavoo03a
307,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and
308D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and
309B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and
310B. Rosenburg and M. Stumm and J. Xenidis"
311,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping"
312,Year="2003"
313,Month="January"
314,journal="IBM Systems Journal"
315,volume="42"
316,number="1"
317,pages="60-76"
318}
319
320@Conference{Arcangeli03
321,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and
322Dipankar Sarma"
323,Title="Using Read-Copy Update Techniques for {System V IPC} in the
324{Linux} 2.5 Kernel"
325,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference
326(FREENIX Track)"
327,Publisher="USENIX Association"
328,year="2003"
329,month="June"
330,pages="297-310"
331}
332
333@article{McKenney03a
334,author="Paul E. McKenney"
335,title="Using {RCU} in the {Linux} 2.5 Kernel"
336,Year="2003"
337,Month="October"
338,journal="Linux Journal"
339,volume="1"
340,number="114"
341,pages="18-26"
342}
343
344@article{McKenney04a
345,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni"
346,title="Scaling dcache with {RCU}"
347,Year="2004"
348,Month="January"
349,journal="Linux Journal"
350,volume="1"
351,number="118"
352,pages="38-46"
353}
354
355@Conference{McKenney04b
356,Author="Paul E. McKenney"
357,Title="{RCU} vs. Locking Performance on Different {CPUs}"
358,Booktitle="{linux.conf.au}"
359,Month="January"
360,Year="2004"
361,Address="Adelaide, Australia"
362,note="Available:
363\url{http://www.linux.org.au/conf/2004/abstracts.html#90}
364\url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf}
365[Viewed June 23, 2004]"
366}
367
368@phdthesis{PaulEdwardMcKenneyPhD
369,author="Paul E. McKenney"
370,title="Exploiting Deferred Destruction:
371An Analysis of Read-Copy-Update Techniques
372in Operating System Kernels"
373,school="OGI School of Science and Engineering at
374Oregon Health and Sciences University"
375,year="2004"
376}
377
378@Conference{Sarma04c
379,Author="Dipankar Sarma and Paul E. McKenney"
380,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications"
381,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference
382(FREENIX Track)"
383,Publisher="USENIX Association"
384,year="2004"
385,month="June"
386,pages="182-191"
387}
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt
new file mode 100644
index 000000000000..551a803d82a8
--- /dev/null
+++ b/Documentation/RCU/UP.txt
@@ -0,0 +1,64 @@
1RCU on Uniprocessor Systems
2
3
4A common misconception is that, on UP systems, the call_rcu() primitive
5may immediately invoke its function, and that the synchronize_kernel
6primitive may return immediately. The basis of this misconception
7is that since there is only one CPU, it should not be necessary to
8wait for anything else to get done, since there are no other CPUs for
9anything else to be happening on. Although this approach will sort of
10work a surprising amount of the time, it is a very bad idea in general.
11This document presents two examples that demonstrate exactly how bad an
12idea this is.
13
14
15Example 1: softirq Suicide
16
17Suppose that an RCU-based algorithm scans a linked list containing
18elements A, B, and C in process context, and can delete elements from
19this same list in softirq context. Suppose that the process-context scan
20is referencing element B when it is interrupted by softirq processing,
21which deletes element B, and then invokes call_rcu() to free element B
22after a grace period.
23
24Now, if call_rcu() were to directly invoke its arguments, then upon return
25from softirq, the list scan would find itself referencing a newly freed
26element B. This situation can greatly decrease the life expectancy of
27your kernel.
28
29
30Example 2: Function-Call Fatality
31
32Of course, one could avert the suicide described in the preceding example
33by having call_rcu() directly invoke its arguments only if it was called
34from process context. However, this can fail in a similar manner.
35
36Suppose that an RCU-based algorithm again scans a linked list containing
37elements A, B, and C in process contexts, but that it invokes a function
38on each element as it is scanned. Suppose further that this function
39deletes element B from the list, then passes it to call_rcu() for deferred
40freeing. This may be a bit unconventional, but it is perfectly legal
41RCU usage, since call_rcu() must wait for a grace period to elapse.
42Therefore, in this case, allowing call_rcu() to immediately invoke
43its arguments would cause it to fail to make the fundamental guarantee
44underlying RCU, namely that call_rcu() defers invoking its arguments until
45all RCU read-side critical sections currently executing have completed.
46
47Quick Quiz: why is it -not- legal to invoke synchronize_kernel() in
48this case?
49
50
51Summary
52
53Permitting call_rcu() to immediately invoke its arguments or permitting
54synchronize_kernel() to immediately return breaks RCU, even on a UP system.
55So do not do it! Even on a UP system, the RCU infrastructure -must-
56respect grace periods.
57
58
59Answer to Quick Quiz
60
61The calling function is scanning an RCU-protected linked list, and
62is therefore within an RCU read-side critical section. Therefore,
63the called function has been invoked within an RCU read-side critical
64section, and is not permitted to block.
diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt
new file mode 100644
index 000000000000..453ebe6953ee
--- /dev/null
+++ b/Documentation/RCU/arrayRCU.txt
@@ -0,0 +1,141 @@
1Using RCU to Protect Read-Mostly Arrays
2
3
4Although RCU is more commonly used to protect linked lists, it can
5also be used to protect arrays. Three situations are as follows:
6
71. Hash Tables
8
92. Static Arrays
10
113. Resizeable Arrays
12
13Each of these situations are discussed below.
14
15
16Situation 1: Hash Tables
17
18Hash tables are often implemented as an array, where each array entry
19has a linked-list hash chain. Each hash chain can be protected by RCU
20as described in the listRCU.txt document. This approach also applies
21to other array-of-list situations, such as radix trees.
22
23
24Situation 2: Static Arrays
25
26Static arrays, where the data (rather than a pointer to the data) is
27located in each array element, and where the array is never resized,
28have not been used with RCU. Rik van Riel recommends using seqlock in
29this situation, which would also have minimal read-side overhead as long
30as updates are rare.
31
32Quick Quiz: Why is it so important that updates be rare when
33 using seqlock?
34
35
36Situation 3: Resizeable Arrays
37
38Use of RCU for resizeable arrays is demonstrated by the grow_ary()
39function used by the System V IPC code. The array is used to map from
40semaphore, message-queue, and shared-memory IDs to the data structure
41that represents the corresponding IPC construct. The grow_ary()
42function does not acquire any locks; instead its caller must hold the
43ids->sem semaphore.
44
45The grow_ary() function, shown below, does some limit checks, allocates a
46new ipc_id_ary, copies the old to the new portion of the new, initializes
47the remainder of the new, updates the ids->entries pointer to point to
48the new array, and invokes ipc_rcu_putref() to free up the old array.
49Note that rcu_assign_pointer() is used to update the ids->entries pointer,
50which includes any memory barriers required on whatever architecture
51you are running on.
52
53 static int grow_ary(struct ipc_ids* ids, int newsize)
54 {
55 struct ipc_id_ary* new;
56 struct ipc_id_ary* old;
57 int i;
58 int size = ids->entries->size;
59
60 if(newsize > IPCMNI)
61 newsize = IPCMNI;
62 if(newsize <= size)
63 return newsize;
64
65 new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize +
66 sizeof(struct ipc_id_ary));
67 if(new == NULL)
68 return size;
69 new->size = newsize;
70 memcpy(new->p, ids->entries->p,
71 sizeof(struct kern_ipc_perm *)*size +
72 sizeof(struct ipc_id_ary));
73 for(i=size;i<newsize;i++) {
74 new->p[i] = NULL;
75 }
76 old = ids->entries;
77
78 /*
79 * Use rcu_assign_pointer() to make sure the memcpyed
80 * contents of the new array are visible before the new
81 * array becomes visible.
82 */
83 rcu_assign_pointer(ids->entries, new);
84
85 ipc_rcu_putref(old);
86 return newsize;
87 }
88
89The ipc_rcu_putref() function decrements the array's reference count
90and then, if the reference count has dropped to zero, uses call_rcu()
91to free the array after a grace period has elapsed.
92
93The array is traversed by the ipc_lock() function. This function
94indexes into the array under the protection of rcu_read_lock(),
95using rcu_dereference() to pick up the pointer to the array so
96that it may later safely be dereferenced -- memory barriers are
97required on the Alpha CPU. Since the size of the array is stored
98with the array itself, there can be no array-size mismatches, so
99a simple check suffices. The pointer to the structure corresponding
100to the desired IPC object is placed in "out", with NULL indicating
101a non-existent entry. After acquiring "out->lock", the "out->deleted"
102flag indicates whether the IPC object is in the process of being
103deleted, and, if not, the pointer is returned.
104
105 struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id)
106 {
107 struct kern_ipc_perm* out;
108 int lid = id % SEQ_MULTIPLIER;
109 struct ipc_id_ary* entries;
110
111 rcu_read_lock();
112 entries = rcu_dereference(ids->entries);
113 if(lid >= entries->size) {
114 rcu_read_unlock();
115 return NULL;
116 }
117 out = entries->p[lid];
118 if(out == NULL) {
119 rcu_read_unlock();
120 return NULL;
121 }
122 spin_lock(&out->lock);
123
124 /* ipc_rmid() may have already freed the ID while ipc_lock
125 * was spinning: here verify that the structure is still valid
126 */
127 if (out->deleted) {
128 spin_unlock(&out->lock);
129 rcu_read_unlock();
130 return NULL;
131 }
132 return out;
133 }
134
135
136Answer to Quick Quiz:
137
138 The reason that it is important that updates be rare when
139 using seqlock is that frequent updates can livelock readers.
140 One way to avoid this problem is to assign a seqlock for
141 each array entry rather than to the entire array.
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt
new file mode 100644
index 000000000000..b3a568abe6b1
--- /dev/null
+++ b/Documentation/RCU/checklist.txt
@@ -0,0 +1,157 @@
1Review Checklist for RCU Patches
2
3
4This document contains a checklist for producing and reviewing patches
5that make use of RCU. Violating any of the rules listed below will
6result in the same sorts of problems that leaving out a locking primitive
7would cause. This list is based on experiences reviewing such patches
8over a rather long period of time, but improvements are always welcome!
9
100. Is RCU being applied to a read-mostly situation? If the data
11 structure is updated more than about 10% of the time, then
12 you should strongly consider some other approach, unless
13 detailed performance measurements show that RCU is nonetheless
14 the right tool for the job.
15
16 The other exception would be where performance is not an issue,
17 and RCU provides a simpler implementation. An example of this
18 situation is the dynamic NMI code in the Linux 2.6 kernel,
19 at least on architectures where NMIs are rare.
20
211. Does the update code have proper mutual exclusion?
22
23 RCU does allow -readers- to run (almost) naked, but -writers- must
24 still use some sort of mutual exclusion, such as:
25
26 a. locking,
27 b. atomic operations, or
28 c. restricting updates to a single task.
29
30 If you choose #b, be prepared to describe how you have handled
31 memory barriers on weakly ordered machines (pretty much all of
32 them -- even x86 allows reads to be reordered), and be prepared
33 to explain why this added complexity is worthwhile. If you
34 choose #c, be prepared to explain how this single task does not
35 become a major bottleneck on big multiprocessor machines.
36
372. Do the RCU read-side critical sections make proper use of
38 rcu_read_lock() and friends? These primitives are needed
39 to suppress preemption (or bottom halves, in the case of
40 rcu_read_lock_bh()) in the read-side critical sections,
41 and are also an excellent aid to readability.
42
433. Does the update code tolerate concurrent accesses?
44
45 The whole point of RCU is to permit readers to run without
46 any locks or atomic operations. This means that readers will
47 be running while updates are in progress. There are a number
48 of ways to handle this concurrency, depending on the situation:
49
50 a. Make updates appear atomic to readers. For example,
51 pointer updates to properly aligned fields will appear
52 atomic, as will individual atomic primitives. Operations
53 performed under a lock and sequences of multiple atomic
54 primitives will -not- appear to be atomic.
55
56 This is almost always the best approach.
57
58 b. Carefully order the updates and the reads so that
59 readers see valid data at all phases of the update.
60 This is often more difficult than it sounds, especially
61 given modern CPUs' tendency to reorder memory references.
62 One must usually liberally sprinkle memory barriers
63 (smp_wmb(), smp_rmb(), smp_mb()) through the code,
64 making it difficult to understand and to test.
65
66 It is usually better to group the changing data into
67 a separate structure, so that the change may be made
68 to appear atomic by updating a pointer to reference
69 a new structure containing updated values.
70
714. Weakly ordered CPUs pose special challenges. Almost all CPUs
72 are weakly ordered -- even i386 CPUs allow reads to be reordered.
73 RCU code must take all of the following measures to prevent
74 memory-corruption problems:
75
76 a. Readers must maintain proper ordering of their memory
77 accesses. The rcu_dereference() primitive ensures that
78 the CPU picks up the pointer before it picks up the data
79 that the pointer points to. This really is necessary
80 on Alpha CPUs. If you don't believe me, see:
81
82 http://www.openvms.compaq.com/wizard/wiz_2637.html
83
84 The rcu_dereference() primitive is also an excellent
85 documentation aid, letting the person reading the code
86 know exactly which pointers are protected by RCU.
87
88 The rcu_dereference() primitive is used by the various
89 "_rcu()" list-traversal primitives, such as the
90 list_for_each_entry_rcu().
91
92 b. If the list macros are being used, the list_del_rcu(),
93 list_add_tail_rcu(), and list_del_rcu() primitives must
94 be used in order to prevent weakly ordered machines from
95 misordering structure initialization and pointer planting.
96 Similarly, if the hlist macros are being used, the
97 hlist_del_rcu() and hlist_add_head_rcu() primitives
98 are required.
99
100 c. Updates must ensure that initialization of a given
101 structure happens before pointers to that structure are
102 publicized. Use the rcu_assign_pointer() primitive
103 when publicizing a pointer to a structure that can
104 be traversed by an RCU read-side critical section.
105
106 [The rcu_assign_pointer() primitive is in process.]
107
1085. If call_rcu(), or a related primitive such as call_rcu_bh(),
109 is used, the callback function must be written to be called
110 from softirq context. In particular, it cannot block.
111
1126. Since synchronize_kernel() blocks, it cannot be called from
113 any sort of irq context.
114
1157. If the updater uses call_rcu(), then the corresponding readers
116 must use rcu_read_lock() and rcu_read_unlock(). If the updater
117 uses call_rcu_bh(), then the corresponding readers must use
118 rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up
119 will result in confusion and broken kernels.
120
121 One exception to this rule: rcu_read_lock() and rcu_read_unlock()
122 may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh()
123 in cases where local bottom halves are already known to be
124 disabled, for example, in irq or softirq context. Commenting
125 such cases is a must, of course! And the jury is still out on
126 whether the increased speed is worth it.
127
1288. Although synchronize_kernel() is a bit slower than is call_rcu(),
129 it usually results in simpler code. So, unless update performance
130 is important or the updaters cannot block, synchronize_kernel()
131 should be used in preference to call_rcu().
132
1339. All RCU list-traversal primitives, which include
134 list_for_each_rcu(), list_for_each_entry_rcu(),
135 list_for_each_continue_rcu(), and list_for_each_safe_rcu(),
136 must be within an RCU read-side critical section. RCU
137 read-side critical sections are delimited by rcu_read_lock()
138 and rcu_read_unlock(), or by similar primitives such as
139 rcu_read_lock_bh() and rcu_read_unlock_bh().
140
141 Use of the _rcu() list-traversal primitives outside of an
142 RCU read-side critical section causes no harm other than
143 a slight performance degradation on Alpha CPUs and some
144 confusion on the part of people trying to read the code.
145
146 Another way of thinking of this is "If you are holding the
147 lock that prevents the data structure from changing, why do
148 you also need RCU-based protection?" That said, there may
149 well be situations where use of the _rcu() list-traversal
150 primitives while the update-side lock is held results in
151 simpler and more maintainable code. The jury is still out
152 on this question.
153
15410. Conversely, if you are in an RCU read-side critical section,
155 you -must- use the "_rcu()" variants of the list macros.
156 Failing to do so will break Alpha and confuse people reading
157 your code.
diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt
new file mode 100644
index 000000000000..bda6ead69bd0
--- /dev/null
+++ b/Documentation/RCU/listRCU.txt
@@ -0,0 +1,307 @@
1Using RCU to Protect Read-Mostly Linked Lists
2
3
4One of the best applications of RCU is to protect read-mostly linked lists
5("struct list_head" in list.h). One big advantage of this approach
6is that all of the required memory barriers are included for you in
7the list macros. This document describes several applications of RCU,
8with the best fits first.
9
10
11Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates
12
13The best applications are cases where, if reader-writer locking were
14used, the read-side lock would be dropped before taking any action
15based on the results of the search. The most celebrated example is
16the routing table. Because the routing table is tracking the state of
17equipment outside of the computer, it will at times contain stale data.
18Therefore, once the route has been computed, there is no need to hold
19the routing table static during transmission of the packet. After all,
20you can hold the routing table static all you want, but that won't keep
21the external Internet from changing, and it is the state of the external
22Internet that really matters. In addition, routing entries are typically
23added or deleted, rather than being modified in place.
24
25A straightforward example of this use of RCU may be found in the
26system-call auditing support. For example, a reader-writer locked
27implementation of audit_filter_task() might be as follows:
28
29 static enum audit_state audit_filter_task(struct task_struct *tsk)
30 {
31 struct audit_entry *e;
32 enum audit_state state;
33
34 read_lock(&auditsc_lock);
35 list_for_each_entry(e, &audit_tsklist, list) {
36 if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
37 read_unlock(&auditsc_lock);
38 return state;
39 }
40 }
41 read_unlock(&auditsc_lock);
42 return AUDIT_BUILD_CONTEXT;
43 }
44
45Here the list is searched under the lock, but the lock is dropped before
46the corresponding value is returned. By the time that this value is acted
47on, the list may well have been modified. This makes sense, since if
48you are turning auditing off, it is OK to audit a few extra system calls.
49
50This means that RCU can be easily applied to the read side, as follows:
51
52 static enum audit_state audit_filter_task(struct task_struct *tsk)
53 {
54 struct audit_entry *e;
55 enum audit_state state;
56
57 rcu_read_lock();
58 list_for_each_entry_rcu(e, &audit_tsklist, list) {
59 if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
60 rcu_read_unlock();
61 return state;
62 }
63 }
64 rcu_read_unlock();
65 return AUDIT_BUILD_CONTEXT;
66 }
67
68The read_lock() and read_unlock() calls have become rcu_read_lock()
69and rcu_read_unlock(), respectively, and the list_for_each_entry() has
70become list_for_each_entry_rcu(). The _rcu() list-traversal primitives
71insert the read-side memory barriers that are required on DEC Alpha CPUs.
72
73The changes to the update side are also straightforward. A reader-writer
74lock might be used as follows for deletion and insertion:
75
76 static inline int audit_del_rule(struct audit_rule *rule,
77 struct list_head *list)
78 {
79 struct audit_entry *e;
80
81 write_lock(&auditsc_lock);
82 list_for_each_entry(e, list, list) {
83 if (!audit_compare_rule(rule, &e->rule)) {
84 list_del(&e->list);
85 write_unlock(&auditsc_lock);
86 return 0;
87 }
88 }
89 write_unlock(&auditsc_lock);
90 return -EFAULT; /* No matching rule */
91 }
92
93 static inline int audit_add_rule(struct audit_entry *entry,
94 struct list_head *list)
95 {
96 write_lock(&auditsc_lock);
97 if (entry->rule.flags & AUDIT_PREPEND) {
98 entry->rule.flags &= ~AUDIT_PREPEND;
99 list_add(&entry->list, list);
100 } else {
101 list_add_tail(&entry->list, list);
102 }
103 write_unlock(&auditsc_lock);
104 return 0;
105 }
106
107Following are the RCU equivalents for these two functions:
108
109 static inline int audit_del_rule(struct audit_rule *rule,
110 struct list_head *list)
111 {
112 struct audit_entry *e;
113
114 /* Do not use the _rcu iterator here, since this is the only
115 * deletion routine. */
116 list_for_each_entry(e, list, list) {
117 if (!audit_compare_rule(rule, &e->rule)) {
118 list_del_rcu(&e->list);
119 call_rcu(&e->rcu, audit_free_rule, e);
120 return 0;
121 }
122 }
123 return -EFAULT; /* No matching rule */
124 }
125
126 static inline int audit_add_rule(struct audit_entry *entry,
127 struct list_head *list)
128 {
129 if (entry->rule.flags & AUDIT_PREPEND) {
130 entry->rule.flags &= ~AUDIT_PREPEND;
131 list_add_rcu(&entry->list, list);
132 } else {
133 list_add_tail_rcu(&entry->list, list);
134 }
135 return 0;
136 }
137
138Normally, the write_lock() and write_unlock() would be replaced by
139a spin_lock() and a spin_unlock(), but in this case, all callers hold
140audit_netlink_sem, so no additional locking is required. The auditsc_lock
141can therefore be eliminated, since use of RCU eliminates the need for
142writers to exclude readers.
143
144The list_del(), list_add(), and list_add_tail() primitives have been
145replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
146The _rcu() list-manipulation primitives add memory barriers that are
147needed on weakly ordered CPUs (most of them!).
148
149So, when readers can tolerate stale data and when entries are either added
150or deleted, without in-place modification, it is very easy to use RCU!
151
152
153Example 2: Handling In-Place Updates
154
155The system-call auditing code does not update auditing rules in place.
156However, if it did, reader-writer-locked code to do so might look as
157follows (presumably, the field_count is only permitted to decrease,
158otherwise, the added fields would need to be filled in):
159
160 static inline int audit_upd_rule(struct audit_rule *rule,
161 struct list_head *list,
162 __u32 newaction,
163 __u32 newfield_count)
164 {
165 struct audit_entry *e;
166 struct audit_newentry *ne;
167
168 write_lock(&auditsc_lock);
169 list_for_each_entry(e, list, list) {
170 if (!audit_compare_rule(rule, &e->rule)) {
171 e->rule.action = newaction;
172 e->rule.file_count = newfield_count;
173 write_unlock(&auditsc_lock);
174 return 0;
175 }
176 }
177 write_unlock(&auditsc_lock);
178 return -EFAULT; /* No matching rule */
179 }
180
181The RCU version creates a copy, updates the copy, then replaces the old
182entry with the newly updated entry. This sequence of actions, allowing
183concurrent reads while doing a copy to perform an update, is what gives
184RCU ("read-copy update") its name. The RCU code is as follows:
185
186 static inline int audit_upd_rule(struct audit_rule *rule,
187 struct list_head *list,
188 __u32 newaction,
189 __u32 newfield_count)
190 {
191 struct audit_entry *e;
192 struct audit_newentry *ne;
193
194 list_for_each_entry(e, list, list) {
195 if (!audit_compare_rule(rule, &e->rule)) {
196 ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
197 if (ne == NULL)
198 return -ENOMEM;
199 audit_copy_rule(&ne->rule, &e->rule);
200 ne->rule.action = newaction;
201 ne->rule.file_count = newfield_count;
202 list_add_rcu(ne, e);
203 list_del(e);
204 call_rcu(&e->rcu, audit_free_rule, e);
205 return 0;
206 }
207 }
208 return -EFAULT; /* No matching rule */
209 }
210
211Again, this assumes that the caller holds audit_netlink_sem. Normally,
212the reader-writer lock would become a spinlock in this sort of code.
213
214
215Example 3: Eliminating Stale Data
216
217The auditing examples above tolerate stale data, as do most algorithms
218that are tracking external state. Because there is a delay from the
219time the external state changes before Linux becomes aware of the change,
220additional RCU-induced staleness is normally not a problem.
221
222However, there are many examples where stale data cannot be tolerated.
223One example in the Linux kernel is the System V IPC (see the ipc_lock()
224function in ipc/util.c). This code checks a "deleted" flag under a
225per-entry spinlock, and, if the "deleted" flag is set, pretends that the
226entry does not exist. For this to be helpful, the search function must
227return holding the per-entry spinlock, as ipc_lock() does in fact do.
228
229Quick Quiz: Why does the search function need to return holding the
230per-entry lock for this deleted-flag technique to be helpful?
231
232If the system-call audit module were to ever need to reject stale data,
233one way to accomplish this would be to add a "deleted" flag and a "lock"
234spinlock to the audit_entry structure, and modify audit_filter_task()
235as follows:
236
237 static enum audit_state audit_filter_task(struct task_struct *tsk)
238 {
239 struct audit_entry *e;
240 enum audit_state state;
241
242 rcu_read_lock();
243 list_for_each_entry_rcu(e, &audit_tsklist, list) {
244 if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
245 spin_lock(&e->lock);
246 if (e->deleted) {
247 spin_unlock(&e->lock);
248 rcu_read_unlock();
249 return AUDIT_BUILD_CONTEXT;
250 }
251 rcu_read_unlock();
252 return state;
253 }
254 }
255 rcu_read_unlock();
256 return AUDIT_BUILD_CONTEXT;
257 }
258
259Note that this example assumes that entries are only added and deleted.
260Additional mechanism is required to deal correctly with the
261update-in-place performed by audit_upd_rule(). For one thing,
262audit_upd_rule() would need additional memory barriers to ensure
263that the list_add_rcu() was really executed before the list_del_rcu().
264
265The audit_del_rule() function would need to set the "deleted"
266flag under the spinlock as follows:
267
268 static inline int audit_del_rule(struct audit_rule *rule,
269 struct list_head *list)
270 {
271 struct audit_entry *e;
272
273 /* Do not use the _rcu iterator here, since this is the only
274 * deletion routine. */
275 list_for_each_entry(e, list, list) {
276 if (!audit_compare_rule(rule, &e->rule)) {
277 spin_lock(&e->lock);
278 list_del_rcu(&e->list);
279 e->deleted = 1;
280 spin_unlock(&e->lock);
281 call_rcu(&e->rcu, audit_free_rule, e);
282 return 0;
283 }
284 }
285 return -EFAULT; /* No matching rule */
286 }
287
288
289Summary
290
291Read-mostly list-based data structures that can tolerate stale data are
292the most amenable to use of RCU. The simplest case is where entries are
293either added or deleted from the data structure (or atomically modified
294in place), but non-atomic in-place modifications can be handled by making
295a copy, updating the copy, then replacing the original with the copy.
296If stale data cannot be tolerated, then a "deleted" flag may be used
297in conjunction with a per-entry spinlock in order to allow the search
298function to reject newly deleted data.
299
300
301Answer to Quick Quiz
302
303If the search function drops the per-entry lock before returning, then
304the caller will be processing stale data in any case. If it is really
305OK to be processing stale data, then you don't need a "deleted" flag.
306If processing stale data really is a problem, then you need to hold the
307per-entry lock across all of the code that uses the value looked up.
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt
new file mode 100644
index 000000000000..7e0c2ab6f2bd
--- /dev/null
+++ b/Documentation/RCU/rcu.txt
@@ -0,0 +1,67 @@
1RCU Concepts
2
3
4The basic idea behind RCU (read-copy update) is to split destructive
5operations into two parts, one that prevents anyone from seeing the data
6item being destroyed, and one that actually carries out the destruction.
7A "grace period" must elapse between the two parts, and this grace period
8must be long enough that any readers accessing the item being deleted have
9since dropped their references. For example, an RCU-protected deletion
10from a linked list would first remove the item from the list, wait for
11a grace period to elapse, then free the element. See the listRCU.txt
12file for more information on using RCU with linked lists.
13
14
15Frequently Asked Questions
16
17o Why would anyone want to use RCU?
18
19 The advantage of RCU's two-part approach is that RCU readers need
20 not acquire any locks, perform any atomic instructions, write to
21 shared memory, or (on CPUs other than Alpha) execute any memory
22 barriers. The fact that these operations are quite expensive
23 on modern CPUs is what gives RCU its performance advantages
24 in read-mostly situations. The fact that RCU readers need not
25 acquire locks can also greatly simplify deadlock-avoidance code.
26
27o How can the updater tell when a grace period has completed
28 if the RCU readers give no indication when they are done?
29
30 Just as with spinlocks, RCU readers are not permitted to
31 block, switch to user-mode execution, or enter the idle loop.
32 Therefore, as soon as a CPU is seen passing through any of these
33 three states, we know that that CPU has exited any previous RCU
34 read-side critical sections. So, if we remove an item from a
35 linked list, and then wait until all CPUs have switched context,
36 executed in user mode, or executed in the idle loop, we can
37 safely free up that item.
38
39o If I am running on a uniprocessor kernel, which can only do one
40 thing at a time, why should I wait for a grace period?
41
42 See the UP.txt file in this directory.
43
44o How can I see where RCU is currently used in the Linux kernel?
45
46 Search for "rcu_read_lock", "call_rcu", and "synchronize_kernel".
47
48o What guidelines should I follow when writing code that uses RCU?
49
50 See the checklist.txt file in this directory.
51
52o Why the name "RCU"?
53
54 "RCU" stands for "read-copy update". The file listRCU.txt has
55 more information on where this name came from, search for
56 "read-copy update" to find it.
57
58o I hear that RCU is patented? What is with that?
59
60 Yes, it is. There are several known patents related to RCU,
61 search for the string "Patent" in RTFP.txt to find them.
62 Of these, one was allowed to lapse by the assignee, and the
63 others have been contributed to the Linux kernel under GPL.
64
65o Where can I find more information on RCU?
66
67 See the RTFP.txt file in this directory.