diff options
| author | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
|---|---|---|
| committer | Linus Torvalds <torvalds@ppc970.osdl.org> | 2005-04-16 18:20:36 -0400 |
| commit | 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 (patch) | |
| tree | 0bba044c4ce775e45a88a51686b5d9f90697ea9d /Documentation/RCU | |
Linux-2.6.12-rc2v2.6.12-rc2
Initial git repository build. I'm not bothering with the full history,
even though we have it. We can create a separate "historical" git
archive of that later if we want to, and in the meantime it's about
3.2GB when imported into git - space that would just make the early
git days unnecessarily complicated, when we don't have a lot of good
infrastructure for it.
Let it rip!
Diffstat (limited to 'Documentation/RCU')
| -rw-r--r-- | Documentation/RCU/RTFP.txt | 387 | ||||
| -rw-r--r-- | Documentation/RCU/UP.txt | 64 | ||||
| -rw-r--r-- | Documentation/RCU/arrayRCU.txt | 141 | ||||
| -rw-r--r-- | Documentation/RCU/checklist.txt | 157 | ||||
| -rw-r--r-- | Documentation/RCU/listRCU.txt | 307 | ||||
| -rw-r--r-- | Documentation/RCU/rcu.txt | 67 |
6 files changed, 1123 insertions, 0 deletions
diff --git a/Documentation/RCU/RTFP.txt b/Documentation/RCU/RTFP.txt new file mode 100644 index 0000000000..12250b342e --- /dev/null +++ b/Documentation/RCU/RTFP.txt | |||
| @@ -0,0 +1,387 @@ | |||
| 1 | Read the F-ing Papers! | ||
| 2 | |||
| 3 | |||
| 4 | This document describes RCU-related publications, and is followed by | ||
| 5 | the corresponding bibtex entries. | ||
| 6 | |||
| 7 | The first thing resembling RCU was published in 1980, when Kung and Lehman | ||
| 8 | [Kung80] recommended use of a garbage collector to defer destruction | ||
| 9 | of nodes in a parallel binary search tree in order to simplify its | ||
| 10 | implementation. This works well in environments that have garbage | ||
| 11 | collectors, but current production garbage collectors incur significant | ||
| 12 | read-side overhead. | ||
| 13 | |||
| 14 | In 1982, Manber and Ladner [Manber82,Manber84] recommended deferring | ||
| 15 | destruction until all threads running at that time have terminated, again | ||
| 16 | for a parallel binary search tree. This approach works well in systems | ||
| 17 | with short-lived threads, such as the K42 research operating system. | ||
| 18 | However, Linux has long-lived tasks, so more is needed. | ||
| 19 | |||
| 20 | In 1986, Hennessy, Osisek, and Seigh [Hennessy89] introduced passive | ||
| 21 | serialization, which is an RCU-like mechanism that relies on the presence | ||
| 22 | of "quiescent states" in the VM/XA hypervisor that are guaranteed not | ||
| 23 | to be referencing the data structure. However, this mechanism was not | ||
| 24 | optimized for modern computer systems, which is not surprising given | ||
| 25 | that these overheads were not so expensive in the mid-80s. Nonetheless, | ||
| 26 | passive serialization appears to be the first deferred-destruction | ||
| 27 | mechanism to be used in production. Furthermore, the relevant patent has | ||
| 28 | lapsed, so this approach may be used in non-GPL software, if desired. | ||
| 29 | (In contrast, use of RCU is permitted only in software licensed under | ||
| 30 | GPL. Sorry!!!) | ||
| 31 | |||
| 32 | In 1990, Pugh [Pugh90] noted that explicitly tracking which threads | ||
| 33 | were reading a given data structure permitted deferred free to operate | ||
| 34 | in the presence of non-terminating threads. However, this explicit | ||
| 35 | tracking imposes significant read-side overhead, which is undesirable | ||
| 36 | in read-mostly situations. This algorithm does take pains to avoid | ||
| 37 | write-side contention and parallelize the other write-side overheads by | ||
| 38 | providing a fine-grained locking design, however, it would be interesting | ||
| 39 | to see how much of the performance advantage reported in 1990 remains | ||
| 40 | in 2004. | ||
| 41 | |||
| 42 | At about this same time, Adams [Adams91] described ``chaotic relaxation'', | ||
| 43 | where the normal barriers between successive iterations of convergent | ||
| 44 | numerical algorithms are relaxed, so that iteration $n$ might use | ||
| 45 | data from iteration $n-1$ or even $n-2$. This introduces error, | ||
| 46 | which typically slows convergence and thus increases the number of | ||
| 47 | iterations required. However, this increase is sometimes more than made | ||
| 48 | up for by a reduction in the number of expensive barrier operations, | ||
| 49 | which are otherwise required to synchronize the threads at the end | ||
| 50 | of each iteration. Unfortunately, chaotic relaxation requires highly | ||
| 51 | structured data, such as the matrices used in scientific programs, and | ||
| 52 | is thus inapplicable to most data structures in operating-system kernels. | ||
| 53 | |||
| 54 | In 1993, Jacobson [Jacobson93] verbally described what is perhaps the | ||
| 55 | simplest deferred-free technique: simply waiting a fixed amount of time | ||
| 56 | before freeing blocks awaiting deferred free. Jacobson did not describe | ||
| 57 | any write-side changes he might have made in this work using SGI's Irix | ||
| 58 | kernel. Aju John published a similar technique in 1995 [AjuJohn95]. | ||
| 59 | This works well if there is a well-defined upper bound on the length of | ||
| 60 | time that reading threads can hold references, as there might well be in | ||
| 61 | hard real-time systems. However, if this time is exceeded, perhaps due | ||
| 62 | to preemption, excessive interrupts, or larger-than-anticipated load, | ||
| 63 | memory corruption can ensue, with no reasonable means of diagnosis. | ||
| 64 | Jacobson's technique is therefore inappropriate for use in production | ||
| 65 | operating-system kernels, except when such kernels can provide hard | ||
| 66 | real-time response guarantees for all operations. | ||
| 67 | |||
| 68 | Also in 1995, Pu et al. [Pu95a] applied a technique similar to that of Pugh's | ||
| 69 | read-side-tracking to permit replugging of algorithms within a commercial | ||
| 70 | Unix operating system. However, this replugging permitted only a single | ||
| 71 | reader at a time. The following year, this same group of researchers | ||
| 72 | extended their technique to allow for multiple readers [Cowan96a]. | ||
| 73 | Their approach requires memory barriers (and thus pipeline stalls), | ||
| 74 | but reduces memory latency, contention, and locking overheads. | ||
| 75 | |||
| 76 | 1995 also saw the first publication of DYNIX/ptx's RCU mechanism | ||
| 77 | [Slingwine95], which was optimized for modern CPU architectures, | ||
| 78 | and was successfully applied to a number of situations within the | ||
| 79 | DYNIX/ptx kernel. The corresponding conference paper appeared in 1998 | ||
| 80 | [McKenney98]. | ||
| 81 | |||
| 82 | In 1999, the Tornado and K42 groups described their "generations" | ||
| 83 | mechanism, which quite similar to RCU [Gamsa99]. These operating systems | ||
| 84 | made pervasive use of RCU in place of "existence locks", which greatly | ||
| 85 | simplifies locking hierarchies. | ||
| 86 | |||
| 87 | 2001 saw the first RCU presentation involving Linux [McKenney01a] | ||
| 88 | at OLS. The resulting abundance of RCU patches was presented the | ||
| 89 | following year [McKenney02a], and use of RCU in dcache was first | ||
| 90 | described that same year [Linder02a]. | ||
| 91 | |||
| 92 | Also in 2002, Michael [Michael02b,Michael02a] presented techniques | ||
| 93 | that defer the destruction of data structures to simplify non-blocking | ||
| 94 | synchronization (wait-free synchronization, lock-free synchronization, | ||
| 95 | and obstruction-free synchronization are all examples of non-blocking | ||
| 96 | synchronization). In particular, this technique eliminates locking, | ||
| 97 | reduces contention, reduces memory latency for readers, and parallelizes | ||
| 98 | pipeline stalls and memory latency for writers. However, these | ||
| 99 | techniques still impose significant read-side overhead in the form of | ||
| 100 | memory barriers. Researchers at Sun worked along similar lines in the | ||
| 101 | same timeframe [HerlihyLM02,HerlihyLMS03]. | ||
| 102 | |||
| 103 | In 2003, the K42 group described how RCU could be used to create | ||
| 104 | hot-pluggable implementations of operating-system functions. Later that | ||
| 105 | year saw a paper describing an RCU implementation of System V IPC | ||
| 106 | [Arcangeli03], and an introduction to RCU in Linux Journal [McKenney03a]. | ||
| 107 | |||
| 108 | 2004 has seen a Linux-Journal article on use of RCU in dcache | ||
| 109 | [McKenney04a], a performance comparison of locking to RCU on several | ||
| 110 | different CPUs [McKenney04b], a dissertation describing use of RCU in a | ||
| 111 | number of operating-system kernels [PaulEdwardMcKenneyPhD], and a paper | ||
| 112 | describing how to make RCU safe for soft-realtime applications [Sarma04c]. | ||
| 113 | |||
| 114 | |||
| 115 | Bibtex Entries | ||
| 116 | |||
| 117 | @article{Kung80 | ||
| 118 | ,author="H. T. Kung and Q. Lehman" | ||
| 119 | ,title="Concurrent Maintenance of Binary Search Trees" | ||
| 120 | ,Year="1980" | ||
| 121 | ,Month="September" | ||
| 122 | ,journal="ACM Transactions on Database Systems" | ||
| 123 | ,volume="5" | ||
| 124 | ,number="3" | ||
| 125 | ,pages="354-382" | ||
| 126 | } | ||
| 127 | |||
| 128 | @techreport{Manber82 | ||
| 129 | ,author="Udi Manber and Richard E. Ladner" | ||
| 130 | ,title="Concurrency Control in a Dynamic Search Structure" | ||
| 131 | ,institution="Department of Computer Science, University of Washington" | ||
| 132 | ,address="Seattle, Washington" | ||
| 133 | ,year="1982" | ||
| 134 | ,number="82-01-01" | ||
| 135 | ,month="January" | ||
| 136 | ,pages="28" | ||
| 137 | } | ||
| 138 | |||
| 139 | @article{Manber84 | ||
| 140 | ,author="Udi Manber and Richard E. Ladner" | ||
| 141 | ,title="Concurrency Control in a Dynamic Search Structure" | ||
| 142 | ,Year="1984" | ||
| 143 | ,Month="September" | ||
| 144 | ,journal="ACM Transactions on Database Systems" | ||
| 145 | ,volume="9" | ||
| 146 | ,number="3" | ||
| 147 | ,pages="439-455" | ||
| 148 | } | ||
| 149 | |||
| 150 | @techreport{Hennessy89 | ||
| 151 | ,author="James P. Hennessy and Damian L. Osisek and Joseph W. {Seigh II}" | ||
| 152 | ,title="Passive Serialization in a Multitasking Environment" | ||
| 153 | ,institution="US Patent and Trademark Office" | ||
| 154 | ,address="Washington, DC" | ||
| 155 | ,year="1989" | ||
| 156 | ,number="US Patent 4,809,168 (lapsed)" | ||
| 157 | ,month="February" | ||
| 158 | ,pages="11" | ||
| 159 | } | ||
| 160 | |||
| 161 | @techreport{Pugh90 | ||
| 162 | ,author="William Pugh" | ||
| 163 | ,title="Concurrent Maintenance of Skip Lists" | ||
| 164 | ,institution="Institute of Advanced Computer Science Studies, Department of Computer Science, University of Maryland" | ||
| 165 | ,address="College Park, Maryland" | ||
| 166 | ,year="1990" | ||
| 167 | ,number="CS-TR-2222.1" | ||
| 168 | ,month="June" | ||
| 169 | } | ||
| 170 | |||
| 171 | @Book{Adams91 | ||
| 172 | ,Author="Gregory R. Adams" | ||
| 173 | ,title="Concurrent Programming, Principles, and Practices" | ||
| 174 | ,Publisher="Benjamin Cummins" | ||
| 175 | ,Year="1991" | ||
| 176 | } | ||
| 177 | |||
| 178 | @unpublished{Jacobson93 | ||
| 179 | ,author="Van Jacobson" | ||
| 180 | ,title="Avoid Read-Side Locking Via Delayed Free" | ||
| 181 | ,year="1993" | ||
| 182 | ,month="September" | ||
| 183 | ,note="Verbal discussion" | ||
| 184 | } | ||
| 185 | |||
| 186 | @Conference{AjuJohn95 | ||
| 187 | ,Author="Aju John" | ||
| 188 | ,Title="Dynamic vnodes -- Design and Implementation" | ||
| 189 | ,Booktitle="{USENIX Winter 1995}" | ||
| 190 | ,Publisher="USENIX Association" | ||
| 191 | ,Month="January" | ||
| 192 | ,Year="1995" | ||
| 193 | ,pages="11-23" | ||
| 194 | ,Address="New Orleans, LA" | ||
| 195 | } | ||
| 196 | |||
| 197 | @techreport{Slingwine95 | ||
| 198 | ,author="John D. Slingwine and Paul E. McKenney" | ||
| 199 | ,title="Apparatus and Method for Achieving Reduced Overhead Mutual | ||
| 200 | Exclusion and Maintaining Coherency in a Multiprocessor System | ||
| 201 | Utilizing Execution History and Thread Monitoring" | ||
| 202 | ,institution="US Patent and Trademark Office" | ||
| 203 | ,address="Washington, DC" | ||
| 204 | ,year="1995" | ||
| 205 | ,number="US Patent 5,442,758 (contributed under GPL)" | ||
| 206 | ,month="August" | ||
| 207 | } | ||
| 208 | |||
| 209 | @techreport{Slingwine97 | ||
| 210 | ,author="John D. Slingwine and Paul E. McKenney" | ||
| 211 | ,title="Method for maintaining data coherency using thread | ||
| 212 | activity summaries in a multicomputer system" | ||
| 213 | ,institution="US Patent and Trademark Office" | ||
| 214 | ,address="Washington, DC" | ||
| 215 | ,year="1997" | ||
| 216 | ,number="US Patent 5,608,893 (contributed under GPL)" | ||
| 217 | ,month="March" | ||
| 218 | } | ||
| 219 | |||
| 220 | @techreport{Slingwine98 | ||
| 221 | ,author="John D. Slingwine and Paul E. McKenney" | ||
| 222 | ,title="Apparatus and method for achieving reduced overhead | ||
| 223 | mutual exclusion and maintaining coherency in a multiprocessor | ||
| 224 | system utilizing execution history and thread monitoring" | ||
| 225 | ,institution="US Patent and Trademark Office" | ||
| 226 | ,address="Washington, DC" | ||
| 227 | ,year="1998" | ||
| 228 | ,number="US Patent 5,727,209 (contributed under GPL)" | ||
| 229 | ,month="March" | ||
| 230 | } | ||
| 231 | |||
| 232 | @Conference{McKenney98 | ||
| 233 | ,Author="Paul E. McKenney and John D. Slingwine" | ||
| 234 | ,Title="Read-Copy Update: Using Execution History to Solve Concurrency | ||
| 235 | Problems" | ||
| 236 | ,Booktitle="{Parallel and Distributed Computing and Systems}" | ||
| 237 | ,Month="October" | ||
| 238 | ,Year="1998" | ||
| 239 | ,pages="509-518" | ||
| 240 | ,Address="Las Vegas, NV" | ||
| 241 | } | ||
| 242 | |||
| 243 | @Conference{Gamsa99 | ||
| 244 | ,Author="Ben Gamsa and Orran Krieger and Jonathan Appavoo and Michael Stumm" | ||
| 245 | ,Title="Tornado: Maximizing Locality and Concurrency in a Shared Memory | ||
| 246 | Multiprocessor Operating System" | ||
| 247 | ,Booktitle="{Proceedings of the 3\textsuperscript{rd} Symposium on | ||
| 248 | Operating System Design and Implementation}" | ||
| 249 | ,Month="February" | ||
| 250 | ,Year="1999" | ||
| 251 | ,pages="87-100" | ||
| 252 | ,Address="New Orleans, LA" | ||
| 253 | } | ||
| 254 | |||
| 255 | @techreport{Slingwine01 | ||
| 256 | ,author="John D. Slingwine and Paul E. McKenney" | ||
| 257 | ,title="Apparatus and method for achieving reduced overhead | ||
| 258 | mutual exclusion and maintaining coherency in a multiprocessor | ||
| 259 | system utilizing execution history and thread monitoring" | ||
| 260 | ,institution="US Patent and Trademark Office" | ||
| 261 | ,address="Washington, DC" | ||
| 262 | ,year="2001" | ||
| 263 | ,number="US Patent 5,219,690 (contributed under GPL)" | ||
| 264 | ,month="April" | ||
| 265 | } | ||
| 266 | |||
| 267 | @Conference{McKenney01a | ||
| 268 | ,Author="Paul E. McKenney and Jonathan Appavoo and Andi Kleen and | ||
| 269 | Orran Krieger and Rusty Russell and Dipankar Sarma and Maneesh Soni" | ||
| 270 | ,Title="Read-Copy Update" | ||
| 271 | ,Booktitle="{Ottawa Linux Symposium}" | ||
| 272 | ,Month="July" | ||
| 273 | ,Year="2001" | ||
| 274 | ,note="Available: | ||
| 275 | \url{http://www.linuxsymposium.org/2001/abstracts/readcopy.php} | ||
| 276 | \url{http://www.rdrop.com/users/paulmck/rclock/rclock_OLS.2001.05.01c.pdf} | ||
| 277 | [Viewed June 23, 2004]" | ||
| 278 | annotation=" | ||
| 279 | Described RCU, and presented some patches implementing and using it in | ||
| 280 | the Linux kernel. | ||
| 281 | " | ||
| 282 | } | ||
| 283 | |||
| 284 | @Conference{Linder02a | ||
| 285 | ,Author="Hanna Linder and Dipankar Sarma and Maneesh Soni" | ||
| 286 | ,Title="Scalability of the Directory Entry Cache" | ||
| 287 | ,Booktitle="{Ottawa Linux Symposium}" | ||
| 288 | ,Month="June" | ||
| 289 | ,Year="2002" | ||
| 290 | ,pages="289-300" | ||
| 291 | } | ||
| 292 | |||
| 293 | @Conference{McKenney02a | ||
| 294 | ,Author="Paul E. McKenney and Dipankar Sarma and | ||
| 295 | Andrea Arcangeli and Andi Kleen and Orran Krieger and Rusty Russell" | ||
| 296 | ,Title="Read-Copy Update" | ||
| 297 | ,Booktitle="{Ottawa Linux Symposium}" | ||
| 298 | ,Month="June" | ||
| 299 | ,Year="2002" | ||
| 300 | ,pages="338-367" | ||
| 301 | ,note="Available: | ||
| 302 | \url{http://www.linux.org.uk/~ajh/ols2002_proceedings.pdf.gz} | ||
| 303 | [Viewed June 23, 2004]" | ||
| 304 | } | ||
| 305 | |||
| 306 | @article{Appavoo03a | ||
| 307 | ,author="J. Appavoo and K. Hui and C. A. N. Soules and R. W. Wisniewski and | ||
| 308 | D. M. {Da Silva} and O. Krieger and M. A. Auslander and D. J. Edelsohn and | ||
| 309 | B. Gamsa and G. R. Ganger and P. McKenney and M. Ostrowski and | ||
| 310 | B. Rosenburg and M. Stumm and J. Xenidis" | ||
| 311 | ,title="Enabling Autonomic Behavior in Systems Software With Hot Swapping" | ||
| 312 | ,Year="2003" | ||
| 313 | ,Month="January" | ||
| 314 | ,journal="IBM Systems Journal" | ||
| 315 | ,volume="42" | ||
| 316 | ,number="1" | ||
| 317 | ,pages="60-76" | ||
| 318 | } | ||
| 319 | |||
| 320 | @Conference{Arcangeli03 | ||
| 321 | ,Author="Andrea Arcangeli and Mingming Cao and Paul E. McKenney and | ||
| 322 | Dipankar Sarma" | ||
| 323 | ,Title="Using Read-Copy Update Techniques for {System V IPC} in the | ||
| 324 | {Linux} 2.5 Kernel" | ||
| 325 | ,Booktitle="Proceedings of the 2003 USENIX Annual Technical Conference | ||
| 326 | (FREENIX Track)" | ||
| 327 | ,Publisher="USENIX Association" | ||
| 328 | ,year="2003" | ||
| 329 | ,month="June" | ||
| 330 | ,pages="297-310" | ||
| 331 | } | ||
| 332 | |||
| 333 | @article{McKenney03a | ||
| 334 | ,author="Paul E. McKenney" | ||
| 335 | ,title="Using {RCU} in the {Linux} 2.5 Kernel" | ||
| 336 | ,Year="2003" | ||
| 337 | ,Month="October" | ||
| 338 | ,journal="Linux Journal" | ||
| 339 | ,volume="1" | ||
| 340 | ,number="114" | ||
| 341 | ,pages="18-26" | ||
| 342 | } | ||
| 343 | |||
| 344 | @article{McKenney04a | ||
| 345 | ,author="Paul E. McKenney and Dipankar Sarma and Maneesh Soni" | ||
| 346 | ,title="Scaling dcache with {RCU}" | ||
| 347 | ,Year="2004" | ||
| 348 | ,Month="January" | ||
| 349 | ,journal="Linux Journal" | ||
| 350 | ,volume="1" | ||
| 351 | ,number="118" | ||
| 352 | ,pages="38-46" | ||
| 353 | } | ||
| 354 | |||
| 355 | @Conference{McKenney04b | ||
| 356 | ,Author="Paul E. McKenney" | ||
| 357 | ,Title="{RCU} vs. Locking Performance on Different {CPUs}" | ||
| 358 | ,Booktitle="{linux.conf.au}" | ||
| 359 | ,Month="January" | ||
| 360 | ,Year="2004" | ||
| 361 | ,Address="Adelaide, Australia" | ||
| 362 | ,note="Available: | ||
| 363 | \url{http://www.linux.org.au/conf/2004/abstracts.html#90} | ||
| 364 | \url{http://www.rdrop.com/users/paulmck/rclock/lockperf.2004.01.17a.pdf} | ||
| 365 | [Viewed June 23, 2004]" | ||
| 366 | } | ||
| 367 | |||
| 368 | @phdthesis{PaulEdwardMcKenneyPhD | ||
| 369 | ,author="Paul E. McKenney" | ||
| 370 | ,title="Exploiting Deferred Destruction: | ||
| 371 | An Analysis of Read-Copy-Update Techniques | ||
| 372 | in Operating System Kernels" | ||
| 373 | ,school="OGI School of Science and Engineering at | ||
| 374 | Oregon Health and Sciences University" | ||
| 375 | ,year="2004" | ||
| 376 | } | ||
| 377 | |||
| 378 | @Conference{Sarma04c | ||
| 379 | ,Author="Dipankar Sarma and Paul E. McKenney" | ||
| 380 | ,Title="Making RCU Safe for Deep Sub-Millisecond Response Realtime Applications" | ||
| 381 | ,Booktitle="Proceedings of the 2004 USENIX Annual Technical Conference | ||
| 382 | (FREENIX Track)" | ||
| 383 | ,Publisher="USENIX Association" | ||
| 384 | ,year="2004" | ||
| 385 | ,month="June" | ||
| 386 | ,pages="182-191" | ||
| 387 | } | ||
diff --git a/Documentation/RCU/UP.txt b/Documentation/RCU/UP.txt new file mode 100644 index 0000000000..551a803d82 --- /dev/null +++ b/Documentation/RCU/UP.txt | |||
| @@ -0,0 +1,64 @@ | |||
| 1 | RCU on Uniprocessor Systems | ||
| 2 | |||
| 3 | |||
| 4 | A common misconception is that, on UP systems, the call_rcu() primitive | ||
| 5 | may immediately invoke its function, and that the synchronize_kernel | ||
| 6 | primitive may return immediately. The basis of this misconception | ||
| 7 | is that since there is only one CPU, it should not be necessary to | ||
| 8 | wait for anything else to get done, since there are no other CPUs for | ||
| 9 | anything else to be happening on. Although this approach will sort of | ||
| 10 | work a surprising amount of the time, it is a very bad idea in general. | ||
| 11 | This document presents two examples that demonstrate exactly how bad an | ||
| 12 | idea this is. | ||
| 13 | |||
| 14 | |||
| 15 | Example 1: softirq Suicide | ||
| 16 | |||
| 17 | Suppose that an RCU-based algorithm scans a linked list containing | ||
| 18 | elements A, B, and C in process context, and can delete elements from | ||
| 19 | this same list in softirq context. Suppose that the process-context scan | ||
| 20 | is referencing element B when it is interrupted by softirq processing, | ||
| 21 | which deletes element B, and then invokes call_rcu() to free element B | ||
| 22 | after a grace period. | ||
| 23 | |||
| 24 | Now, if call_rcu() were to directly invoke its arguments, then upon return | ||
| 25 | from softirq, the list scan would find itself referencing a newly freed | ||
| 26 | element B. This situation can greatly decrease the life expectancy of | ||
| 27 | your kernel. | ||
| 28 | |||
| 29 | |||
| 30 | Example 2: Function-Call Fatality | ||
| 31 | |||
| 32 | Of course, one could avert the suicide described in the preceding example | ||
| 33 | by having call_rcu() directly invoke its arguments only if it was called | ||
| 34 | from process context. However, this can fail in a similar manner. | ||
| 35 | |||
| 36 | Suppose that an RCU-based algorithm again scans a linked list containing | ||
| 37 | elements A, B, and C in process contexts, but that it invokes a function | ||
| 38 | on each element as it is scanned. Suppose further that this function | ||
| 39 | deletes element B from the list, then passes it to call_rcu() for deferred | ||
| 40 | freeing. This may be a bit unconventional, but it is perfectly legal | ||
| 41 | RCU usage, since call_rcu() must wait for a grace period to elapse. | ||
| 42 | Therefore, in this case, allowing call_rcu() to immediately invoke | ||
| 43 | its arguments would cause it to fail to make the fundamental guarantee | ||
| 44 | underlying RCU, namely that call_rcu() defers invoking its arguments until | ||
| 45 | all RCU read-side critical sections currently executing have completed. | ||
| 46 | |||
| 47 | Quick Quiz: why is it -not- legal to invoke synchronize_kernel() in | ||
| 48 | this case? | ||
| 49 | |||
| 50 | |||
| 51 | Summary | ||
| 52 | |||
| 53 | Permitting call_rcu() to immediately invoke its arguments or permitting | ||
| 54 | synchronize_kernel() to immediately return breaks RCU, even on a UP system. | ||
| 55 | So do not do it! Even on a UP system, the RCU infrastructure -must- | ||
| 56 | respect grace periods. | ||
| 57 | |||
| 58 | |||
| 59 | Answer to Quick Quiz | ||
| 60 | |||
| 61 | The calling function is scanning an RCU-protected linked list, and | ||
| 62 | is therefore within an RCU read-side critical section. Therefore, | ||
| 63 | the called function has been invoked within an RCU read-side critical | ||
| 64 | section, and is not permitted to block. | ||
diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt new file mode 100644 index 0000000000..453ebe6953 --- /dev/null +++ b/Documentation/RCU/arrayRCU.txt | |||
| @@ -0,0 +1,141 @@ | |||
| 1 | Using RCU to Protect Read-Mostly Arrays | ||
| 2 | |||
| 3 | |||
| 4 | Although RCU is more commonly used to protect linked lists, it can | ||
| 5 | also be used to protect arrays. Three situations are as follows: | ||
| 6 | |||
| 7 | 1. Hash Tables | ||
| 8 | |||
| 9 | 2. Static Arrays | ||
| 10 | |||
| 11 | 3. Resizeable Arrays | ||
| 12 | |||
| 13 | Each of these situations are discussed below. | ||
| 14 | |||
| 15 | |||
| 16 | Situation 1: Hash Tables | ||
| 17 | |||
| 18 | Hash tables are often implemented as an array, where each array entry | ||
| 19 | has a linked-list hash chain. Each hash chain can be protected by RCU | ||
| 20 | as described in the listRCU.txt document. This approach also applies | ||
| 21 | to other array-of-list situations, such as radix trees. | ||
| 22 | |||
| 23 | |||
| 24 | Situation 2: Static Arrays | ||
| 25 | |||
| 26 | Static arrays, where the data (rather than a pointer to the data) is | ||
| 27 | located in each array element, and where the array is never resized, | ||
| 28 | have not been used with RCU. Rik van Riel recommends using seqlock in | ||
| 29 | this situation, which would also have minimal read-side overhead as long | ||
| 30 | as updates are rare. | ||
| 31 | |||
| 32 | Quick Quiz: Why is it so important that updates be rare when | ||
| 33 | using seqlock? | ||
| 34 | |||
| 35 | |||
| 36 | Situation 3: Resizeable Arrays | ||
| 37 | |||
| 38 | Use of RCU for resizeable arrays is demonstrated by the grow_ary() | ||
| 39 | function used by the System V IPC code. The array is used to map from | ||
| 40 | semaphore, message-queue, and shared-memory IDs to the data structure | ||
| 41 | that represents the corresponding IPC construct. The grow_ary() | ||
| 42 | function does not acquire any locks; instead its caller must hold the | ||
| 43 | ids->sem semaphore. | ||
| 44 | |||
| 45 | The grow_ary() function, shown below, does some limit checks, allocates a | ||
| 46 | new ipc_id_ary, copies the old to the new portion of the new, initializes | ||
| 47 | the remainder of the new, updates the ids->entries pointer to point to | ||
| 48 | the new array, and invokes ipc_rcu_putref() to free up the old array. | ||
| 49 | Note that rcu_assign_pointer() is used to update the ids->entries pointer, | ||
| 50 | which includes any memory barriers required on whatever architecture | ||
| 51 | you are running on. | ||
| 52 | |||
| 53 | static int grow_ary(struct ipc_ids* ids, int newsize) | ||
| 54 | { | ||
| 55 | struct ipc_id_ary* new; | ||
| 56 | struct ipc_id_ary* old; | ||
| 57 | int i; | ||
| 58 | int size = ids->entries->size; | ||
| 59 | |||
| 60 | if(newsize > IPCMNI) | ||
| 61 | newsize = IPCMNI; | ||
| 62 | if(newsize <= size) | ||
| 63 | return newsize; | ||
| 64 | |||
| 65 | new = ipc_rcu_alloc(sizeof(struct kern_ipc_perm *)*newsize + | ||
| 66 | sizeof(struct ipc_id_ary)); | ||
| 67 | if(new == NULL) | ||
| 68 | return size; | ||
| 69 | new->size = newsize; | ||
| 70 | memcpy(new->p, ids->entries->p, | ||
| 71 | sizeof(struct kern_ipc_perm *)*size + | ||
| 72 | sizeof(struct ipc_id_ary)); | ||
| 73 | for(i=size;i<newsize;i++) { | ||
| 74 | new->p[i] = NULL; | ||
| 75 | } | ||
| 76 | old = ids->entries; | ||
| 77 | |||
| 78 | /* | ||
| 79 | * Use rcu_assign_pointer() to make sure the memcpyed | ||
| 80 | * contents of the new array are visible before the new | ||
| 81 | * array becomes visible. | ||
| 82 | */ | ||
| 83 | rcu_assign_pointer(ids->entries, new); | ||
| 84 | |||
| 85 | ipc_rcu_putref(old); | ||
| 86 | return newsize; | ||
| 87 | } | ||
| 88 | |||
| 89 | The ipc_rcu_putref() function decrements the array's reference count | ||
| 90 | and then, if the reference count has dropped to zero, uses call_rcu() | ||
| 91 | to free the array after a grace period has elapsed. | ||
| 92 | |||
| 93 | The array is traversed by the ipc_lock() function. This function | ||
| 94 | indexes into the array under the protection of rcu_read_lock(), | ||
| 95 | using rcu_dereference() to pick up the pointer to the array so | ||
| 96 | that it may later safely be dereferenced -- memory barriers are | ||
| 97 | required on the Alpha CPU. Since the size of the array is stored | ||
| 98 | with the array itself, there can be no array-size mismatches, so | ||
| 99 | a simple check suffices. The pointer to the structure corresponding | ||
| 100 | to the desired IPC object is placed in "out", with NULL indicating | ||
| 101 | a non-existent entry. After acquiring "out->lock", the "out->deleted" | ||
| 102 | flag indicates whether the IPC object is in the process of being | ||
| 103 | deleted, and, if not, the pointer is returned. | ||
| 104 | |||
| 105 | struct kern_ipc_perm* ipc_lock(struct ipc_ids* ids, int id) | ||
| 106 | { | ||
| 107 | struct kern_ipc_perm* out; | ||
| 108 | int lid = id % SEQ_MULTIPLIER; | ||
| 109 | struct ipc_id_ary* entries; | ||
| 110 | |||
| 111 | rcu_read_lock(); | ||
| 112 | entries = rcu_dereference(ids->entries); | ||
| 113 | if(lid >= entries->size) { | ||
| 114 | rcu_read_unlock(); | ||
| 115 | return NULL; | ||
| 116 | } | ||
| 117 | out = entries->p[lid]; | ||
| 118 | if(out == NULL) { | ||
| 119 | rcu_read_unlock(); | ||
| 120 | return NULL; | ||
| 121 | } | ||
| 122 | spin_lock(&out->lock); | ||
| 123 | |||
| 124 | /* ipc_rmid() may have already freed the ID while ipc_lock | ||
| 125 | * was spinning: here verify that the structure is still valid | ||
| 126 | */ | ||
| 127 | if (out->deleted) { | ||
| 128 | spin_unlock(&out->lock); | ||
| 129 | rcu_read_unlock(); | ||
| 130 | return NULL; | ||
| 131 | } | ||
| 132 | return out; | ||
| 133 | } | ||
| 134 | |||
| 135 | |||
| 136 | Answer to Quick Quiz: | ||
| 137 | |||
| 138 | The reason that it is important that updates be rare when | ||
| 139 | using seqlock is that frequent updates can livelock readers. | ||
| 140 | One way to avoid this problem is to assign a seqlock for | ||
| 141 | each array entry rather than to the entire array. | ||
diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt new file mode 100644 index 0000000000..b3a568abe6 --- /dev/null +++ b/Documentation/RCU/checklist.txt | |||
| @@ -0,0 +1,157 @@ | |||
| 1 | Review Checklist for RCU Patches | ||
| 2 | |||
| 3 | |||
| 4 | This document contains a checklist for producing and reviewing patches | ||
| 5 | that make use of RCU. Violating any of the rules listed below will | ||
| 6 | result in the same sorts of problems that leaving out a locking primitive | ||
| 7 | would cause. This list is based on experiences reviewing such patches | ||
| 8 | over a rather long period of time, but improvements are always welcome! | ||
| 9 | |||
| 10 | 0. Is RCU being applied to a read-mostly situation? If the data | ||
| 11 | structure is updated more than about 10% of the time, then | ||
| 12 | you should strongly consider some other approach, unless | ||
| 13 | detailed performance measurements show that RCU is nonetheless | ||
| 14 | the right tool for the job. | ||
| 15 | |||
| 16 | The other exception would be where performance is not an issue, | ||
| 17 | and RCU provides a simpler implementation. An example of this | ||
| 18 | situation is the dynamic NMI code in the Linux 2.6 kernel, | ||
| 19 | at least on architectures where NMIs are rare. | ||
| 20 | |||
| 21 | 1. Does the update code have proper mutual exclusion? | ||
| 22 | |||
| 23 | RCU does allow -readers- to run (almost) naked, but -writers- must | ||
| 24 | still use some sort of mutual exclusion, such as: | ||
| 25 | |||
| 26 | a. locking, | ||
| 27 | b. atomic operations, or | ||
| 28 | c. restricting updates to a single task. | ||
| 29 | |||
| 30 | If you choose #b, be prepared to describe how you have handled | ||
| 31 | memory barriers on weakly ordered machines (pretty much all of | ||
| 32 | them -- even x86 allows reads to be reordered), and be prepared | ||
| 33 | to explain why this added complexity is worthwhile. If you | ||
| 34 | choose #c, be prepared to explain how this single task does not | ||
| 35 | become a major bottleneck on big multiprocessor machines. | ||
| 36 | |||
| 37 | 2. Do the RCU read-side critical sections make proper use of | ||
| 38 | rcu_read_lock() and friends? These primitives are needed | ||
| 39 | to suppress preemption (or bottom halves, in the case of | ||
| 40 | rcu_read_lock_bh()) in the read-side critical sections, | ||
| 41 | and are also an excellent aid to readability. | ||
| 42 | |||
| 43 | 3. Does the update code tolerate concurrent accesses? | ||
| 44 | |||
| 45 | The whole point of RCU is to permit readers to run without | ||
| 46 | any locks or atomic operations. This means that readers will | ||
| 47 | be running while updates are in progress. There are a number | ||
| 48 | of ways to handle this concurrency, depending on the situation: | ||
| 49 | |||
| 50 | a. Make updates appear atomic to readers. For example, | ||
| 51 | pointer updates to properly aligned fields will appear | ||
| 52 | atomic, as will individual atomic primitives. Operations | ||
| 53 | performed under a lock and sequences of multiple atomic | ||
| 54 | primitives will -not- appear to be atomic. | ||
| 55 | |||
| 56 | This is almost always the best approach. | ||
| 57 | |||
| 58 | b. Carefully order the updates and the reads so that | ||
| 59 | readers see valid data at all phases of the update. | ||
| 60 | This is often more difficult than it sounds, especially | ||
| 61 | given modern CPUs' tendency to reorder memory references. | ||
| 62 | One must usually liberally sprinkle memory barriers | ||
| 63 | (smp_wmb(), smp_rmb(), smp_mb()) through the code, | ||
| 64 | making it difficult to understand and to test. | ||
| 65 | |||
| 66 | It is usually better to group the changing data into | ||
| 67 | a separate structure, so that the change may be made | ||
| 68 | to appear atomic by updating a pointer to reference | ||
| 69 | a new structure containing updated values. | ||
| 70 | |||
| 71 | 4. Weakly ordered CPUs pose special challenges. Almost all CPUs | ||
| 72 | are weakly ordered -- even i386 CPUs allow reads to be reordered. | ||
| 73 | RCU code must take all of the following measures to prevent | ||
| 74 | memory-corruption problems: | ||
| 75 | |||
| 76 | a. Readers must maintain proper ordering of their memory | ||
| 77 | accesses. The rcu_dereference() primitive ensures that | ||
| 78 | the CPU picks up the pointer before it picks up the data | ||
| 79 | that the pointer points to. This really is necessary | ||
| 80 | on Alpha CPUs. If you don't believe me, see: | ||
| 81 | |||
| 82 | http://www.openvms.compaq.com/wizard/wiz_2637.html | ||
| 83 | |||
| 84 | The rcu_dereference() primitive is also an excellent | ||
| 85 | documentation aid, letting the person reading the code | ||
| 86 | know exactly which pointers are protected by RCU. | ||
| 87 | |||
| 88 | The rcu_dereference() primitive is used by the various | ||
| 89 | "_rcu()" list-traversal primitives, such as the | ||
| 90 | list_for_each_entry_rcu(). | ||
| 91 | |||
| 92 | b. If the list macros are being used, the list_del_rcu(), | ||
| 93 | list_add_tail_rcu(), and list_del_rcu() primitives must | ||
| 94 | be used in order to prevent weakly ordered machines from | ||
| 95 | misordering structure initialization and pointer planting. | ||
| 96 | Similarly, if the hlist macros are being used, the | ||
| 97 | hlist_del_rcu() and hlist_add_head_rcu() primitives | ||
| 98 | are required. | ||
| 99 | |||
| 100 | c. Updates must ensure that initialization of a given | ||
| 101 | structure happens before pointers to that structure are | ||
| 102 | publicized. Use the rcu_assign_pointer() primitive | ||
| 103 | when publicizing a pointer to a structure that can | ||
| 104 | be traversed by an RCU read-side critical section. | ||
| 105 | |||
| 106 | [The rcu_assign_pointer() primitive is in process.] | ||
| 107 | |||
| 108 | 5. If call_rcu(), or a related primitive such as call_rcu_bh(), | ||
| 109 | is used, the callback function must be written to be called | ||
| 110 | from softirq context. In particular, it cannot block. | ||
| 111 | |||
| 112 | 6. Since synchronize_kernel() blocks, it cannot be called from | ||
| 113 | any sort of irq context. | ||
| 114 | |||
| 115 | 7. If the updater uses call_rcu(), then the corresponding readers | ||
| 116 | must use rcu_read_lock() and rcu_read_unlock(). If the updater | ||
| 117 | uses call_rcu_bh(), then the corresponding readers must use | ||
| 118 | rcu_read_lock_bh() and rcu_read_unlock_bh(). Mixing things up | ||
| 119 | will result in confusion and broken kernels. | ||
| 120 | |||
| 121 | One exception to this rule: rcu_read_lock() and rcu_read_unlock() | ||
| 122 | may be substituted for rcu_read_lock_bh() and rcu_read_unlock_bh() | ||
| 123 | in cases where local bottom halves are already known to be | ||
| 124 | disabled, for example, in irq or softirq context. Commenting | ||
| 125 | such cases is a must, of course! And the jury is still out on | ||
| 126 | whether the increased speed is worth it. | ||
| 127 | |||
| 128 | 8. Although synchronize_kernel() is a bit slower than is call_rcu(), | ||
| 129 | it usually results in simpler code. So, unless update performance | ||
| 130 | is important or the updaters cannot block, synchronize_kernel() | ||
| 131 | should be used in preference to call_rcu(). | ||
| 132 | |||
| 133 | 9. All RCU list-traversal primitives, which include | ||
| 134 | list_for_each_rcu(), list_for_each_entry_rcu(), | ||
| 135 | list_for_each_continue_rcu(), and list_for_each_safe_rcu(), | ||
| 136 | must be within an RCU read-side critical section. RCU | ||
| 137 | read-side critical sections are delimited by rcu_read_lock() | ||
| 138 | and rcu_read_unlock(), or by similar primitives such as | ||
| 139 | rcu_read_lock_bh() and rcu_read_unlock_bh(). | ||
| 140 | |||
| 141 | Use of the _rcu() list-traversal primitives outside of an | ||
| 142 | RCU read-side critical section causes no harm other than | ||
| 143 | a slight performance degradation on Alpha CPUs and some | ||
| 144 | confusion on the part of people trying to read the code. | ||
| 145 | |||
| 146 | Another way of thinking of this is "If you are holding the | ||
| 147 | lock that prevents the data structure from changing, why do | ||
| 148 | you also need RCU-based protection?" That said, there may | ||
| 149 | well be situations where use of the _rcu() list-traversal | ||
| 150 | primitives while the update-side lock is held results in | ||
| 151 | simpler and more maintainable code. The jury is still out | ||
| 152 | on this question. | ||
| 153 | |||
| 154 | 10. Conversely, if you are in an RCU read-side critical section, | ||
| 155 | you -must- use the "_rcu()" variants of the list macros. | ||
| 156 | Failing to do so will break Alpha and confuse people reading | ||
| 157 | your code. | ||
diff --git a/Documentation/RCU/listRCU.txt b/Documentation/RCU/listRCU.txt new file mode 100644 index 0000000000..bda6ead69b --- /dev/null +++ b/Documentation/RCU/listRCU.txt | |||
| @@ -0,0 +1,307 @@ | |||
| 1 | Using RCU to Protect Read-Mostly Linked Lists | ||
| 2 | |||
| 3 | |||
| 4 | One of the best applications of RCU is to protect read-mostly linked lists | ||
| 5 | ("struct list_head" in list.h). One big advantage of this approach | ||
| 6 | is that all of the required memory barriers are included for you in | ||
| 7 | the list macros. This document describes several applications of RCU, | ||
| 8 | with the best fits first. | ||
| 9 | |||
| 10 | |||
| 11 | Example 1: Read-Side Action Taken Outside of Lock, No In-Place Updates | ||
| 12 | |||
| 13 | The best applications are cases where, if reader-writer locking were | ||
| 14 | used, the read-side lock would be dropped before taking any action | ||
| 15 | based on the results of the search. The most celebrated example is | ||
| 16 | the routing table. Because the routing table is tracking the state of | ||
| 17 | equipment outside of the computer, it will at times contain stale data. | ||
| 18 | Therefore, once the route has been computed, there is no need to hold | ||
| 19 | the routing table static during transmission of the packet. After all, | ||
| 20 | you can hold the routing table static all you want, but that won't keep | ||
| 21 | the external Internet from changing, and it is the state of the external | ||
| 22 | Internet that really matters. In addition, routing entries are typically | ||
| 23 | added or deleted, rather than being modified in place. | ||
| 24 | |||
| 25 | A straightforward example of this use of RCU may be found in the | ||
| 26 | system-call auditing support. For example, a reader-writer locked | ||
| 27 | implementation of audit_filter_task() might be as follows: | ||
| 28 | |||
| 29 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
| 30 | { | ||
| 31 | struct audit_entry *e; | ||
| 32 | enum audit_state state; | ||
| 33 | |||
| 34 | read_lock(&auditsc_lock); | ||
| 35 | list_for_each_entry(e, &audit_tsklist, list) { | ||
| 36 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
| 37 | read_unlock(&auditsc_lock); | ||
| 38 | return state; | ||
| 39 | } | ||
| 40 | } | ||
| 41 | read_unlock(&auditsc_lock); | ||
| 42 | return AUDIT_BUILD_CONTEXT; | ||
| 43 | } | ||
| 44 | |||
| 45 | Here the list is searched under the lock, but the lock is dropped before | ||
| 46 | the corresponding value is returned. By the time that this value is acted | ||
| 47 | on, the list may well have been modified. This makes sense, since if | ||
| 48 | you are turning auditing off, it is OK to audit a few extra system calls. | ||
| 49 | |||
| 50 | This means that RCU can be easily applied to the read side, as follows: | ||
| 51 | |||
| 52 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
| 53 | { | ||
| 54 | struct audit_entry *e; | ||
| 55 | enum audit_state state; | ||
| 56 | |||
| 57 | rcu_read_lock(); | ||
| 58 | list_for_each_entry_rcu(e, &audit_tsklist, list) { | ||
| 59 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
| 60 | rcu_read_unlock(); | ||
| 61 | return state; | ||
| 62 | } | ||
| 63 | } | ||
| 64 | rcu_read_unlock(); | ||
| 65 | return AUDIT_BUILD_CONTEXT; | ||
| 66 | } | ||
| 67 | |||
| 68 | The read_lock() and read_unlock() calls have become rcu_read_lock() | ||
| 69 | and rcu_read_unlock(), respectively, and the list_for_each_entry() has | ||
| 70 | become list_for_each_entry_rcu(). The _rcu() list-traversal primitives | ||
| 71 | insert the read-side memory barriers that are required on DEC Alpha CPUs. | ||
| 72 | |||
| 73 | The changes to the update side are also straightforward. A reader-writer | ||
| 74 | lock might be used as follows for deletion and insertion: | ||
| 75 | |||
| 76 | static inline int audit_del_rule(struct audit_rule *rule, | ||
| 77 | struct list_head *list) | ||
| 78 | { | ||
| 79 | struct audit_entry *e; | ||
| 80 | |||
| 81 | write_lock(&auditsc_lock); | ||
| 82 | list_for_each_entry(e, list, list) { | ||
| 83 | if (!audit_compare_rule(rule, &e->rule)) { | ||
| 84 | list_del(&e->list); | ||
| 85 | write_unlock(&auditsc_lock); | ||
| 86 | return 0; | ||
| 87 | } | ||
| 88 | } | ||
| 89 | write_unlock(&auditsc_lock); | ||
| 90 | return -EFAULT; /* No matching rule */ | ||
| 91 | } | ||
| 92 | |||
| 93 | static inline int audit_add_rule(struct audit_entry *entry, | ||
| 94 | struct list_head *list) | ||
| 95 | { | ||
| 96 | write_lock(&auditsc_lock); | ||
| 97 | if (entry->rule.flags & AUDIT_PREPEND) { | ||
| 98 | entry->rule.flags &= ~AUDIT_PREPEND; | ||
| 99 | list_add(&entry->list, list); | ||
| 100 | } else { | ||
| 101 | list_add_tail(&entry->list, list); | ||
| 102 | } | ||
| 103 | write_unlock(&auditsc_lock); | ||
| 104 | return 0; | ||
| 105 | } | ||
| 106 | |||
| 107 | Following are the RCU equivalents for these two functions: | ||
| 108 | |||
| 109 | static inline int audit_del_rule(struct audit_rule *rule, | ||
| 110 | struct list_head *list) | ||
| 111 | { | ||
| 112 | struct audit_entry *e; | ||
| 113 | |||
| 114 | /* Do not use the _rcu iterator here, since this is the only | ||
| 115 | * deletion routine. */ | ||
| 116 | list_for_each_entry(e, list, list) { | ||
| 117 | if (!audit_compare_rule(rule, &e->rule)) { | ||
| 118 | list_del_rcu(&e->list); | ||
| 119 | call_rcu(&e->rcu, audit_free_rule, e); | ||
| 120 | return 0; | ||
| 121 | } | ||
| 122 | } | ||
| 123 | return -EFAULT; /* No matching rule */ | ||
| 124 | } | ||
| 125 | |||
| 126 | static inline int audit_add_rule(struct audit_entry *entry, | ||
| 127 | struct list_head *list) | ||
| 128 | { | ||
| 129 | if (entry->rule.flags & AUDIT_PREPEND) { | ||
| 130 | entry->rule.flags &= ~AUDIT_PREPEND; | ||
| 131 | list_add_rcu(&entry->list, list); | ||
| 132 | } else { | ||
| 133 | list_add_tail_rcu(&entry->list, list); | ||
| 134 | } | ||
| 135 | return 0; | ||
| 136 | } | ||
| 137 | |||
| 138 | Normally, the write_lock() and write_unlock() would be replaced by | ||
| 139 | a spin_lock() and a spin_unlock(), but in this case, all callers hold | ||
| 140 | audit_netlink_sem, so no additional locking is required. The auditsc_lock | ||
| 141 | can therefore be eliminated, since use of RCU eliminates the need for | ||
| 142 | writers to exclude readers. | ||
| 143 | |||
| 144 | The list_del(), list_add(), and list_add_tail() primitives have been | ||
| 145 | replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu(). | ||
| 146 | The _rcu() list-manipulation primitives add memory barriers that are | ||
| 147 | needed on weakly ordered CPUs (most of them!). | ||
| 148 | |||
| 149 | So, when readers can tolerate stale data and when entries are either added | ||
| 150 | or deleted, without in-place modification, it is very easy to use RCU! | ||
| 151 | |||
| 152 | |||
| 153 | Example 2: Handling In-Place Updates | ||
| 154 | |||
| 155 | The system-call auditing code does not update auditing rules in place. | ||
| 156 | However, if it did, reader-writer-locked code to do so might look as | ||
| 157 | follows (presumably, the field_count is only permitted to decrease, | ||
| 158 | otherwise, the added fields would need to be filled in): | ||
| 159 | |||
| 160 | static inline int audit_upd_rule(struct audit_rule *rule, | ||
| 161 | struct list_head *list, | ||
| 162 | __u32 newaction, | ||
| 163 | __u32 newfield_count) | ||
| 164 | { | ||
| 165 | struct audit_entry *e; | ||
| 166 | struct audit_newentry *ne; | ||
| 167 | |||
| 168 | write_lock(&auditsc_lock); | ||
| 169 | list_for_each_entry(e, list, list) { | ||
| 170 | if (!audit_compare_rule(rule, &e->rule)) { | ||
| 171 | e->rule.action = newaction; | ||
| 172 | e->rule.file_count = newfield_count; | ||
| 173 | write_unlock(&auditsc_lock); | ||
| 174 | return 0; | ||
| 175 | } | ||
| 176 | } | ||
| 177 | write_unlock(&auditsc_lock); | ||
| 178 | return -EFAULT; /* No matching rule */ | ||
| 179 | } | ||
| 180 | |||
| 181 | The RCU version creates a copy, updates the copy, then replaces the old | ||
| 182 | entry with the newly updated entry. This sequence of actions, allowing | ||
| 183 | concurrent reads while doing a copy to perform an update, is what gives | ||
| 184 | RCU ("read-copy update") its name. The RCU code is as follows: | ||
| 185 | |||
| 186 | static inline int audit_upd_rule(struct audit_rule *rule, | ||
| 187 | struct list_head *list, | ||
| 188 | __u32 newaction, | ||
| 189 | __u32 newfield_count) | ||
| 190 | { | ||
| 191 | struct audit_entry *e; | ||
| 192 | struct audit_newentry *ne; | ||
| 193 | |||
| 194 | list_for_each_entry(e, list, list) { | ||
| 195 | if (!audit_compare_rule(rule, &e->rule)) { | ||
| 196 | ne = kmalloc(sizeof(*entry), GFP_ATOMIC); | ||
| 197 | if (ne == NULL) | ||
| 198 | return -ENOMEM; | ||
| 199 | audit_copy_rule(&ne->rule, &e->rule); | ||
| 200 | ne->rule.action = newaction; | ||
| 201 | ne->rule.file_count = newfield_count; | ||
| 202 | list_add_rcu(ne, e); | ||
| 203 | list_del(e); | ||
| 204 | call_rcu(&e->rcu, audit_free_rule, e); | ||
| 205 | return 0; | ||
| 206 | } | ||
| 207 | } | ||
| 208 | return -EFAULT; /* No matching rule */ | ||
| 209 | } | ||
| 210 | |||
| 211 | Again, this assumes that the caller holds audit_netlink_sem. Normally, | ||
| 212 | the reader-writer lock would become a spinlock in this sort of code. | ||
| 213 | |||
| 214 | |||
| 215 | Example 3: Eliminating Stale Data | ||
| 216 | |||
| 217 | The auditing examples above tolerate stale data, as do most algorithms | ||
| 218 | that are tracking external state. Because there is a delay from the | ||
| 219 | time the external state changes before Linux becomes aware of the change, | ||
| 220 | additional RCU-induced staleness is normally not a problem. | ||
| 221 | |||
| 222 | However, there are many examples where stale data cannot be tolerated. | ||
| 223 | One example in the Linux kernel is the System V IPC (see the ipc_lock() | ||
| 224 | function in ipc/util.c). This code checks a "deleted" flag under a | ||
| 225 | per-entry spinlock, and, if the "deleted" flag is set, pretends that the | ||
| 226 | entry does not exist. For this to be helpful, the search function must | ||
| 227 | return holding the per-entry spinlock, as ipc_lock() does in fact do. | ||
| 228 | |||
| 229 | Quick Quiz: Why does the search function need to return holding the | ||
| 230 | per-entry lock for this deleted-flag technique to be helpful? | ||
| 231 | |||
| 232 | If the system-call audit module were to ever need to reject stale data, | ||
| 233 | one way to accomplish this would be to add a "deleted" flag and a "lock" | ||
| 234 | spinlock to the audit_entry structure, and modify audit_filter_task() | ||
| 235 | as follows: | ||
| 236 | |||
| 237 | static enum audit_state audit_filter_task(struct task_struct *tsk) | ||
| 238 | { | ||
| 239 | struct audit_entry *e; | ||
| 240 | enum audit_state state; | ||
| 241 | |||
| 242 | rcu_read_lock(); | ||
| 243 | list_for_each_entry_rcu(e, &audit_tsklist, list) { | ||
| 244 | if (audit_filter_rules(tsk, &e->rule, NULL, &state)) { | ||
| 245 | spin_lock(&e->lock); | ||
| 246 | if (e->deleted) { | ||
| 247 | spin_unlock(&e->lock); | ||
| 248 | rcu_read_unlock(); | ||
| 249 | return AUDIT_BUILD_CONTEXT; | ||
| 250 | } | ||
| 251 | rcu_read_unlock(); | ||
| 252 | return state; | ||
| 253 | } | ||
| 254 | } | ||
| 255 | rcu_read_unlock(); | ||
| 256 | return AUDIT_BUILD_CONTEXT; | ||
| 257 | } | ||
| 258 | |||
| 259 | Note that this example assumes that entries are only added and deleted. | ||
| 260 | Additional mechanism is required to deal correctly with the | ||
| 261 | update-in-place performed by audit_upd_rule(). For one thing, | ||
| 262 | audit_upd_rule() would need additional memory barriers to ensure | ||
| 263 | that the list_add_rcu() was really executed before the list_del_rcu(). | ||
| 264 | |||
| 265 | The audit_del_rule() function would need to set the "deleted" | ||
| 266 | flag under the spinlock as follows: | ||
| 267 | |||
| 268 | static inline int audit_del_rule(struct audit_rule *rule, | ||
| 269 | struct list_head *list) | ||
| 270 | { | ||
| 271 | struct audit_entry *e; | ||
| 272 | |||
| 273 | /* Do not use the _rcu iterator here, since this is the only | ||
| 274 | * deletion routine. */ | ||
| 275 | list_for_each_entry(e, list, list) { | ||
| 276 | if (!audit_compare_rule(rule, &e->rule)) { | ||
| 277 | spin_lock(&e->lock); | ||
| 278 | list_del_rcu(&e->list); | ||
| 279 | e->deleted = 1; | ||
| 280 | spin_unlock(&e->lock); | ||
| 281 | call_rcu(&e->rcu, audit_free_rule, e); | ||
| 282 | return 0; | ||
| 283 | } | ||
| 284 | } | ||
| 285 | return -EFAULT; /* No matching rule */ | ||
| 286 | } | ||
| 287 | |||
| 288 | |||
| 289 | Summary | ||
| 290 | |||
| 291 | Read-mostly list-based data structures that can tolerate stale data are | ||
| 292 | the most amenable to use of RCU. The simplest case is where entries are | ||
| 293 | either added or deleted from the data structure (or atomically modified | ||
| 294 | in place), but non-atomic in-place modifications can be handled by making | ||
| 295 | a copy, updating the copy, then replacing the original with the copy. | ||
| 296 | If stale data cannot be tolerated, then a "deleted" flag may be used | ||
| 297 | in conjunction with a per-entry spinlock in order to allow the search | ||
| 298 | function to reject newly deleted data. | ||
| 299 | |||
| 300 | |||
| 301 | Answer to Quick Quiz | ||
| 302 | |||
| 303 | If the search function drops the per-entry lock before returning, then | ||
| 304 | the caller will be processing stale data in any case. If it is really | ||
| 305 | OK to be processing stale data, then you don't need a "deleted" flag. | ||
| 306 | If processing stale data really is a problem, then you need to hold the | ||
| 307 | per-entry lock across all of the code that uses the value looked up. | ||
diff --git a/Documentation/RCU/rcu.txt b/Documentation/RCU/rcu.txt new file mode 100644 index 0000000000..7e0c2ab6f2 --- /dev/null +++ b/Documentation/RCU/rcu.txt | |||
| @@ -0,0 +1,67 @@ | |||
| 1 | RCU Concepts | ||
| 2 | |||
| 3 | |||
| 4 | The basic idea behind RCU (read-copy update) is to split destructive | ||
| 5 | operations into two parts, one that prevents anyone from seeing the data | ||
| 6 | item being destroyed, and one that actually carries out the destruction. | ||
| 7 | A "grace period" must elapse between the two parts, and this grace period | ||
| 8 | must be long enough that any readers accessing the item being deleted have | ||
| 9 | since dropped their references. For example, an RCU-protected deletion | ||
| 10 | from a linked list would first remove the item from the list, wait for | ||
| 11 | a grace period to elapse, then free the element. See the listRCU.txt | ||
| 12 | file for more information on using RCU with linked lists. | ||
| 13 | |||
| 14 | |||
| 15 | Frequently Asked Questions | ||
| 16 | |||
| 17 | o Why would anyone want to use RCU? | ||
| 18 | |||
| 19 | The advantage of RCU's two-part approach is that RCU readers need | ||
| 20 | not acquire any locks, perform any atomic instructions, write to | ||
| 21 | shared memory, or (on CPUs other than Alpha) execute any memory | ||
| 22 | barriers. The fact that these operations are quite expensive | ||
| 23 | on modern CPUs is what gives RCU its performance advantages | ||
| 24 | in read-mostly situations. The fact that RCU readers need not | ||
| 25 | acquire locks can also greatly simplify deadlock-avoidance code. | ||
| 26 | |||
| 27 | o How can the updater tell when a grace period has completed | ||
| 28 | if the RCU readers give no indication when they are done? | ||
| 29 | |||
| 30 | Just as with spinlocks, RCU readers are not permitted to | ||
| 31 | block, switch to user-mode execution, or enter the idle loop. | ||
| 32 | Therefore, as soon as a CPU is seen passing through any of these | ||
| 33 | three states, we know that that CPU has exited any previous RCU | ||
| 34 | read-side critical sections. So, if we remove an item from a | ||
| 35 | linked list, and then wait until all CPUs have switched context, | ||
| 36 | executed in user mode, or executed in the idle loop, we can | ||
| 37 | safely free up that item. | ||
| 38 | |||
| 39 | o If I am running on a uniprocessor kernel, which can only do one | ||
| 40 | thing at a time, why should I wait for a grace period? | ||
| 41 | |||
| 42 | See the UP.txt file in this directory. | ||
| 43 | |||
| 44 | o How can I see where RCU is currently used in the Linux kernel? | ||
| 45 | |||
| 46 | Search for "rcu_read_lock", "call_rcu", and "synchronize_kernel". | ||
| 47 | |||
| 48 | o What guidelines should I follow when writing code that uses RCU? | ||
| 49 | |||
| 50 | See the checklist.txt file in this directory. | ||
| 51 | |||
| 52 | o Why the name "RCU"? | ||
| 53 | |||
| 54 | "RCU" stands for "read-copy update". The file listRCU.txt has | ||
| 55 | more information on where this name came from, search for | ||
| 56 | "read-copy update" to find it. | ||
| 57 | |||
| 58 | o I hear that RCU is patented? What is with that? | ||
| 59 | |||
| 60 | Yes, it is. There are several known patents related to RCU, | ||
| 61 | search for the string "Patent" in RTFP.txt to find them. | ||
| 62 | Of these, one was allowed to lapse by the assignee, and the | ||
| 63 | others have been contributed to the Linux kernel under GPL. | ||
| 64 | |||
| 65 | o Where can I find more information on RCU? | ||
| 66 | |||
| 67 | See the RTFP.txt file in this directory. | ||
