diff options
| author | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-08-07 20:23:20 -0400 |
|---|---|---|
| committer | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-10-09 17:23:36 -0400 |
| commit | bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa (patch) | |
| tree | 9707a2bac694c443b95341fd777d3d287cbbe6c5 /Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | |
| parent | 8a5776a5f49812d29fe4b2d0a2d71675c3facf3f (diff) | |
documentation: RCU grace-period memory ordering guarantees
This commit provides text and diagrams showing how Tree RCU implements
its grace-period memory ordering guarantees.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html')
| -rw-r--r-- | Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | 707 |
1 files changed, 707 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html new file mode 100644 index 000000000000..8651b0b4fd79 --- /dev/null +++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html | |||
| @@ -0,0 +1,707 @@ | |||
| 1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" | ||
| 2 | "http://www.w3.org/TR/html4/loose.dtd"> | ||
| 3 | <html> | ||
| 4 | <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title> | ||
| 5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | ||
| 6 | |||
| 7 | <p>August 8, 2017</p> | ||
| 8 | <p>This article was contributed by Paul E. McKenney</p> | ||
| 9 | |||
| 10 | <h3>Introduction</h3> | ||
| 11 | |||
| 12 | <p>This document gives a rough visual overview of how Tree RCU's | ||
| 13 | grace-period memory ordering guarantee is provided. | ||
| 14 | |||
| 15 | <ol> | ||
| 16 | <li> <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> | ||
| 17 | What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a> | ||
| 18 | <li> <a href="#Tree RCU Grace Period Memory Ordering Building Blocks"> | ||
| 19 | Tree RCU Grace Period Memory Ordering Building Blocks</a> | ||
| 20 | <li> <a href="#Tree RCU Grace Period Memory Ordering Components"> | ||
| 21 | Tree RCU Grace Period Memory Ordering Components</a> | ||
| 22 | <li> <a href="#Putting It All Together">Putting It All Together</a> | ||
| 23 | </ol> | ||
| 24 | |||
| 25 | <h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?"> | ||
| 26 | What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3> | ||
| 27 | |||
| 28 | <p>RCU grace periods provide extremely strong memory-ordering guarantees | ||
| 29 | for non-idle non-offline code. | ||
| 30 | Any code that happens after the end of a given RCU grace period is guaranteed | ||
| 31 | to see the effects of all accesses prior to the beginning of that grace | ||
| 32 | period that are within RCU read-side critical sections. | ||
| 33 | Similarly, any code that happens before the beginning of a given RCU grace | ||
| 34 | period is guaranteed to see the effects of all accesses following the end | ||
| 35 | of that grace period that are within RCU read-side critical sections. | ||
| 36 | |||
| 37 | <p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>, | ||
| 38 | for which RCU-sched read-side critical sections include any region | ||
| 39 | of code for which preemption is disabled. | ||
| 40 | Given that each individual machine instruction can be thought of as | ||
| 41 | an extremely small region of preemption-disabled code, one can think of | ||
| 42 | <tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids. | ||
| 43 | |||
| 44 | <p>RCU updaters use this guarantee by splitting their updates into | ||
| 45 | two phases, one of which is executed before the grace period and | ||
| 46 | the other of which is executed after the grace period. | ||
| 47 | In the most common use case, phase one removes an element from | ||
| 48 | a linked RCU-protected data structure, and phase two frees that element. | ||
| 49 | For this to work, any readers that have witnessed state prior to the | ||
| 50 | phase-one update (in the common case, removal) must not witness state | ||
| 51 | following the phase-two update (in the common case, freeing). | ||
| 52 | |||
| 53 | <p>The RCU implementation provides this guarantee using a network | ||
| 54 | of lock-based critical sections, memory barriers, and per-CPU | ||
| 55 | processing, as is described in the following sections. | ||
| 56 | |||
| 57 | <h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks"> | ||
| 58 | Tree RCU Grace Period Memory Ordering Building Blocks</a></h3> | ||
| 59 | |||
| 60 | <p>The workhorse for RCU's grace-period memory ordering is the | ||
| 61 | critical section for the <tt>rcu_node</tt> structure's | ||
| 62 | <tt>->lock</tt>. | ||
| 63 | These critical sections use helper functions for lock acquisition, including | ||
| 64 | <tt>raw_spin_lock_rcu_node()</tt>, | ||
| 65 | <tt>raw_spin_lock_irq_rcu_node()</tt>, and | ||
| 66 | <tt>raw_spin_lock_irqsave_rcu_node()</tt>. | ||
| 67 | Their lock-release counterparts are | ||
| 68 | <tt>raw_spin_unlock_rcu_node()</tt>, | ||
| 69 | <tt>raw_spin_unlock_irq_rcu_node()</tt>, and | ||
| 70 | <tt>raw_spin_unlock_irqrestore_rcu_node()</tt>, | ||
| 71 | respectively. | ||
| 72 | For completeness, a | ||
| 73 | <tt>raw_spin_trylock_rcu_node()</tt> | ||
| 74 | is also provided. | ||
| 75 | The key point is that the lock-acquisition functions, including | ||
| 76 | <tt>raw_spin_trylock_rcu_node()</tt>, all invoke | ||
| 77 | <tt>smp_mb__after_unlock_lock()</tt> immediately after successful | ||
| 78 | acquisition of the lock. | ||
| 79 | |||
| 80 | <p>Therefore, for any given <tt>rcu_node</tt> struction, any access | ||
| 81 | happening before one of the above lock-release functions will be seen | ||
| 82 | by all CPUs as happening before any access happening after a later | ||
| 83 | one of the above lock-acquisition functions. | ||
| 84 | Furthermore, any access happening before one of the | ||
| 85 | above lock-release function on any given CPU will be seen by all | ||
| 86 | CPUs as happening before any access happening after a later one | ||
| 87 | of the above lock-acquisition functions executing on that same CPU, | ||
| 88 | even if the lock-release and lock-acquisition functions are operating | ||
| 89 | on different <tt>rcu_node</tt> structures. | ||
| 90 | Tree RCU uses these two ordering guarantees to form an ordering | ||
| 91 | network among all CPUs that were in any way involved in the grace | ||
| 92 | period, including any CPUs that came online or went offline during | ||
| 93 | the grace period in question. | ||
| 94 | |||
| 95 | <p>The following litmus test exhibits the ordering effects of these | ||
| 96 | lock-acquisition and lock-release functions: | ||
| 97 | |||
| 98 | <pre> | ||
| 99 | 1 int x, y, z; | ||
| 100 | 2 | ||
| 101 | 3 void task0(void) | ||
| 102 | 4 { | ||
| 103 | 5 raw_spin_lock_rcu_node(rnp); | ||
| 104 | 6 WRITE_ONCE(x, 1); | ||
| 105 | 7 r1 = READ_ONCE(y); | ||
| 106 | 8 raw_spin_unlock_rcu_node(rnp); | ||
| 107 | 9 } | ||
| 108 | 10 | ||
| 109 | 11 void task1(void) | ||
| 110 | 12 { | ||
| 111 | 13 raw_spin_lock_rcu_node(rnp); | ||
| 112 | 14 WRITE_ONCE(y, 1); | ||
| 113 | 15 r2 = READ_ONCE(z); | ||
| 114 | 16 raw_spin_unlock_rcu_node(rnp); | ||
| 115 | 17 } | ||
| 116 | 18 | ||
| 117 | 19 void task2(void) | ||
| 118 | 20 { | ||
| 119 | 21 WRITE_ONCE(z, 1); | ||
| 120 | 22 smp_mb(); | ||
| 121 | 23 r3 = READ_ONCE(x); | ||
| 122 | 24 } | ||
| 123 | 25 | ||
| 124 | 26 WARN_ON(r1 == 0 && r2 == 0 && r3 == 0); | ||
| 125 | </pre> | ||
| 126 | |||
| 127 | <p>The <tt>WARN_ON()</tt> is evaluated at “the end of time”, | ||
| 128 | after all changes have propagated throughout the system. | ||
| 129 | Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the | ||
| 130 | acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example | ||
| 131 | on PowerPC. | ||
| 132 | The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this | ||
| 133 | <tt>WARN_ON()</tt> from triggering. | ||
| 134 | |||
| 135 | <p>This approach must be extended to include idle CPUs, which need | ||
| 136 | RCU's grace-period memory ordering guarantee to extend to any | ||
| 137 | RCU read-side critical sections preceding and following the current | ||
| 138 | idle sojourn. | ||
| 139 | This case is handled by calls to the strongly ordered | ||
| 140 | <tt>atomic_add_return()</tt> read-modify-write atomic operation that | ||
| 141 | is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry | ||
| 142 | time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time. | ||
| 143 | The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and | ||
| 144 | <tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke | ||
| 145 | an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs. | ||
| 146 | |||
| 147 | <table> | ||
| 148 | <tr><th> </th></tr> | ||
| 149 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 150 | <tr><td> | ||
| 151 | But what about CPUs that remain offline for the entire | ||
| 152 | grace period? | ||
| 153 | </td></tr> | ||
| 154 | <tr><th align="left">Answer:</th></tr> | ||
| 155 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 156 | Such CPUs will be offline at the beginning of the grace period, | ||
| 157 | so the grace period won't expect quiescent states from them. | ||
| 158 | Races between grace-period start and CPU-hotplug operations | ||
| 159 | are mediated by the CPU's leaf <tt>rcu_node</tt> structure's | ||
| 160 | <tt>->lock</tt> as described above. | ||
| 161 | </font></td></tr> | ||
| 162 | <tr><td> </td></tr> | ||
| 163 | </table> | ||
| 164 | |||
| 165 | <p>The approach must be extended to handle one final case, that | ||
| 166 | of waking a task blocked in <tt>synchronize_rcu()</tt>. | ||
| 167 | This task might be affinitied to a CPU that is not yet aware that | ||
| 168 | the grace period has ended, and thus might not yet be subject to | ||
| 169 | the grace period's memory ordering. | ||
| 170 | Therefore, there is an <tt>smp_mb()</tt> after the return from | ||
| 171 | <tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt> | ||
| 172 | code path. | ||
| 173 | |||
| 174 | <table> | ||
| 175 | <tr><th> </th></tr> | ||
| 176 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 177 | <tr><td> | ||
| 178 | What? Where??? | ||
| 179 | I don't see any <tt>smp_mb()</tt> after the return from | ||
| 180 | <tt>wait_for_completion()</tt>!!! | ||
| 181 | </td></tr> | ||
| 182 | <tr><th align="left">Answer:</th></tr> | ||
| 183 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 184 | That would be because I spotted the need for that | ||
| 185 | <tt>smp_mb()</tt> during the creation of this documentation, | ||
| 186 | and it is therefore unlikely to hit mainline before v4.14. | ||
| 187 | Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and | ||
| 188 | Jonathan Cameron for asking questions that sensitized me | ||
| 189 | to the rather elaborate sequence of events that demonstrate | ||
| 190 | the need for this memory barrier. | ||
| 191 | </font></td></tr> | ||
| 192 | <tr><td> </td></tr> | ||
| 193 | </table> | ||
| 194 | |||
| 195 | <p>Tree RCU's grace--period memory-ordering guarantees rely most | ||
| 196 | heavily on the <tt>rcu_node</tt> structure's <tt>->lock</tt> | ||
| 197 | field, so much so that it is necessary to abbreviate this pattern | ||
| 198 | in the diagrams in the next section. | ||
| 199 | For example, consider the <tt>rcu_prepare_for_idle()</tt> function | ||
| 200 | shown below, which is one of several functions that enforce ordering | ||
| 201 | of newly arrived RCU callbacks against future grace periods: | ||
| 202 | |||
| 203 | <pre> | ||
| 204 | 1 static void rcu_prepare_for_idle(void) | ||
| 205 | 2 { | ||
| 206 | 3 bool needwake; | ||
| 207 | 4 struct rcu_data *rdp; | ||
| 208 | 5 struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks); | ||
| 209 | 6 struct rcu_node *rnp; | ||
| 210 | 7 struct rcu_state *rsp; | ||
| 211 | 8 int tne; | ||
| 212 | 9 | ||
| 213 | 10 if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) || | ||
| 214 | 11 rcu_is_nocb_cpu(smp_processor_id())) | ||
| 215 | 12 return; | ||
| 216 | 13 tne = READ_ONCE(tick_nohz_active); | ||
| 217 | 14 if (tne != rdtp->tick_nohz_enabled_snap) { | ||
| 218 | 15 if (rcu_cpu_has_callbacks(NULL)) | ||
| 219 | 16 invoke_rcu_core(); | ||
| 220 | 17 rdtp->tick_nohz_enabled_snap = tne; | ||
| 221 | 18 return; | ||
| 222 | 19 } | ||
| 223 | 20 if (!tne) | ||
| 224 | 21 return; | ||
| 225 | 22 if (rdtp->all_lazy && | ||
| 226 | 23 rdtp->nonlazy_posted != rdtp->nonlazy_posted_snap) { | ||
| 227 | 24 rdtp->all_lazy = false; | ||
| 228 | 25 rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted; | ||
| 229 | 26 invoke_rcu_core(); | ||
| 230 | 27 return; | ||
| 231 | 28 } | ||
| 232 | 29 if (rdtp->last_accelerate == jiffies) | ||
| 233 | 30 return; | ||
| 234 | 31 rdtp->last_accelerate = jiffies; | ||
| 235 | 32 for_each_rcu_flavor(rsp) { | ||
| 236 | 33 rdp = this_cpu_ptr(rsp->rda); | ||
| 237 | 34 if (rcu_segcblist_pend_cbs(&rdp->cblist)) | ||
| 238 | 35 continue; | ||
| 239 | 36 rnp = rdp->mynode; | ||
| 240 | 37 raw_spin_lock_rcu_node(rnp); | ||
| 241 | 38 needwake = rcu_accelerate_cbs(rsp, rnp, rdp); | ||
| 242 | 39 raw_spin_unlock_rcu_node(rnp); | ||
| 243 | 40 if (needwake) | ||
| 244 | 41 rcu_gp_kthread_wake(rsp); | ||
| 245 | 42 } | ||
| 246 | 43 } | ||
| 247 | </pre> | ||
| 248 | |||
| 249 | <p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really | ||
| 250 | matters for this discussion are lines 37–39. | ||
| 251 | We will therefore abbreviate this function as follows: | ||
| 252 | |||
| 253 | </p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg"> | ||
| 254 | |||
| 255 | <p>The box represents the <tt>rcu_node</tt> structure's <tt>->lock</tt> | ||
| 256 | critical section, with the double line on top representing the additional | ||
| 257 | <tt>smp_mb__after_unlock_lock()</tt>. | ||
| 258 | |||
| 259 | <h3><a name="Tree RCU Grace Period Memory Ordering Components"> | ||
| 260 | Tree RCU Grace Period Memory Ordering Components</a></h3> | ||
| 261 | |||
| 262 | <p>Tree RCU's grace-period memory-ordering guarantee is provided by | ||
| 263 | a number of RCU components: | ||
| 264 | |||
| 265 | <ol> | ||
| 266 | <li> <a href="#Callback Registry">Callback Registry</a> | ||
| 267 | <li> <a href="#Grace-Period Initialization">Grace-Period Initialization</a> | ||
| 268 | <li> <a href="#Self-Reported Quiescent States"> | ||
| 269 | Self-Reported Quiescent States</a> | ||
| 270 | <li> <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a> | ||
| 271 | <li> <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a> | ||
| 272 | <li> <a href="Forcing Quiescent States">Forcing Quiescent States</a> | ||
| 273 | <li> <a href="Grace-Period Cleanup">Grace-Period Cleanup</a> | ||
| 274 | <li> <a href="Callback Invocation">Callback Invocation</a> | ||
| 275 | </ol> | ||
| 276 | |||
| 277 | <p>Each of the following section looks at the corresponding component | ||
| 278 | in detail. | ||
| 279 | |||
| 280 | <h4><a name="Callback Registry">Callback Registry</a></h4> | ||
| 281 | |||
| 282 | <p>If RCU's grace-period guarantee is to mean anything at all, any | ||
| 283 | access that happens before a given invocation of <tt>call_rcu()</tt> | ||
| 284 | must also happen before the corresponding grace period. | ||
| 285 | The implementation of this portion of RCU's grace period guarantee | ||
| 286 | is shown in the following figure: | ||
| 287 | |||
| 288 | </p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg"> | ||
| 289 | |||
| 290 | <p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state, | ||
| 291 | it provides no ordering guarantees, either for itself or for | ||
| 292 | phase one of the update (which again will usually be removal of | ||
| 293 | an element from an RCU-protected data structure). | ||
| 294 | It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list, | ||
| 295 | which cannot become associated with a grace period until a later | ||
| 296 | call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above. | ||
| 297 | |||
| 298 | <p>One set of code paths shown on the left invokes | ||
| 299 | <tt>rcu_accelerate_cbs()</tt> via | ||
| 300 | <tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if | ||
| 301 | the current CPU is inundated with queued <tt>rcu_head</tt> structures) | ||
| 302 | or more likely from an <tt>RCU_SOFTIRQ</tt> handler. | ||
| 303 | Another code path in the middle is taken only in kernels built with | ||
| 304 | <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes | ||
| 305 | <tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>. | ||
| 306 | The final code path on the right is taken only in kernels built with | ||
| 307 | <tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes | ||
| 308 | <tt>rcu_accelerate_cbs()</tt> via | ||
| 309 | <tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>, | ||
| 310 | <tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>, | ||
| 311 | which in turn is invoked on a surviving CPU after the outgoing | ||
| 312 | CPU has been completely offlined. | ||
| 313 | |||
| 314 | <p>There are a few other code paths within grace-period processing | ||
| 315 | that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>. | ||
| 316 | However, either way, all of the CPU's recently queued <tt>rcu_head</tt> | ||
| 317 | structures are associated with a future grace-period number under | ||
| 318 | the protection of the CPU's lead <tt>rcu_node</tt> structure's | ||
| 319 | <tt>->lock</tt>. | ||
| 320 | In all cases, there is full ordering against any prior critical section | ||
| 321 | for that same <tt>rcu_node</tt> structure's <tt>->lock</tt>, and | ||
| 322 | also full ordering against any of the current task's or CPU's prior critical | ||
| 323 | sections for any <tt>rcu_node</tt> structure's <tt>->lock</tt>. | ||
| 324 | |||
| 325 | <p>The next section will show how this ordering ensures that any | ||
| 326 | accesses prior to the <tt>call_rcu()</tt> (particularly including phase | ||
| 327 | one of the update) | ||
| 328 | happen before the start of the corresponding grace period. | ||
| 329 | |||
| 330 | <table> | ||
| 331 | <tr><th> </th></tr> | ||
| 332 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 333 | <tr><td> | ||
| 334 | But what about <tt>synchronize_rcu()</tt>? | ||
| 335 | </td></tr> | ||
| 336 | <tr><th align="left">Answer:</th></tr> | ||
| 337 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 338 | The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt> | ||
| 339 | to <tt>wait_rcu_gp()</tt>, which invokes it. | ||
| 340 | So either way, it eventually comes down to <tt>call_rcu()</tt>. | ||
| 341 | </font></td></tr> | ||
| 342 | <tr><td> </td></tr> | ||
| 343 | </table> | ||
| 344 | |||
| 345 | <h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4> | ||
| 346 | |||
| 347 | <p>Grace-period initialization is carried out by | ||
| 348 | the grace-period kernel thread, which makes several passes over the | ||
| 349 | <tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function. | ||
| 350 | This means that showing the full flow of ordering through the | ||
| 351 | grace-period computation will require duplicating this tree. | ||
| 352 | If you find this confusing, please note that the state of the | ||
| 353 | <tt>rcu_node</tt> changes over time, just like Heraclitus's river. | ||
| 354 | However, to keep the <tt>rcu_node</tt> river tractable, the | ||
| 355 | grace-period kernel thread's traversals are presented in multiple | ||
| 356 | parts, starting in this section with the various phases of | ||
| 357 | grace-period initialization. | ||
| 358 | |||
| 359 | <p>The first ordering-related grace-period initialization action is to | ||
| 360 | increment the <tt>rcu_state</tt> structure's <tt>->gpnum</tt> | ||
| 361 | grace-period-number counter, as shown below: | ||
| 362 | |||
| 363 | </p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||
| 364 | |||
| 365 | <p>The actual increment is carried out using <tt>smp_store_release()</tt>, | ||
| 366 | which helps reject false-positive RCU CPU stall detection. | ||
| 367 | Note that only the root <tt>rcu_node</tt> structure is touched. | ||
| 368 | |||
| 369 | <p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks | ||
| 370 | based on CPUs having come online or gone offline since the start of | ||
| 371 | the previous grace period. | ||
| 372 | In the common case where the number of online CPUs for this <tt>rcu_node</tt> | ||
| 373 | structure has not transitioned to or from zero, | ||
| 374 | this pass will scan only the leaf <tt>rcu_node</tt> structures. | ||
| 375 | However, if the number of online CPUs for a given leaf <tt>rcu_node</tt> | ||
| 376 | structure has transitioned from zero, | ||
| 377 | <tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU. | ||
| 378 | Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt> | ||
| 379 | structure has transitioned to zero, | ||
| 380 | <tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU. | ||
| 381 | The diagram below shows the path of ordering if the leftmost | ||
| 382 | <tt>rcu_node</tt> structure onlines its first CPU and if the next | ||
| 383 | <tt>rcu_node</tt> structure has no online CPUs | ||
| 384 | (or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines | ||
| 385 | its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs). | ||
| 386 | |||
| 387 | </p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||
| 388 | |||
| 389 | <p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt> | ||
| 390 | tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's | ||
| 391 | <tt>->gpnum</tt> field to the newly incremented value from the | ||
| 392 | <tt>rcu_state</tt> structure, as shown in the following diagram. | ||
| 393 | |||
| 394 | </p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%"> | ||
| 395 | |||
| 396 | <p>This change will also cause each CPU's next call to | ||
| 397 | <tt>__note_gp_changes()</tt> | ||
| 398 | to notice that a new grace period has started, as described in the next | ||
| 399 | section. | ||
| 400 | But because the grace-period kthread started the grace period at the | ||
| 401 | root (with the increment of the <tt>rcu_state</tt> structure's | ||
| 402 | <tt>->gpnum</tt> field) before setting each leaf <tt>rcu_node</tt> | ||
| 403 | structure's <tt>->gpnum</tt> field, each CPU's observation of | ||
| 404 | the start of the grace period will happen after the actual start | ||
| 405 | of the grace period. | ||
| 406 | |||
| 407 | <table> | ||
| 408 | <tr><th> </th></tr> | ||
| 409 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 410 | <tr><td> | ||
| 411 | But what about the CPU that started the grace period? | ||
| 412 | Why wouldn't it see the start of the grace period right when | ||
| 413 | it started that grace period? | ||
| 414 | </td></tr> | ||
| 415 | <tr><th align="left">Answer:</th></tr> | ||
| 416 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 417 | In some deep philosophical and overly anthromorphized | ||
| 418 | sense, yes, the CPU starting the grace period is immediately | ||
| 419 | aware of having done so. | ||
| 420 | However, if we instead assume that RCU is not self-aware, | ||
| 421 | then even the CPU starting the grace period does not really | ||
| 422 | become aware of the start of this grace period until its | ||
| 423 | first call to <tt>__note_gp_changes()</tt>. | ||
| 424 | On the other hand, this CPU potentially gets early notification | ||
| 425 | because it invokes <tt>__note_gp_changes()</tt> during its | ||
| 426 | last <tt>rcu_gp_init()</tt> pass through its leaf | ||
| 427 | <tt>rcu_node</tt> structure. | ||
| 428 | </font></td></tr> | ||
| 429 | <tr><td> </td></tr> | ||
| 430 | </table> | ||
| 431 | |||
| 432 | <h4><a name="Self-Reported Quiescent States"> | ||
| 433 | Self-Reported Quiescent States</a></h4> | ||
| 434 | |||
| 435 | <p>When all entities that might block the grace period have reported | ||
| 436 | quiescent states (or as described in a later section, had quiescent | ||
| 437 | states reported on their behalf), the grace period can end. | ||
| 438 | Online non-idle CPUs report their own quiescent states, as shown | ||
| 439 | in the following diagram: | ||
| 440 | |||
| 441 | </p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%"> | ||
| 442 | |||
| 443 | <p>This is for the last CPU to report a quiescent state, which signals | ||
| 444 | the end of the grace period. | ||
| 445 | Earlier quiescent states would push up the <tt>rcu_node</tt> tree | ||
| 446 | only until they encountered an <tt>rcu_node</tt> structure that | ||
| 447 | is waiting for additional quiescent states. | ||
| 448 | However, ordering is nevertheless preserved because some later quiescent | ||
| 449 | state will acquire that <tt>rcu_node</tt> structure's <tt>->lock</tt>. | ||
| 450 | |||
| 451 | <p>Any number of events can lead up to a CPU invoking | ||
| 452 | <tt>note_gp_changes</tt> (or alternatively, directly invoking | ||
| 453 | <tt>__note_gp_changes()</tt>), at which point that CPU will notice | ||
| 454 | the start of a new grace period while holding its leaf | ||
| 455 | <tt>rcu_node</tt> lock. | ||
| 456 | Therefore, all execution shown in this diagram happens after the | ||
| 457 | start of the grace period. | ||
| 458 | In addition, this CPU will consider any RCU read-side critical | ||
| 459 | section that started before the invocation of <tt>__note_gp_changes()</tt> | ||
| 460 | to have started before the grace period, and thus a critical | ||
| 461 | section that the grace period must wait on. | ||
| 462 | |||
| 463 | <table> | ||
| 464 | <tr><th> </th></tr> | ||
| 465 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 466 | <tr><td> | ||
| 467 | But a RCU read-side critical section might have started | ||
| 468 | after the beginning of the grace period | ||
| 469 | (the <tt>->gpnum++</tt> from earlier), so why should | ||
| 470 | the grace period wait on such a critical section? | ||
| 471 | </td></tr> | ||
| 472 | <tr><th align="left">Answer:</th></tr> | ||
| 473 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 474 | It is indeed not necessary for the grace period to wait on such | ||
| 475 | a critical section. | ||
| 476 | However, it is permissible to wait on it. | ||
| 477 | And it is furthermore important to wait on it, as this | ||
| 478 | lazy approach is far more scalable than a “big bang” | ||
| 479 | all-at-once grace-period start could possibly be. | ||
| 480 | </font></td></tr> | ||
| 481 | <tr><td> </td></tr> | ||
| 482 | </table> | ||
| 483 | |||
| 484 | <p>If the CPU does a context switch, a quiescent state will be | ||
| 485 | noted by <tt>rcu_node_context_switch()</tt> on the left. | ||
| 486 | On the other hand, if the CPU takes a scheduler-clock interrupt | ||
| 487 | while executing in usermode, a quiescent state will be noted by | ||
| 488 | <tt>rcu_check_callbacks()</tt> on the right. | ||
| 489 | Either way, the passage through a quiescent state will be noted | ||
| 490 | in a per-CPU variable. | ||
| 491 | |||
| 492 | <p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on | ||
| 493 | this CPU (for example, after the next scheduler-clock | ||
| 494 | interrupt), <tt>__rcu_process_callbacks()</tt> will invoke | ||
| 495 | <tt>rcu_check_quiescent_state()</tt>, which will notice the | ||
| 496 | recorded quiescent state, and invoke | ||
| 497 | <tt>rcu_report_qs_rdp()</tt>. | ||
| 498 | If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state | ||
| 499 | really does apply to the current grace period, it invokes | ||
| 500 | <tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt> | ||
| 501 | tree as shown at the bottom of the diagram, clearing bits from | ||
| 502 | each <tt>rcu_node</tt> structure's <tt>->qsmask</tt> field, | ||
| 503 | and propagating up the tree when the result is zero. | ||
| 504 | |||
| 505 | <p>Note that traversal passes upwards out of a given <tt>rcu_node</tt> | ||
| 506 | structure only if the current CPU is reporting the last quiescent | ||
| 507 | state for the subtree headed by that <tt>rcu_node</tt> structure. | ||
| 508 | A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt> | ||
| 509 | structure, then there will be a later traversal by another CPU | ||
| 510 | (or perhaps the same one) that proceeds upwards | ||
| 511 | from that point, and the <tt>rcu_node</tt> <tt>->lock</tt> | ||
| 512 | guarantees that the first CPU's quiescent state happens before the | ||
| 513 | remainder of the second CPU's traversal. | ||
| 514 | Applying this line of thought repeatedly shows that all CPUs' | ||
| 515 | quiescent states happen before the last CPU traverses through | ||
| 516 | the root <tt>rcu_node</tt> structure, the “last CPU” | ||
| 517 | being the one that clears the last bit in the root <tt>rcu_node</tt> | ||
| 518 | structure's <tt>->qsmask</tt> field. | ||
| 519 | |||
| 520 | <h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4> | ||
| 521 | |||
| 522 | <p>Due to energy-efficiency considerations, RCU is forbidden from | ||
| 523 | disturbing idle CPUs. | ||
| 524 | CPUs are therefore required to notify RCU when entering or leaving idle | ||
| 525 | state, which they do via fully ordered value-returning atomic operations | ||
| 526 | on a per-CPU variable. | ||
| 527 | The ordering effects are as shown below: | ||
| 528 | |||
| 529 | </p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%"> | ||
| 530 | |||
| 531 | <p>The RCU grace-period kernel thread samples the per-CPU idleness | ||
| 532 | variable while holding the corresponding CPU's leaf <tt>rcu_node</tt> | ||
| 533 | structure's <tt>->lock</tt>. | ||
| 534 | This means that any RCU read-side critical sections that precede the | ||
| 535 | idle period (the oval near the top of the diagram above) will happen | ||
| 536 | before the end of the current grace period. | ||
| 537 | Similarly, the beginning of the current grace period will happen before | ||
| 538 | any RCU read-side critical sections that follow the | ||
| 539 | idle period (the oval near the bottom of the diagram above). | ||
| 540 | |||
| 541 | <p>Plumbing this into the full grace-period execution is described | ||
| 542 | <a href="#Forcing Quiescent States">below</a>. | ||
| 543 | |||
| 544 | <h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4> | ||
| 545 | |||
| 546 | <p>RCU is also forbidden from disturbing offline CPUs, which might well | ||
| 547 | be powered off and removed from the system completely. | ||
| 548 | CPUs are therefore required to notify RCU of their comings and goings | ||
| 549 | as part of the corresponding CPU hotplug operations. | ||
| 550 | The ordering effects are shown below: | ||
| 551 | |||
| 552 | </p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%"> | ||
| 553 | |||
| 554 | <p>Because CPU hotplug operations are much less frequent than idle transitions, | ||
| 555 | they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt> | ||
| 556 | structure's <tt>->lock</tt> and update this structure's | ||
| 557 | <tt>->qsmaskinitnext</tt>. | ||
| 558 | The RCU grace-period kernel thread samples this mask to detect CPUs | ||
| 559 | having gone offline since the beginning of this grace period. | ||
| 560 | |||
| 561 | <p>Plumbing this into the full grace-period execution is described | ||
| 562 | <a href="#Forcing Quiescent States">below</a>. | ||
| 563 | |||
| 564 | <h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4> | ||
| 565 | |||
| 566 | <p>As noted above, idle and offline CPUs cannot report their own | ||
| 567 | quiescent states, and therefore the grace-period kernel thread | ||
| 568 | must do the reporting on their behalf. | ||
| 569 | This process is called “forcing quiescent states”, it is | ||
| 570 | repeated every few jiffies, and its ordering effects are shown below: | ||
| 571 | |||
| 572 | </p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%"> | ||
| 573 | |||
| 574 | <p>Each pass of quiescent state forcing is guaranteed to traverse the | ||
| 575 | leaf <tt>rcu_node</tt> structures, and if there are no new quiescent | ||
| 576 | states due to recently idled and/or offlined CPUs, then only the | ||
| 577 | leaves are traversed. | ||
| 578 | However, if there is a newly offlined CPU as illustrated on the left | ||
| 579 | or a newly idled CPU as illustrated on the right, the corresponding | ||
| 580 | quiescent state will be driven up towards the root. | ||
| 581 | As with self-reported quiescent states, the upwards driving stops | ||
| 582 | once it reaches an <tt>rcu_node</tt> structure that has quiescent | ||
| 583 | states outstanding from other CPUs. | ||
| 584 | |||
| 585 | <table> | ||
| 586 | <tr><th> </th></tr> | ||
| 587 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 588 | <tr><td> | ||
| 589 | The leftmost drive to root stopped before it reached | ||
| 590 | the root <tt>rcu_node</tt> structure, which means that | ||
| 591 | there are still CPUs subordinate to that structure on | ||
| 592 | which the current grace period is waiting. | ||
| 593 | Given that, how is it possible that the rightmost drive | ||
| 594 | to root ended the grace period? | ||
| 595 | </td></tr> | ||
| 596 | <tr><th align="left">Answer:</th></tr> | ||
| 597 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 598 | Good analysis! | ||
| 599 | It is in fact impossible in the absence of bugs in RCU. | ||
| 600 | But this diagram is complex enough as it is, so simplicity | ||
| 601 | overrode accuracy. | ||
| 602 | You can think of it as poetic license, or you can think of | ||
| 603 | it as misdirection that is resolved in the | ||
| 604 | <a href="#Putting It All Together">stitched-together diagram</a>. | ||
| 605 | </font></td></tr> | ||
| 606 | <tr><td> </td></tr> | ||
| 607 | </table> | ||
| 608 | |||
| 609 | <h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4> | ||
| 610 | |||
| 611 | <p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree | ||
| 612 | breadth-first setting all the <tt>->completed</tt> fields equal | ||
| 613 | to the number of the newly completed grace period, then it sets | ||
| 614 | the <tt>rcu_state</tt> structure's <tt>->completed</tt> field, | ||
| 615 | again to the number of the newly completed grace period. | ||
| 616 | The ordering effects are shown below: | ||
| 617 | |||
| 618 | </p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%"> | ||
| 619 | |||
| 620 | <p>As indicated by the oval at the bottom of the diagram, once | ||
| 621 | grace-period cleanup is complete, the next grace period can begin. | ||
| 622 | |||
| 623 | <table> | ||
| 624 | <tr><th> </th></tr> | ||
| 625 | <tr><th align="left">Quick Quiz:</th></tr> | ||
| 626 | <tr><td> | ||
| 627 | But when precisely does the grace period end? | ||
| 628 | </td></tr> | ||
| 629 | <tr><th align="left">Answer:</th></tr> | ||
| 630 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | ||
| 631 | There is no useful single point at which the grace period | ||
| 632 | can be said to end. | ||
| 633 | The earliest reasonable candidate is as soon as the last | ||
| 634 | CPU has reported its quiescent state, but it may be some | ||
| 635 | milliseconds before RCU becomes aware of this. | ||
| 636 | The latest reasonable candidate is once the <tt>rcu_state</tt> | ||
| 637 | structure's <tt>->completed</tt> field has been updated, | ||
| 638 | but it is quite possible that some CPUs have already completed | ||
| 639 | phase two of their updates by that time. | ||
| 640 | In short, if you are going to work with RCU, you need to | ||
| 641 | learn to embrace uncertainty. | ||
| 642 | </font></td></tr> | ||
| 643 | <tr><td> </td></tr> | ||
| 644 | </table> | ||
| 645 | |||
| 646 | |||
| 647 | <h4><a name="Callback Invocation">Callback Invocation</a></h4> | ||
| 648 | |||
| 649 | <p>Once a given CPU's leaf <tt>rcu_node</tt> structure's | ||
| 650 | <tt>->completed</tt> field has been updated, that CPU can begin | ||
| 651 | invoking its RCU callbacks that were waiting for this grace period | ||
| 652 | to end. | ||
| 653 | These callbacks are identified by <tt>rcu_advance_cbs()</tt>, | ||
| 654 | which is usually invoked by <tt>__note_gp_changes()</tt>. | ||
| 655 | As shown in the diagram below, this invocation can be triggered by | ||
| 656 | the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on | ||
| 657 | the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on | ||
| 658 | the right, but only for kernels build with | ||
| 659 | <tt>CONFIG_RCU_FAST_NO_HZ=y</tt>). | ||
| 660 | Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in | ||
| 661 | <tt>rcu_do_batch()</tt> invoking the callbacks, which in turn | ||
| 662 | allows those callbacks to carry out (either directly or indirectly | ||
| 663 | via wakeup) the needed phase-two processing for each update. | ||
| 664 | |||
| 665 | </p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%"> | ||
| 666 | |||
| 667 | <p>Please note that callback invocation can also be prompted by any | ||
| 668 | number of corner-case code paths, for example, when a CPU notes that | ||
| 669 | it has excessive numbers of callbacks queued. | ||
| 670 | In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's | ||
| 671 | <tt>->lock</tt> before invoking callbacks, which preserves the | ||
| 672 | required ordering against the newly completed grace period. | ||
| 673 | |||
| 674 | <p>However, if the callback function communicates to other CPUs, | ||
| 675 | for example, doing a wakeup, then it is that function's responsibility | ||
| 676 | to maintain ordering. | ||
| 677 | For example, if the callback function wakes up a task that runs on | ||
| 678 | some other CPU, proper ordering must in place in both the callback | ||
| 679 | function and the task being awakened. | ||
| 680 | To see why this is important, consider the top half of the | ||
| 681 | <a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram. | ||
| 682 | The callback might be running on a CPU corresponding to the leftmost | ||
| 683 | leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on | ||
| 684 | a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure, | ||
| 685 | and the grace-period kernel thread might not yet have reached the | ||
| 686 | rightmost leaf. | ||
| 687 | In this case, the grace period's memory ordering might not yet have | ||
| 688 | reached that CPU, so again the callback function and the awakened | ||
| 689 | task must supply proper ordering. | ||
| 690 | |||
| 691 | <h3><a name="Putting It All Together">Putting It All Together</a></h3> | ||
| 692 | |||
| 693 | <p>A stitched-together diagram is | ||
| 694 | <a href="Tree-RCU-Diagram.html">here</a>. | ||
| 695 | |||
| 696 | <h3><a name="Legal Statement"> | ||
| 697 | Legal Statement</a></h3> | ||
| 698 | |||
| 699 | <p>This work represents the view of the author and does not necessarily | ||
| 700 | represent the view of IBM. | ||
| 701 | |||
| 702 | </p><p>Linux is a registered trademark of Linus Torvalds. | ||
| 703 | |||
| 704 | </p><p>Other company, product, and service names may be trademarks or | ||
| 705 | service marks of others. | ||
| 706 | |||
| 707 | </body></html> | ||
