documentation: RCU grace-period memory ordering guarantees

This commit provides text and diagrams showing how Tree RCU implements its grace-period memory ordering guarantees. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Linus Torvalds <torvalds@linux-foundation.org>
author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 2017-08-07 20:23:20 -0400
committer: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 2017-10-09 17:23:36 -0400
commit: bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa (patch)
tree: 9707a2bac694c443b95341fd777d3d287cbbe6c5 /Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
parent: 8a5776a5f49812d29fe4b2d0a2d71675c3facf3f (diff)
1 files changed, 707 insertions, 0 deletions
diff --git a/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
new file mode 100644
index 000000000000..8651b0b4fd79
--- /dev/null
+++ b/Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
@@ -0,0 +1,707 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
+        "http://www.w3.org/TR/html4/loose.dtd">
+        <html>
+        <head><title>A Tour Through TREE_RCU's Grace-Period Memory Ordering</title>
+        <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
+           <p>August 8, 2017</p>
+           <p>This article was contributed by Paul E.&nbsp;McKenney</p>
+<h3>Introduction</h3>
+<p>This document gives a rough visual overview of how Tree RCU's
+grace-period memory ordering guarantee is provided.
+<ol>
+<li>    <a href="#What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
+        What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a>
+<li>    <a href="#Tree RCU Grace Period Memory Ordering Building Blocks">
+        Tree RCU Grace Period Memory Ordering Building Blocks</a>
+<li>    <a href="#Tree RCU Grace Period Memory Ordering Components">
+        Tree RCU Grace Period Memory Ordering Components</a>
+<li>    <a href="#Putting It All Together">Putting It All Together</a>
+</ol>
+<h3><a name="What Is Tree RCU's Grace Period Memory Ordering Guarantee?">
+What Is Tree RCU's Grace Period Memory Ordering Guarantee?</a></h3>
+<p>RCU grace periods provide extremely strong memory-ordering guarantees
+for non-idle non-offline code.
+Any code that happens after the end of a given RCU grace period is guaranteed
+to see the effects of all accesses prior to the beginning of that grace
+period that are within RCU read-side critical sections.
+Similarly, any code that happens before the beginning of a given RCU grace
+period is guaranteed to see the effects of all accesses following the end
+of that grace period that are within RCU read-side critical sections.
+<p>This guarantee is particularly pervasive for <tt>synchronize_sched()</tt>,
+for which RCU-sched read-side critical sections include any region
+of code for which preemption is disabled.
+Given that each individual machine instruction can be thought of as
+an extremely small region of preemption-disabled code, one can think of
+<tt>synchronize_sched()</tt> as <tt>smp_mb()</tt> on steroids.
+<p>RCU updaters use this guarantee by splitting their updates into
+two phases, one of which is executed before the grace period and
+the other of which is executed after the grace period.
+In the most common use case, phase one removes an element from
+a linked RCU-protected data structure, and phase two frees that element.
+For this to work, any readers that have witnessed state prior to the
+phase-one update (in the common case, removal) must not witness state
+following the phase-two update (in the common case, freeing).
+<p>The RCU implementation provides this guarantee using a network
+of lock-based critical sections, memory barriers, and per-CPU
+processing, as is described in the following sections.
+<h3><a name="Tree RCU Grace Period Memory Ordering Building Blocks">
+Tree RCU Grace Period Memory Ordering Building Blocks</a></h3>
+<p>The workhorse for RCU's grace-period memory ordering is the
+critical section for the <tt>rcu_node</tt> structure's
+<tt>-&gt;lock</tt>.
+These critical sections use helper functions for lock acquisition, including
+<tt>raw_spin_lock_rcu_node()</tt>,
+<tt>raw_spin_lock_irq_rcu_node()</tt>, and
+<tt>raw_spin_lock_irqsave_rcu_node()</tt>.
+Their lock-release counterparts are
+<tt>raw_spin_unlock_rcu_node()</tt>,
+<tt>raw_spin_unlock_irq_rcu_node()</tt>, and
+<tt>raw_spin_unlock_irqrestore_rcu_node()</tt>,
+respectively.
+For completeness, a
+<tt>raw_spin_trylock_rcu_node()</tt>
+is also provided.
+The key point is that the lock-acquisition functions, including
+<tt>raw_spin_trylock_rcu_node()</tt>, all invoke
+<tt>smp_mb__after_unlock_lock()</tt> immediately after successful
+acquisition of the lock.
+<p>Therefore, for any given <tt>rcu_node</tt> struction, any access
+happening before one of the above lock-release functions will be seen
+by all CPUs as happening before any access happening after a later
+one of the above lock-acquisition functions.
+Furthermore, any access happening before one of the
+above lock-release function on any given CPU will be seen by all
+CPUs as happening before any access happening after a later one
+of the above lock-acquisition functions executing on that same CPU,
+even if the lock-release and lock-acquisition functions are operating
+on different <tt>rcu_node</tt> structures.
+Tree RCU uses these two ordering guarantees to form an ordering
+network among all CPUs that were in any way involved in the grace
+period, including any CPUs that came online or went offline during
+the grace period in question.
+<p>The following litmus test exhibits the ordering effects of these
+lock-acquisition and lock-release functions:
+<pre>
+ 1 int x, y, z;
+ 2
+ 3 void task0(void)
+ 4 {
+ 5   raw_spin_lock_rcu_node(rnp);
+ 6   WRITE_ONCE(x, 1);
+ 7   r1 = READ_ONCE(y);
+ 8   raw_spin_unlock_rcu_node(rnp);
+ 9 }
+10
+11 void task1(void)
+12 {
+13   raw_spin_lock_rcu_node(rnp);
+14   WRITE_ONCE(y, 1);
+15   r2 = READ_ONCE(z);
+16   raw_spin_unlock_rcu_node(rnp);
+17 }
+18
+19 void task2(void)
+20 {
+21   WRITE_ONCE(z, 1);
+22   smp_mb();
+23   r3 = READ_ONCE(x);
+24 }
+25
+26 WARN_ON(r1 == 0 &amp;&amp; r2 == 0 &amp;&amp; r3 == 0);
+</pre>
+<p>The <tt>WARN_ON()</tt> is evaluated at &ldquo;the end of time&rdquo;,
+after all changes have propagated throughout the system.
+Without the <tt>smp_mb__after_unlock_lock()</tt> provided by the
+acquisition functions, this <tt>WARN_ON()</tt> could trigger, for example
+on PowerPC.
+The <tt>smp_mb__after_unlock_lock()</tt> invocations prevent this
+<tt>WARN_ON()</tt> from triggering.
+<p>This approach must be extended to include idle CPUs, which need
+RCU's grace-period memory ordering guarantee to extend to any
+RCU read-side critical sections preceding and following the current
+idle sojourn.
+This case is handled by calls to the strongly ordered
+<tt>atomic_add_return()</tt> read-modify-write atomic operation that
+is invoked within <tt>rcu_dynticks_eqs_enter()</tt> at idle-entry
+time and within <tt>rcu_dynticks_eqs_exit()</tt> at idle-exit time.
+The grace-period kthread invokes <tt>rcu_dynticks_snap()</tt> and
+<tt>rcu_dynticks_in_eqs_since()</tt> (both of which invoke
+an <tt>atomic_add_return()</tt> of zero) to detect idle CPUs.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        But what about CPUs that remain offline for the entire
+        grace period?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        Such CPUs will be offline at the beginning of the grace period,
+        so the grace period won't expect quiescent states from them.
+        Races between grace-period start and CPU-hotplug operations
+        are mediated by the CPU's leaf <tt>rcu_node</tt> structure's
+        <tt>-&gt;lock</tt> as described above.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<p>The approach must be extended to handle one final case, that
+of waking a task blocked in <tt>synchronize_rcu()</tt>.
+This task might be affinitied to a CPU that is not yet aware that
+the grace period has ended, and thus might not yet be subject to
+the grace period's memory ordering.
+Therefore, there is an <tt>smp_mb()</tt> after the return from
+<tt>wait_for_completion()</tt> in the <tt>synchronize_rcu()</tt>
+code path.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        What?  Where???
+        I don't see any <tt>smp_mb()</tt> after the return from
+        <tt>wait_for_completion()</tt>!!!
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        That would be because I spotted the need for that
+        <tt>smp_mb()</tt> during the creation of this documentation,
+        and it is therefore unlikely to hit mainline before v4.14.
+        Kudos to Lance Roy, Will Deacon, Peter Zijlstra, and
+        Jonathan Cameron for asking questions that sensitized me
+        to the rather elaborate sequence of events that demonstrate
+        the need for this memory barrier.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<p>Tree RCU's grace--period memory-ordering guarantees rely most
+heavily on the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
+field, so much so that it is necessary to abbreviate this pattern
+in the diagrams in the next section.
+For example, consider the <tt>rcu_prepare_for_idle()</tt> function
+shown below, which is one of several functions that enforce ordering
+of newly arrived RCU callbacks against future grace periods:
+<pre>
+ 1 static void rcu_prepare_for_idle(void)
+ 2 {
+ 3   bool needwake;
+ 4   struct rcu_data *rdp;
+ 5   struct rcu_dynticks *rdtp = this_cpu_ptr(&amp;rcu_dynticks);
+ 6   struct rcu_node *rnp;
+ 7   struct rcu_state *rsp;
+ 8   int tne;
+ 9
+10   if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
+11       rcu_is_nocb_cpu(smp_processor_id()))
+12     return;
+13   tne = READ_ONCE(tick_nohz_active);
+14   if (tne != rdtp-&gt;tick_nohz_enabled_snap) {
+15     if (rcu_cpu_has_callbacks(NULL))
+16       invoke_rcu_core();
+17     rdtp-&gt;tick_nohz_enabled_snap = tne;
+18     return;
+19   }
+20   if (!tne)
+21     return;
+22   if (rdtp-&gt;all_lazy &amp;&amp;
+23       rdtp-&gt;nonlazy_posted != rdtp-&gt;nonlazy_posted_snap) {
+24     rdtp-&gt;all_lazy = false;
+25     rdtp-&gt;nonlazy_posted_snap = rdtp-&gt;nonlazy_posted;
+26     invoke_rcu_core();
+27     return;
+28   }
+29   if (rdtp-&gt;last_accelerate == jiffies)
+30     return;
+31   rdtp-&gt;last_accelerate = jiffies;
+32   for_each_rcu_flavor(rsp) {
+33     rdp = this_cpu_ptr(rsp-&gt;rda);
+34     if (rcu_segcblist_pend_cbs(&amp;rdp-&gt;cblist))
+35       continue;
+36     rnp = rdp-&gt;mynode;
+37     raw_spin_lock_rcu_node(rnp);
+38     needwake = rcu_accelerate_cbs(rsp, rnp, rdp);
+39     raw_spin_unlock_rcu_node(rnp);
+40     if (needwake)
+41       rcu_gp_kthread_wake(rsp);
+42   }
+43 }
+</pre>
+<p>But the only part of <tt>rcu_prepare_for_idle()</tt> that really
+matters for this discussion are lines&nbsp;37&ndash;39.
+We will therefore abbreviate this function as follows:
+</p><p><img src="rcu_node-lock.svg" alt="rcu_node-lock.svg">
+<p>The box represents the <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>
+critical section, with the double line on top representing the additional
+<tt>smp_mb__after_unlock_lock()</tt>.
+<h3><a name="Tree RCU Grace Period Memory Ordering Components">
+Tree RCU Grace Period Memory Ordering Components</a></h3>
+<p>Tree RCU's grace-period memory-ordering guarantee is provided by
+a number of RCU components:
+<ol>
+<li>    <a href="#Callback Registry">Callback Registry</a>
+<li>    <a href="#Grace-Period Initialization">Grace-Period Initialization</a>
+<li>    <a href="#Self-Reported Quiescent States">
+        Self-Reported Quiescent States</a>
+<li>    <a href="#Dynamic Tick Interface">Dynamic Tick Interface</a>
+<li>    <a href="#CPU-Hotplug Interface">CPU-Hotplug Interface</a>
+<li>    <a href="Forcing Quiescent States">Forcing Quiescent States</a>
+<li>    <a href="Grace-Period Cleanup">Grace-Period Cleanup</a>
+<li>    <a href="Callback Invocation">Callback Invocation</a>
+</ol>
+<p>Each of the following section looks at the corresponding component
+in detail.
+<h4><a name="Callback Registry">Callback Registry</a></h4>
+<p>If RCU's grace-period guarantee is to mean anything at all, any
+access that happens before a given invocation of <tt>call_rcu()</tt>
+must also happen before the corresponding grace period.
+The implementation of this portion of RCU's grace period guarantee
+is shown in the following figure:
+</p><p><img src="TreeRCU-callback-registry.svg" alt="TreeRCU-callback-registry.svg">
+<p>Because <tt>call_rcu()</tt> normally acts only on CPU-local state,
+it provides no ordering guarantees, either for itself or for
+phase one of the update (which again will usually be removal of
+an element from an RCU-protected data structure).
+It simply enqueues the <tt>rcu_head</tt> structure on a per-CPU list,
+which cannot become associated with a grace period until a later
+call to <tt>rcu_accelerate_cbs()</tt>, as shown in the diagram above.
+<p>One set of code paths shown on the left invokes
+<tt>rcu_accelerate_cbs()</tt> via
+<tt>note_gp_changes()</tt>, either directly from <tt>call_rcu()</tt> (if
+the current CPU is inundated with queued <tt>rcu_head</tt> structures)
+or more likely from an <tt>RCU_SOFTIRQ</tt> handler.
+Another code path in the middle is taken only in kernels built with
+<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>, which invokes
+<tt>rcu_accelerate_cbs()</tt> via <tt>rcu_prepare_for_idle()</tt>.
+The final code path on the right is taken only in kernels built with
+<tt>CONFIG_HOTPLUG_CPU=y</tt>, which invokes
+<tt>rcu_accelerate_cbs()</tt> via
+<tt>rcu_advance_cbs()</tt>, <tt>rcu_migrate_callbacks</tt>,
+<tt>rcutree_migrate_callbacks()</tt>, and <tt>takedown_cpu()</tt>,
+which in turn is invoked on a surviving CPU after the outgoing
+CPU has been completely offlined.
+<p>There are a few other code paths within grace-period processing
+that opportunistically invoke <tt>rcu_accelerate_cbs()</tt>.
+However, either way, all of the CPU's recently queued <tt>rcu_head</tt>
+structures are associated with a future grace-period number under
+the protection of the CPU's lead <tt>rcu_node</tt> structure's
+<tt>-&gt;lock</tt>.
+In all cases, there is full ordering against any prior critical section
+for that same <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>, and
+also full ordering against any of the current task's or CPU's prior critical
+sections for any <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
+<p>The next section will show how this ordering ensures that any
+accesses prior to the <tt>call_rcu()</tt> (particularly including phase
+one of the update)
+happen before the start of the corresponding grace period.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        But what about <tt>synchronize_rcu()</tt>?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        The <tt>synchronize_rcu()</tt> passes <tt>call_rcu()</tt>
+        to <tt>wait_rcu_gp()</tt>, which invokes it.
+        So either way, it eventually comes down to <tt>call_rcu()</tt>.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<h4><a name="Grace-Period Initialization">Grace-Period Initialization</a></h4>
+<p>Grace-period initialization is carried out by
+the grace-period kernel thread, which makes several passes over the
+<tt>rcu_node</tt> tree within the <tt>rcu_gp_init()</tt> function.
+This means that showing the full flow of ordering through the
+grace-period computation will require duplicating this tree.
+If you find this confusing, please note that the state of the
+<tt>rcu_node</tt> changes over time, just like Heraclitus's river.
+However, to keep the <tt>rcu_node</tt> river tractable, the
+grace-period kernel thread's traversals are presented in multiple
+parts, starting in this section with the various phases of
+grace-period initialization.
+<p>The first ordering-related grace-period initialization action is to
+increment the <tt>rcu_state</tt> structure's <tt>-&gt;gpnum</tt>
+grace-period-number counter, as shown below:
+</p><p><img src="TreeRCU-gp-init-1.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
+<p>The actual increment is carried out using <tt>smp_store_release()</tt>,
+which helps reject false-positive RCU CPU stall detection.
+Note that only the root <tt>rcu_node</tt> structure is touched.
+<p>The first pass through the <tt>rcu_node</tt> tree updates bitmasks
+based on CPUs having come online or gone offline since the start of
+the previous grace period.
+In the common case where the number of online CPUs for this <tt>rcu_node</tt>
+structure has not transitioned to or from zero,
+this pass will scan only the leaf <tt>rcu_node</tt> structures.
+However, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
+structure has transitioned from zero,
+<tt>rcu_init_new_rnp()</tt> will be invoked for the first incoming CPU.
+Similarly, if the number of online CPUs for a given leaf <tt>rcu_node</tt>
+structure has transitioned to zero,
+<tt>rcu_cleanup_dead_rnp()</tt> will be invoked for the last outgoing CPU.
+The diagram below shows the path of ordering if the leftmost
+<tt>rcu_node</tt> structure onlines its first CPU and if the next
+<tt>rcu_node</tt> structure has no online CPUs
+(or, alternatively if the leftmost <tt>rcu_node</tt> structure offlines
+its last CPU and if the next <tt>rcu_node</tt> structure has no online CPUs).
+</p><p><img src="TreeRCU-gp-init-2.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
+<p>The final <tt>rcu_gp_init()</tt> pass through the <tt>rcu_node</tt>
+tree traverses breadth-first, setting each <tt>rcu_node</tt> structure's
+<tt>-&gt;gpnum</tt> field to the newly incremented value from the
+<tt>rcu_state</tt> structure, as shown in the following diagram.
+</p><p><img src="TreeRCU-gp-init-3.svg" alt="TreeRCU-gp-init-1.svg" width="75%">
+<p>This change will also cause each CPU's next call to
+<tt>__note_gp_changes()</tt>
+to notice that a new grace period has started, as described in the next
+section.
+But because the grace-period kthread started the grace period at the
+root (with the increment of the <tt>rcu_state</tt> structure's
+<tt>-&gt;gpnum</tt> field) before setting each leaf <tt>rcu_node</tt>
+structure's <tt>-&gt;gpnum</tt> field, each CPU's observation of
+the start of the grace period will happen after the actual start
+of the grace period.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        But what about the CPU that started the grace period?
+        Why wouldn't it see the start of the grace period right when
+        it started that grace period?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        In some deep philosophical and overly anthromorphized
+        sense, yes, the CPU starting the grace period is immediately
+        aware of having done so.
+        However, if we instead assume that RCU is not self-aware,
+        then even the CPU starting the grace period does not really
+        become aware of the start of this grace period until its
+        first call to <tt>__note_gp_changes()</tt>.
+        On the other hand, this CPU potentially gets early notification
+        because it invokes <tt>__note_gp_changes()</tt> during its
+        last <tt>rcu_gp_init()</tt> pass through its leaf
+        <tt>rcu_node</tt> structure.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<h4><a name="Self-Reported Quiescent States">
+Self-Reported Quiescent States</a></h4>
+<p>When all entities that might block the grace period have reported
+quiescent states (or as described in a later section, had quiescent
+states reported on their behalf), the grace period can end.
+Online non-idle CPUs report their own quiescent states, as shown
+in the following diagram:
+</p><p><img src="TreeRCU-qs.svg" alt="TreeRCU-qs.svg" width="75%">
+<p>This is for the last CPU to report a quiescent state, which signals
+the end of the grace period.
+Earlier quiescent states would push up the <tt>rcu_node</tt> tree
+only until they encountered an <tt>rcu_node</tt> structure that
+is waiting for additional quiescent states.
+However, ordering is nevertheless preserved because some later quiescent
+state will acquire that <tt>rcu_node</tt> structure's <tt>-&gt;lock</tt>.
+<p>Any number of events can lead up to a CPU invoking
+<tt>note_gp_changes</tt> (or alternatively, directly invoking
+<tt>__note_gp_changes()</tt>), at which point that CPU will notice
+the start of a new grace period while holding its leaf
+<tt>rcu_node</tt> lock.
+Therefore, all execution shown in this diagram happens after the
+start of the grace period.
+In addition, this CPU will consider any RCU read-side critical
+section that started before the invocation of <tt>__note_gp_changes()</tt>
+to have started before the grace period, and thus a critical
+section that the grace period must wait on.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        But a RCU read-side critical section might have started
+        after the beginning of the grace period
+        (the <tt>-&gt;gpnum++</tt> from earlier), so why should
+        the grace period wait on such a critical section?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        It is indeed not necessary for the grace period to wait on such
+        a critical section.
+        However, it is permissible to wait on it.
+        And it is furthermore important to wait on it, as this
+        lazy approach is far more scalable than a &ldquo;big bang&rdquo;
+        all-at-once grace-period start could possibly be.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<p>If the CPU does a context switch, a quiescent state will be
+noted by <tt>rcu_node_context_switch()</tt> on the left.
+On the other hand, if the CPU takes a scheduler-clock interrupt
+while executing in usermode, a quiescent state will be noted by
+<tt>rcu_check_callbacks()</tt> on the right.
+Either way, the passage through a quiescent state will be noted
+in a per-CPU variable.
+<p>The next time an <tt>RCU_SOFTIRQ</tt> handler executes on
+this CPU (for example, after the next scheduler-clock
+interrupt), <tt>__rcu_process_callbacks()</tt> will invoke
+<tt>rcu_check_quiescent_state()</tt>, which will notice the
+recorded quiescent state, and invoke
+<tt>rcu_report_qs_rdp()</tt>.
+If <tt>rcu_report_qs_rdp()</tt> verifies that the quiescent state
+really does apply to the current grace period, it invokes
+<tt>rcu_report_rnp()</tt> which traverses up the <tt>rcu_node</tt>
+tree as shown at the bottom of the diagram, clearing bits from
+each <tt>rcu_node</tt> structure's <tt>-&gt;qsmask</tt> field,
+and propagating up the tree when the result is zero.
+<p>Note that traversal passes upwards out of a given <tt>rcu_node</tt>
+structure only if the current CPU is reporting the last quiescent
+state for the subtree headed by that <tt>rcu_node</tt> structure.
+A key point is that if a CPU's traversal stops at a given <tt>rcu_node</tt>
+structure, then there will be a later traversal by another CPU
+(or perhaps the same one) that proceeds upwards
+from that point, and the <tt>rcu_node</tt> <tt>-&gt;lock</tt>
+guarantees that the first CPU's quiescent state happens before the
+remainder of the second CPU's traversal.
+Applying this line of thought repeatedly shows that all CPUs'
+quiescent states happen before the last CPU traverses through
+the root <tt>rcu_node</tt> structure, the &ldquo;last CPU&rdquo;
+being the one that clears the last bit in the root <tt>rcu_node</tt>
+structure's <tt>-&gt;qsmask</tt> field.
+<h4><a name="Dynamic Tick Interface">Dynamic Tick Interface</a></h4>
+<p>Due to energy-efficiency considerations, RCU is forbidden from
+disturbing idle CPUs.
+CPUs are therefore required to notify RCU when entering or leaving idle
+state, which they do via fully ordered value-returning atomic operations
+on a per-CPU variable.
+The ordering effects are as shown below:
+</p><p><img src="TreeRCU-dyntick.svg" alt="TreeRCU-dyntick.svg" width="50%">
+<p>The RCU grace-period kernel thread samples the per-CPU idleness
+variable while holding the corresponding CPU's leaf <tt>rcu_node</tt>
+structure's <tt>-&gt;lock</tt>.
+This means that any RCU read-side critical sections that precede the
+idle period (the oval near the top of the diagram above) will happen
+before the end of the current grace period.
+Similarly, the beginning of the current grace period will happen before
+any RCU read-side critical sections that follow the
+idle period (the oval near the bottom of the diagram above).
+<p>Plumbing this into the full grace-period execution is described
+<a href="#Forcing Quiescent States">below</a>.
+<h4><a name="CPU-Hotplug Interface">CPU-Hotplug Interface</a></h4>
+<p>RCU is also forbidden from disturbing offline CPUs, which might well
+be powered off and removed from the system completely.
+CPUs are therefore required to notify RCU of their comings and goings
+as part of the corresponding CPU hotplug operations.
+The ordering effects are shown below:
+</p><p><img src="TreeRCU-hotplug.svg" alt="TreeRCU-hotplug.svg" width="50%">
+<p>Because CPU hotplug operations are much less frequent than idle transitions,
+they are heavier weight, and thus acquire the CPU's leaf <tt>rcu_node</tt>
+structure's <tt>-&gt;lock</tt> and update this structure's
+<tt>-&gt;qsmaskinitnext</tt>.
+The RCU grace-period kernel thread samples this mask to detect CPUs
+having gone offline since the beginning of this grace period.
+<p>Plumbing this into the full grace-period execution is described
+<a href="#Forcing Quiescent States">below</a>.
+<h4><a name="Forcing Quiescent States">Forcing Quiescent States</a></h4>
+<p>As noted above, idle and offline CPUs cannot report their own
+quiescent states, and therefore the grace-period kernel thread
+must do the reporting on their behalf.
+This process is called &ldquo;forcing quiescent states&rdquo;, it is
+repeated every few jiffies, and its ordering effects are shown below:
+</p><p><img src="TreeRCU-gp-fqs.svg" alt="TreeRCU-gp-fqs.svg" width="100%">
+<p>Each pass of quiescent state forcing is guaranteed to traverse the
+leaf <tt>rcu_node</tt> structures, and if there are no new quiescent
+states due to recently idled and/or offlined CPUs, then only the
+leaves are traversed.
+However, if there is a newly offlined CPU as illustrated on the left
+or a newly idled CPU as illustrated on the right, the corresponding
+quiescent state will be driven up towards the root.
+As with self-reported quiescent states, the upwards driving stops
+once it reaches an <tt>rcu_node</tt> structure that has quiescent
+states outstanding from other CPUs.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        The leftmost drive to root stopped before it reached
+        the root <tt>rcu_node</tt> structure, which means that
+        there are still CPUs subordinate to that structure on
+        which the current grace period is waiting.
+        Given that, how is it possible that the rightmost drive
+        to root ended the grace period?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        Good analysis!
+        It is in fact impossible in the absence of bugs in RCU.
+        But this diagram is complex enough as it is, so simplicity
+        overrode accuracy.
+        You can think of it as poetic license, or you can think of
+        it as misdirection that is resolved in the
+        <a href="#Putting It All Together">stitched-together diagram</a>.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<h4><a name="Grace-Period Cleanup">Grace-Period Cleanup</a></h4>
+<p>Grace-period cleanup first scans the <tt>rcu_node</tt> tree
+breadth-first setting all the <tt>-&gt;completed</tt> fields equal
+to the number of the newly completed grace period, then it sets
+the <tt>rcu_state</tt> structure's <tt>-&gt;completed</tt> field,
+again to the number of the newly completed grace period.
+The ordering effects are shown below:
+</p><p><img src="TreeRCU-gp-cleanup.svg" alt="TreeRCU-gp-cleanup.svg" width="75%">
+<p>As indicated by the oval at the bottom of the diagram, once
+grace-period cleanup is complete, the next grace period can begin.
+<table>
+<tr><th>&nbsp;</th></tr>
+<tr><th align="left">Quick Quiz:</th></tr>
+<tr><td>
+        But when precisely does the grace period end?
+</td></tr>
+<tr><th align="left">Answer:</th></tr>
+<tr><td bgcolor="#ffffff"><font color="ffffff">
+        There is no useful single point at which the grace period
+        can be said to end.
+        The earliest reasonable candidate is as soon as the last
+        CPU has reported its quiescent state, but it may be some
+        milliseconds before RCU becomes aware of this.
+        The latest reasonable candidate is once the <tt>rcu_state</tt>
+        structure's <tt>-&gt;completed</tt> field has been updated,
+        but it is quite possible that some CPUs have already completed
+        phase two of their updates by that time.
+        In short, if you are going to work with RCU, you need to
+        learn to embrace uncertainty.
+</font></td></tr>
+<tr><td>&nbsp;</td></tr>
+</table>
+<h4><a name="Callback Invocation">Callback Invocation</a></h4>
+<p>Once a given CPU's leaf <tt>rcu_node</tt> structure's
+<tt>-&gt;completed</tt> field has been updated, that CPU can begin
+invoking its RCU callbacks that were waiting for this grace period
+to end.
+These callbacks are identified by <tt>rcu_advance_cbs()</tt>,
+which is usually invoked by <tt>__note_gp_changes()</tt>.
+As shown in the diagram below, this invocation can be triggered by
+the scheduling-clock interrupt (<tt>rcu_check_callbacks()</tt> on
+the left) or by idle entry (<tt>rcu_cleanup_after_idle()</tt> on
+the right, but only for kernels build with
+<tt>CONFIG_RCU_FAST_NO_HZ=y</tt>).
+Either way, <tt>RCU_SOFTIRQ</tt> is raised, which results in
+<tt>rcu_do_batch()</tt> invoking the callbacks, which in turn
+allows those callbacks to carry out (either directly or indirectly
+via wakeup) the needed phase-two processing for each update.
+</p><p><img src="TreeRCU-callback-invocation.svg" alt="TreeRCU-callback-invocation.svg" width="60%">
+<p>Please note that callback invocation can also be prompted by any
+number of corner-case code paths, for example, when a CPU notes that
+it has excessive numbers of callbacks queued.
+In all cases, the CPU acquires its leaf <tt>rcu_node</tt> structure's
+<tt>-&gt;lock</tt> before invoking callbacks, which preserves the
+required ordering against the newly completed grace period.
+<p>However, if the callback function communicates to other CPUs,
+for example, doing a wakeup, then it is that function's responsibility
+to maintain ordering.
+For example, if the callback function wakes up a task that runs on
+some other CPU, proper ordering must in place in both the callback
+function and the task being awakened.
+To see why this is important, consider the top half of the
+<a href="#Grace-Period Cleanup">grace-period cleanup</a> diagram.
+The callback might be running on a CPU corresponding to the leftmost
+leaf <tt>rcu_node</tt> structure, and awaken a task that is to run on
+a CPU corresponding to the rightmost leaf <tt>rcu_node</tt> structure,
+and the grace-period kernel thread might not yet have reached the
+rightmost leaf.
+In this case, the grace period's memory ordering might not yet have
+reached that CPU, so again the callback function and the awakened
+task must supply proper ordering.
+<h3><a name="Putting It All Together">Putting It All Together</a></h3>
+<p>A stitched-together diagram is
+<a href="Tree-RCU-Diagram.html">here</a>.
+<h3><a name="Legal Statement">
+Legal Statement</a></h3>
+<p>This work represents the view of the author and does not necessarily
+represent the view of IBM.
+</p><p>Linux is a registered trademark of Linus Torvalds.
+</p><p>Other company, product, and service names may be trademarks or
+service marks of others.
+</body></html>
author	Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2017-08-07 20:23:20 -0400
committer	Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2017-10-09 17:23:36 -0400
commit	bb7e5ce7dde6f42e7793bd6cf4b1eb71a20145aa (patch)
tree	9707a2bac694c443b95341fd777d3d287cbbe6c5 /Documentation/RCU/Design/Memory-Ordering/Tree-RCU-Memory-Ordering.html
parent	8a5776a5f49812d29fe4b2d0a2d71675c3facf3f (diff)