Document RCU and unloadable modules

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 2008-11-13 21:11:52 -0500
committer: Jonathan Corbet <corbet@lwn.net> 2008-12-03 17:58:01 -0500
commit: 1c12757c56b4c9ab5aab1f6c1248ae4ea8af3a01 (patch)
tree: 1a501e8c8bea09ce85a080af6c28da42d021fd80
parent: 061e41fdb5047b1fb161e89664057835935ca1d2 (diff)
2 files changed, 306 insertions, 0 deletions
diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX
index 461481dfb7c3..0f2a8d081681 100644
--- a/Documentation/RCU/00-INDEX
+++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ rcuref.txt
        - Reference-count design for elements of lists/arrays protected by RCU
 rcu.txt
        - RCU Concepts
+rcubarrier.txt
+        - Unloading modules that use RCU callbacks
 RTFP.txt
        - List of RCU papers (bibliography) going back to 1980.
 torture.txt
diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt
new file mode 100644
index 000000000000..909602d409bb
--- /dev/null
+++ b/Documentation/RCU/rcubarrier.txt
@@ -0,0 +1,304 @@
+RCU and Unloadable Modules
+[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
+RCU (read-copy update) is a synchronization mechanism that can be thought
+of as a replacement for read-writer locking (among other things), but with
+very low-overhead readers that are immune to deadlock, priority inversion,
+and unbounded latency. RCU read-side critical sections are delimited
+by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT
+kernels, generate no code whatsoever.
+This means that RCU writers are unaware of the presence of concurrent
+readers, so that RCU updates to shared data must be undertaken quite
+carefully, leaving an old version of the data structure in place until all
+pre-existing readers have finished. These old versions are needed because
+such readers might hold a reference to them. RCU updates can therefore be
+rather expensive, and RCU is thus best suited for read-mostly situations.
+How can an RCU writer possibly determine when all readers are finished,
+given that readers might well leave absolutely no trace of their
+presence? There is a synchronize_rcu() primitive that blocks until all
+pre-existing readers have completed. An updater wishing to delete an
+element p from a linked list might do the following, while holding an
+appropriate lock, of course:
+        list_del_rcu(p);
+        synchronize_rcu();
+        kfree(p);
+But the above code cannot be used in IRQ context -- the call_rcu()
+primitive must be used instead. This primitive takes a pointer to an
+rcu_head struct placed within the RCU-protected data structure and
+another pointer to a function that may be invoked later to free that
+structure. Code to delete an element p from the linked list from IRQ
+context might then be as follows:
+        list_del_rcu(p);
+        call_rcu(&p->rcu, p_callback);
+Since call_rcu() never blocks, this code can safely be used from within
+IRQ context. The function p_callback() might be defined as follows:
+        static void p_callback(struct rcu_head *rp)
+        {
+                struct pstruct *p = container_of(rp, struct pstruct, rcu);
+                kfree(p);
+        }
+Unloading Modules That Use call_rcu()
+But what if p_callback is defined in an unloadable module?
+If we unload the module while some RCU callbacks are pending,
+the CPUs executing these callbacks are going to be severely
+disappointed when they are later invoked, as fancifully depicted at
+http://lwn.net/images/ns/kernel/rcu-drop.jpg.
+We could try placing a synchronize_rcu() in the module-exit code path,
+but this is not sufficient. Although synchronize_rcu() does wait for a
+grace period to elapse, it does not wait for the callbacks to complete.
+One might be tempted to try several back-to-back synchronize_rcu()
+calls, but this is still not guaranteed to work. If there is a very
+heavy RCU-callback load, then some of the callbacks might be deferred
+in order to allow other processing to proceed. Such deferral is required
+in realtime kernels in order to avoid excessive scheduling latencies.
+rcu_barrier()
+We instead need the rcu_barrier() primitive. This primitive is similar
+to synchronize_rcu(), but instead of waiting solely for a grace
+period to elapse, it also waits for all outstanding RCU callbacks to
+complete. Pseudo-code using rcu_barrier() is as follows:
+   1. Prevent any new RCU callbacks from being posted.
+   2. Execute rcu_barrier().
+   3. Allow the module to be unloaded.
+Quick Quiz #1: Why is there no srcu_barrier()?
+The rcutorture module makes use of rcu_barrier in its exit function
+as follows:
+ 1 static void
+ 2 rcu_torture_cleanup(void)
+ 3 {
+ 4   int i;
+ 5
+ 6   fullstop = 1;
+ 7   if (shuffler_task != NULL) {
+ 8     VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
+ 9     kthread_stop(shuffler_task);
+10   }
+11   shuffler_task = NULL;
+12
+13   if (writer_task != NULL) {
+14     VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
+15     kthread_stop(writer_task);
+16   }
+17   writer_task = NULL;
+18
+19   if (reader_tasks != NULL) {
+20     for (i = 0; i < nrealreaders; i++) {
+21       if (reader_tasks[i] != NULL) {
+22         VERBOSE_PRINTK_STRING(
+23           "Stopping rcu_torture_reader task");
+24         kthread_stop(reader_tasks[i]);
+25       }
+26       reader_tasks[i] = NULL;
+27     }
+28     kfree(reader_tasks);
+29     reader_tasks = NULL;
+30   }
+31   rcu_torture_current = NULL;
+32
+33   if (fakewriter_tasks != NULL) {
+34     for (i = 0; i < nfakewriters; i++) {
+35       if (fakewriter_tasks[i] != NULL) {
+36         VERBOSE_PRINTK_STRING(
+37           "Stopping rcu_torture_fakewriter task");
+38         kthread_stop(fakewriter_tasks[i]);
+39       }
+40       fakewriter_tasks[i] = NULL;
+41     }
+42     kfree(fakewriter_tasks);
+43     fakewriter_tasks = NULL;
+44   }
+45
+46   if (stats_task != NULL) {
+47     VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
+48     kthread_stop(stats_task);
+49   }
+50   stats_task = NULL;
+51
+52   /* Wait for all RCU callbacks to fire. */
+53   rcu_barrier();
+54
+55   rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
+56
+57   if (cur_ops->cleanup != NULL)
+58     cur_ops->cleanup();
+59   if (atomic_read(&n_rcu_torture_error))
+60     rcu_torture_print_module_parms("End of test: FAILURE");
+61   else
+62     rcu_torture_print_module_parms("End of test: SUCCESS");
+63 }
+Line 6 sets a global variable that prevents any RCU callbacks from
+re-posting themselves. This will not be necessary in most cases, since
+RCU callbacks rarely include calls to call_rcu(). However, the rcutorture
+module is an exception to this rule, and therefore needs to set this
+global variable.
+Lines 7-50 stop all the kernel tasks associated with the rcutorture
+module. Therefore, once execution reaches line 53, no more rcutorture
+RCU callbacks will be posted. The rcu_barrier() call on line 53 waits
+for any pre-existing callbacks to complete.
+Then lines 55-62 print status and do operation-specific cleanup, and
+then return, permitting the module-unload operation to be completed.
+Quick Quiz #2: Is there any other situation where rcu_barrier() might
+        be required?
+Your module might have additional complications. For example, if your
+module invokes call_rcu() from timers, you will need to first cancel all
+the timers, and only then invoke rcu_barrier() to wait for any remaining
+RCU callbacks to complete.
+Implementing rcu_barrier()
+Dipankar Sarma's implementation of rcu_barrier() makes use of the fact
+that RCU callbacks are never reordered once queued on one of the per-CPU
+queues. His implementation queues an RCU callback on each of the per-CPU
+callback queues, and then waits until they have all started executing, at
+which point, all earlier RCU callbacks are guaranteed to have completed.
+The original code for rcu_barrier() was as follows:
+ 1 void rcu_barrier(void)
+ 2 {
+ 3   BUG_ON(in_interrupt());
+ 4   /* Take cpucontrol mutex to protect against CPU hotplug */
+ 5   mutex_lock(&rcu_barrier_mutex);
+ 6   init_completion(&rcu_barrier_completion);
+ 7   atomic_set(&rcu_barrier_cpu_count, 0);
+ 8   on_each_cpu(rcu_barrier_func, NULL, 0, 1);
+ 9   wait_for_completion(&rcu_barrier_completion);
+10   mutex_unlock(&rcu_barrier_mutex);
+11 }
+Line 3 verifies that the caller is in process context, and lines 5 and 10
+use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
+global completion and counters at a time, which are initialized on lines
+6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
+shown below. Note that the final "1" in on_each_cpu()'s argument list
+ensures that all the calls to rcu_barrier_func() will have completed
+before on_each_cpu() returns. Line 9 then waits for the completion.
+This code was rewritten in 2008 to support rcu_barrier_bh() and
+rcu_barrier_sched() in addition to the original rcu_barrier().
+The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
+to post an RCU callback, as follows:
+ 1 static void rcu_barrier_func(void *notused)
+ 2 {
+ 3 int cpu = smp_processor_id();
+ 4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+ 5 struct rcu_head *head;
+ 6
+ 7 head = &rdp->barrier;
+ 8 atomic_inc(&rcu_barrier_cpu_count);
+ 9 call_rcu(head, rcu_barrier_callback);
+10 }
+Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
+which contains the struct rcu_head that needed for the later call to
+call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
+8 increments a global counter. This counter will later be decremented
+by the callback. Line 9 then registers the rcu_barrier_callback() on
+the current CPU's queue.
+The rcu_barrier_callback() function simply atomically decrements the
+rcu_barrier_cpu_count variable and finalizes the completion when it
+reaches zero, as follows:
+ 1 static void rcu_barrier_callback(struct rcu_head *notused)
+ 2 {
+ 3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
+ 4 complete(&rcu_barrier_completion);
+ 5 }
+Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes
+        immediately (thus incrementing rcu_barrier_cpu_count to the
+        value one), but the other CPU's rcu_barrier_func() invocations
+        are delayed for a full grace period? Couldn't this result in
+        rcu_barrier() returning prematurely?
+rcu_barrier() Summary
+The rcu_barrier() primitive has seen relatively little use, since most
+code using RCU is in the core kernel rather than in modules. However, if
+you are using RCU from an unloadable module, you need to use rcu_barrier()
+so that your module may be safely unloaded.
+Answers to Quick Quizzes
+Quick Quiz #1: Why is there no srcu_barrier()?
+Answer: Since there is no call_srcu(), there can be no outstanding SRCU
+        callbacks. Therefore, there is no need to wait for them.
+Quick Quiz #2: Is there any other situation where rcu_barrier() might
+        be required?
+Answer: Interestingly enough, rcu_barrier() was not originally
+        implemented for module unloading. Nikita Danilov was using
+        RCU in a filesystem, which resulted in a similar situation at
+        filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
+        in response, so that Nikita could invoke it during the
+        filesystem-unmount process.
+        Much later, yours truly hit the RCU module-unload problem when
+        implementing rcutorture, and found that rcu_barrier() solves
+        this problem as well.
+Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes
+        immediately (thus incrementing rcu_barrier_cpu_count to the
+        value one), but the other CPU's rcu_barrier_func() invocations
+        are delayed for a full grace period? Couldn't this result in
+        rcu_barrier() returning prematurely?
+Answer: This cannot happen. The reason is that on_each_cpu() has its last
+        argument, the wait flag, set to "1". This flag is passed through
+        to smp_call_function() and further to smp_call_function_on_cpu(),
+        causing this latter to spin until the cross-CPU invocation of
+        rcu_barrier_func() has completed. This by itself would prevent
+        a grace period from completing on non-CONFIG_PREEMPT kernels,
+        since each CPU must undergo a context switch (or other quiescent
+        state) before the grace period can complete. However, this is
+        of no use in CONFIG_PREEMPT kernels.
+        Therefore, on_each_cpu() disables preemption across its call
+        to smp_call_function() and also across the local call to
+        rcu_barrier_func(). This prevents the local CPU from context
+        switching, again preventing grace periods from completing. This
+        means that all CPUs have executed rcu_barrier_func() before
+        the first rcu_barrier_callback() can possibly execute, in turn
+        preventing rcu_barrier_cpu_count from prematurely reaching zero.
+        Currently, -rt implementations of RCU keep but a single global
+        queue for RCU callbacks, and thus do not suffer from this
+        problem. However, when the -rt RCU eventually does have per-CPU
+        callback queues, things will have to change. One simple change
+        is to add an rcu_read_lock() before line 8 of rcu_barrier()
+        and an rcu_read_unlock() after line 8 of this same function. If
+        you can think of a better change, please let me know!
author	Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2008-11-13 21:11:52 -0500
committer	Jonathan Corbet <corbet@lwn.net>	2008-12-03 17:58:01 -0500
commit	1c12757c56b4c9ab5aab1f6c1248ae4ea8af3a01 (patch)
tree	1a501e8c8bea09ce85a080af6c28da42d021fd80
parent	061e41fdb5047b1fb161e89664057835935ca1d2 (diff)

diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index 461481dfb7c3..0f2a8d081681 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX
@@ -12,6 +12,8 @@ rcuref.txt
12	- Reference-count design for elements of lists/arrays protected by RCU	12	- Reference-count design for elements of lists/arrays protected by RCU
13	rcu.txt	13	rcu.txt
14	- RCU Concepts	14	- RCU Concepts
		15	rcubarrier.txt
		16	- Unloading modules that use RCU callbacks
15	RTFP.txt	17	RTFP.txt
16	- List of RCU papers (bibliography) going back to 1980.	18	- List of RCU papers (bibliography) going back to 1980.
17	torture.txt	19	torture.txt


diff --git a/Documentation/RCU/rcubarrier.txt b/Documentation/RCU/rcubarrier.txt new file mode 100644 index 000000000000..909602d409bb --- /dev/null +++ b/Documentation/RCU/rcubarrier.txt
@@ -0,0 +1,304 @@
		1	RCU and Unloadable Modules
		2
		3	[Originally published in LWN Jan. 14, 2007: http://lwn.net/Articles/217484/]
		4
		5	RCU (read-copy update) is a synchronization mechanism that can be thought
		6	of as a replacement for read-writer locking (among other things), but with
		7	very low-overhead readers that are immune to deadlock, priority inversion,
		8	and unbounded latency. RCU read-side critical sections are delimited
		9	by rcu_read_lock() and rcu_read_unlock(), which, in non-CONFIG_PREEMPT
		10	kernels, generate no code whatsoever.
		11
		12	This means that RCU writers are unaware of the presence of concurrent
		13	readers, so that RCU updates to shared data must be undertaken quite
		14	carefully, leaving an old version of the data structure in place until all
		15	pre-existing readers have finished. These old versions are needed because
		16	such readers might hold a reference to them. RCU updates can therefore be
		17	rather expensive, and RCU is thus best suited for read-mostly situations.
		18
		19	How can an RCU writer possibly determine when all readers are finished,
		20	given that readers might well leave absolutely no trace of their
		21	presence? There is a synchronize_rcu() primitive that blocks until all
		22	pre-existing readers have completed. An updater wishing to delete an
		23	element p from a linked list might do the following, while holding an
		24	appropriate lock, of course:
		25
		26	list_del_rcu(p);
		27	synchronize_rcu();
		28	kfree(p);
		29
		30	But the above code cannot be used in IRQ context -- the call_rcu()
		31	primitive must be used instead. This primitive takes a pointer to an
		32	rcu_head struct placed within the RCU-protected data structure and
		33	another pointer to a function that may be invoked later to free that
		34	structure. Code to delete an element p from the linked list from IRQ
		35	context might then be as follows:
		36
		37	list_del_rcu(p);
		38	call_rcu(&p->rcu, p_callback);
		39
		40	Since call_rcu() never blocks, this code can safely be used from within
		41	IRQ context. The function p_callback() might be defined as follows:
		42
		43	static void p_callback(struct rcu_head *rp)
		44	{
		45	struct pstruct *p = container_of(rp, struct pstruct, rcu);
		46
		47	kfree(p);
		48	}
		49
		50
		51	Unloading Modules That Use call_rcu()
		52
		53	But what if p_callback is defined in an unloadable module?
		54
		55	If we unload the module while some RCU callbacks are pending,
		56	the CPUs executing these callbacks are going to be severely
		57	disappointed when they are later invoked, as fancifully depicted at
		58	http://lwn.net/images/ns/kernel/rcu-drop.jpg.
		59
		60	We could try placing a synchronize_rcu() in the module-exit code path,
		61	but this is not sufficient. Although synchronize_rcu() does wait for a
		62	grace period to elapse, it does not wait for the callbacks to complete.
		63
		64	One might be tempted to try several back-to-back synchronize_rcu()
		65	calls, but this is still not guaranteed to work. If there is a very
		66	heavy RCU-callback load, then some of the callbacks might be deferred
		67	in order to allow other processing to proceed. Such deferral is required
		68	in realtime kernels in order to avoid excessive scheduling latencies.
		69
		70
		71	rcu_barrier()
		72
		73	We instead need the rcu_barrier() primitive. This primitive is similar
		74	to synchronize_rcu(), but instead of waiting solely for a grace
		75	period to elapse, it also waits for all outstanding RCU callbacks to
		76	complete. Pseudo-code using rcu_barrier() is as follows:
		77
		78	1. Prevent any new RCU callbacks from being posted.
		79	2. Execute rcu_barrier().
		80	3. Allow the module to be unloaded.
		81
		82	Quick Quiz #1: Why is there no srcu_barrier()?
		83
		84	The rcutorture module makes use of rcu_barrier in its exit function
		85	as follows:
		86
		87	1 static void
		88	2 rcu_torture_cleanup(void)
		89	3 {
		90	4 int i;
		91	5
		92	6 fullstop = 1;
		93	7 if (shuffler_task != NULL) {
		94	8 VERBOSE_PRINTK_STRING("Stopping rcu_torture_shuffle task");
		95	9 kthread_stop(shuffler_task);
		96	10 }
		97	11 shuffler_task = NULL;
		98	12
		99	13 if (writer_task != NULL) {
		100	14 VERBOSE_PRINTK_STRING("Stopping rcu_torture_writer task");
		101	15 kthread_stop(writer_task);
		102	16 }
		103	17 writer_task = NULL;
		104	18
		105	19 if (reader_tasks != NULL) {
		106	20 for (i = 0; i < nrealreaders; i++) {
		107	21 if (reader_tasks[i] != NULL) {
		108	22 VERBOSE_PRINTK_STRING(
		109	23 "Stopping rcu_torture_reader task");
		110	24 kthread_stop(reader_tasks[i]);
		111	25 }
		112	26 reader_tasks[i] = NULL;
		113	27 }
		114	28 kfree(reader_tasks);
		115	29 reader_tasks = NULL;
		116	30 }
		117	31 rcu_torture_current = NULL;
		118	32
		119	33 if (fakewriter_tasks != NULL) {
		120	34 for (i = 0; i < nfakewriters; i++) {
		121	35 if (fakewriter_tasks[i] != NULL) {
		122	36 VERBOSE_PRINTK_STRING(
		123	37 "Stopping rcu_torture_fakewriter task");
		124	38 kthread_stop(fakewriter_tasks[i]);
		125	39 }
		126	40 fakewriter_tasks[i] = NULL;
		127	41 }
		128	42 kfree(fakewriter_tasks);
		129	43 fakewriter_tasks = NULL;
		130	44 }
		131	45
		132	46 if (stats_task != NULL) {
		133	47 VERBOSE_PRINTK_STRING("Stopping rcu_torture_stats task");
		134	48 kthread_stop(stats_task);
		135	49 }
		136	50 stats_task = NULL;
		137	51
		138	52 /* Wait for all RCU callbacks to fire. */
		139	53 rcu_barrier();
		140	54
		141	55 rcu_torture_stats_print(); /* -After- the stats thread is stopped! */
		142	56
		143	57 if (cur_ops->cleanup != NULL)
		144	58 cur_ops->cleanup();
		145	59 if (atomic_read(&n_rcu_torture_error))
		146	60 rcu_torture_print_module_parms("End of test: FAILURE");
		147	61 else
		148	62 rcu_torture_print_module_parms("End of test: SUCCESS");
		149	63 }
		150
		151	Line 6 sets a global variable that prevents any RCU callbacks from
		152	re-posting themselves. This will not be necessary in most cases, since
		153	RCU callbacks rarely include calls to call_rcu(). However, the rcutorture
		154	module is an exception to this rule, and therefore needs to set this
		155	global variable.
		156
		157	Lines 7-50 stop all the kernel tasks associated with the rcutorture
		158	module. Therefore, once execution reaches line 53, no more rcutorture
		159	RCU callbacks will be posted. The rcu_barrier() call on line 53 waits
		160	for any pre-existing callbacks to complete.
		161
		162	Then lines 55-62 print status and do operation-specific cleanup, and
		163	then return, permitting the module-unload operation to be completed.
		164
		165	Quick Quiz #2: Is there any other situation where rcu_barrier() might
		166	be required?
		167
		168	Your module might have additional complications. For example, if your
		169	module invokes call_rcu() from timers, you will need to first cancel all
		170	the timers, and only then invoke rcu_barrier() to wait for any remaining
		171	RCU callbacks to complete.
		172
		173
		174	Implementing rcu_barrier()
		175
		176	Dipankar Sarma's implementation of rcu_barrier() makes use of the fact
		177	that RCU callbacks are never reordered once queued on one of the per-CPU
		178	queues. His implementation queues an RCU callback on each of the per-CPU
		179	callback queues, and then waits until they have all started executing, at
		180	which point, all earlier RCU callbacks are guaranteed to have completed.
		181
		182	The original code for rcu_barrier() was as follows:
		183
		184	1 void rcu_barrier(void)
		185	2 {
		186	3 BUG_ON(in_interrupt());
		187	4 /* Take cpucontrol mutex to protect against CPU hotplug */
		188	5 mutex_lock(&rcu_barrier_mutex);
		189	6 init_completion(&rcu_barrier_completion);
		190	7 atomic_set(&rcu_barrier_cpu_count, 0);
		191	8 on_each_cpu(rcu_barrier_func, NULL, 0, 1);
		192	9 wait_for_completion(&rcu_barrier_completion);
		193	10 mutex_unlock(&rcu_barrier_mutex);
		194	11 }
		195
		196	Line 3 verifies that the caller is in process context, and lines 5 and 10
		197	use rcu_barrier_mutex to ensure that only one rcu_barrier() is using the
		198	global completion and counters at a time, which are initialized on lines
		199	6 and 7. Line 8 causes each CPU to invoke rcu_barrier_func(), which is
		200	shown below. Note that the final "1" in on_each_cpu()'s argument list
		201	ensures that all the calls to rcu_barrier_func() will have completed
		202	before on_each_cpu() returns. Line 9 then waits for the completion.
		203
		204	This code was rewritten in 2008 to support rcu_barrier_bh() and
		205	rcu_barrier_sched() in addition to the original rcu_barrier().
		206
		207	The rcu_barrier_func() runs on each CPU, where it invokes call_rcu()
		208	to post an RCU callback, as follows:
		209
		210	1 static void rcu_barrier_func(void *notused)
		211	2 {
		212	3 int cpu = smp_processor_id();
		213	4 struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
		214	5 struct rcu_head *head;
		215	6
		216	7 head = &rdp->barrier;
		217	8 atomic_inc(&rcu_barrier_cpu_count);
		218	9 call_rcu(head, rcu_barrier_callback);
		219	10 }
		220
		221	Lines 3 and 4 locate RCU's internal per-CPU rcu_data structure,
		222	which contains the struct rcu_head that needed for the later call to
		223	call_rcu(). Line 7 picks up a pointer to this struct rcu_head, and line
		224	8 increments a global counter. This counter will later be decremented
		225	by the callback. Line 9 then registers the rcu_barrier_callback() on
		226	the current CPU's queue.
		227
		228	The rcu_barrier_callback() function simply atomically decrements the
		229	rcu_barrier_cpu_count variable and finalizes the completion when it
		230	reaches zero, as follows:
		231
		232	1 static void rcu_barrier_callback(struct rcu_head *notused)
		233	2 {
		234	3 if (atomic_dec_and_test(&rcu_barrier_cpu_count))
		235	4 complete(&rcu_barrier_completion);
		236	5 }
		237
		238	Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes
		239	immediately (thus incrementing rcu_barrier_cpu_count to the
		240	value one), but the other CPU's rcu_barrier_func() invocations
		241	are delayed for a full grace period? Couldn't this result in
		242	rcu_barrier() returning prematurely?
		243
		244
		245	rcu_barrier() Summary
		246
		247	The rcu_barrier() primitive has seen relatively little use, since most
		248	code using RCU is in the core kernel rather than in modules. However, if
		249	you are using RCU from an unloadable module, you need to use rcu_barrier()
		250	so that your module may be safely unloaded.
		251
		252
		253	Answers to Quick Quizzes
		254
		255	Quick Quiz #1: Why is there no srcu_barrier()?
		256
		257	Answer: Since there is no call_srcu(), there can be no outstanding SRCU
		258	callbacks. Therefore, there is no need to wait for them.
		259
		260	Quick Quiz #2: Is there any other situation where rcu_barrier() might
		261	be required?
		262
		263	Answer: Interestingly enough, rcu_barrier() was not originally
		264	implemented for module unloading. Nikita Danilov was using
		265	RCU in a filesystem, which resulted in a similar situation at
		266	filesystem-unmount time. Dipankar Sarma coded up rcu_barrier()
		267	in response, so that Nikita could invoke it during the
		268	filesystem-unmount process.
		269
		270	Much later, yours truly hit the RCU module-unload problem when
		271	implementing rcutorture, and found that rcu_barrier() solves
		272	this problem as well.
		273
		274	Quick Quiz #3: What happens if CPU 0's rcu_barrier_func() executes
		275	immediately (thus incrementing rcu_barrier_cpu_count to the
		276	value one), but the other CPU's rcu_barrier_func() invocations
		277	are delayed for a full grace period? Couldn't this result in
		278	rcu_barrier() returning prematurely?
		279
		280	Answer: This cannot happen. The reason is that on_each_cpu() has its last
		281	argument, the wait flag, set to "1". This flag is passed through
		282	to smp_call_function() and further to smp_call_function_on_cpu(),
		283	causing this latter to spin until the cross-CPU invocation of
		284	rcu_barrier_func() has completed. This by itself would prevent
		285	a grace period from completing on non-CONFIG_PREEMPT kernels,
		286	since each CPU must undergo a context switch (or other quiescent
		287	state) before the grace period can complete. However, this is
		288	of no use in CONFIG_PREEMPT kernels.
		289
		290	Therefore, on_each_cpu() disables preemption across its call
		291	to smp_call_function() and also across the local call to
		292	rcu_barrier_func(). This prevents the local CPU from context
		293	switching, again preventing grace periods from completing. This
		294	means that all CPUs have executed rcu_barrier_func() before
		295	the first rcu_barrier_callback() can possibly execute, in turn
		296	preventing rcu_barrier_cpu_count from prematurely reaching zero.
		297
		298	Currently, -rt implementations of RCU keep but a single global
		299	queue for RCU callbacks, and thus do not suffer from this
		300	problem. However, when the -rt RCU eventually does have per-CPU
		301	callback queues, things will have to change. One simple change
		302	is to add an rcu_read_lock() before line 8 of rcu_barrier()
		303	and an rcu_read_unlock() after line 8 of this same function. If
		304	you can think of a better change, please let me know!