Documentation/memory-barriers.txt: Add needed ACCESS_ONCE() calls to memory-barriers.txt

The Documentation/memory-barriers.txt file was written before the need for ACCESS_ONCE() was fully appreciated. It therefore contains no ACCESS_ONCE() calls, which can be a problem when people lift examples from it. This commit therefore adds ACCESS_ONCE() calls. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: <linux-arch@vger.kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/1386799151-2219-1-git-send-email-paulmck@linux.vnet.ibm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 2013-12-11 16:59:04 -0500
committer: Ingo Molnar <mingo@kernel.org> 2013-12-16 05:36:08 -0500
commit: 2ecf810121c7eae34473b8fa108112036bc61127 (patch)
tree: db2f48fec611baab7b9fa15f45c9be8c5587670c
parent: 962d9c5757b2ebd30678a88a4f475ae997f4d5dd (diff)
1 files changed, 126 insertions, 80 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 020cccdbdd0c..1d067235b0bc 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU:
 (*) On any given CPU, dependent memory accesses will be issued in order, with
     respect to itself.  This means that for:
-        Q = P; D = *Q;
+        ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);
     the CPU will issue the following memory operations:
        Q = LOAD P, D = LOAD *Q
-     and always in that order.
+     and always in that order.  On most systems, smp_read_barrier_depends()
+     does nothing, but it is required for DEC Alpha.  The ACCESS_ONCE()
+     is required to prevent compiler mischief.  Please note that you
+     should normally use something like rcu_dereference() instead of
+     open-coding smp_read_barrier_depends().
 (*) Overlapping loads and stores within a particular CPU will appear to be
     ordered within that CPU.  This means that for:
-        a = *X; *X = b;
+        a = ACCESS_ONCE(*X); ACCESS_ONCE(*X) = b;
     the CPU will only issue the following sequence of memory operations:
@@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU:
     And for:
-        *X = c; d = *X;
+        ACCESS_ONCE(*X) = c; d = ACCESS_ONCE(*X);
     the CPU will only issue:
@@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU:
 And there are a number of things that _must_ or _must_not_ be assumed:
+ (*) It _must_not_ be assumed that the compiler will do what you want with
+     memory references that are not protected by ACCESS_ONCE().  Without
+     ACCESS_ONCE(), the compiler is within its rights to do all sorts
+     of "creative" transformations:
+     (-) Repeat the load, possibly getting a different value on the second
+         and subsequent loads.  This is especially prone to happen when
+         register pressure is high.
+     (-) Merge adjacent loads and stores to the same location.  The most
+         familiar example is the transformation from:
+                while (a)
+                        do_something();
+         to something like:
+                if (a)
+                        for (;;)
+                                do_something();
+         Using ACCESS_ONCE() as follows prevents this sort of optimization:
+                while (ACCESS_ONCE(a))
+                        do_something();
+     (-) "Store tearing", where a single store in the source code is split
+         into smaller stores in the object code.  Note that gcc really
+         will do this on some architectures when storing certain constants.
+         It can be cheaper to do a series of immediate stores than to
+         form the constant in a register and then to store that register.
+     (-) "Load tearing", which splits loads in a manner analogous to
+         store tearing.
 (*) It _must_not_ be assumed that independent loads and stores will be issued
     in the order given.  This means that for:
@@ -450,14 +489,14 @@ The usage requirements of data dependency barriers are a little subtle, and
 it's not always obvious that they're needed.  To illustrate, consider the
 following sequence of events:
-        CPU 1           CPU 2
+        CPU 1                 CPU 2
-        =============== ===============
+        ===============       ===============
        { A == 1, B == 2, C = 3, P == &A, Q == &C }
        B = 4;
        <write barrier>
-        P = &B
+        ACCESS_ONCE(P) = &B
-                        Q = P;
+                              Q = ACCESS_ONCE(P);
-                        D = *Q;
+                              D = *Q;
 There's a clear data dependency here, and it would seem that by the end of the
 sequence, Q must be either &A or &B, and that:
@@ -477,15 +516,15 @@ Alpha).
 To deal with this, a data dependency barrier or better must be inserted
 between the address load and the data load:
-        CPU 1           CPU 2
+        CPU 1                 CPU 2
-        =============== ===============
+        ===============       ===============
        { A == 1, B == 2, C = 3, P == &A, Q == &C }
        B = 4;
        <write barrier>
-        P = &B
+        ACCESS_ONCE(P) = &B
-                        Q = P;
+                              Q = ACCESS_ONCE(P);
-                        <data dependency barrier>
+                              <data dependency barrier>
-                        D = *Q;
+                              D = *Q;
 This enforces the occurrence of one of the two implications, and prevents the
 third possibility from arising.
@@ -504,21 +543,22 @@ Another example of where data dependency barriers might be required is where a
 number is read from memory and then used to calculate the index for an array
 access:
-        CPU 1           CPU 2
+        CPU 1                 CPU 2
-        =============== ===============
+        ===============       ===============
        { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
        M[1] = 4;
        <write barrier>
-        P = 1
+        ACCESS_ONCE(P) = 1
-                        Q = P;
+                              Q = ACCESS_ONCE(P);
-                        <data dependency barrier>
+                              <data dependency barrier>
-                        D = M[Q];
+                              D = M[Q];
-The data dependency barrier is very important to the RCU system, for example.
+The data dependency barrier is very important to the RCU system,
-See rcu_dereference() in include/linux/rcupdate.h.  This permits the current
+for example.  See rcu_assign_pointer() and rcu_dereference() in
-target of an RCU'd pointer to be replaced with a new modified target, without
+include/linux/rcupdate.h.  This permits the current target of an RCU'd
-the replacement target appearing to be incompletely initialised.
+pointer to be replaced with a new modified target, without the replacement
+target appearing to be incompletely initialised.
 See also the subsection on "Cache Coherency" for a more thorough example.
@@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data
 dependency barrier to make it work correctly.  Consider the following bit of
 code:
-        q = &a;
+        q = ACCESS_ONCE(a);
        if (p) {
                <data dependency barrier>
-                q = &b;
+                q = ACCESS_ONCE(b);
        }
        x = *q;
 This will not have the desired effect because there is no actual data
-dependency, but rather a control dependency that the CPU may short-circuit by
+dependency, but rather a control dependency that the CPU may short-circuit
-attempting to predict the outcome in advance.  In such a case what's actually
+by attempting to predict the outcome in advance, so that other CPUs see
-required is:
+the load from b as having happened before the load from a.  In such a
+case what's actually required is:
-        q = &a;
+        q = ACCESS_ONCE(a);
        if (p) {
                <read barrier>
-                q = &b;
+                q = ACCESS_ONCE(b);
        }
        x = *q;
@@ -561,23 +602,23 @@ barrier, though a general barrier would also be viable.  Similarly a read
 barrier or a data dependency barrier should always be paired with at least an
 write barrier, though, again, a general barrier is viable:
-        CPU 1           CPU 2
+        CPU 1                 CPU 2
-        =============== ===============
+        ===============       ===============
-        a = 1;
+        ACCESS_ONCE(a) = 1;
        <write barrier>
-        b = 2;          x = b;
+        ACCESS_ONCE(b) = 2;   x = ACCESS_ONCE(b);
-                        <read barrier>
+                              <read barrier>
-                        y = a;
+                              y = ACCESS_ONCE(a);
 Or:
-        CPU 1           CPU 2
+        CPU 1                 CPU 2
-        =============== ===============================
+        ===============       ===============================
        a = 1;
        <write barrier>
-        b = &a;         x = b;
+        ACCESS_ONCE(b) = &a;  x = ACCESS_ONCE(b);
-                        <data dependency barrier>
+                              <data dependency barrier>
-                        y = *x;
+                              y = *x;
 Basically, the read barrier always has to be there, even though it can be of
 the "weaker" type.
@@ -586,13 +627,13 @@ the "weaker" type.
 match the loads after the read barrier or the data dependency barrier, and vice
 versa:
-        CPU 1                           CPU 2
+        CPU 1                               CPU 2
-        ===============                 ===============
+        ===================                 ===================
-        a = 1;           }----   --->{  v = c
+        ACCESS_ONCE(a) = 1;  }----   --->{  v = ACCESS_ONCE(c);
-        b = 2;           }    \ /    {  w = d
+        ACCESS_ONCE(b) = 2;  }    \ /    {  w = ACCESS_ONCE(d);
-        <write barrier>        \        <read barrier>
+        <write barrier>            \        <read barrier>
-        c = 3;           }    / \    {  x = a;
+        ACCESS_ONCE(c) = 3;  }    / \    {  x = ACCESS_ONCE(a);
-        d = 4;           }----   --->{  y = b;
+        ACCESS_ONCE(d) = 4;  }----   --->{  y = ACCESS_ONCE(b);
 EXAMPLES OF MEMORY BARRIER SEQUENCES
@@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur:
        CPU 1                           CPU 2
        =============================== ===============================
-        *A = a;                         *E = e;
+        ACCESS_ONCE(*A) = a;            ACCESS_ONCE(*E) = e;
        LOCK M                          LOCK Q
-        *B = b;                         *F = f;
+        ACCESS_ONCE(*B) = b;            ACCESS_ONCE(*F) = f;
-        *C = c;                         *G = g;
+        ACCESS_ONCE(*C) = c;            ACCESS_ONCE(*G) = g;
        UNLOCK M                        UNLOCK Q
-        *D = d;                         *H = h;
+        ACCESS_ONCE(*D) = d;            ACCESS_ONCE(*H) = h;
 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
@@ -1460,17 +1501,17 @@ However, if the following occurs:
        CPU 1                           CPU 2
        =============================== ===============================
-        *A = a;
+        ACCESS_ONCE(*A) = a;
-        LOCK M          [1]
+        LOCK M               [1]
-        *B = b;
+        ACCESS_ONCE(*B) = b;
-        *C = c;
+        ACCESS_ONCE(*C) = c;
-        UNLOCK M        [1]
+        UNLOCK M             [1]
-        *D = d;                         *E = e;
+        ACCESS_ONCE(*D) = d;            ACCESS_ONCE(*E) = e;
-                                        LOCK M          [2]
+                                        LOCK M               [2]
-                                        *F = f;
+                                        ACCESS_ONCE(*F) = f;
-                                        *G = g;
+                                        ACCESS_ONCE(*G) = g;
-                                        UNLOCK M        [2]
+                                        UNLOCK M             [2]
-                                        *H = h;
+                                        ACCESS_ONCE(*H) = h;
 CPU 3 might see:
@@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory
 operations in exactly the order specified, so that if the CPU is, for example,
 given the following piece of code to execute:
-        a = *A;
+        a = ACCESS_ONCE(*A);
-        *B = b;
+        ACCESS_ONCE(*B) = b;
-        c = *C;
+        c = ACCESS_ONCE(*C);
-        d = *D;
+        d = ACCESS_ONCE(*D);
-        *E = e;
+        ACCESS_ONCE(*E) = e;
 they would then expect that the CPU will complete the memory operation for each
 instruction before moving on to the next one, leading to a definite sequence of
@@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its
 _own_ accesses appear to be correctly ordered, without the need for a memory
 barrier.  For instance with the following code:
-        U = *A;
+        U = ACCESS_ONCE(*A);
-        *A = V;
+        ACCESS_ONCE(*A) = V;
-        *A = W;
+        ACCESS_ONCE(*A) = W;
-        X = *A;
+        X = ACCESS_ONCE(*A);
-        *A = Y;
+        ACCESS_ONCE(*A) = Y;
-        Z = *A;
+        Z = ACCESS_ONCE(*A);
 and assuming no intervention by an external influence, it can be assumed that
 the final result will appear to be:
@@ -2250,7 +2291,12 @@ accesses:
 in that order, but, without intervention, the sequence may have almost any
 combination of elements combined or discarded, provided the program's view of
-the world remains consistent.
+the world remains consistent.  Note that ACCESS_ONCE() is -not- optional
+in the above example, as there are architectures where a given CPU might
+interchange successive loads to the same location.  On such architectures,
+ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
+Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
+special ld.acq and st.rel instructions that prevent such reordering.
 The compiler may also combine, discard or defer elements of the sequence before
 the CPU even sees them.
@@ -2264,13 +2310,13 @@ may be reduced to:
        *A = W;
-since, without a write barrier, it can be assumed that the effect of the
+since, without either a write barrier or an ACCESS_ONCE(), it can be
-storage of V to *A is lost.  Similarly:
+assumed that the effect of the storage of V to *A is lost.  Similarly:
        *A = Y;
        Z = *A;
-may, without a memory barrier, be reduced to:
+may, without a memory barrier or an ACCESS_ONCE(), be reduced to:
        *A = Y;
        Z = Y;
author	Paul E. McKenney <paulmck@linux.vnet.ibm.com>	2013-12-11 16:59:04 -0500
committer	Ingo Molnar <mingo@kernel.org>	2013-12-16 05:36:08 -0500
commit	2ecf810121c7eae34473b8fa108112036bc61127 (patch)
tree	db2f48fec611baab7b9fa15f45c9be8c5587670c
parent	962d9c5757b2ebd30678a88a4f475ae997f4d5dd (diff)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 020cccdbdd0c..1d067235b0bc 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt
@@ -194,18 +194,22 @@ There are some minimal guarantees that may be expected of a CPU:
194	(*) On any given CPU, dependent memory accesses will be issued in order, with	194	(*) On any given CPU, dependent memory accesses will be issued in order, with
195	respect to itself. This means that for:	195	respect to itself. This means that for:
196		196
197	Q = P; D = *Q;	197	ACCESS_ONCE(Q) = P; smp_read_barrier_depends(); D = ACCESS_ONCE(*Q);
198		198
199	the CPU will issue the following memory operations:	199	the CPU will issue the following memory operations:
200		200
201	Q = LOAD P, D = LOAD *Q	201	Q = LOAD P, D = LOAD *Q
202		202
203	and always in that order.	203	and always in that order. On most systems, smp_read_barrier_depends()
		204	does nothing, but it is required for DEC Alpha. The ACCESS_ONCE()
		205	is required to prevent compiler mischief. Please note that you
		206	should normally use something like rcu_dereference() instead of
		207	open-coding smp_read_barrier_depends().
204		208
205	(*) Overlapping loads and stores within a particular CPU will appear to be	209	(*) Overlapping loads and stores within a particular CPU will appear to be
206	ordered within that CPU. This means that for:	210	ordered within that CPU. This means that for:
207		211
208	a = X; X = b;	212	a = ACCESS_ONCE(X); ACCESS_ONCE(X) = b;
209		213
210	the CPU will only issue the following sequence of memory operations:	214	the CPU will only issue the following sequence of memory operations:
211		215
@@ -213,7 +217,7 @@ There are some minimal guarantees that may be expected of a CPU:
213		217
214	And for:	218	And for:
215		219
216	X = c; d = X;	220	ACCESS_ONCE(X) = c; d = ACCESS_ONCE(X);
217		221
218	the CPU will only issue:	222	the CPU will only issue:
219		223
@@ -224,6 +228,41 @@ There are some minimal guarantees that may be expected of a CPU:
224		228
225	And there are a number of things that _must_ or _must_not_ be assumed:	229	And there are a number of things that _must_ or _must_not_ be assumed:
226		230
		231	(*) It _must_not_ be assumed that the compiler will do what you want with
		232	memory references that are not protected by ACCESS_ONCE(). Without
		233	ACCESS_ONCE(), the compiler is within its rights to do all sorts
		234	of "creative" transformations:
		235
		236	(-) Repeat the load, possibly getting a different value on the second
		237	and subsequent loads. This is especially prone to happen when
		238	register pressure is high.
		239
		240	(-) Merge adjacent loads and stores to the same location. The most
		241	familiar example is the transformation from:
		242
		243	while (a)
		244	do_something();
		245
		246	to something like:
		247
		248	if (a)
		249	for (;;)
		250	do_something();
		251
		252	Using ACCESS_ONCE() as follows prevents this sort of optimization:
		253
		254	while (ACCESS_ONCE(a))
		255	do_something();
		256
		257	(-) "Store tearing", where a single store in the source code is split
		258	into smaller stores in the object code. Note that gcc really
		259	will do this on some architectures when storing certain constants.
		260	It can be cheaper to do a series of immediate stores than to
		261	form the constant in a register and then to store that register.
		262
		263	(-) "Load tearing", which splits loads in a manner analogous to
		264	store tearing.
		265
227	(*) It _must_not_ be assumed that independent loads and stores will be issued	266	(*) It _must_not_ be assumed that independent loads and stores will be issued
228	in the order given. This means that for:	267	in the order given. This means that for:
229		268
@@ -450,14 +489,14 @@ The usage requirements of data dependency barriers are a little subtle, and
450	it's not always obvious that they're needed. To illustrate, consider the	489	it's not always obvious that they're needed. To illustrate, consider the
451	following sequence of events:	490	following sequence of events:
452		491
453	CPU 1 CPU 2	492	CPU 1 CPU 2
454	=============== ===============	493	=============== ===============
455	{ A == 1, B == 2, C = 3, P == &A, Q == &C }	494	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
456	B = 4;	495	B = 4;
457	<write barrier>	496	<write barrier>
458	P = &B	497	ACCESS_ONCE(P) = &B
459	Q = P;	498	Q = ACCESS_ONCE(P);
460	D = *Q;	499	D = *Q;
461		500
462	There's a clear data dependency here, and it would seem that by the end of the	501	There's a clear data dependency here, and it would seem that by the end of the
463	sequence, Q must be either &A or &B, and that:	502	sequence, Q must be either &A or &B, and that:
@@ -477,15 +516,15 @@ Alpha).
477	To deal with this, a data dependency barrier or better must be inserted	516	To deal with this, a data dependency barrier or better must be inserted
478	between the address load and the data load:	517	between the address load and the data load:
479		518
480	CPU 1 CPU 2	519	CPU 1 CPU 2
481	=============== ===============	520	=============== ===============
482	{ A == 1, B == 2, C = 3, P == &A, Q == &C }	521	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
483	B = 4;	522	B = 4;
484	<write barrier>	523	<write barrier>
485	P = &B	524	ACCESS_ONCE(P) = &B
486	Q = P;	525	Q = ACCESS_ONCE(P);
487	<data dependency barrier>	526	<data dependency barrier>
488	D = *Q;	527	D = *Q;
489		528
490	This enforces the occurrence of one of the two implications, and prevents the	529	This enforces the occurrence of one of the two implications, and prevents the
491	third possibility from arising.	530	third possibility from arising.
@@ -504,21 +543,22 @@ Another example of where data dependency barriers might be required is where a
504	number is read from memory and then used to calculate the index for an array	543	number is read from memory and then used to calculate the index for an array
505	access:	544	access:
506		545
507	CPU 1 CPU 2	546	CPU 1 CPU 2
508	=============== ===============	547	=============== ===============
509	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }	548	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
510	M[1] = 4;	549	M[1] = 4;
511	<write barrier>	550	<write barrier>
512	P = 1	551	ACCESS_ONCE(P) = 1
513	Q = P;	552	Q = ACCESS_ONCE(P);
514	<data dependency barrier>	553	<data dependency barrier>
515	D = M[Q];	554	D = M[Q];
516		555
517		556
518	The data dependency barrier is very important to the RCU system, for example.	557	The data dependency barrier is very important to the RCU system,
519	See rcu_dereference() in include/linux/rcupdate.h. This permits the current	558	for example. See rcu_assign_pointer() and rcu_dereference() in
520	target of an RCU'd pointer to be replaced with a new modified target, without	559	include/linux/rcupdate.h. This permits the current target of an RCU'd
521	the replacement target appearing to be incompletely initialised.	560	pointer to be replaced with a new modified target, without the replacement
		561	target appearing to be incompletely initialised.
522		562
523	See also the subsection on "Cache Coherency" for a more thorough example.	563	See also the subsection on "Cache Coherency" for a more thorough example.
524		564
@@ -530,22 +570,23 @@ A control dependency requires a full read memory barrier, not simply a data
530	dependency barrier to make it work correctly. Consider the following bit of	570	dependency barrier to make it work correctly. Consider the following bit of
531	code:	571	code:
532		572
533	q = &a;	573	q = ACCESS_ONCE(a);
534	if (p) {	574	if (p) {
535	<data dependency barrier>	575	<data dependency barrier>
536	q = &b;	576	q = ACCESS_ONCE(b);
537	}	577	}
538	x = *q;	578	x = *q;
539		579
540	This will not have the desired effect because there is no actual data	580	This will not have the desired effect because there is no actual data
541	dependency, but rather a control dependency that the CPU may short-circuit by	581	dependency, but rather a control dependency that the CPU may short-circuit
542	attempting to predict the outcome in advance. In such a case what's actually	582	by attempting to predict the outcome in advance, so that other CPUs see
543	required is:	583	the load from b as having happened before the load from a. In such a
		584	case what's actually required is:
544		585
545	q = &a;	586	q = ACCESS_ONCE(a);
546	if (p) {	587	if (p) {
547	<read barrier>	588	<read barrier>
548	q = &b;	589	q = ACCESS_ONCE(b);
549	}	590	}
550	x = *q;	591	x = *q;
551		592
@@ -561,23 +602,23 @@ barrier, though a general barrier would also be viable. Similarly a read
561	barrier or a data dependency barrier should always be paired with at least an	602	barrier or a data dependency barrier should always be paired with at least an
562	write barrier, though, again, a general barrier is viable:	603	write barrier, though, again, a general barrier is viable:
563		604
564	CPU 1 CPU 2	605	CPU 1 CPU 2
565	=============== ===============	606	=============== ===============
566	a = 1;	607	ACCESS_ONCE(a) = 1;
567	<write barrier>	608	<write barrier>
568	b = 2; x = b;	609	ACCESS_ONCE(b) = 2; x = ACCESS_ONCE(b);
569	<read barrier>	610	<read barrier>
570	y = a;	611	y = ACCESS_ONCE(a);
571		612
572	Or:	613	Or:
573		614
574	CPU 1 CPU 2	615	CPU 1 CPU 2
575	=============== ===============================	616	=============== ===============================
576	a = 1;	617	a = 1;
577	<write barrier>	618	<write barrier>
578	b = &a; x = b;	619	ACCESS_ONCE(b) = &a; x = ACCESS_ONCE(b);
579	<data dependency barrier>	620	<data dependency barrier>
580	y = *x;	621	y = *x;
581		622
582	Basically, the read barrier always has to be there, even though it can be of	623	Basically, the read barrier always has to be there, even though it can be of
583	the "weaker" type.	624	the "weaker" type.
@@ -586,13 +627,13 @@ the "weaker" type.
586	match the loads after the read barrier or the data dependency barrier, and vice	627	match the loads after the read barrier or the data dependency barrier, and vice
587	versa:	628	versa:
588		629
589	CPU 1 CPU 2	630	CPU 1 CPU 2
590	=============== ===============	631	=================== ===================
591	a = 1; }---- --->{ v = c	632	ACCESS_ONCE(a) = 1; }---- --->{ v = ACCESS_ONCE(c);
592	b = 2; } \ / { w = d	633	ACCESS_ONCE(b) = 2; } \ / { w = ACCESS_ONCE(d);
593	<write barrier> \ <read barrier>	634	<write barrier> \ <read barrier>
594	c = 3; } / \ { x = a;	635	ACCESS_ONCE(c) = 3; } / \ { x = ACCESS_ONCE(a);
595	d = 4; }---- --->{ y = b;	636	ACCESS_ONCE(d) = 4; }---- --->{ y = ACCESS_ONCE(b);
596		637
597		638
598	EXAMPLES OF MEMORY BARRIER SEQUENCES	639	EXAMPLES OF MEMORY BARRIER SEQUENCES
@@ -1435,12 +1476,12 @@ three CPUs; then should the following sequence of events occur:
1435		1476
1436	CPU 1 CPU 2	1477	CPU 1 CPU 2
1437	=============================== ===============================	1478	=============================== ===============================
1438	A = a; E = e;	1479	ACCESS_ONCE(A) = a; ACCESS_ONCE(E) = e;
1439	LOCK M LOCK Q	1480	LOCK M LOCK Q
1440	B = b; F = f;	1481	ACCESS_ONCE(B) = b; ACCESS_ONCE(F) = f;
1441	C = c; G = g;	1482	ACCESS_ONCE(C) = c; ACCESS_ONCE(G) = g;
1442	UNLOCK M UNLOCK Q	1483	UNLOCK M UNLOCK Q
1443	D = d; H = h;	1484	ACCESS_ONCE(D) = d; ACCESS_ONCE(H) = h;
1444		1485
1445	Then there is no guarantee as to what order CPU 3 will see the accesses to *A	1486	Then there is no guarantee as to what order CPU 3 will see the accesses to *A
1446	through *H occur in, other than the constraints imposed by the separate locks	1487	through *H occur in, other than the constraints imposed by the separate locks
@@ -1460,17 +1501,17 @@ However, if the following occurs:
1460		1501
1461	CPU 1 CPU 2	1502	CPU 1 CPU 2
1462	=============================== ===============================	1503	=============================== ===============================
1463	*A = a;	1504	ACCESS_ONCE(*A) = a;
1464	LOCK M [1]	1505	LOCK M [1]
1465	*B = b;	1506	ACCESS_ONCE(*B) = b;
1466	*C = c;	1507	ACCESS_ONCE(*C) = c;
1467	UNLOCK M [1]	1508	UNLOCK M [1]
1468	D = d; E = e;	1509	ACCESS_ONCE(D) = d; ACCESS_ONCE(E) = e;
1469	LOCK M [2]	1510	LOCK M [2]
1470	*F = f;	1511	ACCESS_ONCE(*F) = f;
1471	*G = g;	1512	ACCESS_ONCE(*G) = g;
1472	UNLOCK M [2]	1513	UNLOCK M [2]
1473	*H = h;	1514	ACCESS_ONCE(*H) = h;
1474		1515
1475	CPU 3 might see:	1516	CPU 3 might see:
1476		1517
@@ -2177,11 +2218,11 @@ A programmer might take it for granted that the CPU will perform memory
2177	operations in exactly the order specified, so that if the CPU is, for example,	2218	operations in exactly the order specified, so that if the CPU is, for example,
2178	given the following piece of code to execute:	2219	given the following piece of code to execute:
2179		2220
2180	a = *A;	2221	a = ACCESS_ONCE(*A);
2181	*B = b;	2222	ACCESS_ONCE(*B) = b;
2182	c = *C;	2223	c = ACCESS_ONCE(*C);
2183	d = *D;	2224	d = ACCESS_ONCE(*D);
2184	*E = e;	2225	ACCESS_ONCE(*E) = e;
2185		2226
2186	they would then expect that the CPU will complete the memory operation for each	2227	they would then expect that the CPU will complete the memory operation for each
2187	instruction before moving on to the next one, leading to a definite sequence of	2228	instruction before moving on to the next one, leading to a definite sequence of
@@ -2228,12 +2269,12 @@ However, it is guaranteed that a CPU will be self-consistent: it will see its
2228	_own_ accesses appear to be correctly ordered, without the need for a memory	2269	_own_ accesses appear to be correctly ordered, without the need for a memory
2229	barrier. For instance with the following code:	2270	barrier. For instance with the following code:
2230		2271
2231	U = *A;	2272	U = ACCESS_ONCE(*A);
2232	*A = V;	2273	ACCESS_ONCE(*A) = V;
2233	*A = W;	2274	ACCESS_ONCE(*A) = W;
2234	X = *A;	2275	X = ACCESS_ONCE(*A);
2235	*A = Y;	2276	ACCESS_ONCE(*A) = Y;
2236	Z = *A;	2277	Z = ACCESS_ONCE(*A);
2237		2278
2238	and assuming no intervention by an external influence, it can be assumed that	2279	and assuming no intervention by an external influence, it can be assumed that
2239	the final result will appear to be:	2280	the final result will appear to be:
@@ -2250,7 +2291,12 @@ accesses:
2250		2291
2251	in that order, but, without intervention, the sequence may have almost any	2292	in that order, but, without intervention, the sequence may have almost any
2252	combination of elements combined or discarded, provided the program's view of	2293	combination of elements combined or discarded, provided the program's view of
2253	the world remains consistent.	2294	the world remains consistent. Note that ACCESS_ONCE() is -not- optional
		2295	in the above example, as there are architectures where a given CPU might
		2296	interchange successive loads to the same location. On such architectures,
		2297	ACCESS_ONCE() does whatever is necessary to prevent this, for example, on
		2298	Itanium the volatile casts used by ACCESS_ONCE() cause GCC to emit the
		2299	special ld.acq and st.rel instructions that prevent such reordering.
2254		2300
2255	The compiler may also combine, discard or defer elements of the sequence before	2301	The compiler may also combine, discard or defer elements of the sequence before
2256	the CPU even sees them.	2302	the CPU even sees them.
@@ -2264,13 +2310,13 @@ may be reduced to:
2264		2310
2265	*A = W;	2311	*A = W;
2266		2312
2267	since, without a write barrier, it can be assumed that the effect of the	2313	since, without either a write barrier or an ACCESS_ONCE(), it can be
2268	storage of V to *A is lost. Similarly:	2314	assumed that the effect of the storage of V to *A is lost. Similarly:
2269		2315
2270	*A = Y;	2316	*A = Y;
2271	Z = *A;	2317	Z = *A;
2272		2318
2273	may, without a memory barrier, be reduced to:	2319	may, without a memory barrier or an ACCESS_ONCE(), be reduced to:
2274		2320
2275	*A = Y;	2321	*A = Y;
2276	Z = Y;	2322	Z = Y;