Merge branch 'master' into upstream

author: Jeff Garzik <jeff@garzik.org> 2006-06-13 20:29:04 -0400
committer: Jeff Garzik <jeff@garzik.org> 2006-06-13 20:29:04 -0400
commit: b5ed7639c9f502898af4109e778f5613dacbfd9c (patch)
tree: abe908c60ce1ea8f201028c9fc830cacd25c724b /Documentation
parent: 0638dec01e89059c853515ab71c55fd13ba5a8ea (diff)
parent: eb35cf60e462491249166182e3e755d3d5d91a28 (diff)
1 files changed, 270 insertions, 78 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index c61d8b876fdb..4710845dbac4 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -19,6 +19,7 @@ Contents:
     - Control dependencies.
     - SMP barrier pairing.
     - Examples of memory barrier sequences.
+     - Read memory barriers vs load speculation.
 (*) Explicit kernel barriers.
@@ -248,7 +249,7 @@ And there are a number of things that _must_ or _must_not_ be assumed:
     we may get either of:
        STORE *A = X; Y = LOAD *A;
-        STORE *A = Y;
+        STORE *A = Y = X;
 =========================
@@ -344,9 +345,12 @@ Memory barriers come in four basic varieties:
 (4) General memory barriers.
-     A general memory barrier is a combination of both a read memory barrier
+     A general memory barrier gives a guarantee that all the LOAD and STORE
-     and a write memory barrier.  It is a partial ordering over both loads and
+     operations specified before the barrier will appear to happen before all
-     stores.
+     the LOAD and STORE operations specified after the barrier with respect to
+     the other components of the system.
+     A general memory barrier is a partial ordering over both loads and stores.
     General memory barriers imply both read and write memory barriers, and so
     can substitute for either.
@@ -546,9 +550,9 @@ write barrier, though, again, a general barrier is viable:
        =============== ===============
        a = 1;
        <write barrier>
-        b = 2;          x = a;
+        b = 2;          x = b;
                        <read barrier>
-                        y = b;
+                        y = a;
 Or:
@@ -563,6 +567,18 @@ Or:
 Basically, the read barrier always has to be there, even though it can be of
 the "weaker" type.
+[!] Note that the stores before the write barrier would normally be expected to
+match the loads after the read barrier or data dependency barrier, and vice
+versa:
+        CPU 1                           CPU 2
+        ===============                 ===============
+        a = 1;           }----   --->{  v = c
+        b = 2;           }    \ /    {  w = d
+        <write barrier>        \        <read barrier>
+        c = 3;           }    / \    {  x = a;
+        d = 4;           }----   --->{  y = b;
 EXAMPLES OF MEMORY BARRIER SEQUENCES
 ------------------------------------
@@ -600,8 +616,8 @@ STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
        |       |       +------+
        +-------+       :      :
                           |
-                           | Sequence in which stores committed to memory system
+                           | Sequence in which stores are committed to the
-                           | by CPU 1
+                           | memory system by CPU 1
                           V
@@ -683,14 +699,12 @@ then the following will occur:
                                       |        :       :       |       |
                                       |        :       :       | CPU 2 |
                                       |        +-------+       |       |
-                                        \       | X->9  |------>|       |
+                                       |        | X->9  |------>|       |
-                                         \      +-------+       |       |
+                                       |        +-------+       |       |
-                                          ----->| B->2  |       |       |
+          Makes sure all effects --->   \   ddddddddddddddddd   |       |
-                                                +-------+       |       |
+          prior to the store of C        \      +-------+       |       |
-             Makes sure all effects --->    ddddddddddddddddd   |       |
+          are perceptible to              ----->| B->2  |------>|       |
-             prior to the store of C            +-------+       |       |
+          subsequent loads                      +-------+       |       |
-             are perceptible to                 | B->2  |------>|       |
-             successive loads                   +-------+       |       |
                                                :       :       +-------+
@@ -699,73 +713,239 @@ following sequence of events:
        CPU 1                   CPU 2
        ======================= =======================
+                { A = 0, B = 9 }
        STORE A=1
-        STORE B=2
-        STORE C=3
        <write barrier>
-        STORE D=4
+        STORE B=2
-        STORE E=5
-                                LOAD A
                                LOAD B
-                                LOAD C
+                                LOAD A
-                                LOAD D
-                                LOAD E
 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
 some effectively random order, despite the write barrier issued by CPU 1:
-        +-------+       :      :
+        +-------+       :      :                :       :
-        |       |       +------+
+        |       |       +------+                +-------+
-        |       |------>| C=3  | }
+        |       |------>| A=1  |------      --->| A->0  |
-        |       |  :    +------+ }
+        |       |       +------+      \         +-------+
-        |       |  :    | A=1  | }
+        | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
-        |       |  :    +------+ }
+        |       |       +------+        |       +-------+
-        | CPU 1 |  :    | B=2  | }---
+        |       |------>| B=2  |---     |       :       :
-        |       |       +------+ }   \
+        |       |       +------+   \    |       :       :       +-------+
-        |       |   wwwwwwwwwwwww}    \
+        +-------+       :      :    \   |       +-------+       |       |
-        |       |       +------+ }     \          :       :       +-------+
+                                     ---------->| B->2  |------>|       |
-        |       |  :    | E=5  | }      \         +-------+       |       |
+                                        |       +-------+       | CPU 2 |
-        |       |  :    +------+ }       \      { | C->3  |------>|       |
+                                        |       | A->0  |------>|       |
-        |       |------>| D=4  | }        \     { +-------+    :  |       |
+                                        |       +-------+       |       |
-        |       |       +------+           \    { | E->5  |    :  |       |
+                                        |       :       :       +-------+
-        +-------+       :      :            \   { +-------+    :  |       |
+                                         \      :       :
-                                   Transfer  -->{ | A->1  |    :  | CPU 2 |
+                                          \     +-------+
-                                  from CPU 1    { +-------+    :  |       |
+                                           ---->| A->1  |
-                                   to CPU 2     { | D->4  |    :  |       |
+                                                +-------+
-                                                { +-------+    :  |       |
+                                                :       :
-                                                { | B->2  |------>|       |
-                                                  +-------+       |       |
-                                                  :       :       +-------+
-If, however, a read barrier were to be placed between the load of C and the
-load of D on CPU 2, then the partial ordering imposed by CPU 1 will be
-perceived correctly by CPU 2.
-        +-------+       :      :
-        |       |       +------+
+If, however, a read barrier were to be placed between the load of E and the
-        |       |------>| C=3  | }
+load of A on CPU 2:
-        |       |  :    +------+ }
-        |       |  :    | A=1  | }---
+        CPU 1                   CPU 2
-        |       |  :    +------+ }   \
+        ======================= =======================
-        | CPU 1 |  :    | B=2  | }    \
+                { A = 0, B = 9 }
-        |       |       +------+       \
+        STORE A=1
-        |       |   wwwwwwwwwwwwwwww    \
+        <write barrier>
-        |       |       +------+         \        :       :       +-------+
+        STORE B=2
-        |       |  :    | E=5  | }        \       +-------+       |       |
+                                LOAD B
-        |       |  :    +------+ }---      \    { | C->3  |------>|       |
+                                <read barrier>
-        |       |------>| D=4  | }   \      \   { +-------+    :  |       |
+                                LOAD A
-        |       |       +------+      \      -->{ | B->2  |    :  |       |
-        +-------+       :      :       \        { +-------+    :  |       |
+then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
-                                        \       { | A->1  |    :  | CPU 2 |
+2:
-                                         \        +-------+       |       |
-           At this point the read ---->   \   rrrrrrrrrrrrrrrrr   |       |
+        +-------+       :      :                :       :
-           barrier causes all effects      \      +-------+       |       |
+        |       |       +------+                +-------+
-           prior to the storage of C        \   { | E->5  |    :  |       |
+        |       |------>| A=1  |------      --->| A->0  |
-           to be perceptible to CPU 2        -->{ +-------+    :  |       |
+        |       |       +------+      \         +-------+
-                                                { | D->4  |------>|       |
+        | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
-                                                  +-------+       |       |
+        |       |       +------+        |       +-------+
-                                                  :       :       +-------+
+        |       |------>| B=2  |---     |       :       :
+        |       |       +------+   \    |       :       :       +-------+
+        +-------+       :      :    \   |       +-------+       |       |
+                                     ---------->| B->2  |------>|       |
+                                        |       +-------+       | CPU 2 |
+                                        |       :       :       |       |
+                                        |       :       :       |       |
+          At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
+          barrier causes all effects      \     +-------+       |       |
+          prior to the storage of B        ---->| A->1  |------>|       |
+          to be perceptible to CPU 2            +-------+       |       |
+                                                :       :       +-------+
+To illustrate this more completely, consider what could happen if the code
+contained a load of A either side of the read barrier:
+        CPU 1                   CPU 2
+        ======================= =======================
+                { A = 0, B = 9 }
+        STORE A=1
+        <write barrier>
+        STORE B=2
+                                LOAD B
+                                LOAD A [first load of A]
+                                <read barrier>
+                                LOAD A [second load of A]
+Even though the two loads of A both occur after the load of B, they may both
+come up with different values:
+        +-------+       :      :                :       :
+        |       |       +------+                +-------+
+        |       |------>| A=1  |------      --->| A->0  |
+        |       |       +------+      \         +-------+
+        | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
+        |       |       +------+        |       +-------+
+        |       |------>| B=2  |---     |       :       :
+        |       |       +------+   \    |       :       :       +-------+
+        +-------+       :      :    \   |       +-------+       |       |
+                                     ---------->| B->2  |------>|       |
+                                        |       +-------+       | CPU 2 |
+                                        |       :       :       |       |
+                                        |       :       :       |       |
+                                        |       +-------+       |       |
+                                        |       | A->0  |------>| 1st   |
+                                        |       +-------+       |       |
+          At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
+          barrier causes all effects      \     +-------+       |       |
+          prior to the storage of B        ---->| A->1  |------>| 2nd   |
+          to be perceptible to CPU 2            +-------+       |       |
+                                                :       :       +-------+
+But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
+before the read barrier completes anyway:
+        +-------+       :      :                :       :
+        |       |       +------+                +-------+
+        |       |------>| A=1  |------      --->| A->0  |
+        |       |       +------+      \         +-------+
+        | CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
+        |       |       +------+        |       +-------+
+        |       |------>| B=2  |---     |       :       :
+        |       |       +------+   \    |       :       :       +-------+
+        +-------+       :      :    \   |       +-------+       |       |
+                                     ---------->| B->2  |------>|       |
+                                        |       +-------+       | CPU 2 |
+                                        |       :       :       |       |
+                                         \      :       :       |       |
+                                          \     +-------+       |       |
+                                           ---->| A->1  |------>| 1st   |
+                                                +-------+       |       |
+                                            rrrrrrrrrrrrrrrrr   |       |
+                                                +-------+       |       |
+                                                | A->1  |------>| 2nd   |
+                                                +-------+       |       |
+                                                :       :       +-------+
+The guarantee is that the second load will always come up with A == 1 if the
+load of B came up with B == 2.  No such guarantee exists for the first load of
+A; that may come up with either A == 0 or A == 1.
+READ MEMORY BARRIERS VS LOAD SPECULATION
+----------------------------------------
+Many CPUs speculate with loads: that is they see that they will need to load an
+item from memory, and they find a time where they're not using the bus for any
+other loads, and so do the load in advance - even though they haven't actually
+got to that point in the instruction execution flow yet.  This permits the
+actual load instruction to potentially complete immediately because the CPU
+already has the value to hand.
+It may turn out that the CPU didn't actually need the value - perhaps because a
+branch circumvented the load - in which case it can discard the value or just
+cache it for later use.
+Consider:
+        CPU 1                   CPU 2
+        ======================= =======================
+                                LOAD B
+                                DIVIDE          } Divide instructions generally
+                                DIVIDE          } take a long time to perform
+                                LOAD A
+Which might appear as this:
+                                                :       :       +-------+
+                                                +-------+       |       |
+                                            --->| B->2  |------>|       |
+                                                +-------+       | CPU 2 |
+                                                :       :DIVIDE |       |
+                                                +-------+       |       |
+        The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
+        division speculates on the              +-------+   ~   |       |
+        LOAD of A                               :       :   ~   |       |
+                                                :       :DIVIDE |       |
+                                                :       :   ~   |       |
+        Once the divisions are complete -->     :       :   ~-->|       |
+        the CPU can then perform the            :       :       |       |
+        LOAD with immediate effect              :       :       +-------+
+Placing a read barrier or a data dependency barrier just before the second
+load:
+        CPU 1                   CPU 2
+        ======================= =======================
+                                LOAD B
+                                DIVIDE
+                                DIVIDE
+                                <read barrier>
+                                LOAD A
+will force any value speculatively obtained to be reconsidered to an extent
+dependent on the type of barrier used.  If there was no change made to the
+speculated memory location, then the speculated value will just be used:
+                                                :       :       +-------+
+                                                +-------+       |       |
+                                            --->| B->2  |------>|       |
+                                                +-------+       | CPU 2 |
+                                                :       :DIVIDE |       |
+                                                +-------+       |       |
+        The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
+        division speculates on the              +-------+   ~   |       |
+        LOAD of A                               :       :   ~   |       |
+                                                :       :DIVIDE |       |
+                                                :       :   ~   |       |
+                                                :       :   ~   |       |
+                                            rrrrrrrrrrrrrrrr~   |       |
+                                                :       :   ~   |       |
+                                                :       :   ~-->|       |
+                                                :       :       |       |
+                                                :       :       +-------+
+but if there was an update or an invalidation from another CPU pending, then
+the speculation will be cancelled and the value reloaded:
+                                                :       :       +-------+
+                                                +-------+       |       |
+                                            --->| B->2  |------>|       |
+                                                +-------+       | CPU 2 |
+                                                :       :DIVIDE |       |
+                                                +-------+       |       |
+        The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
+        division speculates on the              +-------+   ~   |       |
+        LOAD of A                               :       :   ~   |       |
+                                                :       :DIVIDE |       |
+                                                :       :   ~   |       |
+                                                :       :   ~   |       |
+                                            rrrrrrrrrrrrrrrrr   |       |
+                                                +-------+       |       |
+        The speculation is discarded --->   --->| A->1  |------>|       |
+        and an updated value is                 +-------+       |       |
+        retrieved                               :       :       +-------+
 ========================
@@ -901,7 +1081,7 @@ IMPLICIT KERNEL MEMORY BARRIERS
 ===============================
 Some of the other functions in the linux kernel imply memory barriers, amongst
-which are locking, scheduling and memory allocation functions.
+which are locking and scheduling functions.
 This specification is a _minimum_ guarantee; any particular architecture may
 provide more substantial guarantees, but these may not be relied upon outside
@@ -966,6 +1146,20 @@ equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
    barriers is that the effects instructions outside of a critical section may
    seep into the inside of the critical section.
+A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
+because it is possible for an access preceding the LOCK to happen after the
+LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
+two accesses can themselves then cross:
+        *A = a;
+        LOCK
+        UNLOCK
+        *B = b;
+may occur as:
+        LOCK, STORE *B, STORE *A, UNLOCK
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
 anything at all - especially with respect to I/O accesses - unless combined
@@ -1016,8 +1210,6 @@ Other functions that imply barriers:
 (*) schedule() and similar imply full memory barriers.
- (*) Memory allocation and release functions imply full memory barriers.
 =================================
 INTER-CPU LOCKING BARRIER EFFECTS
author	Jeff Garzik <jeff@garzik.org>	2006-06-13 20:29:04 -0400
committer	Jeff Garzik <jeff@garzik.org>	2006-06-13 20:29:04 -0400
commit	b5ed7639c9f502898af4109e778f5613dacbfd9c (patch)
tree	abe908c60ce1ea8f201028c9fc830cacd25c724b /Documentation
parent	0638dec01e89059c853515ab71c55fd13ba5a8ea (diff)
parent	eb35cf60e462491249166182e3e755d3d5d91a28 (diff)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index c61d8b876fdb..4710845dbac4 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt
@@ -19,6 +19,7 @@ Contents:
19	- Control dependencies.	19	- Control dependencies.
20	- SMP barrier pairing.	20	- SMP barrier pairing.
21	- Examples of memory barrier sequences.	21	- Examples of memory barrier sequences.
		22	- Read memory barriers vs load speculation.
22		23
23	(*) Explicit kernel barriers.	24	(*) Explicit kernel barriers.
24		25
@@ -248,7 +249,7 @@ And there are a number of things that _must_ or _must_not_ be assumed:
248	we may get either of:	249	we may get either of:
249		250
250	STORE A = X; Y = LOAD A;	251	STORE A = X; Y = LOAD A;
251	STORE *A = Y;	252	STORE *A = Y = X;
252		253
253		254
254	=========================	255	=========================
@@ -344,9 +345,12 @@ Memory barriers come in four basic varieties:
344		345
345	(4) General memory barriers.	346	(4) General memory barriers.
346		347
347	A general memory barrier is a combination of both a read memory barrier	348	A general memory barrier gives a guarantee that all the LOAD and STORE
348	and a write memory barrier. It is a partial ordering over both loads and	349	operations specified before the barrier will appear to happen before all
349	stores.	350	the LOAD and STORE operations specified after the barrier with respect to
		351	the other components of the system.
		352
		353	A general memory barrier is a partial ordering over both loads and stores.
350		354
351	General memory barriers imply both read and write memory barriers, and so	355	General memory barriers imply both read and write memory barriers, and so
352	can substitute for either.	356	can substitute for either.
@@ -546,9 +550,9 @@ write barrier, though, again, a general barrier is viable:
546	=============== ===============	550	=============== ===============
547	a = 1;	551	a = 1;
548	<write barrier>	552	<write barrier>
549	b = 2; x = a;	553	b = 2; x = b;
550	<read barrier>	554	<read barrier>
551	y = b;	555	y = a;
552		556
553	Or:	557	Or:
554		558
@@ -563,6 +567,18 @@ Or:
563	Basically, the read barrier always has to be there, even though it can be of	567	Basically, the read barrier always has to be there, even though it can be of
564	the "weaker" type.	568	the "weaker" type.
565		569
		570	[!] Note that the stores before the write barrier would normally be expected to
		571	match the loads after the read barrier or data dependency barrier, and vice
		572	versa:
		573
		574	CPU 1 CPU 2
		575	=============== ===============
		576	a = 1; }---- --->{ v = c
		577	b = 2; } \ / { w = d
		578	<write barrier> \ <read barrier>
		579	c = 3; } / \ { x = a;
		580	d = 4; }---- --->{ y = b;
		581
566		582
567	EXAMPLES OF MEMORY BARRIER SEQUENCES	583	EXAMPLES OF MEMORY BARRIER SEQUENCES
568	------------------------------------	584	------------------------------------
@@ -600,8 +616,8 @@ STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E
600	\| \| +------+	616	\| \| +------+
601	+-------+ : :	617	+-------+ : :
602	\|	618	\|
603	\| Sequence in which stores committed to memory system	619	\| Sequence in which stores are committed to the
604	\| by CPU 1	620	\| memory system by CPU 1
605	V	621	V
606		622
607		623
@@ -683,14 +699,12 @@ then the following will occur:
683	\| : : \| \|	699	\| : : \| \|
684	\| : : \| CPU 2 \|	700	\| : : \| CPU 2 \|
685	\| +-------+ \| \|	701	\| +-------+ \| \|
686	\ \| X->9 \|------>\| \|	702	\| \| X->9 \|------>\| \|
687	\ +-------+ \| \|	703	\| +-------+ \| \|
688	----->\| B->2 \| \| \|	704	Makes sure all effects ---> \ ddddddddddddddddd \| \|
689	+-------+ \| \|	705	prior to the store of C \ +-------+ \| \|
690	Makes sure all effects ---> ddddddddddddddddd \| \|	706	are perceptible to ----->\| B->2 \|------>\| \|
691	prior to the store of C +-------+ \| \|	707	subsequent loads +-------+ \| \|
692	are perceptible to \| B->2 \|------>\| \|
693	successive loads +-------+ \| \|
694	: : +-------+	708	: : +-------+
695		709
696		710
@@ -699,73 +713,239 @@ following sequence of events:
699		713
700	CPU 1 CPU 2	714	CPU 1 CPU 2
701	======================= =======================	715	======================= =======================
		716	{ A = 0, B = 9 }
702	STORE A=1	717	STORE A=1
703	STORE B=2
704	STORE C=3
705	<write barrier>	718	<write barrier>
706	STORE D=4	719	STORE B=2
707	STORE E=5
708	LOAD A
709	LOAD B	720	LOAD B
710	LOAD C	721	LOAD A
711	LOAD D
712	LOAD E
713		722
714	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in	723	Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
715	some effectively random order, despite the write barrier issued by CPU 1:	724	some effectively random order, despite the write barrier issued by CPU 1:
716		725
717	+-------+ : :	726	+-------+ : : : :
718	\| \| +------+	727	\| \| +------+ +-------+
719	\| \|------>\| C=3 \| }	728	\| \|------>\| A=1 \|------ --->\| A->0 \|
720	\| \| : +------+ }	729	\| \| +------+ \ +-------+
721	\| \| : \| A=1 \| }	730	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
722	\| \| : +------+ }	731	\| \| +------+ \| +-------+
723	\| CPU 1 \| : \| B=2 \| }---	732	\| \|------>\| B=2 \|--- \| : :
724	\| \| +------+ } \	733	\| \| +------+ \ \| : : +-------+
725	\| \| wwwwwwwwwwwww} \	734	+-------+ : : \ \| +-------+ \| \|
726	\| \| +------+ } \ : : +-------+	735	---------->\| B->2 \|------>\| \|
727	\| \| : \| E=5 \| } \ +-------+ \| \|	736	\| +-------+ \| CPU 2 \|
728	\| \| : +------+ } \ { \| C->3 \|------>\| \|	737	\| \| A->0 \|------>\| \|
729	\| \|------>\| D=4 \| } \ { +-------+ : \| \|	738	\| +-------+ \| \|
730	\| \| +------+ \ { \| E->5 \| : \| \|	739	\| : : +-------+
731	+-------+ : : \ { +-------+ : \| \|	740	\ : :
732	Transfer -->{ \| A->1 \| : \| CPU 2 \|	741	\ +-------+
733	from CPU 1 { +-------+ : \| \|	742	---->\| A->1 \|
734	to CPU 2 { \| D->4 \| : \| \|	743	+-------+
735	{ +-------+ : \| \|	744	: :
736	{ \| B->2 \|------>\| \|
737	+-------+ \| \|
738	: : +-------+
739
740
741	If, however, a read barrier were to be placed between the load of C and the
742	load of D on CPU 2, then the partial ordering imposed by CPU 1 will be
743	perceived correctly by CPU 2.
744		745
745	+-------+ : :	746
746	\| \| +------+	747	If, however, a read barrier were to be placed between the load of E and the
747	\| \|------>\| C=3 \| }	748	load of A on CPU 2:
748	\| \| : +------+ }	749
749	\| \| : \| A=1 \| }---	750	CPU 1 CPU 2
750	\| \| : +------+ } \	751	======================= =======================
751	\| CPU 1 \| : \| B=2 \| } \	752	{ A = 0, B = 9 }
752	\| \| +------+ \	753	STORE A=1
753	\| \| wwwwwwwwwwwwwwww \	754	<write barrier>
754	\| \| +------+ \ : : +-------+	755	STORE B=2
755	\| \| : \| E=5 \| } \ +-------+ \| \|	756	LOAD B
756	\| \| : +------+ }--- \ { \| C->3 \|------>\| \|	757	<read barrier>
757	\| \|------>\| D=4 \| } \ \ { +-------+ : \| \|	758	LOAD A
758	\| \| +------+ \ -->{ \| B->2 \| : \| \|	759
759	+-------+ : : \ { +-------+ : \| \|	760	then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
760	\ { \| A->1 \| : \| CPU 2 \|	761	2:
761	\ +-------+ \| \|	762
762	At this point the read ----> \ rrrrrrrrrrrrrrrrr \| \|	763	+-------+ : : : :
763	barrier causes all effects \ +-------+ \| \|	764	\| \| +------+ +-------+
764	prior to the storage of C \ { \| E->5 \| : \| \|	765	\| \|------>\| A=1 \|------ --->\| A->0 \|
765	to be perceptible to CPU 2 -->{ +-------+ : \| \|	766	\| \| +------+ \ +-------+
766	{ \| D->4 \|------>\| \|	767	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
767	+-------+ \| \|	768	\| \| +------+ \| +-------+
768	: : +-------+	769	\| \|------>\| B=2 \|--- \| : :
		770	\| \| +------+ \ \| : : +-------+
		771	+-------+ : : \ \| +-------+ \| \|
		772	---------->\| B->2 \|------>\| \|
		773	\| +-------+ \| CPU 2 \|
		774	\| : : \| \|
		775	\| : : \| \|
		776	At this point the read ----> \ rrrrrrrrrrrrrrrrr \| \|
		777	barrier causes all effects \ +-------+ \| \|
		778	prior to the storage of B ---->\| A->1 \|------>\| \|
		779	to be perceptible to CPU 2 +-------+ \| \|
		780	: : +-------+
		781
		782
		783	To illustrate this more completely, consider what could happen if the code
		784	contained a load of A either side of the read barrier:
		785
		786	CPU 1 CPU 2
		787	======================= =======================
		788	{ A = 0, B = 9 }
		789	STORE A=1
		790	<write barrier>
		791	STORE B=2
		792	LOAD B
		793	LOAD A [first load of A]
		794	<read barrier>
		795	LOAD A [second load of A]
		796
		797	Even though the two loads of A both occur after the load of B, they may both
		798	come up with different values:
		799
		800	+-------+ : : : :
		801	\| \| +------+ +-------+
		802	\| \|------>\| A=1 \|------ --->\| A->0 \|
		803	\| \| +------+ \ +-------+
		804	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
		805	\| \| +------+ \| +-------+
		806	\| \|------>\| B=2 \|--- \| : :
		807	\| \| +------+ \ \| : : +-------+
		808	+-------+ : : \ \| +-------+ \| \|
		809	---------->\| B->2 \|------>\| \|
		810	\| +-------+ \| CPU 2 \|
		811	\| : : \| \|
		812	\| : : \| \|
		813	\| +-------+ \| \|
		814	\| \| A->0 \|------>\| 1st \|
		815	\| +-------+ \| \|
		816	At this point the read ----> \ rrrrrrrrrrrrrrrrr \| \|
		817	barrier causes all effects \ +-------+ \| \|
		818	prior to the storage of B ---->\| A->1 \|------>\| 2nd \|
		819	to be perceptible to CPU 2 +-------+ \| \|
		820	: : +-------+
		821
		822
		823	But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
		824	before the read barrier completes anyway:
		825
		826	+-------+ : : : :
		827	\| \| +------+ +-------+
		828	\| \|------>\| A=1 \|------ --->\| A->0 \|
		829	\| \| +------+ \ +-------+
		830	\| CPU 1 \| wwwwwwwwwwwwwwww \ --->\| B->9 \|
		831	\| \| +------+ \| +-------+
		832	\| \|------>\| B=2 \|--- \| : :
		833	\| \| +------+ \ \| : : +-------+
		834	+-------+ : : \ \| +-------+ \| \|
		835	---------->\| B->2 \|------>\| \|
		836	\| +-------+ \| CPU 2 \|
		837	\| : : \| \|
		838	\ : : \| \|
		839	\ +-------+ \| \|
		840	---->\| A->1 \|------>\| 1st \|
		841	+-------+ \| \|
		842	rrrrrrrrrrrrrrrrr \| \|
		843	+-------+ \| \|
		844	\| A->1 \|------>\| 2nd \|
		845	+-------+ \| \|
		846	: : +-------+
		847
		848
		849	The guarantee is that the second load will always come up with A == 1 if the
		850	load of B came up with B == 2. No such guarantee exists for the first load of
		851	A; that may come up with either A == 0 or A == 1.
		852
		853
		854	READ MEMORY BARRIERS VS LOAD SPECULATION
		855	----------------------------------------
		856
		857	Many CPUs speculate with loads: that is they see that they will need to load an
		858	item from memory, and they find a time where they're not using the bus for any
		859	other loads, and so do the load in advance - even though they haven't actually
		860	got to that point in the instruction execution flow yet. This permits the
		861	actual load instruction to potentially complete immediately because the CPU
		862	already has the value to hand.
		863
		864	It may turn out that the CPU didn't actually need the value - perhaps because a
		865	branch circumvented the load - in which case it can discard the value or just
		866	cache it for later use.
		867
		868	Consider:
		869
		870	CPU 1 CPU 2
		871	======================= =======================
		872	LOAD B
		873	DIVIDE } Divide instructions generally
		874	DIVIDE } take a long time to perform
		875	LOAD A
		876
		877	Which might appear as this:
		878
		879	: : +-------+
		880	+-------+ \| \|
		881	--->\| B->2 \|------>\| \|
		882	+-------+ \| CPU 2 \|
		883	: :DIVIDE \| \|
		884	+-------+ \| \|
		885	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
		886	division speculates on the +-------+ ~ \| \|
		887	LOAD of A : : ~ \| \|
		888	: :DIVIDE \| \|
		889	: : ~ \| \|
		890	Once the divisions are complete --> : : ~-->\| \|
		891	the CPU can then perform the : : \| \|
		892	LOAD with immediate effect : : +-------+
		893
		894
		895	Placing a read barrier or a data dependency barrier just before the second
		896	load:
		897
		898	CPU 1 CPU 2
		899	======================= =======================
		900	LOAD B
		901	DIVIDE
		902	DIVIDE
		903	<read barrier>
		904	LOAD A
		905
		906	will force any value speculatively obtained to be reconsidered to an extent
		907	dependent on the type of barrier used. If there was no change made to the
		908	speculated memory location, then the speculated value will just be used:
		909
		910	: : +-------+
		911	+-------+ \| \|
		912	--->\| B->2 \|------>\| \|
		913	+-------+ \| CPU 2 \|
		914	: :DIVIDE \| \|
		915	+-------+ \| \|
		916	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
		917	division speculates on the +-------+ ~ \| \|
		918	LOAD of A : : ~ \| \|
		919	: :DIVIDE \| \|
		920	: : ~ \| \|
		921	: : ~ \| \|
		922	rrrrrrrrrrrrrrrr~ \| \|
		923	: : ~ \| \|
		924	: : ~-->\| \|
		925	: : \| \|
		926	: : +-------+
		927
		928
		929	but if there was an update or an invalidation from another CPU pending, then
		930	the speculation will be cancelled and the value reloaded:
		931
		932	: : +-------+
		933	+-------+ \| \|
		934	--->\| B->2 \|------>\| \|
		935	+-------+ \| CPU 2 \|
		936	: :DIVIDE \| \|
		937	+-------+ \| \|
		938	The CPU being busy doing a ---> --->\| A->0 \|~~~~ \| \|
		939	division speculates on the +-------+ ~ \| \|
		940	LOAD of A : : ~ \| \|
		941	: :DIVIDE \| \|
		942	: : ~ \| \|
		943	: : ~ \| \|
		944	rrrrrrrrrrrrrrrrr \| \|
		945	+-------+ \| \|
		946	The speculation is discarded ---> --->\| A->1 \|------>\| \|
		947	and an updated value is +-------+ \| \|
		948	retrieved : : +-------+
769		949
770		950
771	========================	951	========================
@@ -901,7 +1081,7 @@ IMPLICIT KERNEL MEMORY BARRIERS
901	===============================	1081	===============================
902		1082
903	Some of the other functions in the linux kernel imply memory barriers, amongst	1083	Some of the other functions in the linux kernel imply memory barriers, amongst
904	which are locking, scheduling and memory allocation functions.	1084	which are locking and scheduling functions.
905		1085
906	This specification is a _minimum_ guarantee; any particular architecture may	1086	This specification is a _minimum_ guarantee; any particular architecture may
907	provide more substantial guarantees, but these may not be relied upon outside	1087	provide more substantial guarantees, but these may not be relied upon outside
@@ -966,6 +1146,20 @@ equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
966	barriers is that the effects instructions outside of a critical section may	1146	barriers is that the effects instructions outside of a critical section may
967	seep into the inside of the critical section.	1147	seep into the inside of the critical section.
968		1148
		1149	A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
		1150	because it is possible for an access preceding the LOCK to happen after the
		1151	LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
		1152	two accesses can themselves then cross:
		1153
		1154	*A = a;
		1155	LOCK
		1156	UNLOCK
		1157	*B = b;
		1158
		1159	may occur as:
		1160
		1161	LOCK, STORE B, STORE A, UNLOCK
		1162
969	Locks and semaphores may not provide any guarantee of ordering on UP compiled	1163	Locks and semaphores may not provide any guarantee of ordering on UP compiled
970	systems, and so cannot be counted on in such a situation to actually achieve	1164	systems, and so cannot be counted on in such a situation to actually achieve
971	anything at all - especially with respect to I/O accesses - unless combined	1165	anything at all - especially with respect to I/O accesses - unless combined
@@ -1016,8 +1210,6 @@ Other functions that imply barriers:
1016		1210
1017	(*) schedule() and similar imply full memory barriers.	1211	(*) schedule() and similar imply full memory barriers.
1018		1212
1019	(*) Memory allocation and release functions imply full memory barriers.
1020
1021		1213
1022	=================================	1214	=================================
1023	INTER-CPU LOCKING BARRIER EFFECTS	1215	INTER-CPU LOCKING BARRIER EFFECTS