1 files changed, 116 insertions, 25 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 904ee42d078e..3729cbe60e41 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -232,7 +232,7 @@ And there are a number of things that _must_ or _must_not_ be assumed:
     with memory references that are not protected by READ_ONCE() and
     WRITE_ONCE().  Without them, the compiler is within its rights to
     do all sorts of "creative" transformations, which are covered in
-     the Compiler Barrier section.
+     the COMPILER BARRIER section.
 (*) It _must_not_ be assumed that independent loads and stores will be issued
     in the order given.  This means that for:
@@ -555,6 +555,30 @@ between the address load and the data load:
 This enforces the occurrence of one of the two implications, and prevents the
 third possibility from arising.
+A data-dependency barrier must also order against dependent writes:
+        CPU 1                 CPU 2
+        ===============       ===============
+        { A == 1, B == 2, C = 3, P == &A, Q == &C }
+        B = 4;
+        <write barrier>
+        WRITE_ONCE(P, &B);
+                              Q = READ_ONCE(P);
+                              <data dependency barrier>
+                              *Q = 5;
+The data-dependency barrier must order the read into Q with the store
+into *Q.  This prohibits this outcome:
+        (Q == B) && (B == 4)
+Please note that this pattern should be rare.  After all, the whole point
+of dependency ordering is to -prevent- writes to the data structure, along
+with the expensive cache misses associated with those writes.  This pattern
+can be used to record rare error conditions and the like, and the ordering
+prevents such records from being lost.
 [!] Note that this extremely counterintuitive situation arises most easily on
 machines with split caches, so that, for example, one cache bank processes
 even-numbered cache lines and the other bank processes odd-numbered cache
@@ -565,21 +589,6 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B),
 but the old value of the variable B (2).
-Another example of where data dependency barriers might be required is where a
-number is read from memory and then used to calculate the index for an array
-access:
-        CPU 1                 CPU 2
-        ===============       ===============
-        { M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
-        M[1] = 4;
-        <write barrier>
-        WRITE_ONCE(P, 1);
-                              Q = READ_ONCE(P);
-                              <data dependency barrier>
-                              D = M[Q];
 The data dependency barrier is very important to the RCU system,
 for example.  See rcu_assign_pointer() and rcu_dereference() in
 include/linux/rcupdate.h.  This permits the current target of an RCU'd
@@ -800,9 +809,13 @@ In summary:
      use smp_rmb(), smp_wmb(), or, in the case of prior stores and
      later loads, smp_mb().
-  (*) If both legs of the "if" statement begin with identical stores
+  (*) If both legs of the "if" statement begin with identical stores to
-      to the same variable, a barrier() statement is required at the
+      the same variable, then those stores must be ordered, either by
-      beginning of each leg of the "if" statement.
+      preceding both of them with smp_mb() or by using smp_store_release()
+      to carry out the stores.  Please note that it is -not- sufficient
+      to use barrier() at beginning of each leg of the "if" statement,
+      as optimizing compilers do not necessarily respect barrier()
+      in this case.
  (*) Control dependencies require at least one run-time conditional
      between the prior load and the subsequent store, and this
@@ -814,7 +827,7 @@ In summary:
  (*) Control dependencies require that the compiler avoid reordering the
      dependency into nonexistence.  Careful use of READ_ONCE() or
      atomic{,64}_read() can help to preserve your control dependency.
-      Please see the Compiler Barrier section for more information.
+      Please see the COMPILER BARRIER section for more information.
  (*) Control dependencies pair normally with other types of barriers.
@@ -1257,7 +1270,7 @@ TRANSITIVITY
 Transitivity is a deeply intuitive notion about ordering that is not
 always provided by real computer systems.  The following example
-demonstrates transitivity (also called "cumulativity"):
+demonstrates transitivity:
        CPU 1                   CPU 2                   CPU 3
        ======================= ======================= =======================
@@ -1305,8 +1318,86 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
 General barriers are therefore required to ensure that all CPUs agree
 on the combined order of CPU 1's and CPU 2's accesses.
-To reiterate, if your code requires transitivity, use general barriers
+General barriers provide "global transitivity", so that all CPUs will
-throughout.
+agree on the order of operations.  In contrast, a chain of release-acquire
+pairs provides only "local transitivity", so that only those CPUs on
+the chain are guaranteed to agree on the combined order of the accesses.
+For example, switching to C code in deference to Herman Hollerith:
+        int u, v, x, y, z;
+        void cpu0(void)
+        {
+                r0 = smp_load_acquire(&x);
+                WRITE_ONCE(u, 1);
+                smp_store_release(&y, 1);
+        }
+        void cpu1(void)
+        {
+                r1 = smp_load_acquire(&y);
+                r4 = READ_ONCE(v);
+                r5 = READ_ONCE(u);
+                smp_store_release(&z, 1);
+        }
+        void cpu2(void)
+        {
+                r2 = smp_load_acquire(&z);
+                smp_store_release(&x, 1);
+        }
+        void cpu3(void)
+        {
+                WRITE_ONCE(v, 1);
+                smp_mb();
+                r3 = READ_ONCE(u);
+        }
+Because cpu0(), cpu1(), and cpu2() participate in a local transitive
+chain of smp_store_release()/smp_load_acquire() pairs, the following
+outcome is prohibited:
+        r0 == 1 && r1 == 1 && r2 == 1
+Furthermore, because of the release-acquire relationship between cpu0()
+and cpu1(), cpu1() must see cpu0()'s writes, so that the following
+outcome is prohibited:
+        r1 == 1 && r5 == 0
+However, the transitivity of release-acquire is local to the participating
+CPUs and does not apply to cpu3().  Therefore, the following outcome
+is possible:
+        r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
+As an aside, the following outcome is also possible:
+        r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
+Although cpu0(), cpu1(), and cpu2() will see their respective reads and
+writes in order, CPUs not involved in the release-acquire chain might
+well disagree on the order.  This disagreement stems from the fact that
+the weak memory-barrier instructions used to implement smp_load_acquire()
+and smp_store_release() are not required to order prior stores against
+subsequent loads in all cases.  This means that cpu3() can see cpu0()'s
+store to u as happening -after- cpu1()'s load from v, even though
+both cpu0() and cpu1() agree that these two operations occurred in the
+intended order.
+However, please keep in mind that smp_load_acquire() is not magic.
+In particular, it simply reads from its argument with ordering.  It does
+-not- ensure that any particular value will be read.  Therefore, the
+following outcome is possible:
+        r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
+Note that this outcome can happen even on a mythical sequentially
+consistent system where nothing is ever reordered.
+To reiterate, if your code requires global transitivity, use general
+barriers throughout.
 ========================
@@ -1459,7 +1550,7 @@ of optimizations:
     the following:
        a = 0;
-        /* Code that does not store to variable a. */
+        ... Code that does not store to variable a ...
        a = 0;
     The compiler sees that the value of variable 'a' is already zero, so
@@ -1471,7 +1562,7 @@ of optimizations:
     wrong guess:
        WRITE_ONCE(a, 0);
-        /* Code that does not store to variable a. */
+        ... Code that does not store to variable a ...
        WRITE_ONCE(a, 0);
 (*) The compiler is within its rights to reorder memory accesses unless

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 904ee42d078e..3729cbe60e41 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt
@@ -232,7 +232,7 @@ And there are a number of things that _must_ or _must_not_ be assumed:
232	with memory references that are not protected by READ_ONCE() and	232	with memory references that are not protected by READ_ONCE() and
233	WRITE_ONCE(). Without them, the compiler is within its rights to	233	WRITE_ONCE(). Without them, the compiler is within its rights to
234	do all sorts of "creative" transformations, which are covered in	234	do all sorts of "creative" transformations, which are covered in
235	the Compiler Barrier section.	235	the COMPILER BARRIER section.
236		236
237	(*) It _must_not_ be assumed that independent loads and stores will be issued	237	(*) It _must_not_ be assumed that independent loads and stores will be issued
238	in the order given. This means that for:	238	in the order given. This means that for:
@@ -555,6 +555,30 @@ between the address load and the data load:
555	This enforces the occurrence of one of the two implications, and prevents the	555	This enforces the occurrence of one of the two implications, and prevents the
556	third possibility from arising.	556	third possibility from arising.
557		557
		558	A data-dependency barrier must also order against dependent writes:
		559
		560	CPU 1 CPU 2
		561	=============== ===============
		562	{ A == 1, B == 2, C = 3, P == &A, Q == &C }
		563	B = 4;
		564	<write barrier>
		565	WRITE_ONCE(P, &B);
		566	Q = READ_ONCE(P);
		567	<data dependency barrier>
		568	*Q = 5;
		569
		570	The data-dependency barrier must order the read into Q with the store
		571	into *Q. This prohibits this outcome:
		572
		573	(Q == B) && (B == 4)
		574
		575	Please note that this pattern should be rare. After all, the whole point
		576	of dependency ordering is to -prevent- writes to the data structure, along
		577	with the expensive cache misses associated with those writes. This pattern
		578	can be used to record rare error conditions and the like, and the ordering
		579	prevents such records from being lost.
		580
		581
558	[!] Note that this extremely counterintuitive situation arises most easily on	582	[!] Note that this extremely counterintuitive situation arises most easily on
559	machines with split caches, so that, for example, one cache bank processes	583	machines with split caches, so that, for example, one cache bank processes
560	even-numbered cache lines and the other bank processes odd-numbered cache	584	even-numbered cache lines and the other bank processes odd-numbered cache
@@ -565,21 +589,6 @@ odd-numbered bank is idle, one can see the new value of the pointer P (&B),
565	but the old value of the variable B (2).	589	but the old value of the variable B (2).
566		590
567		591
568	Another example of where data dependency barriers might be required is where a
569	number is read from memory and then used to calculate the index for an array
570	access:
571
572	CPU 1 CPU 2
573	=============== ===============
574	{ M[0] == 1, M[1] == 2, M[3] = 3, P == 0, Q == 3 }
575	M[1] = 4;
576	<write barrier>
577	WRITE_ONCE(P, 1);
578	Q = READ_ONCE(P);
579	<data dependency barrier>
580	D = M[Q];
581
582
583	The data dependency barrier is very important to the RCU system,	592	The data dependency barrier is very important to the RCU system,
584	for example. See rcu_assign_pointer() and rcu_dereference() in	593	for example. See rcu_assign_pointer() and rcu_dereference() in
585	include/linux/rcupdate.h. This permits the current target of an RCU'd	594	include/linux/rcupdate.h. This permits the current target of an RCU'd
@@ -800,9 +809,13 @@ In summary:
800	use smp_rmb(), smp_wmb(), or, in the case of prior stores and	809	use smp_rmb(), smp_wmb(), or, in the case of prior stores and
801	later loads, smp_mb().	810	later loads, smp_mb().
802		811
803	(*) If both legs of the "if" statement begin with identical stores	812	(*) If both legs of the "if" statement begin with identical stores to
804	to the same variable, a barrier() statement is required at the	813	the same variable, then those stores must be ordered, either by
805	beginning of each leg of the "if" statement.	814	preceding both of them with smp_mb() or by using smp_store_release()
		815	to carry out the stores. Please note that it is -not- sufficient
		816	to use barrier() at beginning of each leg of the "if" statement,
		817	as optimizing compilers do not necessarily respect barrier()
		818	in this case.
806		819
807	(*) Control dependencies require at least one run-time conditional	820	(*) Control dependencies require at least one run-time conditional
808	between the prior load and the subsequent store, and this	821	between the prior load and the subsequent store, and this
@@ -814,7 +827,7 @@ In summary:
814	(*) Control dependencies require that the compiler avoid reordering the	827	(*) Control dependencies require that the compiler avoid reordering the
815	dependency into nonexistence. Careful use of READ_ONCE() or	828	dependency into nonexistence. Careful use of READ_ONCE() or
816	atomic{,64}_read() can help to preserve your control dependency.	829	atomic{,64}_read() can help to preserve your control dependency.
817	Please see the Compiler Barrier section for more information.	830	Please see the COMPILER BARRIER section for more information.
818		831
819	(*) Control dependencies pair normally with other types of barriers.	832	(*) Control dependencies pair normally with other types of barriers.
820		833
@@ -1257,7 +1270,7 @@ TRANSITIVITY
1257		1270
1258	Transitivity is a deeply intuitive notion about ordering that is not	1271	Transitivity is a deeply intuitive notion about ordering that is not
1259	always provided by real computer systems. The following example	1272	always provided by real computer systems. The following example
1260	demonstrates transitivity (also called "cumulativity"):	1273	demonstrates transitivity:
1261		1274
1262	CPU 1 CPU 2 CPU 3	1275	CPU 1 CPU 2 CPU 3
1263	======================= ======================= =======================	1276	======================= ======================= =======================
@@ -1305,8 +1318,86 @@ or a level of cache, CPU 2 might have early access to CPU 1's writes.
1305	General barriers are therefore required to ensure that all CPUs agree	1318	General barriers are therefore required to ensure that all CPUs agree
1306	on the combined order of CPU 1's and CPU 2's accesses.	1319	on the combined order of CPU 1's and CPU 2's accesses.
1307		1320
1308	To reiterate, if your code requires transitivity, use general barriers	1321	General barriers provide "global transitivity", so that all CPUs will
1309	throughout.	1322	agree on the order of operations. In contrast, a chain of release-acquire
		1323	pairs provides only "local transitivity", so that only those CPUs on
		1324	the chain are guaranteed to agree on the combined order of the accesses.
		1325	For example, switching to C code in deference to Herman Hollerith:
		1326
		1327	int u, v, x, y, z;
		1328
		1329	void cpu0(void)
		1330	{
		1331	r0 = smp_load_acquire(&x);
		1332	WRITE_ONCE(u, 1);
		1333	smp_store_release(&y, 1);
		1334	}
		1335
		1336	void cpu1(void)
		1337	{
		1338	r1 = smp_load_acquire(&y);
		1339	r4 = READ_ONCE(v);
		1340	r5 = READ_ONCE(u);
		1341	smp_store_release(&z, 1);
		1342	}
		1343
		1344	void cpu2(void)
		1345	{
		1346	r2 = smp_load_acquire(&z);
		1347	smp_store_release(&x, 1);
		1348	}
		1349
		1350	void cpu3(void)
		1351	{
		1352	WRITE_ONCE(v, 1);
		1353	smp_mb();
		1354	r3 = READ_ONCE(u);
		1355	}
		1356
		1357	Because cpu0(), cpu1(), and cpu2() participate in a local transitive
		1358	chain of smp_store_release()/smp_load_acquire() pairs, the following
		1359	outcome is prohibited:
		1360
		1361	r0 == 1 && r1 == 1 && r2 == 1
		1362
		1363	Furthermore, because of the release-acquire relationship between cpu0()
		1364	and cpu1(), cpu1() must see cpu0()'s writes, so that the following
		1365	outcome is prohibited:
		1366
		1367	r1 == 1 && r5 == 0
		1368
		1369	However, the transitivity of release-acquire is local to the participating
		1370	CPUs and does not apply to cpu3(). Therefore, the following outcome
		1371	is possible:
		1372
		1373	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0
		1374
		1375	As an aside, the following outcome is also possible:
		1376
		1377	r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 && r5 == 1
		1378
		1379	Although cpu0(), cpu1(), and cpu2() will see their respective reads and
		1380	writes in order, CPUs not involved in the release-acquire chain might
		1381	well disagree on the order. This disagreement stems from the fact that
		1382	the weak memory-barrier instructions used to implement smp_load_acquire()
		1383	and smp_store_release() are not required to order prior stores against
		1384	subsequent loads in all cases. This means that cpu3() can see cpu0()'s
		1385	store to u as happening -after- cpu1()'s load from v, even though
		1386	both cpu0() and cpu1() agree that these two operations occurred in the
		1387	intended order.
		1388
		1389	However, please keep in mind that smp_load_acquire() is not magic.
		1390	In particular, it simply reads from its argument with ordering. It does
		1391	-not- ensure that any particular value will be read. Therefore, the
		1392	following outcome is possible:
		1393
		1394	r0 == 0 && r1 == 0 && r2 == 0 && r5 == 0
		1395
		1396	Note that this outcome can happen even on a mythical sequentially
		1397	consistent system where nothing is ever reordered.
		1398
		1399	To reiterate, if your code requires global transitivity, use general
		1400	barriers throughout.
1310		1401
1311		1402
1312	========================	1403	========================
@@ -1459,7 +1550,7 @@ of optimizations:
1459	the following:	1550	the following:
1460		1551
1461	a = 0;	1552	a = 0;
1462	/* Code that does not store to variable a. */	1553	... Code that does not store to variable a ...
1463	a = 0;	1554	a = 0;
1464		1555
1465	The compiler sees that the value of variable 'a' is already zero, so	1556	The compiler sees that the value of variable 'a' is already zero, so
@@ -1471,7 +1562,7 @@ of optimizations:
1471	wrong guess:	1562	wrong guess:
1472		1563
1473	WRITE_ONCE(a, 0);	1564	WRITE_ONCE(a, 0);
1474	/* Code that does not store to variable a. */	1565	... Code that does not store to variable a ...
1475	WRITE_ONCE(a, 0);	1566	WRITE_ONCE(a, 0);
1476		1567
1477	(*) The compiler is within its rights to reorder memory accesses unless	1568	(*) The compiler is within its rights to reorder memory accesses unless