diff options
author | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-08-29 18:49:21 -0400 |
---|---|---|
committer | Paul E. McKenney <paulmck@linux.vnet.ibm.com> | 2017-10-09 17:23:36 -0400 |
commit | f1ab25a30ce81f4e9be3cb33cd9bb9fb2db64b28 (patch) | |
tree | 5bb7cb22d9e51dba5acd7b64bd2c00b015bf0149 | |
parent | d3cf5176d0b17610b7c7f1562e496ec401fe01f8 (diff) |
memory-barriers: Replace uses of "transitive"
The current version of memory-barriers.txt misuses the term "transitive",
so this commit replaces it with multi-copy atomic, also adding a
definition of this term.
Reported-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
-rw-r--r-- | Documentation/memory-barriers.txt | 185 |
1 files changed, 91 insertions, 94 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index b759a60624fd..b6882680247e 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt | |||
@@ -53,7 +53,7 @@ CONTENTS | |||
53 | - SMP barrier pairing. | 53 | - SMP barrier pairing. |
54 | - Examples of memory barrier sequences. | 54 | - Examples of memory barrier sequences. |
55 | - Read memory barriers vs load speculation. | 55 | - Read memory barriers vs load speculation. |
56 | - Transitivity | 56 | - Multicopy atomicity. |
57 | 57 | ||
58 | (*) Explicit kernel barriers. | 58 | (*) Explicit kernel barriers. |
59 | 59 | ||
@@ -635,6 +635,11 @@ can be used to record rare error conditions and the like, and the CPUs' | |||
635 | naturally occurring ordering prevents such records from being lost. | 635 | naturally occurring ordering prevents such records from being lost. |
636 | 636 | ||
637 | 637 | ||
638 | Note well that the ordering provided by a data dependency is local to | ||
639 | the CPU containing it. See the section on "Multicopy atomicity" for | ||
640 | more information. | ||
641 | |||
642 | |||
638 | The data dependency barrier is very important to the RCU system, | 643 | The data dependency barrier is very important to the RCU system, |
639 | for example. See rcu_assign_pointer() and rcu_dereference() in | 644 | for example. See rcu_assign_pointer() and rcu_dereference() in |
640 | include/linux/rcupdate.h. This permits the current target of an RCU'd | 645 | include/linux/rcupdate.h. This permits the current target of an RCU'd |
@@ -851,38 +856,11 @@ In short, control dependencies apply only to the stores in the then-clause | |||
851 | and else-clause of the if-statement in question (including functions | 856 | and else-clause of the if-statement in question (including functions |
852 | invoked by those two clauses), not to code following that if-statement. | 857 | invoked by those two clauses), not to code following that if-statement. |
853 | 858 | ||
854 | Finally, control dependencies do -not- provide transitivity. This is | ||
855 | demonstrated by two related examples, with the initial values of | ||
856 | 'x' and 'y' both being zero: | ||
857 | |||
858 | CPU 0 CPU 1 | ||
859 | ======================= ======================= | ||
860 | r1 = READ_ONCE(x); r2 = READ_ONCE(y); | ||
861 | if (r1 > 0) if (r2 > 0) | ||
862 | WRITE_ONCE(y, 1); WRITE_ONCE(x, 1); | ||
863 | |||
864 | assert(!(r1 == 1 && r2 == 1)); | ||
865 | |||
866 | The above two-CPU example will never trigger the assert(). However, | ||
867 | if control dependencies guaranteed transitivity (which they do not), | ||
868 | then adding the following CPU would guarantee a related assertion: | ||
869 | |||
870 | CPU 2 | ||
871 | ===================== | ||
872 | WRITE_ONCE(x, 2); | ||
873 | |||
874 | assert(!(r1 == 2 && r2 == 1 && x == 2)); /* FAILS!!! */ | ||
875 | 859 | ||
876 | But because control dependencies do -not- provide transitivity, the above | 860 | Note well that the ordering provided by a control dependency is local |
877 | assertion can fail after the combined three-CPU example completes. If you | 861 | to the CPU containing it. See the section on "Multicopy atomicity" |
878 | need the three-CPU example to provide ordering, you will need smp_mb() | 862 | for more information. |
879 | between the loads and stores in the CPU 0 and CPU 1 code fragments, | ||
880 | that is, just before or just after the "if" statements. Furthermore, | ||
881 | the original two-CPU example is very fragile and should be avoided. | ||
882 | 863 | ||
883 | These two examples are the LB and WWC litmus tests from this paper: | ||
884 | http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this | ||
885 | site: https://www.cl.cam.ac.uk/~pes20/ppcmem/index.html. | ||
886 | 864 | ||
887 | In summary: | 865 | In summary: |
888 | 866 | ||
@@ -922,8 +900,8 @@ In summary: | |||
922 | 900 | ||
923 | (*) Control dependencies pair normally with other types of barriers. | 901 | (*) Control dependencies pair normally with other types of barriers. |
924 | 902 | ||
925 | (*) Control dependencies do -not- provide transitivity. If you | 903 | (*) Control dependencies do -not- provide multicopy atomicity. If you |
926 | need transitivity, use smp_mb(). | 904 | need all the CPUs to see a given store at the same time, use smp_mb(). |
927 | 905 | ||
928 | (*) Compilers do not understand control dependencies. It is therefore | 906 | (*) Compilers do not understand control dependencies. It is therefore |
929 | your job to ensure that they do not break your code. | 907 | your job to ensure that they do not break your code. |
@@ -936,13 +914,14 @@ When dealing with CPU-CPU interactions, certain types of memory barrier should | |||
936 | always be paired. A lack of appropriate pairing is almost certainly an error. | 914 | always be paired. A lack of appropriate pairing is almost certainly an error. |
937 | 915 | ||
938 | General barriers pair with each other, though they also pair with most | 916 | General barriers pair with each other, though they also pair with most |
939 | other types of barriers, albeit without transitivity. An acquire barrier | 917 | other types of barriers, albeit without multicopy atomicity. An acquire |
940 | pairs with a release barrier, but both may also pair with other barriers, | 918 | barrier pairs with a release barrier, but both may also pair with other |
941 | including of course general barriers. A write barrier pairs with a data | 919 | barriers, including of course general barriers. A write barrier pairs |
942 | dependency barrier, a control dependency, an acquire barrier, a release | 920 | with a data dependency barrier, a control dependency, an acquire barrier, |
943 | barrier, a read barrier, or a general barrier. Similarly a read barrier, | 921 | a release barrier, a read barrier, or a general barrier. Similarly a |
944 | control dependency, or a data dependency barrier pairs with a write | 922 | read barrier, control dependency, or a data dependency barrier pairs |
945 | barrier, an acquire barrier, a release barrier, or a general barrier: | 923 | with a write barrier, an acquire barrier, a release barrier, or a |
924 | general barrier: | ||
946 | 925 | ||
947 | CPU 1 CPU 2 | 926 | CPU 1 CPU 2 |
948 | =============== =============== | 927 | =============== =============== |
@@ -1359,64 +1338,77 @@ the speculation will be cancelled and the value reloaded: | |||
1359 | retrieved : : +-------+ | 1338 | retrieved : : +-------+ |
1360 | 1339 | ||
1361 | 1340 | ||
1362 | TRANSITIVITY | 1341 | MULTICOPY ATOMICITY |
1363 | ------------ | 1342 | -------------------- |
1343 | |||
1344 | Multicopy atomicity is a deeply intuitive notion about ordering that is | ||
1345 | not always provided by real computer systems, namely that a given store | ||
1346 | is visible at the same time to all CPUs, or, alternatively, that all | ||
1347 | CPUs agree on the order in which all stores took place. However, use of | ||
1348 | full multicopy atomicity would rule out valuable hardware optimizations, | ||
1349 | so a weaker form called ``other multicopy atomicity'' instead guarantees | ||
1350 | that a given store is observed at the same time by all -other- CPUs. The | ||
1351 | remainder of this document discusses this weaker form, but for brevity | ||
1352 | will call it simply ``multicopy atomicity''. | ||
1364 | 1353 | ||
1365 | Transitivity is a deeply intuitive notion about ordering that is not | 1354 | The following example demonstrates multicopy atomicity: |
1366 | always provided by real computer systems. The following example | ||
1367 | demonstrates transitivity: | ||
1368 | 1355 | ||
1369 | CPU 1 CPU 2 CPU 3 | 1356 | CPU 1 CPU 2 CPU 3 |
1370 | ======================= ======================= ======================= | 1357 | ======================= ======================= ======================= |
1371 | { X = 0, Y = 0 } | 1358 | { X = 0, Y = 0 } |
1372 | STORE X=1 LOAD X STORE Y=1 | 1359 | STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) |
1373 | <general barrier> <general barrier> | 1360 | <general barrier> <read barrier> |
1374 | LOAD Y LOAD X | 1361 | STORE Y=r1 LOAD X |
1375 | 1362 | ||
1376 | Suppose that CPU 2's load from X returns 1 and its load from Y returns 0. | 1363 | Suppose that CPU 2's load from X returns 1 which it then stores to Y and |
1377 | This indicates that CPU 2's load from X in some sense follows CPU 1's | 1364 | that CPU 3's load from Y returns 1. This indicates that CPU 2's load |
1378 | store to X and that CPU 2's load from Y in some sense preceded CPU 3's | 1365 | from X in some sense follows CPU 1's store to X and that CPU 2's store |
1379 | store to Y. The question is then "Can CPU 3's load from X return 0?" | 1366 | to Y in some sense preceded CPU 3's load from Y. The question is then |
1367 | "Can CPU 3's load from X return 0?" | ||
1380 | 1368 | ||
1381 | Because CPU 2's load from X in some sense came after CPU 1's store, it | 1369 | Because CPU 3's load from X in some sense came after CPU 2's load, it |
1382 | is natural to expect that CPU 3's load from X must therefore return 1. | 1370 | is natural to expect that CPU 3's load from X must therefore return 1. |
1383 | This expectation is an example of transitivity: if a load executing on | 1371 | This expectation is an example of multicopy atomicity: if a load executing |
1384 | CPU A follows a load from the same variable executing on CPU B, then | 1372 | on CPU A follows a load from the same variable executing on CPU B, then |
1385 | CPU A's load must either return the same value that CPU B's load did, | 1373 | an understandable but incorrect expectation is that CPU A's load must |
1386 | or must return some later value. | 1374 | either return the same value that CPU B's load did, or must return some |
1387 | 1375 | later value. | |
1388 | In the Linux kernel, use of general memory barriers guarantees | 1376 | |
1389 | transitivity. Therefore, in the above example, if CPU 2's load from X | 1377 | In the Linux kernel, the above use of a general memory barrier compensates |
1390 | returns 1 and its load from Y returns 0, then CPU 3's load from X must | 1378 | for any lack of multicopy atomicity. Therefore, in the above example, |
1391 | also return 1. | 1379 | if CPU 2's load from X returns 1 and its load from Y returns 0, and CPU 3's |
1392 | 1380 | load from Y returns 1, then CPU 3's load from X must also return 1. | |
1393 | However, transitivity is -not- guaranteed for read or write barriers. | 1381 | |
1394 | For example, suppose that CPU 2's general barrier in the above example | 1382 | However, dependencies, read barriers, and write barriers are not always |
1395 | is changed to a read barrier as shown below: | 1383 | able to compensate for non-multicopy atomicity. For example, suppose |
1384 | that CPU 2's general barrier is removed from the above example, leaving | ||
1385 | only the data dependency shown below: | ||
1396 | 1386 | ||
1397 | CPU 1 CPU 2 CPU 3 | 1387 | CPU 1 CPU 2 CPU 3 |
1398 | ======================= ======================= ======================= | 1388 | ======================= ======================= ======================= |
1399 | { X = 0, Y = 0 } | 1389 | { X = 0, Y = 0 } |
1400 | STORE X=1 LOAD X STORE Y=1 | 1390 | STORE X=1 r1=LOAD X (reads 1) LOAD Y (reads 1) |
1401 | <read barrier> <general barrier> | 1391 | <data dependency> <read barrier> |
1402 | LOAD Y LOAD X | 1392 | STORE Y=r1 LOAD X (reads 0) |
1403 | 1393 | ||
1404 | This substitution destroys transitivity: in this example, it is perfectly | 1394 | This substitution allows non-multicopy atomicity to run rampant: in |
1405 | legal for CPU 2's load from X to return 1, its load from Y to return 0, | 1395 | this example, it is perfectly legal for CPU 2's load from X to return 1, |
1406 | and CPU 3's load from X to return 0. | 1396 | CPU 3's load from Y to return 1, and its load from X to return 0. |
1407 | 1397 | ||
1408 | The key point is that although CPU 2's read barrier orders its pair | 1398 | The key point is that although CPU 2's data dependency orders its load |
1409 | of loads, it does not guarantee to order CPU 1's store. Therefore, if | 1399 | and store, it does not guarantee to order CPU 1's store. Therefore, |
1410 | this example runs on a system where CPUs 1 and 2 share a store buffer | 1400 | if this example runs on a non-multicopy-atomic system where CPUs 1 and 2 |
1411 | or a level of cache, CPU 2 might have early access to CPU 1's writes. | 1401 | share a store buffer or a level of cache, CPU 2 might have early access |
1412 | General barriers are therefore required to ensure that all CPUs agree | 1402 | to CPU 1's writes. A general barrier is therefore required to ensure |
1413 | on the combined order of CPU 1's and CPU 2's accesses. | 1403 | that all CPUs agree on the combined order of CPU 1's and CPU 2's accesses. |
1414 | 1404 | ||
1415 | General barriers provide "global transitivity", so that all CPUs will | 1405 | General barriers can compensate not only for non-multicopy atomicity, |
1416 | agree on the order of operations. In contrast, a chain of release-acquire | 1406 | but can also generate additional ordering that can ensure that -all- |
1417 | pairs provides only "local transitivity", so that only those CPUs on | 1407 | CPUs will perceive the same order of -all- operations. In contrast, a |
1418 | the chain are guaranteed to agree on the combined order of the accesses. | 1408 | chain of release-acquire pairs do not provide this additional ordering, |
1419 | For example, switching to C code in deference to Herman Hollerith: | 1409 | which means that only those CPUs on the chain are guaranteed to agree |
1410 | on the combined order of the accesses. For example, switching to C code | ||
1411 | in deference to the ghost of Herman Hollerith: | ||
1420 | 1412 | ||
1421 | int u, v, x, y, z; | 1413 | int u, v, x, y, z; |
1422 | 1414 | ||
@@ -1448,9 +1440,9 @@ For example, switching to C code in deference to Herman Hollerith: | |||
1448 | r3 = READ_ONCE(u); | 1440 | r3 = READ_ONCE(u); |
1449 | } | 1441 | } |
1450 | 1442 | ||
1451 | Because cpu0(), cpu1(), and cpu2() participate in a local transitive | 1443 | Because cpu0(), cpu1(), and cpu2() participate in a chain of |
1452 | chain of smp_store_release()/smp_load_acquire() pairs, the following | 1444 | smp_store_release()/smp_load_acquire() pairs, the following outcome |
1453 | outcome is prohibited: | 1445 | is prohibited: |
1454 | 1446 | ||
1455 | r0 == 1 && r1 == 1 && r2 == 1 | 1447 | r0 == 1 && r1 == 1 && r2 == 1 |
1456 | 1448 | ||
@@ -1460,9 +1452,9 @@ outcome is prohibited: | |||
1460 | 1452 | ||
1461 | r1 == 1 && r5 == 0 | 1453 | r1 == 1 && r5 == 0 |
1462 | 1454 | ||
1463 | However, the transitivity of release-acquire is local to the participating | 1455 | However, the ordering provided by a release-acquire chain is local |
1464 | CPUs and does not apply to cpu3(). Therefore, the following outcome | 1456 | to the CPUs participating in that chain and does not apply to cpu3(), |
1465 | is possible: | 1457 | at least aside from stores. Therefore, the following outcome is possible: |
1466 | 1458 | ||
1467 | r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 | 1459 | r0 == 0 && r1 == 1 && r2 == 1 && r3 == 0 && r4 == 0 |
1468 | 1460 | ||
@@ -1490,8 +1482,8 @@ following outcome is possible: | |||
1490 | Note that this outcome can happen even on a mythical sequentially | 1482 | Note that this outcome can happen even on a mythical sequentially |
1491 | consistent system where nothing is ever reordered. | 1483 | consistent system where nothing is ever reordered. |
1492 | 1484 | ||
1493 | To reiterate, if your code requires global transitivity, use general | 1485 | To reiterate, if your code requires full ordering of all operations, |
1494 | barriers throughout. | 1486 | use general barriers throughout. |
1495 | 1487 | ||
1496 | 1488 | ||
1497 | ======================== | 1489 | ======================== |
@@ -3101,6 +3093,9 @@ AMD64 Architecture Programmer's Manual Volume 2: System Programming | |||
3101 | Chapter 7.1: Memory-Access Ordering | 3093 | Chapter 7.1: Memory-Access Ordering |
3102 | Chapter 7.4: Buffering and Combining Memory Writes | 3094 | Chapter 7.4: Buffering and Combining Memory Writes |
3103 | 3095 | ||
3096 | ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile) | ||
3097 | Chapter B2: The AArch64 Application Level Memory Model | ||
3098 | |||
3104 | IA-32 Intel Architecture Software Developer's Manual, Volume 3: | 3099 | IA-32 Intel Architecture Software Developer's Manual, Volume 3: |
3105 | System Programming Guide | 3100 | System Programming Guide |
3106 | Chapter 7.1: Locked Atomic Operations | 3101 | Chapter 7.1: Locked Atomic Operations |
@@ -3112,6 +3107,8 @@ The SPARC Architecture Manual, Version 9 | |||
3112 | Appendix D: Formal Specification of the Memory Models | 3107 | Appendix D: Formal Specification of the Memory Models |
3113 | Appendix J: Programming with the Memory Models | 3108 | Appendix J: Programming with the Memory Models |
3114 | 3109 | ||
3110 | Storage in the PowerPC (Stone and Fitzgerald) | ||
3111 | |||
3115 | UltraSPARC Programmer Reference Manual | 3112 | UltraSPARC Programmer Reference Manual |
3116 | Chapter 5: Memory Accesses and Cacheability | 3113 | Chapter 5: Memory Accesses and Cacheability |
3117 | Chapter 15: Sparc-V9 Memory Models | 3114 | Chapter 15: Sparc-V9 Memory Models |