diff options
Diffstat (limited to 'Documentation/memory-barriers.txt')
-rw-r--r-- | Documentation/memory-barriers.txt | 352 |
1 files changed, 272 insertions, 80 deletions
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index 92f0056d928c..4710845dbac4 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt | |||
@@ -19,6 +19,7 @@ Contents: | |||
19 | - Control dependencies. | 19 | - Control dependencies. |
20 | - SMP barrier pairing. | 20 | - SMP barrier pairing. |
21 | - Examples of memory barrier sequences. | 21 | - Examples of memory barrier sequences. |
22 | - Read memory barriers vs load speculation. | ||
22 | 23 | ||
23 | (*) Explicit kernel barriers. | 24 | (*) Explicit kernel barriers. |
24 | 25 | ||
@@ -248,7 +249,7 @@ And there are a number of things that _must_ or _must_not_ be assumed: | |||
248 | we may get either of: | 249 | we may get either of: |
249 | 250 | ||
250 | STORE *A = X; Y = LOAD *A; | 251 | STORE *A = X; Y = LOAD *A; |
251 | STORE *A = Y; | 252 | STORE *A = Y = X; |
252 | 253 | ||
253 | 254 | ||
254 | ========================= | 255 | ========================= |
@@ -344,9 +345,12 @@ Memory barriers come in four basic varieties: | |||
344 | 345 | ||
345 | (4) General memory barriers. | 346 | (4) General memory barriers. |
346 | 347 | ||
347 | A general memory barrier is a combination of both a read memory barrier | 348 | A general memory barrier gives a guarantee that all the LOAD and STORE |
348 | and a write memory barrier. It is a partial ordering over both loads and | 349 | operations specified before the barrier will appear to happen before all |
349 | stores. | 350 | the LOAD and STORE operations specified after the barrier with respect to |
351 | the other components of the system. | ||
352 | |||
353 | A general memory barrier is a partial ordering over both loads and stores. | ||
350 | 354 | ||
351 | General memory barriers imply both read and write memory barriers, and so | 355 | General memory barriers imply both read and write memory barriers, and so |
352 | can substitute for either. | 356 | can substitute for either. |
@@ -546,9 +550,9 @@ write barrier, though, again, a general barrier is viable: | |||
546 | =============== =============== | 550 | =============== =============== |
547 | a = 1; | 551 | a = 1; |
548 | <write barrier> | 552 | <write barrier> |
549 | b = 2; x = a; | 553 | b = 2; x = b; |
550 | <read barrier> | 554 | <read barrier> |
551 | y = b; | 555 | y = a; |
552 | 556 | ||
553 | Or: | 557 | Or: |
554 | 558 | ||
@@ -563,6 +567,18 @@ Or: | |||
563 | Basically, the read barrier always has to be there, even though it can be of | 567 | Basically, the read barrier always has to be there, even though it can be of |
564 | the "weaker" type. | 568 | the "weaker" type. |
565 | 569 | ||
570 | [!] Note that the stores before the write barrier would normally be expected to | ||
571 | match the loads after the read barrier or data dependency barrier, and vice | ||
572 | versa: | ||
573 | |||
574 | CPU 1 CPU 2 | ||
575 | =============== =============== | ||
576 | a = 1; }---- --->{ v = c | ||
577 | b = 2; } \ / { w = d | ||
578 | <write barrier> \ <read barrier> | ||
579 | c = 3; } / \ { x = a; | ||
580 | d = 4; }---- --->{ y = b; | ||
581 | |||
566 | 582 | ||
567 | EXAMPLES OF MEMORY BARRIER SEQUENCES | 583 | EXAMPLES OF MEMORY BARRIER SEQUENCES |
568 | ------------------------------------ | 584 | ------------------------------------ |
@@ -600,8 +616,8 @@ STORE B, STORE C } all occuring before the unordered set of { STORE D, STORE E | |||
600 | | | +------+ | 616 | | | +------+ |
601 | +-------+ : : | 617 | +-------+ : : |
602 | | | 618 | | |
603 | | Sequence in which stores committed to memory system | 619 | | Sequence in which stores are committed to the |
604 | | by CPU 1 | 620 | | memory system by CPU 1 |
605 | V | 621 | V |
606 | 622 | ||
607 | 623 | ||
@@ -683,14 +699,12 @@ then the following will occur: | |||
683 | | : : | | | 699 | | : : | | |
684 | | : : | CPU 2 | | 700 | | : : | CPU 2 | |
685 | | +-------+ | | | 701 | | +-------+ | | |
686 | \ | X->9 |------>| | | 702 | | | X->9 |------>| | |
687 | \ +-------+ | | | 703 | | +-------+ | | |
688 | ----->| B->2 | | | | 704 | Makes sure all effects ---> \ ddddddddddddddddd | | |
689 | +-------+ | | | 705 | prior to the store of C \ +-------+ | | |
690 | Makes sure all effects ---> ddddddddddddddddd | | | 706 | are perceptible to ----->| B->2 |------>| | |
691 | prior to the store of C +-------+ | | | 707 | subsequent loads +-------+ | | |
692 | are perceptible to | B->2 |------>| | | ||
693 | successive loads +-------+ | | | ||
694 | : : +-------+ | 708 | : : +-------+ |
695 | 709 | ||
696 | 710 | ||
@@ -699,73 +713,239 @@ following sequence of events: | |||
699 | 713 | ||
700 | CPU 1 CPU 2 | 714 | CPU 1 CPU 2 |
701 | ======================= ======================= | 715 | ======================= ======================= |
716 | { A = 0, B = 9 } | ||
702 | STORE A=1 | 717 | STORE A=1 |
703 | STORE B=2 | ||
704 | STORE C=3 | ||
705 | <write barrier> | 718 | <write barrier> |
706 | STORE D=4 | 719 | STORE B=2 |
707 | STORE E=5 | ||
708 | LOAD A | ||
709 | LOAD B | 720 | LOAD B |
710 | LOAD C | 721 | LOAD A |
711 | LOAD D | ||
712 | LOAD E | ||
713 | 722 | ||
714 | Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in | 723 | Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in |
715 | some effectively random order, despite the write barrier issued by CPU 1: | 724 | some effectively random order, despite the write barrier issued by CPU 1: |
716 | 725 | ||
717 | +-------+ : : | 726 | +-------+ : : : : |
718 | | | +------+ | 727 | | | +------+ +-------+ |
719 | | |------>| C=3 | } | 728 | | |------>| A=1 |------ --->| A->0 | |
720 | | | : +------+ } | 729 | | | +------+ \ +-------+ |
721 | | | : | A=1 | } | 730 | | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | |
722 | | | : +------+ } | 731 | | | +------+ | +-------+ |
723 | | CPU 1 | : | B=2 | }--- | 732 | | |------>| B=2 |--- | : : |
724 | | | +------+ } \ | 733 | | | +------+ \ | : : +-------+ |
725 | | | wwwwwwwwwwwww} \ | 734 | +-------+ : : \ | +-------+ | | |
726 | | | +------+ } \ : : +-------+ | 735 | ---------->| B->2 |------>| | |
727 | | | : | E=5 | } \ +-------+ | | | 736 | | +-------+ | CPU 2 | |
728 | | | : +------+ } \ { | C->3 |------>| | | 737 | | | A->0 |------>| | |
729 | | |------>| D=4 | } \ { +-------+ : | | | 738 | | +-------+ | | |
730 | | | +------+ \ { | E->5 | : | | | 739 | | : : +-------+ |
731 | +-------+ : : \ { +-------+ : | | | 740 | \ : : |
732 | Transfer -->{ | A->1 | : | CPU 2 | | 741 | \ +-------+ |
733 | from CPU 1 { +-------+ : | | | 742 | ---->| A->1 | |
734 | to CPU 2 { | D->4 | : | | | 743 | +-------+ |
735 | { +-------+ : | | | 744 | : : |
736 | { | B->2 |------>| | | ||
737 | +-------+ | | | ||
738 | : : +-------+ | ||
739 | |||
740 | |||
741 | If, however, a read barrier were to be placed between the load of C and the | ||
742 | load of D on CPU 2, then the partial ordering imposed by CPU 1 will be | ||
743 | perceived correctly by CPU 2. | ||
744 | 745 | ||
745 | +-------+ : : | 746 | |
746 | | | +------+ | 747 | If, however, a read barrier were to be placed between the load of E and the |
747 | | |------>| C=3 | } | 748 | load of A on CPU 2: |
748 | | | : +------+ } | 749 | |
749 | | | : | A=1 | }--- | 750 | CPU 1 CPU 2 |
750 | | | : +------+ } \ | 751 | ======================= ======================= |
751 | | CPU 1 | : | B=2 | } \ | 752 | { A = 0, B = 9 } |
752 | | | +------+ \ | 753 | STORE A=1 |
753 | | | wwwwwwwwwwwwwwww \ | 754 | <write barrier> |
754 | | | +------+ \ : : +-------+ | 755 | STORE B=2 |
755 | | | : | E=5 | } \ +-------+ | | | 756 | LOAD B |
756 | | | : +------+ }--- \ { | C->3 |------>| | | 757 | <read barrier> |
757 | | |------>| D=4 | } \ \ { +-------+ : | | | 758 | LOAD A |
758 | | | +------+ \ -->{ | B->2 | : | | | 759 | |
759 | +-------+ : : \ { +-------+ : | | | 760 | then the partial ordering imposed by CPU 1 will be perceived correctly by CPU |
760 | \ { | A->1 | : | CPU 2 | | 761 | 2: |
761 | \ +-------+ | | | 762 | |
762 | At this point the read ----> \ rrrrrrrrrrrrrrrrr | | | 763 | +-------+ : : : : |
763 | barrier causes all effects \ +-------+ | | | 764 | | | +------+ +-------+ |
764 | prior to the storage of C \ { | E->5 | : | | | 765 | | |------>| A=1 |------ --->| A->0 | |
765 | to be perceptible to CPU 2 -->{ +-------+ : | | | 766 | | | +------+ \ +-------+ |
766 | { | D->4 |------>| | | 767 | | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | |
767 | +-------+ | | | 768 | | | +------+ | +-------+ |
768 | : : +-------+ | 769 | | |------>| B=2 |--- | : : |
770 | | | +------+ \ | : : +-------+ | ||
771 | +-------+ : : \ | +-------+ | | | ||
772 | ---------->| B->2 |------>| | | ||
773 | | +-------+ | CPU 2 | | ||
774 | | : : | | | ||
775 | | : : | | | ||
776 | At this point the read ----> \ rrrrrrrrrrrrrrrrr | | | ||
777 | barrier causes all effects \ +-------+ | | | ||
778 | prior to the storage of B ---->| A->1 |------>| | | ||
779 | to be perceptible to CPU 2 +-------+ | | | ||
780 | : : +-------+ | ||
781 | |||
782 | |||
783 | To illustrate this more completely, consider what could happen if the code | ||
784 | contained a load of A either side of the read barrier: | ||
785 | |||
786 | CPU 1 CPU 2 | ||
787 | ======================= ======================= | ||
788 | { A = 0, B = 9 } | ||
789 | STORE A=1 | ||
790 | <write barrier> | ||
791 | STORE B=2 | ||
792 | LOAD B | ||
793 | LOAD A [first load of A] | ||
794 | <read barrier> | ||
795 | LOAD A [second load of A] | ||
796 | |||
797 | Even though the two loads of A both occur after the load of B, they may both | ||
798 | come up with different values: | ||
799 | |||
800 | +-------+ : : : : | ||
801 | | | +------+ +-------+ | ||
802 | | |------>| A=1 |------ --->| A->0 | | ||
803 | | | +------+ \ +-------+ | ||
804 | | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | | ||
805 | | | +------+ | +-------+ | ||
806 | | |------>| B=2 |--- | : : | ||
807 | | | +------+ \ | : : +-------+ | ||
808 | +-------+ : : \ | +-------+ | | | ||
809 | ---------->| B->2 |------>| | | ||
810 | | +-------+ | CPU 2 | | ||
811 | | : : | | | ||
812 | | : : | | | ||
813 | | +-------+ | | | ||
814 | | | A->0 |------>| 1st | | ||
815 | | +-------+ | | | ||
816 | At this point the read ----> \ rrrrrrrrrrrrrrrrr | | | ||
817 | barrier causes all effects \ +-------+ | | | ||
818 | prior to the storage of B ---->| A->1 |------>| 2nd | | ||
819 | to be perceptible to CPU 2 +-------+ | | | ||
820 | : : +-------+ | ||
821 | |||
822 | |||
823 | But it may be that the update to A from CPU 1 becomes perceptible to CPU 2 | ||
824 | before the read barrier completes anyway: | ||
825 | |||
826 | +-------+ : : : : | ||
827 | | | +------+ +-------+ | ||
828 | | |------>| A=1 |------ --->| A->0 | | ||
829 | | | +------+ \ +-------+ | ||
830 | | CPU 1 | wwwwwwwwwwwwwwww \ --->| B->9 | | ||
831 | | | +------+ | +-------+ | ||
832 | | |------>| B=2 |--- | : : | ||
833 | | | +------+ \ | : : +-------+ | ||
834 | +-------+ : : \ | +-------+ | | | ||
835 | ---------->| B->2 |------>| | | ||
836 | | +-------+ | CPU 2 | | ||
837 | | : : | | | ||
838 | \ : : | | | ||
839 | \ +-------+ | | | ||
840 | ---->| A->1 |------>| 1st | | ||
841 | +-------+ | | | ||
842 | rrrrrrrrrrrrrrrrr | | | ||
843 | +-------+ | | | ||
844 | | A->1 |------>| 2nd | | ||
845 | +-------+ | | | ||
846 | : : +-------+ | ||
847 | |||
848 | |||
849 | The guarantee is that the second load will always come up with A == 1 if the | ||
850 | load of B came up with B == 2. No such guarantee exists for the first load of | ||
851 | A; that may come up with either A == 0 or A == 1. | ||
852 | |||
853 | |||
854 | READ MEMORY BARRIERS VS LOAD SPECULATION | ||
855 | ---------------------------------------- | ||
856 | |||
857 | Many CPUs speculate with loads: that is they see that they will need to load an | ||
858 | item from memory, and they find a time where they're not using the bus for any | ||
859 | other loads, and so do the load in advance - even though they haven't actually | ||
860 | got to that point in the instruction execution flow yet. This permits the | ||
861 | actual load instruction to potentially complete immediately because the CPU | ||
862 | already has the value to hand. | ||
863 | |||
864 | It may turn out that the CPU didn't actually need the value - perhaps because a | ||
865 | branch circumvented the load - in which case it can discard the value or just | ||
866 | cache it for later use. | ||
867 | |||
868 | Consider: | ||
869 | |||
870 | CPU 1 CPU 2 | ||
871 | ======================= ======================= | ||
872 | LOAD B | ||
873 | DIVIDE } Divide instructions generally | ||
874 | DIVIDE } take a long time to perform | ||
875 | LOAD A | ||
876 | |||
877 | Which might appear as this: | ||
878 | |||
879 | : : +-------+ | ||
880 | +-------+ | | | ||
881 | --->| B->2 |------>| | | ||
882 | +-------+ | CPU 2 | | ||
883 | : :DIVIDE | | | ||
884 | +-------+ | | | ||
885 | The CPU being busy doing a ---> --->| A->0 |~~~~ | | | ||
886 | division speculates on the +-------+ ~ | | | ||
887 | LOAD of A : : ~ | | | ||
888 | : :DIVIDE | | | ||
889 | : : ~ | | | ||
890 | Once the divisions are complete --> : : ~-->| | | ||
891 | the CPU can then perform the : : | | | ||
892 | LOAD with immediate effect : : +-------+ | ||
893 | |||
894 | |||
895 | Placing a read barrier or a data dependency barrier just before the second | ||
896 | load: | ||
897 | |||
898 | CPU 1 CPU 2 | ||
899 | ======================= ======================= | ||
900 | LOAD B | ||
901 | DIVIDE | ||
902 | DIVIDE | ||
903 | <read barrier> | ||
904 | LOAD A | ||
905 | |||
906 | will force any value speculatively obtained to be reconsidered to an extent | ||
907 | dependent on the type of barrier used. If there was no change made to the | ||
908 | speculated memory location, then the speculated value will just be used: | ||
909 | |||
910 | : : +-------+ | ||
911 | +-------+ | | | ||
912 | --->| B->2 |------>| | | ||
913 | +-------+ | CPU 2 | | ||
914 | : :DIVIDE | | | ||
915 | +-------+ | | | ||
916 | The CPU being busy doing a ---> --->| A->0 |~~~~ | | | ||
917 | division speculates on the +-------+ ~ | | | ||
918 | LOAD of A : : ~ | | | ||
919 | : :DIVIDE | | | ||
920 | : : ~ | | | ||
921 | : : ~ | | | ||
922 | rrrrrrrrrrrrrrrr~ | | | ||
923 | : : ~ | | | ||
924 | : : ~-->| | | ||
925 | : : | | | ||
926 | : : +-------+ | ||
927 | |||
928 | |||
929 | but if there was an update or an invalidation from another CPU pending, then | ||
930 | the speculation will be cancelled and the value reloaded: | ||
931 | |||
932 | : : +-------+ | ||
933 | +-------+ | | | ||
934 | --->| B->2 |------>| | | ||
935 | +-------+ | CPU 2 | | ||
936 | : :DIVIDE | | | ||
937 | +-------+ | | | ||
938 | The CPU being busy doing a ---> --->| A->0 |~~~~ | | | ||
939 | division speculates on the +-------+ ~ | | | ||
940 | LOAD of A : : ~ | | | ||
941 | : :DIVIDE | | | ||
942 | : : ~ | | | ||
943 | : : ~ | | | ||
944 | rrrrrrrrrrrrrrrrr | | | ||
945 | +-------+ | | | ||
946 | The speculation is discarded ---> --->| A->1 |------>| | | ||
947 | and an updated value is +-------+ | | | ||
948 | retrieved : : +-------+ | ||
769 | 949 | ||
770 | 950 | ||
771 | ======================== | 951 | ======================== |
@@ -901,7 +1081,7 @@ IMPLICIT KERNEL MEMORY BARRIERS | |||
901 | =============================== | 1081 | =============================== |
902 | 1082 | ||
903 | Some of the other functions in the linux kernel imply memory barriers, amongst | 1083 | Some of the other functions in the linux kernel imply memory barriers, amongst |
904 | which are locking, scheduling and memory allocation functions. | 1084 | which are locking and scheduling functions. |
905 | 1085 | ||
906 | This specification is a _minimum_ guarantee; any particular architecture may | 1086 | This specification is a _minimum_ guarantee; any particular architecture may |
907 | provide more substantial guarantees, but these may not be relied upon outside | 1087 | provide more substantial guarantees, but these may not be relied upon outside |
@@ -966,6 +1146,20 @@ equivalent to a full barrier, but a LOCK followed by an UNLOCK is not. | |||
966 | barriers is that the effects instructions outside of a critical section may | 1146 | barriers is that the effects instructions outside of a critical section may |
967 | seep into the inside of the critical section. | 1147 | seep into the inside of the critical section. |
968 | 1148 | ||
1149 | A LOCK followed by an UNLOCK may not be assumed to be full memory barrier | ||
1150 | because it is possible for an access preceding the LOCK to happen after the | ||
1151 | LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the | ||
1152 | two accesses can themselves then cross: | ||
1153 | |||
1154 | *A = a; | ||
1155 | LOCK | ||
1156 | UNLOCK | ||
1157 | *B = b; | ||
1158 | |||
1159 | may occur as: | ||
1160 | |||
1161 | LOCK, STORE *B, STORE *A, UNLOCK | ||
1162 | |||
969 | Locks and semaphores may not provide any guarantee of ordering on UP compiled | 1163 | Locks and semaphores may not provide any guarantee of ordering on UP compiled |
970 | systems, and so cannot be counted on in such a situation to actually achieve | 1164 | systems, and so cannot be counted on in such a situation to actually achieve |
971 | anything at all - especially with respect to I/O accesses - unless combined | 1165 | anything at all - especially with respect to I/O accesses - unless combined |
@@ -1016,8 +1210,6 @@ Other functions that imply barriers: | |||
1016 | 1210 | ||
1017 | (*) schedule() and similar imply full memory barriers. | 1211 | (*) schedule() and similar imply full memory barriers. |
1018 | 1212 | ||
1019 | (*) Memory allocation and release functions imply full memory barriers. | ||
1020 | |||
1021 | 1213 | ||
1022 | ================================= | 1214 | ================================= |
1023 | INTER-CPU LOCKING BARRIER EFFECTS | 1215 | INTER-CPU LOCKING BARRIER EFFECTS |
@@ -1031,7 +1223,7 @@ conflict on any particular lock. | |||
1031 | LOCKS VS MEMORY ACCESSES | 1223 | LOCKS VS MEMORY ACCESSES |
1032 | ------------------------ | 1224 | ------------------------ |
1033 | 1225 | ||
1034 | Consider the following: the system has a pair of spinlocks (N) and (Q), and | 1226 | Consider the following: the system has a pair of spinlocks (M) and (Q), and |
1035 | three CPUs; then should the following sequence of events occur: | 1227 | three CPUs; then should the following sequence of events occur: |
1036 | 1228 | ||
1037 | CPU 1 CPU 2 | 1229 | CPU 1 CPU 2 |
@@ -1678,7 +1870,7 @@ CPU's caches by some other cache event: | |||
1678 | smp_wmb(); | 1870 | smp_wmb(); |
1679 | <A:modify v=2> <C:busy> | 1871 | <A:modify v=2> <C:busy> |
1680 | <C:queue v=2> | 1872 | <C:queue v=2> |
1681 | p = &b; q = p; | 1873 | p = &v; q = p; |
1682 | <D:request p> | 1874 | <D:request p> |
1683 | <B:modify p=&v> <D:commit p=&v> | 1875 | <B:modify p=&v> <D:commit p=&v> |
1684 | <D:read p> | 1876 | <D:read p> |