diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-04-02 13:27:16 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-04-02 13:27:16 -0400 |
commit | 701f3b314905ac05f09fc052c87b022825d831f2 (patch) | |
tree | 630065bc1c4f046029a1f3398e049e0831a33035 /tools | |
parent | 8747a29173c6eb6f4b3e8d3b3bcabc0fa132678a (diff) | |
parent | 19193bcad8dced863f2f720b1a76110bda07c970 (diff) |
Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
"The main changes in the locking subsystem in this cycle were:
- Add the Linux Kernel Memory Consistency Model (LKMM) subsystem,
which is an an array of tools in tools/memory-model/ that formally
describe the Linux memory coherency model (a.k.a.
Documentation/memory-barriers.txt), and also produce 'litmus tests'
in form of kernel code which can be directly executed and tested.
Here's a high level background article about an earlier version of
this work on LWN.net:
https://lwn.net/Articles/718628/
The design principles:
"There is reason to believe that Documentation/memory-barriers.txt
could use some help, and a major purpose of this patch is to
provide that help in the form of a design-time tool that can
produce all valid executions of a small fragment of concurrent
Linux-kernel code, which is called a "litmus test". This tool's
functionality is roughly similar to a full state-space search.
Please note that this is a design-time tool, not useful for
regression testing. However, we hope that the underlying
Linux-kernel memory model will be incorporated into other tools
capable of analyzing large bodies of code for regression-testing
purposes."
[...]
"A second tool is klitmus7, which converts litmus tests to
loadable kernel modules for direct testing. As with herd7, the
klitmus7 code is freely available from
http://diy.inria.fr/sources/index.html
(and via "git" at https://github.com/herd/herdtools7)"
[...]
Credits go to:
"This patch was the result of a most excellent collaboration
founded by Jade Alglave and also including Alan Stern, Andrea
Parri, and Luc Maranget."
... and to the gents listed in the MAINTAINERS entry:
LINUX KERNEL MEMORY CONSISTENCY MODEL (LKMM)
M: Alan Stern <stern@rowland.harvard.edu>
M: Andrea Parri <parri.andrea@gmail.com>
M: Will Deacon <will.deacon@arm.com>
M: Peter Zijlstra <peterz@infradead.org>
M: Boqun Feng <boqun.feng@gmail.com>
M: Nicholas Piggin <npiggin@gmail.com>
M: David Howells <dhowells@redhat.com>
M: Jade Alglave <j.alglave@ucl.ac.uk>
M: Luc Maranget <luc.maranget@inria.fr>
M: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
The LKMM project already found several bugs in Linux locking
primitives and improved the understanding and the documentation of
the Linux memory model all around.
- Add KASAN instrumentation to atomic APIs (Dmitry Vyukov)
- Add RWSEM API debugging and reorganize the lock debugging Kconfig
(Waiman Long)
- ... misc cleanups and other smaller changes"
* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (31 commits)
locking/Kconfig: Restructure the lock debugging menu
locking/Kconfig: Add LOCK_DEBUGGING_SUPPORT to make it more readable
locking/rwsem: Add DEBUG_RWSEMS to look for lock/unlock mismatches
lockdep: Make the lock debug output more useful
locking/rtmutex: Handle non enqueued waiters gracefully in remove_waiter()
locking/atomic, asm-generic, x86: Add comments for atomic instrumentation
locking/atomic, asm-generic: Add KASAN instrumentation to atomic operations
locking/atomic/x86: Switch atomic.h to use atomic-instrumented.h
locking/atomic, asm-generic: Add asm-generic/atomic-instrumented.h
locking/xchg/alpha: Remove superfluous memory barriers from the _local() variants
tools/memory-model: Finish the removal of rb-dep, smp_read_barrier_depends(), and lockless_dereference()
tools/memory-model: Add documentation of new litmus test
tools/memory-model: Remove mention of docker/gentoo image
locking/memory-barriers: De-emphasize smp_read_barrier_depends() some more
locking/lockdep: Show unadorned pointers
mutex: Drop linkage.h from mutex.h
tools/memory-model: Remove rb-dep, smp_read_barrier_depends, and lockless_dereference
tools/memory-model: Convert underscores to hyphens
tools/memory-model: Add a S lock-based external-view litmus test
tools/memory-model: Add required herd7 version to README file
...
Diffstat (limited to 'tools')
40 files changed, 4234 insertions, 0 deletions
diff --git a/tools/memory-model/Documentation/cheatsheet.txt b/tools/memory-model/Documentation/cheatsheet.txt new file mode 100644 index 000000000000..956b1ae4aafb --- /dev/null +++ b/tools/memory-model/Documentation/cheatsheet.txt | |||
@@ -0,0 +1,29 @@ | |||
1 | Prior Operation Subsequent Operation | ||
2 | --------------- --------------------------- | ||
3 | C Self R W RWM Self R W DR DW RMW SV | ||
4 | -- ---- - - --- ---- - - -- -- --- -- | ||
5 | |||
6 | Store, e.g., WRITE_ONCE() Y Y | ||
7 | Load, e.g., READ_ONCE() Y Y Y Y | ||
8 | Unsuccessful RMW operation Y Y Y Y | ||
9 | rcu_dereference() Y Y Y Y | ||
10 | Successful *_acquire() R Y Y Y Y Y Y | ||
11 | Successful *_release() C Y Y Y W Y | ||
12 | smp_rmb() Y R Y Y R | ||
13 | smp_wmb() Y W Y Y W | ||
14 | smp_mb() & synchronize_rcu() CP Y Y Y Y Y Y Y Y | ||
15 | Successful full non-void RMW CP Y Y Y Y Y Y Y Y Y Y Y | ||
16 | smp_mb__before_atomic() CP Y Y Y a a a a Y | ||
17 | smp_mb__after_atomic() CP a a Y Y Y Y Y | ||
18 | |||
19 | |||
20 | Key: C: Ordering is cumulative | ||
21 | P: Ordering propagates | ||
22 | R: Read, for example, READ_ONCE(), or read portion of RMW | ||
23 | W: Write, for example, WRITE_ONCE(), or write portion of RMW | ||
24 | Y: Provides ordering | ||
25 | a: Provides ordering given intervening RMW atomic operation | ||
26 | DR: Dependent read (address dependency) | ||
27 | DW: Dependent write (address, data, or control dependency) | ||
28 | RMW: Atomic read-modify-write operation | ||
29 | SV Same-variable access | ||
diff --git a/tools/memory-model/Documentation/explanation.txt b/tools/memory-model/Documentation/explanation.txt new file mode 100644 index 000000000000..a727c82bd434 --- /dev/null +++ b/tools/memory-model/Documentation/explanation.txt | |||
@@ -0,0 +1,1845 @@ | |||
1 | Explanation of the Linux-Kernel Memory Consistency Model | ||
2 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
3 | |||
4 | :Author: Alan Stern <stern@rowland.harvard.edu> | ||
5 | :Created: October 2017 | ||
6 | |||
7 | .. Contents | ||
8 | |||
9 | 1. INTRODUCTION | ||
10 | 2. BACKGROUND | ||
11 | 3. A SIMPLE EXAMPLE | ||
12 | 4. A SELECTION OF MEMORY MODELS | ||
13 | 5. ORDERING AND CYCLES | ||
14 | 6. EVENTS | ||
15 | 7. THE PROGRAM ORDER RELATION: po AND po-loc | ||
16 | 8. A WARNING | ||
17 | 9. DEPENDENCY RELATIONS: data, addr, and ctrl | ||
18 | 10. THE READS-FROM RELATION: rf, rfi, and rfe | ||
19 | 11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe | ||
20 | 12. THE FROM-READS RELATION: fr, fri, and fre | ||
21 | 13. AN OPERATIONAL MODEL | ||
22 | 14. PROPAGATION ORDER RELATION: cumul-fence | ||
23 | 15. DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL | ||
24 | 16. SEQUENTIAL CONSISTENCY PER VARIABLE | ||
25 | 17. ATOMIC UPDATES: rmw | ||
26 | 18. THE PRESERVED PROGRAM ORDER RELATION: ppo | ||
27 | 19. AND THEN THERE WAS ALPHA | ||
28 | 20. THE HAPPENS-BEFORE RELATION: hb | ||
29 | 21. THE PROPAGATES-BEFORE RELATION: pb | ||
30 | 22. RCU RELATIONS: link, gp-link, rscs-link, and rcu-path | ||
31 | 23. ODDS AND ENDS | ||
32 | |||
33 | |||
34 | |||
35 | INTRODUCTION | ||
36 | ------------ | ||
37 | |||
38 | The Linux-kernel memory consistency model (LKMM) is rather complex and | ||
39 | obscure. This is particularly evident if you read through the | ||
40 | linux-kernel.bell and linux-kernel.cat files that make up the formal | ||
41 | version of the model; they are extremely terse and their meanings are | ||
42 | far from clear. | ||
43 | |||
44 | This document describes the ideas underlying the LKMM. It is meant | ||
45 | for people who want to understand how the model was designed. It does | ||
46 | not go into the details of the code in the .bell and .cat files; | ||
47 | rather, it explains in English what the code expresses symbolically. | ||
48 | |||
49 | Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed | ||
50 | toward beginners; they explain what memory consistency models are and | ||
51 | the basic notions shared by all such models. People already familiar | ||
52 | with these concepts can skim or skip over them. Sections 6 (EVENTS) | ||
53 | through 12 (THE FROM_READS RELATION) describe the fundamental | ||
54 | relations used in many models. Starting in Section 13 (AN OPERATIONAL | ||
55 | MODEL), the workings of the LKMM itself are covered. | ||
56 | |||
57 | Warning: The code examples in this document are not written in the | ||
58 | proper format for litmus tests. They don't include a header line, the | ||
59 | initializations are not enclosed in braces, the global variables are | ||
60 | not passed by pointers, and they don't have an "exists" clause at the | ||
61 | end. Converting them to the right format is left as an exercise for | ||
62 | the reader. | ||
63 | |||
64 | |||
65 | BACKGROUND | ||
66 | ---------- | ||
67 | |||
68 | A memory consistency model (or just memory model, for short) is | ||
69 | something which predicts, given a piece of computer code running on a | ||
70 | particular kind of system, what values may be obtained by the code's | ||
71 | load instructions. The LKMM makes these predictions for code running | ||
72 | as part of the Linux kernel. | ||
73 | |||
74 | In practice, people tend to use memory models the other way around. | ||
75 | That is, given a piece of code and a collection of values specified | ||
76 | for the loads, the model will predict whether it is possible for the | ||
77 | code to run in such a way that the loads will indeed obtain the | ||
78 | specified values. Of course, this is just another way of expressing | ||
79 | the same idea. | ||
80 | |||
81 | For code running on a uniprocessor system, the predictions are easy: | ||
82 | Each load instruction must obtain the value written by the most recent | ||
83 | store instruction accessing the same location (we ignore complicating | ||
84 | factors such as DMA and mixed-size accesses.) But on multiprocessor | ||
85 | systems, with multiple CPUs making concurrent accesses to shared | ||
86 | memory locations, things aren't so simple. | ||
87 | |||
88 | Different architectures have differing memory models, and the Linux | ||
89 | kernel supports a variety of architectures. The LKMM has to be fairly | ||
90 | permissive, in the sense that any behavior allowed by one of these | ||
91 | architectures also has to be allowed by the LKMM. | ||
92 | |||
93 | |||
94 | A SIMPLE EXAMPLE | ||
95 | ---------------- | ||
96 | |||
97 | Here is a simple example to illustrate the basic concepts. Consider | ||
98 | some code running as part of a device driver for an input device. The | ||
99 | driver might contain an interrupt handler which collects data from the | ||
100 | device, stores it in a buffer, and sets a flag to indicate the buffer | ||
101 | is full. Running concurrently on a different CPU might be a part of | ||
102 | the driver code being executed by a process in the midst of a read(2) | ||
103 | system call. This code tests the flag to see whether the buffer is | ||
104 | ready, and if it is, copies the data back to userspace. The buffer | ||
105 | and the flag are memory locations shared between the two CPUs. | ||
106 | |||
107 | We can abstract out the important pieces of the driver code as follows | ||
108 | (the reason for using WRITE_ONCE() and READ_ONCE() instead of simple | ||
109 | assignment statements is discussed later): | ||
110 | |||
111 | int buf = 0, flag = 0; | ||
112 | |||
113 | P0() | ||
114 | { | ||
115 | WRITE_ONCE(buf, 1); | ||
116 | WRITE_ONCE(flag, 1); | ||
117 | } | ||
118 | |||
119 | P1() | ||
120 | { | ||
121 | int r1; | ||
122 | int r2 = 0; | ||
123 | |||
124 | r1 = READ_ONCE(flag); | ||
125 | if (r1) | ||
126 | r2 = READ_ONCE(buf); | ||
127 | } | ||
128 | |||
129 | Here the P0() function represents the interrupt handler running on one | ||
130 | CPU and P1() represents the read() routine running on another. The | ||
131 | value 1 stored in buf represents input data collected from the device. | ||
132 | Thus, P0 stores the data in buf and then sets flag. Meanwhile, P1 | ||
133 | reads flag into the private variable r1, and if it is set, reads the | ||
134 | data from buf into a second private variable r2 for copying to | ||
135 | userspace. (Presumably if flag is not set then the driver will wait a | ||
136 | while and try again.) | ||
137 | |||
138 | This pattern of memory accesses, where one CPU stores values to two | ||
139 | shared memory locations and another CPU loads from those locations in | ||
140 | the opposite order, is widely known as the "Message Passing" or MP | ||
141 | pattern. It is typical of memory access patterns in the kernel. | ||
142 | |||
143 | Please note that this example code is a simplified abstraction. Real | ||
144 | buffers are usually larger than a single integer, real device drivers | ||
145 | usually use sleep and wakeup mechanisms rather than polling for I/O | ||
146 | completion, and real code generally doesn't bother to copy values into | ||
147 | private variables before using them. All that is beside the point; | ||
148 | the idea here is simply to illustrate the overall pattern of memory | ||
149 | accesses by the CPUs. | ||
150 | |||
151 | A memory model will predict what values P1 might obtain for its loads | ||
152 | from flag and buf, or equivalently, what values r1 and r2 might end up | ||
153 | with after the code has finished running. | ||
154 | |||
155 | Some predictions are trivial. For instance, no sane memory model would | ||
156 | predict that r1 = 42 or r2 = -7, because neither of those values ever | ||
157 | gets stored in flag or buf. | ||
158 | |||
159 | Some nontrivial predictions are nonetheless quite simple. For | ||
160 | instance, P1 might run entirely before P0 begins, in which case r1 and | ||
161 | r2 will both be 0 at the end. Or P0 might run entirely before P1 | ||
162 | begins, in which case r1 and r2 will both be 1. | ||
163 | |||
164 | The interesting predictions concern what might happen when the two | ||
165 | routines run concurrently. One possibility is that P1 runs after P0's | ||
166 | store to buf but before the store to flag. In this case, r1 and r2 | ||
167 | will again both be 0. (If P1 had been designed to read buf | ||
168 | unconditionally then we would instead have r1 = 0 and r2 = 1.) | ||
169 | |||
170 | However, the most interesting possibility is where r1 = 1 and r2 = 0. | ||
171 | If this were to occur it would mean the driver contains a bug, because | ||
172 | incorrect data would get sent to the user: 0 instead of 1. As it | ||
173 | happens, the LKMM does predict this outcome can occur, and the example | ||
174 | driver code shown above is indeed buggy. | ||
175 | |||
176 | |||
177 | A SELECTION OF MEMORY MODELS | ||
178 | ---------------------------- | ||
179 | |||
180 | The first widely cited memory model, and the simplest to understand, | ||
181 | is Sequential Consistency. According to this model, systems behave as | ||
182 | if each CPU executed its instructions in order but with unspecified | ||
183 | timing. In other words, the instructions from the various CPUs get | ||
184 | interleaved in a nondeterministic way, always according to some single | ||
185 | global order that agrees with the order of the instructions in the | ||
186 | program source for each CPU. The model says that the value obtained | ||
187 | by each load is simply the value written by the most recently executed | ||
188 | store to the same memory location, from any CPU. | ||
189 | |||
190 | For the MP example code shown above, Sequential Consistency predicts | ||
191 | that the undesired result r1 = 1, r2 = 0 cannot occur. The reasoning | ||
192 | goes like this: | ||
193 | |||
194 | Since r1 = 1, P0 must store 1 to flag before P1 loads 1 from | ||
195 | it, as loads can obtain values only from earlier stores. | ||
196 | |||
197 | P1 loads from flag before loading from buf, since CPUs execute | ||
198 | their instructions in order. | ||
199 | |||
200 | P1 must load 0 from buf before P0 stores 1 to it; otherwise r2 | ||
201 | would be 1 since a load obtains its value from the most recent | ||
202 | store to the same address. | ||
203 | |||
204 | P0 stores 1 to buf before storing 1 to flag, since it executes | ||
205 | its instructions in order. | ||
206 | |||
207 | Since an instruction (in this case, P1's store to flag) cannot | ||
208 | execute before itself, the specified outcome is impossible. | ||
209 | |||
210 | However, real computer hardware almost never follows the Sequential | ||
211 | Consistency memory model; doing so would rule out too many valuable | ||
212 | performance optimizations. On ARM and PowerPC architectures, for | ||
213 | instance, the MP example code really does sometimes yield r1 = 1 and | ||
214 | r2 = 0. | ||
215 | |||
216 | x86 and SPARC follow yet a different memory model: TSO (Total Store | ||
217 | Ordering). This model predicts that the undesired outcome for the MP | ||
218 | pattern cannot occur, but in other respects it differs from Sequential | ||
219 | Consistency. One example is the Store Buffer (SB) pattern, in which | ||
220 | each CPU stores to its own shared location and then loads from the | ||
221 | other CPU's location: | ||
222 | |||
223 | int x = 0, y = 0; | ||
224 | |||
225 | P0() | ||
226 | { | ||
227 | int r0; | ||
228 | |||
229 | WRITE_ONCE(x, 1); | ||
230 | r0 = READ_ONCE(y); | ||
231 | } | ||
232 | |||
233 | P1() | ||
234 | { | ||
235 | int r1; | ||
236 | |||
237 | WRITE_ONCE(y, 1); | ||
238 | r1 = READ_ONCE(x); | ||
239 | } | ||
240 | |||
241 | Sequential Consistency predicts that the outcome r0 = 0, r1 = 0 is | ||
242 | impossible. (Exercise: Figure out the reasoning.) But TSO allows | ||
243 | this outcome to occur, and in fact it does sometimes occur on x86 and | ||
244 | SPARC systems. | ||
245 | |||
246 | The LKMM was inspired by the memory models followed by PowerPC, ARM, | ||
247 | x86, Alpha, and other architectures. However, it is different in | ||
248 | detail from each of them. | ||
249 | |||
250 | |||
251 | ORDERING AND CYCLES | ||
252 | ------------------- | ||
253 | |||
254 | Memory models are all about ordering. Often this is temporal ordering | ||
255 | (i.e., the order in which certain events occur) but it doesn't have to | ||
256 | be; consider for example the order of instructions in a program's | ||
257 | source code. We saw above that Sequential Consistency makes an | ||
258 | important assumption that CPUs execute instructions in the same order | ||
259 | as those instructions occur in the code, and there are many other | ||
260 | instances of ordering playing central roles in memory models. | ||
261 | |||
262 | The counterpart to ordering is a cycle. Ordering rules out cycles: | ||
263 | It's not possible to have X ordered before Y, Y ordered before Z, and | ||
264 | Z ordered before X, because this would mean that X is ordered before | ||
265 | itself. The analysis of the MP example under Sequential Consistency | ||
266 | involved just such an impossible cycle: | ||
267 | |||
268 | W: P0 stores 1 to flag executes before | ||
269 | X: P1 loads 1 from flag executes before | ||
270 | Y: P1 loads 0 from buf executes before | ||
271 | Z: P0 stores 1 to buf executes before | ||
272 | W: P0 stores 1 to flag. | ||
273 | |||
274 | In short, if a memory model requires certain accesses to be ordered, | ||
275 | and a certain outcome for the loads in a piece of code can happen only | ||
276 | if those accesses would form a cycle, then the memory model predicts | ||
277 | that outcome cannot occur. | ||
278 | |||
279 | The LKMM is defined largely in terms of cycles, as we will see. | ||
280 | |||
281 | |||
282 | EVENTS | ||
283 | ------ | ||
284 | |||
285 | The LKMM does not work directly with the C statements that make up | ||
286 | kernel source code. Instead it considers the effects of those | ||
287 | statements in a more abstract form, namely, events. The model | ||
288 | includes three types of events: | ||
289 | |||
290 | Read events correspond to loads from shared memory, such as | ||
291 | calls to READ_ONCE(), smp_load_acquire(), or | ||
292 | rcu_dereference(). | ||
293 | |||
294 | Write events correspond to stores to shared memory, such as | ||
295 | calls to WRITE_ONCE(), smp_store_release(), or atomic_set(). | ||
296 | |||
297 | Fence events correspond to memory barriers (also known as | ||
298 | fences), such as calls to smp_rmb() or rcu_read_lock(). | ||
299 | |||
300 | These categories are not exclusive; a read or write event can also be | ||
301 | a fence. This happens with functions like smp_load_acquire() or | ||
302 | spin_lock(). However, no single event can be both a read and a write. | ||
303 | Atomic read-modify-write accesses, such as atomic_inc() or xchg(), | ||
304 | correspond to a pair of events: a read followed by a write. (The | ||
305 | write event is omitted for executions where it doesn't occur, such as | ||
306 | a cmpxchg() where the comparison fails.) | ||
307 | |||
308 | Other parts of the code, those which do not involve interaction with | ||
309 | shared memory, do not give rise to events. Thus, arithmetic and | ||
310 | logical computations, control-flow instructions, or accesses to | ||
311 | private memory or CPU registers are not of central interest to the | ||
312 | memory model. They only affect the model's predictions indirectly. | ||
313 | For example, an arithmetic computation might determine the value that | ||
314 | gets stored to a shared memory location (or in the case of an array | ||
315 | index, the address where the value gets stored), but the memory model | ||
316 | is concerned only with the store itself -- its value and its address | ||
317 | -- not the computation leading up to it. | ||
318 | |||
319 | Events in the LKMM can be linked by various relations, which we will | ||
320 | describe in the following sections. The memory model requires certain | ||
321 | of these relations to be orderings, that is, it requires them not to | ||
322 | have any cycles. | ||
323 | |||
324 | |||
325 | THE PROGRAM ORDER RELATION: po AND po-loc | ||
326 | ----------------------------------------- | ||
327 | |||
328 | The most important relation between events is program order (po). You | ||
329 | can think of it as the order in which statements occur in the source | ||
330 | code after branches are taken into account and loops have been | ||
331 | unrolled. A better description might be the order in which | ||
332 | instructions are presented to a CPU's execution unit. Thus, we say | ||
333 | that X is po-before Y (written as "X ->po Y" in formulas) if X occurs | ||
334 | before Y in the instruction stream. | ||
335 | |||
336 | This is inherently a single-CPU relation; two instructions executing | ||
337 | on different CPUs are never linked by po. Also, it is by definition | ||
338 | an ordering so it cannot have any cycles. | ||
339 | |||
340 | po-loc is a sub-relation of po. It links two memory accesses when the | ||
341 | first comes before the second in program order and they access the | ||
342 | same memory location (the "-loc" suffix). | ||
343 | |||
344 | Although this may seem straightforward, there is one subtle aspect to | ||
345 | program order we need to explain. The LKMM was inspired by low-level | ||
346 | architectural memory models which describe the behavior of machine | ||
347 | code, and it retains their outlook to a considerable extent. The | ||
348 | read, write, and fence events used by the model are close in spirit to | ||
349 | individual machine instructions. Nevertheless, the LKMM describes | ||
350 | kernel code written in C, and the mapping from C to machine code can | ||
351 | be extremely complex. | ||
352 | |||
353 | Optimizing compilers have great freedom in the way they translate | ||
354 | source code to object code. They are allowed to apply transformations | ||
355 | that add memory accesses, eliminate accesses, combine them, split them | ||
356 | into pieces, or move them around. Faced with all these possibilities, | ||
357 | the LKMM basically gives up. It insists that the code it analyzes | ||
358 | must contain no ordinary accesses to shared memory; all accesses must | ||
359 | be performed using READ_ONCE(), WRITE_ONCE(), or one of the other | ||
360 | atomic or synchronization primitives. These primitives prevent a | ||
361 | large number of compiler optimizations. In particular, it is | ||
362 | guaranteed that the compiler will not remove such accesses from the | ||
363 | generated code (unless it can prove the accesses will never be | ||
364 | executed), it will not change the order in which they occur in the | ||
365 | code (within limits imposed by the C standard), and it will not | ||
366 | introduce extraneous accesses. | ||
367 | |||
368 | This explains why the MP and SB examples above used READ_ONCE() and | ||
369 | WRITE_ONCE() rather than ordinary memory accesses. Thanks to this | ||
370 | usage, we can be certain that in the MP example, P0's write event to | ||
371 | buf really is po-before its write event to flag, and similarly for the | ||
372 | other shared memory accesses in the examples. | ||
373 | |||
374 | Private variables are not subject to this restriction. Since they are | ||
375 | not shared between CPUs, they can be accessed normally without | ||
376 | READ_ONCE() or WRITE_ONCE(), and there will be no ill effects. In | ||
377 | fact, they need not even be stored in normal memory at all -- in | ||
378 | principle a private variable could be stored in a CPU register (hence | ||
379 | the convention that these variables have names starting with the | ||
380 | letter 'r'). | ||
381 | |||
382 | |||
383 | A WARNING | ||
384 | --------- | ||
385 | |||
386 | The protections provided by READ_ONCE(), WRITE_ONCE(), and others are | ||
387 | not perfect; and under some circumstances it is possible for the | ||
388 | compiler to undermine the memory model. Here is an example. Suppose | ||
389 | both branches of an "if" statement store the same value to the same | ||
390 | location: | ||
391 | |||
392 | r1 = READ_ONCE(x); | ||
393 | if (r1) { | ||
394 | WRITE_ONCE(y, 2); | ||
395 | ... /* do something */ | ||
396 | } else { | ||
397 | WRITE_ONCE(y, 2); | ||
398 | ... /* do something else */ | ||
399 | } | ||
400 | |||
401 | For this code, the LKMM predicts that the load from x will always be | ||
402 | executed before either of the stores to y. However, a compiler could | ||
403 | lift the stores out of the conditional, transforming the code into | ||
404 | something resembling: | ||
405 | |||
406 | r1 = READ_ONCE(x); | ||
407 | WRITE_ONCE(y, 2); | ||
408 | if (r1) { | ||
409 | ... /* do something */ | ||
410 | } else { | ||
411 | ... /* do something else */ | ||
412 | } | ||
413 | |||
414 | Given this version of the code, the LKMM would predict that the load | ||
415 | from x could be executed after the store to y. Thus, the memory | ||
416 | model's original prediction could be invalidated by the compiler. | ||
417 | |||
418 | Another issue arises from the fact that in C, arguments to many | ||
419 | operators and function calls can be evaluated in any order. For | ||
420 | example: | ||
421 | |||
422 | r1 = f(5) + g(6); | ||
423 | |||
424 | The object code might call f(5) either before or after g(6); the | ||
425 | memory model cannot assume there is a fixed program order relation | ||
426 | between them. (In fact, if the functions are inlined then the | ||
427 | compiler might even interleave their object code.) | ||
428 | |||
429 | |||
430 | DEPENDENCY RELATIONS: data, addr, and ctrl | ||
431 | ------------------------------------------ | ||
432 | |||
433 | We say that two events are linked by a dependency relation when the | ||
434 | execution of the second event depends in some way on a value obtained | ||
435 | from memory by the first. The first event must be a read, and the | ||
436 | value it obtains must somehow affect what the second event does. | ||
437 | There are three kinds of dependencies: data, address (addr), and | ||
438 | control (ctrl). | ||
439 | |||
440 | A read and a write event are linked by a data dependency if the value | ||
441 | obtained by the read affects the value stored by the write. As a very | ||
442 | simple example: | ||
443 | |||
444 | int x, y; | ||
445 | |||
446 | r1 = READ_ONCE(x); | ||
447 | WRITE_ONCE(y, r1 + 5); | ||
448 | |||
449 | The value stored by the WRITE_ONCE obviously depends on the value | ||
450 | loaded by the READ_ONCE. Such dependencies can wind through | ||
451 | arbitrarily complicated computations, and a write can depend on the | ||
452 | values of multiple reads. | ||
453 | |||
454 | A read event and another memory access event are linked by an address | ||
455 | dependency if the value obtained by the read affects the location | ||
456 | accessed by the other event. The second event can be either a read or | ||
457 | a write. Here's another simple example: | ||
458 | |||
459 | int a[20]; | ||
460 | int i; | ||
461 | |||
462 | r1 = READ_ONCE(i); | ||
463 | r2 = READ_ONCE(a[r1]); | ||
464 | |||
465 | Here the location accessed by the second READ_ONCE() depends on the | ||
466 | index value loaded by the first. Pointer indirection also gives rise | ||
467 | to address dependencies, since the address of a location accessed | ||
468 | through a pointer will depend on the value read earlier from that | ||
469 | pointer. | ||
470 | |||
471 | Finally, a read event and another memory access event are linked by a | ||
472 | control dependency if the value obtained by the read affects whether | ||
473 | the second event is executed at all. Simple example: | ||
474 | |||
475 | int x, y; | ||
476 | |||
477 | r1 = READ_ONCE(x); | ||
478 | if (r1) | ||
479 | WRITE_ONCE(y, 1984); | ||
480 | |||
481 | Execution of the WRITE_ONCE() is controlled by a conditional expression | ||
482 | which depends on the value obtained by the READ_ONCE(); hence there is | ||
483 | a control dependency from the load to the store. | ||
484 | |||
485 | It should be pretty obvious that events can only depend on reads that | ||
486 | come earlier in program order. Symbolically, if we have R ->data X, | ||
487 | R ->addr X, or R ->ctrl X (where R is a read event), then we must also | ||
488 | have R ->po X. It wouldn't make sense for a computation to depend | ||
489 | somehow on a value that doesn't get loaded from shared memory until | ||
490 | later in the code! | ||
491 | |||
492 | |||
493 | THE READS-FROM RELATION: rf, rfi, and rfe | ||
494 | ----------------------------------------- | ||
495 | |||
496 | The reads-from relation (rf) links a write event to a read event when | ||
497 | the value loaded by the read is the value that was stored by the | ||
498 | write. In colloquial terms, the load "reads from" the store. We | ||
499 | write W ->rf R to indicate that the load R reads from the store W. We | ||
500 | further distinguish the cases where the load and the store occur on | ||
501 | the same CPU (internal reads-from, or rfi) and where they occur on | ||
502 | different CPUs (external reads-from, or rfe). | ||
503 | |||
504 | For our purposes, a memory location's initial value is treated as | ||
505 | though it had been written there by an imaginary initial store that | ||
506 | executes on a separate CPU before the program runs. | ||
507 | |||
508 | Usage of the rf relation implicitly assumes that loads will always | ||
509 | read from a single store. It doesn't apply properly in the presence | ||
510 | of load-tearing, where a load obtains some of its bits from one store | ||
511 | and some of them from another store. Fortunately, use of READ_ONCE() | ||
512 | and WRITE_ONCE() will prevent load-tearing; it's not possible to have: | ||
513 | |||
514 | int x = 0; | ||
515 | |||
516 | P0() | ||
517 | { | ||
518 | WRITE_ONCE(x, 0x1234); | ||
519 | } | ||
520 | |||
521 | P1() | ||
522 | { | ||
523 | int r1; | ||
524 | |||
525 | r1 = READ_ONCE(x); | ||
526 | } | ||
527 | |||
528 | and end up with r1 = 0x1200 (partly from x's initial value and partly | ||
529 | from the value stored by P0). | ||
530 | |||
531 | On the other hand, load-tearing is unavoidable when mixed-size | ||
532 | accesses are used. Consider this example: | ||
533 | |||
534 | union { | ||
535 | u32 w; | ||
536 | u16 h[2]; | ||
537 | } x; | ||
538 | |||
539 | P0() | ||
540 | { | ||
541 | WRITE_ONCE(x.h[0], 0x1234); | ||
542 | WRITE_ONCE(x.h[1], 0x5678); | ||
543 | } | ||
544 | |||
545 | P1() | ||
546 | { | ||
547 | int r1; | ||
548 | |||
549 | r1 = READ_ONCE(x.w); | ||
550 | } | ||
551 | |||
552 | If r1 = 0x56781234 (little-endian!) at the end, then P1 must have read | ||
553 | from both of P0's stores. It is possible to handle mixed-size and | ||
554 | unaligned accesses in a memory model, but the LKMM currently does not | ||
555 | attempt to do so. It requires all accesses to be properly aligned and | ||
556 | of the location's actual size. | ||
557 | |||
558 | |||
559 | CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe | ||
560 | ------------------------------------------------------------------ | ||
561 | |||
562 | Cache coherence is a general principle requiring that in a | ||
563 | multi-processor system, the CPUs must share a consistent view of the | ||
564 | memory contents. Specifically, it requires that for each location in | ||
565 | shared memory, the stores to that location must form a single global | ||
566 | ordering which all the CPUs agree on (the coherence order), and this | ||
567 | ordering must be consistent with the program order for accesses to | ||
568 | that location. | ||
569 | |||
570 | To put it another way, for any variable x, the coherence order (co) of | ||
571 | the stores to x is simply the order in which the stores overwrite one | ||
572 | another. The imaginary store which establishes x's initial value | ||
573 | comes first in the coherence order; the store which directly | ||
574 | overwrites the initial value comes second; the store which overwrites | ||
575 | that value comes third, and so on. | ||
576 | |||
577 | You can think of the coherence order as being the order in which the | ||
578 | stores reach x's location in memory (or if you prefer a more | ||
579 | hardware-centric view, the order in which the stores get written to | ||
580 | x's cache line). We write W ->co W' if W comes before W' in the | ||
581 | coherence order, that is, if the value stored by W gets overwritten, | ||
582 | directly or indirectly, by the value stored by W'. | ||
583 | |||
584 | Coherence order is required to be consistent with program order. This | ||
585 | requirement takes the form of four coherency rules: | ||
586 | |||
587 | Write-write coherence: If W ->po-loc W' (i.e., W comes before | ||
588 | W' in program order and they access the same location), where W | ||
589 | and W' are two stores, then W ->co W'. | ||
590 | |||
591 | Write-read coherence: If W ->po-loc R, where W is a store and R | ||
592 | is a load, then R must read from W or from some other store | ||
593 | which comes after W in the coherence order. | ||
594 | |||
595 | Read-write coherence: If R ->po-loc W, where R is a load and W | ||
596 | is a store, then the store which R reads from must come before | ||
597 | W in the coherence order. | ||
598 | |||
599 | Read-read coherence: If R ->po-loc R', where R and R' are two | ||
600 | loads, then either they read from the same store or else the | ||
601 | store read by R comes before the store read by R' in the | ||
602 | coherence order. | ||
603 | |||
604 | This is sometimes referred to as sequential consistency per variable, | ||
605 | because it means that the accesses to any single memory location obey | ||
606 | the rules of the Sequential Consistency memory model. (According to | ||
607 | Wikipedia, sequential consistency per variable and cache coherence | ||
608 | mean the same thing except that cache coherence includes an extra | ||
609 | requirement that every store eventually becomes visible to every CPU.) | ||
610 | |||
611 | Any reasonable memory model will include cache coherence. Indeed, our | ||
612 | expectation of cache coherence is so deeply ingrained that violations | ||
613 | of its requirements look more like hardware bugs than programming | ||
614 | errors: | ||
615 | |||
616 | int x; | ||
617 | |||
618 | P0() | ||
619 | { | ||
620 | WRITE_ONCE(x, 17); | ||
621 | WRITE_ONCE(x, 23); | ||
622 | } | ||
623 | |||
624 | If the final value stored in x after this code ran was 17, you would | ||
625 | think your computer was broken. It would be a violation of the | ||
626 | write-write coherence rule: Since the store of 23 comes later in | ||
627 | program order, it must also come later in x's coherence order and | ||
628 | thus must overwrite the store of 17. | ||
629 | |||
630 | int x = 0; | ||
631 | |||
632 | P0() | ||
633 | { | ||
634 | int r1; | ||
635 | |||
636 | r1 = READ_ONCE(x); | ||
637 | WRITE_ONCE(x, 666); | ||
638 | } | ||
639 | |||
640 | If r1 = 666 at the end, this would violate the read-write coherence | ||
641 | rule: The READ_ONCE() load comes before the WRITE_ONCE() store in | ||
642 | program order, so it must not read from that store but rather from one | ||
643 | coming earlier in the coherence order (in this case, x's initial | ||
644 | value). | ||
645 | |||
646 | int x = 0; | ||
647 | |||
648 | P0() | ||
649 | { | ||
650 | WRITE_ONCE(x, 5); | ||
651 | } | ||
652 | |||
653 | P1() | ||
654 | { | ||
655 | int r1, r2; | ||
656 | |||
657 | r1 = READ_ONCE(x); | ||
658 | r2 = READ_ONCE(x); | ||
659 | } | ||
660 | |||
661 | If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the | ||
662 | imaginary store which establishes x's initial value) at the end, this | ||
663 | would violate the read-read coherence rule: The r1 load comes before | ||
664 | the r2 load in program order, so it must not read from a store that | ||
665 | comes later in the coherence order. | ||
666 | |||
667 | (As a minor curiosity, if this code had used normal loads instead of | ||
668 | READ_ONCE() in P1, on Itanium it sometimes could end up with r1 = 5 | ||
669 | and r2 = 0! This results from parallel execution of the operations | ||
670 | encoded in Itanium's Very-Long-Instruction-Word format, and it is yet | ||
671 | another motivation for using READ_ONCE() when accessing shared memory | ||
672 | locations.) | ||
673 | |||
674 | Just like the po relation, co is inherently an ordering -- it is not | ||
675 | possible for a store to directly or indirectly overwrite itself! And | ||
676 | just like with the rf relation, we distinguish between stores that | ||
677 | occur on the same CPU (internal coherence order, or coi) and stores | ||
678 | that occur on different CPUs (external coherence order, or coe). | ||
679 | |||
680 | On the other hand, stores to different memory locations are never | ||
681 | related by co, just as instructions on different CPUs are never | ||
682 | related by po. Coherence order is strictly per-location, or if you | ||
683 | prefer, each location has its own independent coherence order. | ||
684 | |||
685 | |||
686 | THE FROM-READS RELATION: fr, fri, and fre | ||
687 | ----------------------------------------- | ||
688 | |||
689 | The from-reads relation (fr) can be a little difficult for people to | ||
690 | grok. It describes the situation where a load reads a value that gets | ||
691 | overwritten by a store. In other words, we have R ->fr W when the | ||
692 | value that R reads is overwritten (directly or indirectly) by W, or | ||
693 | equivalently, when R reads from a store which comes earlier than W in | ||
694 | the coherence order. | ||
695 | |||
696 | For example: | ||
697 | |||
698 | int x = 0; | ||
699 | |||
700 | P0() | ||
701 | { | ||
702 | int r1; | ||
703 | |||
704 | r1 = READ_ONCE(x); | ||
705 | WRITE_ONCE(x, 2); | ||
706 | } | ||
707 | |||
708 | The value loaded from x will be 0 (assuming cache coherence!), and it | ||
709 | gets overwritten by the value 2. Thus there is an fr link from the | ||
710 | READ_ONCE() to the WRITE_ONCE(). If the code contained any later | ||
711 | stores to x, there would also be fr links from the READ_ONCE() to | ||
712 | them. | ||
713 | |||
714 | As with rf, rfi, and rfe, we subdivide the fr relation into fri (when | ||
715 | the load and the store are on the same CPU) and fre (when they are on | ||
716 | different CPUs). | ||
717 | |||
718 | Note that the fr relation is determined entirely by the rf and co | ||
719 | relations; it is not independent. Given a read event R and a write | ||
720 | event W for the same location, we will have R ->fr W if and only if | ||
721 | the write which R reads from is co-before W. In symbols, | ||
722 | |||
723 | (R ->fr W) := (there exists W' with W' ->rf R and W' ->co W). | ||
724 | |||
725 | |||
726 | AN OPERATIONAL MODEL | ||
727 | -------------------- | ||
728 | |||
729 | The LKMM is based on various operational memory models, meaning that | ||
730 | the models arise from an abstract view of how a computer system | ||
731 | operates. Here are the main ideas, as incorporated into the LKMM. | ||
732 | |||
733 | The system as a whole is divided into the CPUs and a memory subsystem. | ||
734 | The CPUs are responsible for executing instructions (not necessarily | ||
735 | in program order), and they communicate with the memory subsystem. | ||
736 | For the most part, executing an instruction requires a CPU to perform | ||
737 | only internal operations. However, loads, stores, and fences involve | ||
738 | more. | ||
739 | |||
740 | When CPU C executes a store instruction, it tells the memory subsystem | ||
741 | to store a certain value at a certain location. The memory subsystem | ||
742 | propagates the store to all the other CPUs as well as to RAM. (As a | ||
743 | special case, we say that the store propagates to its own CPU at the | ||
744 | time it is executed.) The memory subsystem also determines where the | ||
745 | store falls in the location's coherence order. In particular, it must | ||
746 | arrange for the store to be co-later than (i.e., to overwrite) any | ||
747 | other store to the same location which has already propagated to CPU C. | ||
748 | |||
749 | When a CPU executes a load instruction R, it first checks to see | ||
750 | whether there are any as-yet unexecuted store instructions, for the | ||
751 | same location, that come before R in program order. If there are, it | ||
752 | uses the value of the po-latest such store as the value obtained by R, | ||
753 | and we say that the store's value is forwarded to R. Otherwise, the | ||
754 | CPU asks the memory subsystem for the value to load and we say that R | ||
755 | is satisfied from memory. The memory subsystem hands back the value | ||
756 | of the co-latest store to the location in question which has already | ||
757 | propagated to that CPU. | ||
758 | |||
759 | (In fact, the picture needs to be a little more complicated than this. | ||
760 | CPUs have local caches, and propagating a store to a CPU really means | ||
761 | propagating it to the CPU's local cache. A local cache can take some | ||
762 | time to process the stores that it receives, and a store can't be used | ||
763 | to satisfy one of the CPU's loads until it has been processed. On | ||
764 | most architectures, the local caches process stores in | ||
765 | First-In-First-Out order, and consequently the processing delay | ||
766 | doesn't matter for the memory model. But on Alpha, the local caches | ||
767 | have a partitioned design that results in non-FIFO behavior. We will | ||
768 | discuss this in more detail later.) | ||
769 | |||
770 | Note that load instructions may be executed speculatively and may be | ||
771 | restarted under certain circumstances. The memory model ignores these | ||
772 | premature executions; we simply say that the load executes at the | ||
773 | final time it is forwarded or satisfied. | ||
774 | |||
775 | Executing a fence (or memory barrier) instruction doesn't require a | ||
776 | CPU to do anything special other than informing the memory subsystem | ||
777 | about the fence. However, fences do constrain the way CPUs and the | ||
778 | memory subsystem handle other instructions, in two respects. | ||
779 | |||
780 | First, a fence forces the CPU to execute various instructions in | ||
781 | program order. Exactly which instructions are ordered depends on the | ||
782 | type of fence: | ||
783 | |||
784 | Strong fences, including smp_mb() and synchronize_rcu(), force | ||
785 | the CPU to execute all po-earlier instructions before any | ||
786 | po-later instructions; | ||
787 | |||
788 | smp_rmb() forces the CPU to execute all po-earlier loads | ||
789 | before any po-later loads; | ||
790 | |||
791 | smp_wmb() forces the CPU to execute all po-earlier stores | ||
792 | before any po-later stores; | ||
793 | |||
794 | Acquire fences, such as smp_load_acquire(), force the CPU to | ||
795 | execute the load associated with the fence (e.g., the load | ||
796 | part of an smp_load_acquire()) before any po-later | ||
797 | instructions; | ||
798 | |||
799 | Release fences, such as smp_store_release(), force the CPU to | ||
800 | execute all po-earlier instructions before the store | ||
801 | associated with the fence (e.g., the store part of an | ||
802 | smp_store_release()). | ||
803 | |||
804 | Second, some types of fence affect the way the memory subsystem | ||
805 | propagates stores. When a fence instruction is executed on CPU C: | ||
806 | |||
807 | For each other CPU C', smb_wmb() forces all po-earlier stores | ||
808 | on C to propagate to C' before any po-later stores do. | ||
809 | |||
810 | For each other CPU C', any store which propagates to C before | ||
811 | a release fence is executed (including all po-earlier | ||
812 | stores executed on C) is forced to propagate to C' before the | ||
813 | store associated with the release fence does. | ||
814 | |||
815 | Any store which propagates to C before a strong fence is | ||
816 | executed (including all po-earlier stores on C) is forced to | ||
817 | propagate to all other CPUs before any instructions po-after | ||
818 | the strong fence are executed on C. | ||
819 | |||
820 | The propagation ordering enforced by release fences and strong fences | ||
821 | affects stores from other CPUs that propagate to CPU C before the | ||
822 | fence is executed, as well as stores that are executed on C before the | ||
823 | fence. We describe this property by saying that release fences and | ||
824 | strong fences are A-cumulative. By contrast, smp_wmb() fences are not | ||
825 | A-cumulative; they only affect the propagation of stores that are | ||
826 | executed on C before the fence (i.e., those which precede the fence in | ||
827 | program order). | ||
828 | |||
829 | rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have | ||
830 | other properties which we discuss later. | ||
831 | |||
832 | |||
833 | PROPAGATION ORDER RELATION: cumul-fence | ||
834 | --------------------------------------- | ||
835 | |||
836 | The fences which affect propagation order (i.e., strong, release, and | ||
837 | smp_wmb() fences) are collectively referred to as cumul-fences, even | ||
838 | though smp_wmb() isn't A-cumulative. The cumul-fence relation is | ||
839 | defined to link memory access events E and F whenever: | ||
840 | |||
841 | E and F are both stores on the same CPU and an smp_wmb() fence | ||
842 | event occurs between them in program order; or | ||
843 | |||
844 | F is a release fence and some X comes before F in program order, | ||
845 | where either X = E or else E ->rf X; or | ||
846 | |||
847 | A strong fence event occurs between some X and F in program | ||
848 | order, where either X = E or else E ->rf X. | ||
849 | |||
850 | The operational model requires that whenever W and W' are both stores | ||
851 | and W ->cumul-fence W', then W must propagate to any given CPU | ||
852 | before W' does. However, for different CPUs C and C', it does not | ||
853 | require W to propagate to C before W' propagates to C'. | ||
854 | |||
855 | |||
856 | DERIVATION OF THE LKMM FROM THE OPERATIONAL MODEL | ||
857 | ------------------------------------------------- | ||
858 | |||
859 | The LKMM is derived from the restrictions imposed by the design | ||
860 | outlined above. These restrictions involve the necessity of | ||
861 | maintaining cache coherence and the fact that a CPU can't operate on a | ||
862 | value before it knows what that value is, among other things. | ||
863 | |||
864 | The formal version of the LKMM is defined by five requirements, or | ||
865 | axioms: | ||
866 | |||
867 | Sequential consistency per variable: This requires that the | ||
868 | system obey the four coherency rules. | ||
869 | |||
870 | Atomicity: This requires that atomic read-modify-write | ||
871 | operations really are atomic, that is, no other stores can | ||
872 | sneak into the middle of such an update. | ||
873 | |||
874 | Happens-before: This requires that certain instructions are | ||
875 | executed in a specific order. | ||
876 | |||
877 | Propagation: This requires that certain stores propagate to | ||
878 | CPUs and to RAM in a specific order. | ||
879 | |||
880 | Rcu: This requires that RCU read-side critical sections and | ||
881 | grace periods obey the rules of RCU, in particular, the | ||
882 | Grace-Period Guarantee. | ||
883 | |||
884 | The first and second are quite common; they can be found in many | ||
885 | memory models (such as those for C11/C++11). The "happens-before" and | ||
886 | "propagation" axioms have analogs in other memory models as well. The | ||
887 | "rcu" axiom is specific to the LKMM. | ||
888 | |||
889 | Each of these axioms is discussed below. | ||
890 | |||
891 | |||
892 | SEQUENTIAL CONSISTENCY PER VARIABLE | ||
893 | ----------------------------------- | ||
894 | |||
895 | According to the principle of cache coherence, the stores to any fixed | ||
896 | shared location in memory form a global ordering. We can imagine | ||
897 | inserting the loads from that location into this ordering, by placing | ||
898 | each load between the store that it reads from and the following | ||
899 | store. This leaves the relative positions of loads that read from the | ||
900 | same store unspecified; let's say they are inserted in program order, | ||
901 | first for CPU 0, then CPU 1, etc. | ||
902 | |||
903 | You can check that the four coherency rules imply that the rf, co, fr, | ||
904 | and po-loc relations agree with this global ordering; in other words, | ||
905 | whenever we have X ->rf Y or X ->co Y or X ->fr Y or X ->po-loc Y, the | ||
906 | X event comes before the Y event in the global ordering. The LKMM's | ||
907 | "coherence" axiom expresses this by requiring the union of these | ||
908 | relations not to have any cycles. This means it must not be possible | ||
909 | to find events | ||
910 | |||
911 | X0 -> X1 -> X2 -> ... -> Xn -> X0, | ||
912 | |||
913 | where each of the links is either rf, co, fr, or po-loc. This has to | ||
914 | hold if the accesses to the fixed memory location can be ordered as | ||
915 | cache coherence demands. | ||
916 | |||
917 | Although it is not obvious, it can be shown that the converse is also | ||
918 | true: This LKMM axiom implies that the four coherency rules are | ||
919 | obeyed. | ||
920 | |||
921 | |||
922 | ATOMIC UPDATES: rmw | ||
923 | ------------------- | ||
924 | |||
925 | What does it mean to say that a read-modify-write (rmw) update, such | ||
926 | as atomic_inc(&x), is atomic? It means that the memory location (x in | ||
927 | this case) does not get altered between the read and the write events | ||
928 | making up the atomic operation. In particular, if two CPUs perform | ||
929 | atomic_inc(&x) concurrently, it must be guaranteed that the final | ||
930 | value of x will be the initial value plus two. We should never have | ||
931 | the following sequence of events: | ||
932 | |||
933 | CPU 0 loads x obtaining 13; | ||
934 | CPU 1 loads x obtaining 13; | ||
935 | CPU 0 stores 14 to x; | ||
936 | CPU 1 stores 14 to x; | ||
937 | |||
938 | where the final value of x is wrong (14 rather than 15). | ||
939 | |||
940 | In this example, CPU 0's increment effectively gets lost because it | ||
941 | occurs in between CPU 1's load and store. To put it another way, the | ||
942 | problem is that the position of CPU 0's store in x's coherence order | ||
943 | is between the store that CPU 1 reads from and the store that CPU 1 | ||
944 | performs. | ||
945 | |||
946 | The same analysis applies to all atomic update operations. Therefore, | ||
947 | to enforce atomicity the LKMM requires that atomic updates follow this | ||
948 | rule: Whenever R and W are the read and write events composing an | ||
949 | atomic read-modify-write and W' is the write event which R reads from, | ||
950 | there must not be any stores coming between W' and W in the coherence | ||
951 | order. Equivalently, | ||
952 | |||
953 | (R ->rmw W) implies (there is no X with R ->fr X and X ->co W), | ||
954 | |||
955 | where the rmw relation links the read and write events making up each | ||
956 | atomic update. This is what the LKMM's "atomic" axiom says. | ||
957 | |||
958 | |||
959 | THE PRESERVED PROGRAM ORDER RELATION: ppo | ||
960 | ----------------------------------------- | ||
961 | |||
962 | There are many situations where a CPU is obligated to execute two | ||
963 | instructions in program order. We amalgamate them into the ppo (for | ||
964 | "preserved program order") relation, which links the po-earlier | ||
965 | instruction to the po-later instruction and is thus a sub-relation of | ||
966 | po. | ||
967 | |||
968 | The operational model already includes a description of one such | ||
969 | situation: Fences are a source of ppo links. Suppose X and Y are | ||
970 | memory accesses with X ->po Y; then the CPU must execute X before Y if | ||
971 | any of the following hold: | ||
972 | |||
973 | A strong (smp_mb() or synchronize_rcu()) fence occurs between | ||
974 | X and Y; | ||
975 | |||
976 | X and Y are both stores and an smp_wmb() fence occurs between | ||
977 | them; | ||
978 | |||
979 | X and Y are both loads and an smp_rmb() fence occurs between | ||
980 | them; | ||
981 | |||
982 | X is also an acquire fence, such as smp_load_acquire(); | ||
983 | |||
984 | Y is also a release fence, such as smp_store_release(). | ||
985 | |||
986 | Another possibility, not mentioned earlier but discussed in the next | ||
987 | section, is: | ||
988 | |||
989 | X and Y are both loads, X ->addr Y (i.e., there is an address | ||
990 | dependency from X to Y), and X is a READ_ONCE() or an atomic | ||
991 | access. | ||
992 | |||
993 | Dependencies can also cause instructions to be executed in program | ||
994 | order. This is uncontroversial when the second instruction is a | ||
995 | store; either a data, address, or control dependency from a load R to | ||
996 | a store W will force the CPU to execute R before W. This is very | ||
997 | simply because the CPU cannot tell the memory subsystem about W's | ||
998 | store before it knows what value should be stored (in the case of a | ||
999 | data dependency), what location it should be stored into (in the case | ||
1000 | of an address dependency), or whether the store should actually take | ||
1001 | place (in the case of a control dependency). | ||
1002 | |||
1003 | Dependencies to load instructions are more problematic. To begin with, | ||
1004 | there is no such thing as a data dependency to a load. Next, a CPU | ||
1005 | has no reason to respect a control dependency to a load, because it | ||
1006 | can always satisfy the second load speculatively before the first, and | ||
1007 | then ignore the result if it turns out that the second load shouldn't | ||
1008 | be executed after all. And lastly, the real difficulties begin when | ||
1009 | we consider address dependencies to loads. | ||
1010 | |||
1011 | To be fair about it, all Linux-supported architectures do execute | ||
1012 | loads in program order if there is an address dependency between them. | ||
1013 | After all, a CPU cannot ask the memory subsystem to load a value from | ||
1014 | a particular location before it knows what that location is. However, | ||
1015 | the split-cache design used by Alpha can cause it to behave in a way | ||
1016 | that looks as if the loads were executed out of order (see the next | ||
1017 | section for more details). The kernel includes a workaround for this | ||
1018 | problem when the loads come from READ_ONCE(), and therefore the LKMM | ||
1019 | includes address dependencies to loads in the ppo relation. | ||
1020 | |||
1021 | On the other hand, dependencies can indirectly affect the ordering of | ||
1022 | two loads. This happens when there is a dependency from a load to a | ||
1023 | store and a second, po-later load reads from that store: | ||
1024 | |||
1025 | R ->dep W ->rfi R', | ||
1026 | |||
1027 | where the dep link can be either an address or a data dependency. In | ||
1028 | this situation we know it is possible for the CPU to execute R' before | ||
1029 | W, because it can forward the value that W will store to R'. But it | ||
1030 | cannot execute R' before R, because it cannot forward the value before | ||
1031 | it knows what that value is, or that W and R' do access the same | ||
1032 | location. However, if there is merely a control dependency between R | ||
1033 | and W then the CPU can speculatively forward W to R' before executing | ||
1034 | R; if the speculation turns out to be wrong then the CPU merely has to | ||
1035 | restart or abandon R'. | ||
1036 | |||
1037 | (In theory, a CPU might forward a store to a load when it runs across | ||
1038 | an address dependency like this: | ||
1039 | |||
1040 | r1 = READ_ONCE(ptr); | ||
1041 | WRITE_ONCE(*r1, 17); | ||
1042 | r2 = READ_ONCE(*r1); | ||
1043 | |||
1044 | because it could tell that the store and the second load access the | ||
1045 | same location even before it knows what the location's address is. | ||
1046 | However, none of the architectures supported by the Linux kernel do | ||
1047 | this.) | ||
1048 | |||
1049 | Two memory accesses of the same location must always be executed in | ||
1050 | program order if the second access is a store. Thus, if we have | ||
1051 | |||
1052 | R ->po-loc W | ||
1053 | |||
1054 | (the po-loc link says that R comes before W in program order and they | ||
1055 | access the same location), the CPU is obliged to execute W after R. | ||
1056 | If it executed W first then the memory subsystem would respond to R's | ||
1057 | read request with the value stored by W (or an even later store), in | ||
1058 | violation of the read-write coherence rule. Similarly, if we had | ||
1059 | |||
1060 | W ->po-loc W' | ||
1061 | |||
1062 | and the CPU executed W' before W, then the memory subsystem would put | ||
1063 | W' before W in the coherence order. It would effectively cause W to | ||
1064 | overwrite W', in violation of the write-write coherence rule. | ||
1065 | (Interestingly, an early ARMv8 memory model, now obsolete, proposed | ||
1066 | allowing out-of-order writes like this to occur. The model avoided | ||
1067 | violating the write-write coherence rule by requiring the CPU not to | ||
1068 | send the W write to the memory subsystem at all!) | ||
1069 | |||
1070 | There is one last example of preserved program order in the LKMM: when | ||
1071 | a load-acquire reads from an earlier store-release. For example: | ||
1072 | |||
1073 | smp_store_release(&x, 123); | ||
1074 | r1 = smp_load_acquire(&x); | ||
1075 | |||
1076 | If the smp_load_acquire() ends up obtaining the 123 value that was | ||
1077 | stored by the smp_store_release(), the LKMM says that the load must be | ||
1078 | executed after the store; the store cannot be forwarded to the load. | ||
1079 | This requirement does not arise from the operational model, but it | ||
1080 | yields correct predictions on all architectures supported by the Linux | ||
1081 | kernel, although for differing reasons. | ||
1082 | |||
1083 | On some architectures, including x86 and ARMv8, it is true that the | ||
1084 | store cannot be forwarded to the load. On others, including PowerPC | ||
1085 | and ARMv7, smp_store_release() generates object code that starts with | ||
1086 | a fence and smp_load_acquire() generates object code that ends with a | ||
1087 | fence. The upshot is that even though the store may be forwarded to | ||
1088 | the load, it is still true that any instruction preceding the store | ||
1089 | will be executed before the load or any following instructions, and | ||
1090 | the store will be executed before any instruction following the load. | ||
1091 | |||
1092 | |||
1093 | AND THEN THERE WAS ALPHA | ||
1094 | ------------------------ | ||
1095 | |||
1096 | As mentioned above, the Alpha architecture is unique in that it does | ||
1097 | not appear to respect address dependencies to loads. This means that | ||
1098 | code such as the following: | ||
1099 | |||
1100 | int x = 0; | ||
1101 | int y = -1; | ||
1102 | int *ptr = &y; | ||
1103 | |||
1104 | P0() | ||
1105 | { | ||
1106 | WRITE_ONCE(x, 1); | ||
1107 | smp_wmb(); | ||
1108 | WRITE_ONCE(ptr, &x); | ||
1109 | } | ||
1110 | |||
1111 | P1() | ||
1112 | { | ||
1113 | int *r1; | ||
1114 | int r2; | ||
1115 | |||
1116 | r1 = ptr; | ||
1117 | r2 = READ_ONCE(*r1); | ||
1118 | } | ||
1119 | |||
1120 | can malfunction on Alpha systems (notice that P1 uses an ordinary load | ||
1121 | to read ptr instead of READ_ONCE()). It is quite possible that r1 = &x | ||
1122 | and r2 = 0 at the end, in spite of the address dependency. | ||
1123 | |||
1124 | At first glance this doesn't seem to make sense. We know that the | ||
1125 | smp_wmb() forces P0's store to x to propagate to P1 before the store | ||
1126 | to ptr does. And since P1 can't execute its second load | ||
1127 | until it knows what location to load from, i.e., after executing its | ||
1128 | first load, the value x = 1 must have propagated to P1 before the | ||
1129 | second load executed. So why doesn't r2 end up equal to 1? | ||
1130 | |||
1131 | The answer lies in the Alpha's split local caches. Although the two | ||
1132 | stores do reach P1's local cache in the proper order, it can happen | ||
1133 | that the first store is processed by a busy part of the cache while | ||
1134 | the second store is processed by an idle part. As a result, the x = 1 | ||
1135 | value may not become available for P1's CPU to read until after the | ||
1136 | ptr = &x value does, leading to the undesirable result above. The | ||
1137 | final effect is that even though the two loads really are executed in | ||
1138 | program order, it appears that they aren't. | ||
1139 | |||
1140 | This could not have happened if the local cache had processed the | ||
1141 | incoming stores in FIFO order. By contrast, other architectures | ||
1142 | maintain at least the appearance of FIFO order. | ||
1143 | |||
1144 | In practice, this difficulty is solved by inserting a special fence | ||
1145 | between P1's two loads when the kernel is compiled for the Alpha | ||
1146 | architecture. In fact, as of version 4.15, the kernel automatically | ||
1147 | adds this fence (called smp_read_barrier_depends() and defined as | ||
1148 | nothing at all on non-Alpha builds) after every READ_ONCE() and atomic | ||
1149 | load. The effect of the fence is to cause the CPU not to execute any | ||
1150 | po-later instructions until after the local cache has finished | ||
1151 | processing all the stores it has already received. Thus, if the code | ||
1152 | was changed to: | ||
1153 | |||
1154 | P1() | ||
1155 | { | ||
1156 | int *r1; | ||
1157 | int r2; | ||
1158 | |||
1159 | r1 = READ_ONCE(ptr); | ||
1160 | r2 = READ_ONCE(*r1); | ||
1161 | } | ||
1162 | |||
1163 | then we would never get r1 = &x and r2 = 0. By the time P1 executed | ||
1164 | its second load, the x = 1 store would already be fully processed by | ||
1165 | the local cache and available for satisfying the read request. Thus | ||
1166 | we have yet another reason why shared data should always be read with | ||
1167 | READ_ONCE() or another synchronization primitive rather than accessed | ||
1168 | directly. | ||
1169 | |||
1170 | The LKMM requires that smp_rmb(), acquire fences, and strong fences | ||
1171 | share this property with smp_read_barrier_depends(): They do not allow | ||
1172 | the CPU to execute any po-later instructions (or po-later loads in the | ||
1173 | case of smp_rmb()) until all outstanding stores have been processed by | ||
1174 | the local cache. In the case of a strong fence, the CPU first has to | ||
1175 | wait for all of its po-earlier stores to propagate to every other CPU | ||
1176 | in the system; then it has to wait for the local cache to process all | ||
1177 | the stores received as of that time -- not just the stores received | ||
1178 | when the strong fence began. | ||
1179 | |||
1180 | And of course, none of this matters for any architecture other than | ||
1181 | Alpha. | ||
1182 | |||
1183 | |||
1184 | THE HAPPENS-BEFORE RELATION: hb | ||
1185 | ------------------------------- | ||
1186 | |||
1187 | The happens-before relation (hb) links memory accesses that have to | ||
1188 | execute in a certain order. hb includes the ppo relation and two | ||
1189 | others, one of which is rfe. | ||
1190 | |||
1191 | W ->rfe R implies that W and R are on different CPUs. It also means | ||
1192 | that W's store must have propagated to R's CPU before R executed; | ||
1193 | otherwise R could not have read the value stored by W. Therefore W | ||
1194 | must have executed before R, and so we have W ->hb R. | ||
1195 | |||
1196 | The equivalent fact need not hold if W ->rfi R (i.e., W and R are on | ||
1197 | the same CPU). As we have already seen, the operational model allows | ||
1198 | W's value to be forwarded to R in such cases, meaning that R may well | ||
1199 | execute before W does. | ||
1200 | |||
1201 | It's important to understand that neither coe nor fre is included in | ||
1202 | hb, despite their similarities to rfe. For example, suppose we have | ||
1203 | W ->coe W'. This means that W and W' are stores to the same location, | ||
1204 | they execute on different CPUs, and W comes before W' in the coherence | ||
1205 | order (i.e., W' overwrites W). Nevertheless, it is possible for W' to | ||
1206 | execute before W, because the decision as to which store overwrites | ||
1207 | the other is made later by the memory subsystem. When the stores are | ||
1208 | nearly simultaneous, either one can come out on top. Similarly, | ||
1209 | R ->fre W means that W overwrites the value which R reads, but it | ||
1210 | doesn't mean that W has to execute after R. All that's necessary is | ||
1211 | for the memory subsystem not to propagate W to R's CPU until after R | ||
1212 | has executed, which is possible if W executes shortly before R. | ||
1213 | |||
1214 | The third relation included in hb is like ppo, in that it only links | ||
1215 | events that are on the same CPU. However it is more difficult to | ||
1216 | explain, because it arises only indirectly from the requirement of | ||
1217 | cache coherence. The relation is called prop, and it links two events | ||
1218 | on CPU C in situations where a store from some other CPU comes after | ||
1219 | the first event in the coherence order and propagates to C before the | ||
1220 | second event executes. | ||
1221 | |||
1222 | This is best explained with some examples. The simplest case looks | ||
1223 | like this: | ||
1224 | |||
1225 | int x; | ||
1226 | |||
1227 | P0() | ||
1228 | { | ||
1229 | int r1; | ||
1230 | |||
1231 | WRITE_ONCE(x, 1); | ||
1232 | r1 = READ_ONCE(x); | ||
1233 | } | ||
1234 | |||
1235 | P1() | ||
1236 | { | ||
1237 | WRITE_ONCE(x, 8); | ||
1238 | } | ||
1239 | |||
1240 | If r1 = 8 at the end then P0's accesses must have executed in program | ||
1241 | order. We can deduce this from the operational model; if P0's load | ||
1242 | had executed before its store then the value of the store would have | ||
1243 | been forwarded to the load, so r1 would have ended up equal to 1, not | ||
1244 | 8. In this case there is a prop link from P0's write event to its read | ||
1245 | event, because P1's store came after P0's store in x's coherence | ||
1246 | order, and P1's store propagated to P0 before P0's load executed. | ||
1247 | |||
1248 | An equally simple case involves two loads of the same location that | ||
1249 | read from different stores: | ||
1250 | |||
1251 | int x = 0; | ||
1252 | |||
1253 | P0() | ||
1254 | { | ||
1255 | int r1, r2; | ||
1256 | |||
1257 | r1 = READ_ONCE(x); | ||
1258 | r2 = READ_ONCE(x); | ||
1259 | } | ||
1260 | |||
1261 | P1() | ||
1262 | { | ||
1263 | WRITE_ONCE(x, 9); | ||
1264 | } | ||
1265 | |||
1266 | If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed | ||
1267 | in program order. If the second load had executed before the first | ||
1268 | then the x = 9 store must have been propagated to P0 before the first | ||
1269 | load executed, and so r1 would have been 9 rather than 0. In this | ||
1270 | case there is a prop link from P0's first read event to its second, | ||
1271 | because P1's store overwrote the value read by P0's first load, and | ||
1272 | P1's store propagated to P0 before P0's second load executed. | ||
1273 | |||
1274 | Less trivial examples of prop all involve fences. Unlike the simple | ||
1275 | examples above, they can require that some instructions are executed | ||
1276 | out of program order. This next one should look familiar: | ||
1277 | |||
1278 | int buf = 0, flag = 0; | ||
1279 | |||
1280 | P0() | ||
1281 | { | ||
1282 | WRITE_ONCE(buf, 1); | ||
1283 | smp_wmb(); | ||
1284 | WRITE_ONCE(flag, 1); | ||
1285 | } | ||
1286 | |||
1287 | P1() | ||
1288 | { | ||
1289 | int r1; | ||
1290 | int r2; | ||
1291 | |||
1292 | r1 = READ_ONCE(flag); | ||
1293 | r2 = READ_ONCE(buf); | ||
1294 | } | ||
1295 | |||
1296 | This is the MP pattern again, with an smp_wmb() fence between the two | ||
1297 | stores. If r1 = 1 and r2 = 0 at the end then there is a prop link | ||
1298 | from P1's second load to its first (backwards!). The reason is | ||
1299 | similar to the previous examples: The value P1 loads from buf gets | ||
1300 | overwritten by P0's store to buf, the fence guarantees that the store | ||
1301 | to buf will propagate to P1 before the store to flag does, and the | ||
1302 | store to flag propagates to P1 before P1 reads flag. | ||
1303 | |||
1304 | The prop link says that in order to obtain the r1 = 1, r2 = 0 result, | ||
1305 | P1 must execute its second load before the first. Indeed, if the load | ||
1306 | from flag were executed first, then the buf = 1 store would already | ||
1307 | have propagated to P1 by the time P1's load from buf executed, so r2 | ||
1308 | would have been 1 at the end, not 0. (The reasoning holds even for | ||
1309 | Alpha, although the details are more complicated and we will not go | ||
1310 | into them.) | ||
1311 | |||
1312 | But what if we put an smp_rmb() fence between P1's loads? The fence | ||
1313 | would force the two loads to be executed in program order, and it | ||
1314 | would generate a cycle in the hb relation: The fence would create a ppo | ||
1315 | link (hence an hb link) from the first load to the second, and the | ||
1316 | prop relation would give an hb link from the second load to the first. | ||
1317 | Since an instruction can't execute before itself, we are forced to | ||
1318 | conclude that if an smp_rmb() fence is added, the r1 = 1, r2 = 0 | ||
1319 | outcome is impossible -- as it should be. | ||
1320 | |||
1321 | The formal definition of the prop relation involves a coe or fre link, | ||
1322 | followed by an arbitrary number of cumul-fence links, ending with an | ||
1323 | rfe link. You can concoct more exotic examples, containing more than | ||
1324 | one fence, although this quickly leads to diminishing returns in terms | ||
1325 | of complexity. For instance, here's an example containing a coe link | ||
1326 | followed by two fences and an rfe link, utilizing the fact that | ||
1327 | release fences are A-cumulative: | ||
1328 | |||
1329 | int x, y, z; | ||
1330 | |||
1331 | P0() | ||
1332 | { | ||
1333 | int r0; | ||
1334 | |||
1335 | WRITE_ONCE(x, 1); | ||
1336 | r0 = READ_ONCE(z); | ||
1337 | } | ||
1338 | |||
1339 | P1() | ||
1340 | { | ||
1341 | WRITE_ONCE(x, 2); | ||
1342 | smp_wmb(); | ||
1343 | WRITE_ONCE(y, 1); | ||
1344 | } | ||
1345 | |||
1346 | P2() | ||
1347 | { | ||
1348 | int r2; | ||
1349 | |||
1350 | r2 = READ_ONCE(y); | ||
1351 | smp_store_release(&z, 1); | ||
1352 | } | ||
1353 | |||
1354 | If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop | ||
1355 | link from P0's store to its load. This is because P0's store gets | ||
1356 | overwritten by P1's store since x = 2 at the end (a coe link), the | ||
1357 | smp_wmb() ensures that P1's store to x propagates to P2 before the | ||
1358 | store to y does (the first fence), the store to y propagates to P2 | ||
1359 | before P2's load and store execute, P2's smp_store_release() | ||
1360 | guarantees that the stores to x and y both propagate to P0 before the | ||
1361 | store to z does (the second fence), and P0's load executes after the | ||
1362 | store to z has propagated to P0 (an rfe link). | ||
1363 | |||
1364 | In summary, the fact that the hb relation links memory access events | ||
1365 | in the order they execute means that it must not have cycles. This | ||
1366 | requirement is the content of the LKMM's "happens-before" axiom. | ||
1367 | |||
1368 | The LKMM defines yet another relation connected to times of | ||
1369 | instruction execution, but it is not included in hb. It relies on the | ||
1370 | particular properties of strong fences, which we cover in the next | ||
1371 | section. | ||
1372 | |||
1373 | |||
1374 | THE PROPAGATES-BEFORE RELATION: pb | ||
1375 | ---------------------------------- | ||
1376 | |||
1377 | The propagates-before (pb) relation capitalizes on the special | ||
1378 | features of strong fences. It links two events E and F whenever some | ||
1379 | store is coherence-later than E and propagates to every CPU and to RAM | ||
1380 | before F executes. The formal definition requires that E be linked to | ||
1381 | F via a coe or fre link, an arbitrary number of cumul-fences, an | ||
1382 | optional rfe link, a strong fence, and an arbitrary number of hb | ||
1383 | links. Let's see how this definition works out. | ||
1384 | |||
1385 | Consider first the case where E is a store (implying that the sequence | ||
1386 | of links begins with coe). Then there are events W, X, Y, and Z such | ||
1387 | that: | ||
1388 | |||
1389 | E ->coe W ->cumul-fence* X ->rfe? Y ->strong-fence Z ->hb* F, | ||
1390 | |||
1391 | where the * suffix indicates an arbitrary number of links of the | ||
1392 | specified type, and the ? suffix indicates the link is optional (Y may | ||
1393 | be equal to X). Because of the cumul-fence links, we know that W will | ||
1394 | propagate to Y's CPU before X does, hence before Y executes and hence | ||
1395 | before the strong fence executes. Because this fence is strong, we | ||
1396 | know that W will propagate to every CPU and to RAM before Z executes. | ||
1397 | And because of the hb links, we know that Z will execute before F. | ||
1398 | Thus W, which comes later than E in the coherence order, will | ||
1399 | propagate to every CPU and to RAM before F executes. | ||
1400 | |||
1401 | The case where E is a load is exactly the same, except that the first | ||
1402 | link in the sequence is fre instead of coe. | ||
1403 | |||
1404 | The existence of a pb link from E to F implies that E must execute | ||
1405 | before F. To see why, suppose that F executed first. Then W would | ||
1406 | have propagated to E's CPU before E executed. If E was a store, the | ||
1407 | memory subsystem would then be forced to make E come after W in the | ||
1408 | coherence order, contradicting the fact that E ->coe W. If E was a | ||
1409 | load, the memory subsystem would then be forced to satisfy E's read | ||
1410 | request with the value stored by W or an even later store, | ||
1411 | contradicting the fact that E ->fre W. | ||
1412 | |||
1413 | A good example illustrating how pb works is the SB pattern with strong | ||
1414 | fences: | ||
1415 | |||
1416 | int x = 0, y = 0; | ||
1417 | |||
1418 | P0() | ||
1419 | { | ||
1420 | int r0; | ||
1421 | |||
1422 | WRITE_ONCE(x, 1); | ||
1423 | smp_mb(); | ||
1424 | r0 = READ_ONCE(y); | ||
1425 | } | ||
1426 | |||
1427 | P1() | ||
1428 | { | ||
1429 | int r1; | ||
1430 | |||
1431 | WRITE_ONCE(y, 1); | ||
1432 | smp_mb(); | ||
1433 | r1 = READ_ONCE(x); | ||
1434 | } | ||
1435 | |||
1436 | If r0 = 0 at the end then there is a pb link from P0's load to P1's | ||
1437 | load: an fre link from P0's load to P1's store (which overwrites the | ||
1438 | value read by P0), and a strong fence between P1's store and its load. | ||
1439 | In this example, the sequences of cumul-fence and hb links are empty. | ||
1440 | Note that this pb link is not included in hb as an instance of prop, | ||
1441 | because it does not start and end on the same CPU. | ||
1442 | |||
1443 | Similarly, if r1 = 0 at the end then there is a pb link from P1's load | ||
1444 | to P0's. This means that if both r1 and r2 were 0 there would be a | ||
1445 | cycle in pb, which is not possible since an instruction cannot execute | ||
1446 | before itself. Thus, adding smp_mb() fences to the SB pattern | ||
1447 | prevents the r0 = 0, r1 = 0 outcome. | ||
1448 | |||
1449 | In summary, the fact that the pb relation links events in the order | ||
1450 | they execute means that it cannot have cycles. This requirement is | ||
1451 | the content of the LKMM's "propagation" axiom. | ||
1452 | |||
1453 | |||
1454 | RCU RELATIONS: link, gp-link, rscs-link, and rcu-path | ||
1455 | ----------------------------------------------------- | ||
1456 | |||
1457 | RCU (Read-Copy-Update) is a powerful synchronization mechanism. It | ||
1458 | rests on two concepts: grace periods and read-side critical sections. | ||
1459 | |||
1460 | A grace period is the span of time occupied by a call to | ||
1461 | synchronize_rcu(). A read-side critical section (or just critical | ||
1462 | section, for short) is a region of code delimited by rcu_read_lock() | ||
1463 | at the start and rcu_read_unlock() at the end. Critical sections can | ||
1464 | be nested, although we won't make use of this fact. | ||
1465 | |||
1466 | As far as memory models are concerned, RCU's main feature is its | ||
1467 | Grace-Period Guarantee, which states that a critical section can never | ||
1468 | span a full grace period. In more detail, the Guarantee says: | ||
1469 | |||
1470 | If a critical section starts before a grace period then it | ||
1471 | must end before the grace period does. In addition, every | ||
1472 | store that propagates to the critical section's CPU before the | ||
1473 | end of the critical section must propagate to every CPU before | ||
1474 | the end of the grace period. | ||
1475 | |||
1476 | If a critical section ends after a grace period ends then it | ||
1477 | must start after the grace period does. In addition, every | ||
1478 | store that propagates to the grace period's CPU before the | ||
1479 | start of the grace period must propagate to every CPU before | ||
1480 | the start of the critical section. | ||
1481 | |||
1482 | Here is a simple example of RCU in action: | ||
1483 | |||
1484 | int x, y; | ||
1485 | |||
1486 | P0() | ||
1487 | { | ||
1488 | rcu_read_lock(); | ||
1489 | WRITE_ONCE(x, 1); | ||
1490 | WRITE_ONCE(y, 1); | ||
1491 | rcu_read_unlock(); | ||
1492 | } | ||
1493 | |||
1494 | P1() | ||
1495 | { | ||
1496 | int r1, r2; | ||
1497 | |||
1498 | r1 = READ_ONCE(x); | ||
1499 | synchronize_rcu(); | ||
1500 | r2 = READ_ONCE(y); | ||
1501 | } | ||
1502 | |||
1503 | The Grace Period Guarantee tells us that when this code runs, it will | ||
1504 | never end with r1 = 1 and r2 = 0. The reasoning is as follows. r1 = 1 | ||
1505 | means that P0's store to x propagated to P1 before P1 called | ||
1506 | synchronize_rcu(), so P0's critical section must have started before | ||
1507 | P1's grace period. On the other hand, r2 = 0 means that P0's store to | ||
1508 | y, which occurs before the end of the critical section, did not | ||
1509 | propagate to P1 before the end of the grace period, violating the | ||
1510 | Guarantee. | ||
1511 | |||
1512 | In the kernel's implementations of RCU, the business about stores | ||
1513 | propagating to every CPU is realized by placing strong fences at | ||
1514 | suitable places in the RCU-related code. Thus, if a critical section | ||
1515 | starts before a grace period does then the critical section's CPU will | ||
1516 | execute an smp_mb() fence after the end of the critical section and | ||
1517 | some time before the grace period's synchronize_rcu() call returns. | ||
1518 | And if a critical section ends after a grace period does then the | ||
1519 | synchronize_rcu() routine will execute an smp_mb() fence at its start | ||
1520 | and some time before the critical section's opening rcu_read_lock() | ||
1521 | executes. | ||
1522 | |||
1523 | What exactly do we mean by saying that a critical section "starts | ||
1524 | before" or "ends after" a grace period? Some aspects of the meaning | ||
1525 | are pretty obvious, as in the example above, but the details aren't | ||
1526 | entirely clear. The LKMM formalizes this notion by means of a | ||
1527 | relation with the unfortunately generic name "link". It is a very | ||
1528 | general relation; among other things, X ->link Z includes cases where | ||
1529 | X happens-before or is equal to some event Y which is equal to or | ||
1530 | comes before Z in the coherence order. Taking Y = Z, this says that | ||
1531 | X ->rfe Z implies X ->link Z, and taking Y = X, it says that X ->fr Z | ||
1532 | and X ->co Z each imply X ->link Z. | ||
1533 | |||
1534 | The formal definition of the link relation is more than a little | ||
1535 | obscure, and we won't give it here. It is closely related to the pb | ||
1536 | relation, and the details don't matter unless you want to comb through | ||
1537 | a somewhat lengthy formal proof. Pretty much all you need to know | ||
1538 | about link is the information in the preceding paragraph. | ||
1539 | |||
1540 | The LKMM goes on to define the gp-link and rscs-link relations. They | ||
1541 | bring grace periods and read-side critical sections into the picture, | ||
1542 | in the following way: | ||
1543 | |||
1544 | E ->gp-link F means there is a synchronize_rcu() fence event S | ||
1545 | and an event X such that E ->po S, either S ->po X or S = X, | ||
1546 | and X ->link F. In other words, E and F are connected by a | ||
1547 | grace period followed by an instance of link. | ||
1548 | |||
1549 | E ->rscs-link F means there is a critical section delimited by | ||
1550 | an rcu_read_lock() fence L and an rcu_read_unlock() fence U, | ||
1551 | and an event X such that E ->po U, either L ->po X or L = X, | ||
1552 | and X ->link F. Roughly speaking, this says that some event | ||
1553 | in the same critical section as E is connected by link to F. | ||
1554 | |||
1555 | If we think of the link relation as standing for an extended "before", | ||
1556 | then E ->gp-link F says that E executes before a grace period which | ||
1557 | ends before F executes. (In fact it says more than this, because it | ||
1558 | includes cases where E executes before a grace period and some store | ||
1559 | propagates to F's CPU before F executes and doesn't propagate to some | ||
1560 | other CPU until after the grace period ends.) Similarly, | ||
1561 | E ->rscs-link F says that E is part of (or before the start of) a | ||
1562 | critical section which starts before F executes. | ||
1563 | |||
1564 | Putting this all together, the LKMM expresses the Grace Period | ||
1565 | Guarantee by requiring that there are no cycles consisting of gp-link | ||
1566 | and rscs-link connections in which the number of gp-link instances is | ||
1567 | >= the number of rscs-link instances. It does this by defining the | ||
1568 | rcu-path relation to link events E and F whenever it is possible to | ||
1569 | pass from E to F by a sequence of gp-link and rscs-link connections | ||
1570 | with at least as many of the former as the latter. The LKMM's "rcu" | ||
1571 | axiom then says that there are no events E such that E ->rcu-path E. | ||
1572 | |||
1573 | Justifying this axiom takes some intellectual effort, but it is in | ||
1574 | fact a valid formalization of the Grace Period Guarantee. We won't | ||
1575 | attempt to go through the detailed argument, but the following | ||
1576 | analysis gives a taste of what is involved. Suppose we have a | ||
1577 | violation of the first part of the Guarantee: A critical section | ||
1578 | starts before a grace period, and some store propagates to the | ||
1579 | critical section's CPU before the end of the critical section but | ||
1580 | doesn't propagate to some other CPU until after the end of the grace | ||
1581 | period. | ||
1582 | |||
1583 | Putting symbols to these ideas, let L and U be the rcu_read_lock() and | ||
1584 | rcu_read_unlock() fence events delimiting the critical section in | ||
1585 | question, and let S be the synchronize_rcu() fence event for the grace | ||
1586 | period. Saying that the critical section starts before S means there | ||
1587 | are events E and F where E is po-after L (which marks the start of the | ||
1588 | critical section), E is "before" F in the sense of the link relation, | ||
1589 | and F is po-before the grace period S: | ||
1590 | |||
1591 | L ->po E ->link F ->po S. | ||
1592 | |||
1593 | Let W be the store mentioned above, let Z come before the end of the | ||
1594 | critical section and witness that W propagates to the critical | ||
1595 | section's CPU by reading from W, and let Y on some arbitrary CPU be a | ||
1596 | witness that W has not propagated to that CPU, where Y happens after | ||
1597 | some event X which is po-after S. Symbolically, this amounts to: | ||
1598 | |||
1599 | S ->po X ->hb* Y ->fr W ->rf Z ->po U. | ||
1600 | |||
1601 | The fr link from Y to W indicates that W has not propagated to Y's CPU | ||
1602 | at the time that Y executes. From this, it can be shown (see the | ||
1603 | discussion of the link relation earlier) that X and Z are connected by | ||
1604 | link, yielding: | ||
1605 | |||
1606 | S ->po X ->link Z ->po U. | ||
1607 | |||
1608 | These formulas say that S is po-between F and X, hence F ->gp-link Z | ||
1609 | via X. They also say that Z comes before the end of the critical | ||
1610 | section and E comes after its start, hence Z ->rscs-link F via E. But | ||
1611 | now we have a forbidden cycle: F ->gp-link Z ->rscs-link F. Thus the | ||
1612 | "rcu" axiom rules out this violation of the Grace Period Guarantee. | ||
1613 | |||
1614 | For something a little more down-to-earth, let's see how the axiom | ||
1615 | works out in practice. Consider the RCU code example from above, this | ||
1616 | time with statement labels added to the memory access instructions: | ||
1617 | |||
1618 | int x, y; | ||
1619 | |||
1620 | P0() | ||
1621 | { | ||
1622 | rcu_read_lock(); | ||
1623 | W: WRITE_ONCE(x, 1); | ||
1624 | X: WRITE_ONCE(y, 1); | ||
1625 | rcu_read_unlock(); | ||
1626 | } | ||
1627 | |||
1628 | P1() | ||
1629 | { | ||
1630 | int r1, r2; | ||
1631 | |||
1632 | Y: r1 = READ_ONCE(x); | ||
1633 | synchronize_rcu(); | ||
1634 | Z: r2 = READ_ONCE(y); | ||
1635 | } | ||
1636 | |||
1637 | |||
1638 | If r2 = 0 at the end then P0's store at X overwrites the value | ||
1639 | that P1's load at Z reads from, so we have Z ->fre X and thus | ||
1640 | Z ->link X. In addition, there is a synchronize_rcu() between Y and | ||
1641 | Z, so therefore we have Y ->gp-link X. | ||
1642 | |||
1643 | If r1 = 1 at the end then P1's load at Y reads from P0's store at W, | ||
1644 | so we have W ->link Y. In addition, W and X are in the same critical | ||
1645 | section, so therefore we have X ->rscs-link Y. | ||
1646 | |||
1647 | This gives us a cycle, Y ->gp-link X ->rscs-link Y, with one gp-link | ||
1648 | and one rscs-link, violating the "rcu" axiom. Hence the outcome is | ||
1649 | not allowed by the LKMM, as we would expect. | ||
1650 | |||
1651 | For contrast, let's see what can happen in a more complicated example: | ||
1652 | |||
1653 | int x, y, z; | ||
1654 | |||
1655 | P0() | ||
1656 | { | ||
1657 | int r0; | ||
1658 | |||
1659 | rcu_read_lock(); | ||
1660 | W: r0 = READ_ONCE(x); | ||
1661 | X: WRITE_ONCE(y, 1); | ||
1662 | rcu_read_unlock(); | ||
1663 | } | ||
1664 | |||
1665 | P1() | ||
1666 | { | ||
1667 | int r1; | ||
1668 | |||
1669 | Y: r1 = READ_ONCE(y); | ||
1670 | synchronize_rcu(); | ||
1671 | Z: WRITE_ONCE(z, 1); | ||
1672 | } | ||
1673 | |||
1674 | P2() | ||
1675 | { | ||
1676 | int r2; | ||
1677 | |||
1678 | rcu_read_lock(); | ||
1679 | U: r2 = READ_ONCE(z); | ||
1680 | V: WRITE_ONCE(x, 1); | ||
1681 | rcu_read_unlock(); | ||
1682 | } | ||
1683 | |||
1684 | If r0 = r1 = r2 = 1 at the end, then similar reasoning to before shows | ||
1685 | that W ->rscs-link Y via X, Y ->gp-link U via Z, and U ->rscs-link W | ||
1686 | via V. And just as before, this gives a cycle: | ||
1687 | |||
1688 | W ->rscs-link Y ->gp-link U ->rscs-link W. | ||
1689 | |||
1690 | However, this cycle has fewer gp-link instances than rscs-link | ||
1691 | instances, and consequently the outcome is not forbidden by the LKMM. | ||
1692 | The following instruction timing diagram shows how it might actually | ||
1693 | occur: | ||
1694 | |||
1695 | P0 P1 P2 | ||
1696 | -------------------- -------------------- -------------------- | ||
1697 | rcu_read_lock() | ||
1698 | X: WRITE_ONCE(y, 1) | ||
1699 | Y: r1 = READ_ONCE(y) | ||
1700 | synchronize_rcu() starts | ||
1701 | . rcu_read_lock() | ||
1702 | . V: WRITE_ONCE(x, 1) | ||
1703 | W: r0 = READ_ONCE(x) . | ||
1704 | rcu_read_unlock() . | ||
1705 | synchronize_rcu() ends | ||
1706 | Z: WRITE_ONCE(z, 1) | ||
1707 | U: r2 = READ_ONCE(z) | ||
1708 | rcu_read_unlock() | ||
1709 | |||
1710 | This requires P0 and P2 to execute their loads and stores out of | ||
1711 | program order, but of course they are allowed to do so. And as you | ||
1712 | can see, the Grace Period Guarantee is not violated: The critical | ||
1713 | section in P0 both starts before P1's grace period does and ends | ||
1714 | before it does, and the critical section in P2 both starts after P1's | ||
1715 | grace period does and ends after it does. | ||
1716 | |||
1717 | |||
1718 | ODDS AND ENDS | ||
1719 | ------------- | ||
1720 | |||
1721 | This section covers material that didn't quite fit anywhere in the | ||
1722 | earlier sections. | ||
1723 | |||
1724 | The descriptions in this document don't always match the formal | ||
1725 | version of the LKMM exactly. For example, the actual formal | ||
1726 | definition of the prop relation makes the initial coe or fre part | ||
1727 | optional, and it doesn't require the events linked by the relation to | ||
1728 | be on the same CPU. These differences are very unimportant; indeed, | ||
1729 | instances where the coe/fre part of prop is missing are of no interest | ||
1730 | because all the other parts (fences and rfe) are already included in | ||
1731 | hb anyway, and where the formal model adds prop into hb, it includes | ||
1732 | an explicit requirement that the events being linked are on the same | ||
1733 | CPU. | ||
1734 | |||
1735 | Another minor difference has to do with events that are both memory | ||
1736 | accesses and fences, such as those corresponding to smp_load_acquire() | ||
1737 | calls. In the formal model, these events aren't actually both reads | ||
1738 | and fences; rather, they are read events with an annotation marking | ||
1739 | them as acquires. (Or write events annotated as releases, in the case | ||
1740 | smp_store_release().) The final effect is the same. | ||
1741 | |||
1742 | Although we didn't mention it above, the instruction execution | ||
1743 | ordering provided by the smp_rmb() fence doesn't apply to read events | ||
1744 | that are part of a non-value-returning atomic update. For instance, | ||
1745 | given: | ||
1746 | |||
1747 | atomic_inc(&x); | ||
1748 | smp_rmb(); | ||
1749 | r1 = READ_ONCE(y); | ||
1750 | |||
1751 | it is not guaranteed that the load from y will execute after the | ||
1752 | update to x. This is because the ARMv8 architecture allows | ||
1753 | non-value-returning atomic operations effectively to be executed off | ||
1754 | the CPU. Basically, the CPU tells the memory subsystem to increment | ||
1755 | x, and then the increment is carried out by the memory hardware with | ||
1756 | no further involvement from the CPU. Since the CPU doesn't ever read | ||
1757 | the value of x, there is nothing for the smp_rmb() fence to act on. | ||
1758 | |||
1759 | The LKMM defines a few extra synchronization operations in terms of | ||
1760 | things we have already covered. In particular, rcu_dereference() is | ||
1761 | treated as READ_ONCE() and rcu_assign_pointer() is treated as | ||
1762 | smp_store_release() -- which is basically how the Linux kernel treats | ||
1763 | them. | ||
1764 | |||
1765 | There are a few oddball fences which need special treatment: | ||
1766 | smp_mb__before_atomic(), smp_mb__after_atomic(), and | ||
1767 | smp_mb__after_spinlock(). The LKMM uses fence events with special | ||
1768 | annotations for them; they act as strong fences just like smp_mb() | ||
1769 | except for the sets of events that they order. Instead of ordering | ||
1770 | all po-earlier events against all po-later events, as smp_mb() does, | ||
1771 | they behave as follows: | ||
1772 | |||
1773 | smp_mb__before_atomic() orders all po-earlier events against | ||
1774 | po-later atomic updates and the events following them; | ||
1775 | |||
1776 | smp_mb__after_atomic() orders po-earlier atomic updates and | ||
1777 | the events preceding them against all po-later events; | ||
1778 | |||
1779 | smp_mb_after_spinlock() orders po-earlier lock acquisition | ||
1780 | events and the events preceding them against all po-later | ||
1781 | events. | ||
1782 | |||
1783 | The LKMM includes locking. In fact, there is special code for locking | ||
1784 | in the formal model, added in order to make tools run faster. | ||
1785 | However, this special code is intended to be exactly equivalent to | ||
1786 | concepts we have already covered. A spinlock_t variable is treated | ||
1787 | the same as an int, and spin_lock(&s) is treated the same as: | ||
1788 | |||
1789 | while (cmpxchg_acquire(&s, 0, 1) != 0) | ||
1790 | cpu_relax(); | ||
1791 | |||
1792 | which waits until s is equal to 0 and then atomically sets it to 1, | ||
1793 | and where the read part of the atomic update is also an acquire fence. | ||
1794 | An alternate way to express the same thing would be: | ||
1795 | |||
1796 | r = xchg_acquire(&s, 1); | ||
1797 | |||
1798 | along with a requirement that at the end, r = 0. spin_unlock(&s) is | ||
1799 | treated the same as: | ||
1800 | |||
1801 | smp_store_release(&s, 0); | ||
1802 | |||
1803 | Interestingly, RCU and locking each introduce the possibility of | ||
1804 | deadlock. When faced with code sequences such as: | ||
1805 | |||
1806 | spin_lock(&s); | ||
1807 | spin_lock(&s); | ||
1808 | spin_unlock(&s); | ||
1809 | spin_unlock(&s); | ||
1810 | |||
1811 | or: | ||
1812 | |||
1813 | rcu_read_lock(); | ||
1814 | synchronize_rcu(); | ||
1815 | rcu_read_unlock(); | ||
1816 | |||
1817 | what does the LKMM have to say? Answer: It says there are no allowed | ||
1818 | executions at all, which makes sense. But this can also lead to | ||
1819 | misleading results, because if a piece of code has multiple possible | ||
1820 | executions, some of which deadlock, the model will report only on the | ||
1821 | non-deadlocking executions. For example: | ||
1822 | |||
1823 | int x, y; | ||
1824 | |||
1825 | P0() | ||
1826 | { | ||
1827 | int r0; | ||
1828 | |||
1829 | WRITE_ONCE(x, 1); | ||
1830 | r0 = READ_ONCE(y); | ||
1831 | } | ||
1832 | |||
1833 | P1() | ||
1834 | { | ||
1835 | rcu_read_lock(); | ||
1836 | if (READ_ONCE(x) > 0) { | ||
1837 | WRITE_ONCE(y, 36); | ||
1838 | synchronize_rcu(); | ||
1839 | } | ||
1840 | rcu_read_unlock(); | ||
1841 | } | ||
1842 | |||
1843 | Is it possible to end up with r0 = 36 at the end? The LKMM will tell | ||
1844 | you it is not, but the model won't mention that this is because P1 | ||
1845 | will self-deadlock in the executions where it stores 36 in y. | ||
diff --git a/tools/memory-model/Documentation/recipes.txt b/tools/memory-model/Documentation/recipes.txt new file mode 100644 index 000000000000..ee4309a87fc4 --- /dev/null +++ b/tools/memory-model/Documentation/recipes.txt | |||
@@ -0,0 +1,570 @@ | |||
1 | This document provides "recipes", that is, litmus tests for commonly | ||
2 | occurring situations, as well as a few that illustrate subtly broken but | ||
3 | attractive nuisances. Many of these recipes include example code from | ||
4 | v4.13 of the Linux kernel. | ||
5 | |||
6 | The first section covers simple special cases, the second section | ||
7 | takes off the training wheels to cover more involved examples, | ||
8 | and the third section provides a few rules of thumb. | ||
9 | |||
10 | |||
11 | Simple special cases | ||
12 | ==================== | ||
13 | |||
14 | This section presents two simple special cases, the first being where | ||
15 | there is only one CPU or only one memory location is accessed, and the | ||
16 | second being use of that old concurrency workhorse, locking. | ||
17 | |||
18 | |||
19 | Single CPU or single memory location | ||
20 | ------------------------------------ | ||
21 | |||
22 | If there is only one CPU on the one hand or only one variable | ||
23 | on the other, the code will execute in order. There are (as | ||
24 | usual) some things to be careful of: | ||
25 | |||
26 | 1. Some aspects of the C language are unordered. For example, | ||
27 | in the expression "f(x) + g(y)", the order in which f and g are | ||
28 | called is not defined; the object code is allowed to use either | ||
29 | order or even to interleave the computations. | ||
30 | |||
31 | 2. Compilers are permitted to use the "as-if" rule. That is, a | ||
32 | compiler can emit whatever code it likes for normal accesses, | ||
33 | as long as the results of a single-threaded execution appear | ||
34 | just as if the compiler had followed all the relevant rules. | ||
35 | To see this, compile with a high level of optimization and run | ||
36 | the debugger on the resulting binary. | ||
37 | |||
38 | 3. If there is only one variable but multiple CPUs, that variable | ||
39 | must be properly aligned and all accesses to that variable must | ||
40 | be full sized. Variables that straddle cachelines or pages void | ||
41 | your full-ordering warranty, as do undersized accesses that load | ||
42 | from or store to only part of the variable. | ||
43 | |||
44 | 4. If there are multiple CPUs, accesses to shared variables should | ||
45 | use READ_ONCE() and WRITE_ONCE() or stronger to prevent load/store | ||
46 | tearing, load/store fusing, and invented loads and stores. | ||
47 | There are exceptions to this rule, including: | ||
48 | |||
49 | i. When there is no possibility of a given shared variable | ||
50 | being updated by some other CPU, for example, while | ||
51 | holding the update-side lock, reads from that variable | ||
52 | need not use READ_ONCE(). | ||
53 | |||
54 | ii. When there is no possibility of a given shared variable | ||
55 | being either read or updated by other CPUs, for example, | ||
56 | when running during early boot, reads from that variable | ||
57 | need not use READ_ONCE() and writes to that variable | ||
58 | need not use WRITE_ONCE(). | ||
59 | |||
60 | |||
61 | Locking | ||
62 | ------- | ||
63 | |||
64 | Locking is well-known and straightforward, at least if you don't think | ||
65 | about it too hard. And the basic rule is indeed quite simple: Any CPU that | ||
66 | has acquired a given lock sees any changes previously seen or made by any | ||
67 | CPU before it released that same lock. Note that this statement is a bit | ||
68 | stronger than "Any CPU holding a given lock sees all changes made by any | ||
69 | CPU during the time that CPU was holding this same lock". For example, | ||
70 | consider the following pair of code fragments: | ||
71 | |||
72 | /* See MP+polocks.litmus. */ | ||
73 | void CPU0(void) | ||
74 | { | ||
75 | WRITE_ONCE(x, 1); | ||
76 | spin_lock(&mylock); | ||
77 | WRITE_ONCE(y, 1); | ||
78 | spin_unlock(&mylock); | ||
79 | } | ||
80 | |||
81 | void CPU1(void) | ||
82 | { | ||
83 | spin_lock(&mylock); | ||
84 | r0 = READ_ONCE(y); | ||
85 | spin_unlock(&mylock); | ||
86 | r1 = READ_ONCE(x); | ||
87 | } | ||
88 | |||
89 | The basic rule guarantees that if CPU0() acquires mylock before CPU1(), | ||
90 | then both r0 and r1 must be set to the value 1. This also has the | ||
91 | consequence that if the final value of r0 is equal to 1, then the final | ||
92 | value of r1 must also be equal to 1. In contrast, the weaker rule would | ||
93 | say nothing about the final value of r1. | ||
94 | |||
95 | The converse to the basic rule also holds, as illustrated by the | ||
96 | following litmus test: | ||
97 | |||
98 | /* See MP+porevlocks.litmus. */ | ||
99 | void CPU0(void) | ||
100 | { | ||
101 | r0 = READ_ONCE(y); | ||
102 | spin_lock(&mylock); | ||
103 | r1 = READ_ONCE(x); | ||
104 | spin_unlock(&mylock); | ||
105 | } | ||
106 | |||
107 | void CPU1(void) | ||
108 | { | ||
109 | spin_lock(&mylock); | ||
110 | WRITE_ONCE(x, 1); | ||
111 | spin_unlock(&mylock); | ||
112 | WRITE_ONCE(y, 1); | ||
113 | } | ||
114 | |||
115 | This converse to the basic rule guarantees that if CPU0() acquires | ||
116 | mylock before CPU1(), then both r0 and r1 must be set to the value 0. | ||
117 | This also has the consequence that if the final value of r1 is equal | ||
118 | to 0, then the final value of r0 must also be equal to 0. In contrast, | ||
119 | the weaker rule would say nothing about the final value of r0. | ||
120 | |||
121 | These examples show only a single pair of CPUs, but the effects of the | ||
122 | locking basic rule extend across multiple acquisitions of a given lock | ||
123 | across multiple CPUs. | ||
124 | |||
125 | However, it is not necessarily the case that accesses ordered by | ||
126 | locking will be seen as ordered by CPUs not holding that lock. | ||
127 | Consider this example: | ||
128 | |||
129 | /* See Z6.0+pooncelock+pooncelock+pombonce.litmus. */ | ||
130 | void CPU0(void) | ||
131 | { | ||
132 | spin_lock(&mylock); | ||
133 | WRITE_ONCE(x, 1); | ||
134 | WRITE_ONCE(y, 1); | ||
135 | spin_unlock(&mylock); | ||
136 | } | ||
137 | |||
138 | void CPU1(void) | ||
139 | { | ||
140 | spin_lock(&mylock); | ||
141 | r0 = READ_ONCE(y); | ||
142 | WRITE_ONCE(z, 1); | ||
143 | spin_unlock(&mylock); | ||
144 | } | ||
145 | |||
146 | void CPU2(void) | ||
147 | { | ||
148 | WRITE_ONCE(z, 2); | ||
149 | smp_mb(); | ||
150 | r1 = READ_ONCE(x); | ||
151 | } | ||
152 | |||
153 | Counter-intuitive though it might be, it is quite possible to have | ||
154 | the final value of r0 be 1, the final value of z be 2, and the final | ||
155 | value of r1 be 0. The reason for this surprising outcome is that | ||
156 | CPU2() never acquired the lock, and thus did not benefit from the | ||
157 | lock's ordering properties. | ||
158 | |||
159 | Ordering can be extended to CPUs not holding the lock by careful use | ||
160 | of smp_mb__after_spinlock(): | ||
161 | |||
162 | /* See Z6.0+pooncelock+poonceLock+pombonce.litmus. */ | ||
163 | void CPU0(void) | ||
164 | { | ||
165 | spin_lock(&mylock); | ||
166 | WRITE_ONCE(x, 1); | ||
167 | WRITE_ONCE(y, 1); | ||
168 | spin_unlock(&mylock); | ||
169 | } | ||
170 | |||
171 | void CPU1(void) | ||
172 | { | ||
173 | spin_lock(&mylock); | ||
174 | smp_mb__after_spinlock(); | ||
175 | r0 = READ_ONCE(y); | ||
176 | WRITE_ONCE(z, 1); | ||
177 | spin_unlock(&mylock); | ||
178 | } | ||
179 | |||
180 | void CPU2(void) | ||
181 | { | ||
182 | WRITE_ONCE(z, 2); | ||
183 | smp_mb(); | ||
184 | r1 = READ_ONCE(x); | ||
185 | } | ||
186 | |||
187 | This addition of smp_mb__after_spinlock() strengthens the lock acquisition | ||
188 | sufficiently to rule out the counter-intuitive outcome. | ||
189 | |||
190 | |||
191 | Taking off the training wheels | ||
192 | ============================== | ||
193 | |||
194 | This section looks at more complex examples, including message passing, | ||
195 | load buffering, release-acquire chains, store buffering. | ||
196 | Many classes of litmus tests have abbreviated names, which may be found | ||
197 | here: https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf | ||
198 | |||
199 | |||
200 | Message passing (MP) | ||
201 | -------------------- | ||
202 | |||
203 | The MP pattern has one CPU execute a pair of stores to a pair of variables | ||
204 | and another CPU execute a pair of loads from this same pair of variables, | ||
205 | but in the opposite order. The goal is to avoid the counter-intuitive | ||
206 | outcome in which the first load sees the value written by the second store | ||
207 | but the second load does not see the value written by the first store. | ||
208 | In the absence of any ordering, this goal may not be met, as can be seen | ||
209 | in the MP+poonceonces.litmus litmus test. This section therefore looks at | ||
210 | a number of ways of meeting this goal. | ||
211 | |||
212 | |||
213 | Release and acquire | ||
214 | ~~~~~~~~~~~~~~~~~~~ | ||
215 | |||
216 | Use of smp_store_release() and smp_load_acquire() is one way to force | ||
217 | the desired MP ordering. The general approach is shown below: | ||
218 | |||
219 | /* See MP+pooncerelease+poacquireonce.litmus. */ | ||
220 | void CPU0(void) | ||
221 | { | ||
222 | WRITE_ONCE(x, 1); | ||
223 | smp_store_release(&y, 1); | ||
224 | } | ||
225 | |||
226 | void CPU1(void) | ||
227 | { | ||
228 | r0 = smp_load_acquire(&y); | ||
229 | r1 = READ_ONCE(x); | ||
230 | } | ||
231 | |||
232 | The smp_store_release() macro orders any prior accesses against the | ||
233 | store, while the smp_load_acquire macro orders the load against any | ||
234 | subsequent accesses. Therefore, if the final value of r0 is the value 1, | ||
235 | the final value of r1 must also be the value 1. | ||
236 | |||
237 | The init_stack_slab() function in lib/stackdepot.c uses release-acquire | ||
238 | in this way to safely initialize of a slab of the stack. Working out | ||
239 | the mutual-exclusion design is left as an exercise for the reader. | ||
240 | |||
241 | |||
242 | Assign and dereference | ||
243 | ~~~~~~~~~~~~~~~~~~~~~~ | ||
244 | |||
245 | Use of rcu_assign_pointer() and rcu_dereference() is quite similar to the | ||
246 | use of smp_store_release() and smp_load_acquire(), except that both | ||
247 | rcu_assign_pointer() and rcu_dereference() operate on RCU-protected | ||
248 | pointers. The general approach is shown below: | ||
249 | |||
250 | /* See MP+onceassign+derefonce.litmus. */ | ||
251 | int z; | ||
252 | int *y = &z; | ||
253 | int x; | ||
254 | |||
255 | void CPU0(void) | ||
256 | { | ||
257 | WRITE_ONCE(x, 1); | ||
258 | rcu_assign_pointer(y, &x); | ||
259 | } | ||
260 | |||
261 | void CPU1(void) | ||
262 | { | ||
263 | rcu_read_lock(); | ||
264 | r0 = rcu_dereference(y); | ||
265 | r1 = READ_ONCE(*r0); | ||
266 | rcu_read_unlock(); | ||
267 | } | ||
268 | |||
269 | In this example, if the final value of r0 is &x then the final value of | ||
270 | r1 must be 1. | ||
271 | |||
272 | The rcu_assign_pointer() macro has the same ordering properties as does | ||
273 | smp_store_release(), but the rcu_dereference() macro orders the load only | ||
274 | against later accesses that depend on the value loaded. A dependency | ||
275 | is present if the value loaded determines the address of a later access | ||
276 | (address dependency, as shown above), the value written by a later store | ||
277 | (data dependency), or whether or not a later store is executed in the | ||
278 | first place (control dependency). Note that the term "data dependency" | ||
279 | is sometimes casually used to cover both address and data dependencies. | ||
280 | |||
281 | In lib/prime_numbers.c, the expand_to_next_prime() function invokes | ||
282 | rcu_assign_pointer(), and the next_prime_number() function invokes | ||
283 | rcu_dereference(). This combination mediates access to a bit vector | ||
284 | that is expanded as additional primes are needed. | ||
285 | |||
286 | |||
287 | Write and read memory barriers | ||
288 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
289 | |||
290 | It is usually better to use smp_store_release() instead of smp_wmb() | ||
291 | and to use smp_load_acquire() instead of smp_rmb(). However, the older | ||
292 | smp_wmb() and smp_rmb() APIs are still heavily used, so it is important | ||
293 | to understand their use cases. The general approach is shown below: | ||
294 | |||
295 | /* See MP+wmbonceonce+rmbonceonce.litmus. */ | ||
296 | void CPU0(void) | ||
297 | { | ||
298 | WRITE_ONCE(x, 1); | ||
299 | smp_wmb(); | ||
300 | WRITE_ONCE(y, 1); | ||
301 | } | ||
302 | |||
303 | void CPU1(void) | ||
304 | { | ||
305 | r0 = READ_ONCE(y); | ||
306 | smp_rmb(); | ||
307 | r1 = READ_ONCE(x); | ||
308 | } | ||
309 | |||
310 | The smp_wmb() macro orders prior stores against later stores, and the | ||
311 | smp_rmb() macro orders prior loads against later loads. Therefore, if | ||
312 | the final value of r0 is 1, the final value of r1 must also be 1. | ||
313 | |||
314 | The the xlog_state_switch_iclogs() function in fs/xfs/xfs_log.c contains | ||
315 | the following write-side code fragment: | ||
316 | |||
317 | log->l_curr_block -= log->l_logBBsize; | ||
318 | ASSERT(log->l_curr_block >= 0); | ||
319 | smp_wmb(); | ||
320 | log->l_curr_cycle++; | ||
321 | |||
322 | And the xlog_valid_lsn() function in fs/xfs/xfs_log_priv.h contains | ||
323 | the corresponding read-side code fragment: | ||
324 | |||
325 | cur_cycle = ACCESS_ONCE(log->l_curr_cycle); | ||
326 | smp_rmb(); | ||
327 | cur_block = ACCESS_ONCE(log->l_curr_block); | ||
328 | |||
329 | Alternatively, consider the following comment in function | ||
330 | perf_output_put_handle() in kernel/events/ring_buffer.c: | ||
331 | |||
332 | * kernel user | ||
333 | * | ||
334 | * if (LOAD ->data_tail) { LOAD ->data_head | ||
335 | * (A) smp_rmb() (C) | ||
336 | * STORE $data LOAD $data | ||
337 | * smp_wmb() (B) smp_mb() (D) | ||
338 | * STORE ->data_head STORE ->data_tail | ||
339 | * } | ||
340 | |||
341 | The B/C pairing is an example of the MP pattern using smp_wmb() on the | ||
342 | write side and smp_rmb() on the read side. | ||
343 | |||
344 | Of course, given that smp_mb() is strictly stronger than either smp_wmb() | ||
345 | or smp_rmb(), any code fragment that would work with smp_rmb() and | ||
346 | smp_wmb() would also work with smp_mb() replacing either or both of the | ||
347 | weaker barriers. | ||
348 | |||
349 | |||
350 | Load buffering (LB) | ||
351 | ------------------- | ||
352 | |||
353 | The LB pattern has one CPU load from one variable and then store to a | ||
354 | second, while another CPU loads from the second variable and then stores | ||
355 | to the first. The goal is to avoid the counter-intuitive situation where | ||
356 | each load reads the value written by the other CPU's store. In the | ||
357 | absence of any ordering it is quite possible that this may happen, as | ||
358 | can be seen in the LB+poonceonces.litmus litmus test. | ||
359 | |||
360 | One way of avoiding the counter-intuitive outcome is through the use of a | ||
361 | control dependency paired with a full memory barrier: | ||
362 | |||
363 | /* See LB+ctrlonceonce+mbonceonce.litmus. */ | ||
364 | void CPU0(void) | ||
365 | { | ||
366 | r0 = READ_ONCE(x); | ||
367 | if (r0) | ||
368 | WRITE_ONCE(y, 1); | ||
369 | } | ||
370 | |||
371 | void CPU1(void) | ||
372 | { | ||
373 | r1 = READ_ONCE(y); | ||
374 | smp_mb(); | ||
375 | WRITE_ONCE(x, 1); | ||
376 | } | ||
377 | |||
378 | This pairing of a control dependency in CPU0() with a full memory | ||
379 | barrier in CPU1() prevents r0 and r1 from both ending up equal to 1. | ||
380 | |||
381 | The A/D pairing from the ring-buffer use case shown earlier also | ||
382 | illustrates LB. Here is a repeat of the comment in | ||
383 | perf_output_put_handle() in kernel/events/ring_buffer.c, showing a | ||
384 | control dependency on the kernel side and a full memory barrier on | ||
385 | the user side: | ||
386 | |||
387 | * kernel user | ||
388 | * | ||
389 | * if (LOAD ->data_tail) { LOAD ->data_head | ||
390 | * (A) smp_rmb() (C) | ||
391 | * STORE $data LOAD $data | ||
392 | * smp_wmb() (B) smp_mb() (D) | ||
393 | * STORE ->data_head STORE ->data_tail | ||
394 | * } | ||
395 | * | ||
396 | * Where A pairs with D, and B pairs with C. | ||
397 | |||
398 | The kernel's control dependency between the load from ->data_tail | ||
399 | and the store to data combined with the user's full memory barrier | ||
400 | between the load from data and the store to ->data_tail prevents | ||
401 | the counter-intuitive outcome where the kernel overwrites the data | ||
402 | before the user gets done loading it. | ||
403 | |||
404 | |||
405 | Release-acquire chains | ||
406 | ---------------------- | ||
407 | |||
408 | Release-acquire chains are a low-overhead, flexible, and easy-to-use | ||
409 | method of maintaining order. However, they do have some limitations that | ||
410 | need to be fully understood. Here is an example that maintains order: | ||
411 | |||
412 | /* See ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus. */ | ||
413 | void CPU0(void) | ||
414 | { | ||
415 | WRITE_ONCE(x, 1); | ||
416 | smp_store_release(&y, 1); | ||
417 | } | ||
418 | |||
419 | void CPU1(void) | ||
420 | { | ||
421 | r0 = smp_load_acquire(y); | ||
422 | smp_store_release(&z, 1); | ||
423 | } | ||
424 | |||
425 | void CPU2(void) | ||
426 | { | ||
427 | r1 = smp_load_acquire(z); | ||
428 | r2 = READ_ONCE(x); | ||
429 | } | ||
430 | |||
431 | In this case, if r0 and r1 both have final values of 1, then r2 must | ||
432 | also have a final value of 1. | ||
433 | |||
434 | The ordering in this example is stronger than it needs to be. For | ||
435 | example, ordering would still be preserved if CPU1()'s smp_load_acquire() | ||
436 | invocation was replaced with READ_ONCE(). | ||
437 | |||
438 | It is tempting to assume that CPU0()'s store to x is globally ordered | ||
439 | before CPU1()'s store to z, but this is not the case: | ||
440 | |||
441 | /* See Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus. */ | ||
442 | void CPU0(void) | ||
443 | { | ||
444 | WRITE_ONCE(x, 1); | ||
445 | smp_store_release(&y, 1); | ||
446 | } | ||
447 | |||
448 | void CPU1(void) | ||
449 | { | ||
450 | r0 = smp_load_acquire(y); | ||
451 | smp_store_release(&z, 1); | ||
452 | } | ||
453 | |||
454 | void CPU2(void) | ||
455 | { | ||
456 | WRITE_ONCE(z, 2); | ||
457 | smp_mb(); | ||
458 | r1 = READ_ONCE(x); | ||
459 | } | ||
460 | |||
461 | One might hope that if the final value of r0 is 1 and the final value | ||
462 | of z is 2, then the final value of r1 must also be 1, but it really is | ||
463 | possible for r1 to have the final value of 0. The reason, of course, | ||
464 | is that in this version, CPU2() is not part of the release-acquire chain. | ||
465 | This situation is accounted for in the rules of thumb below. | ||
466 | |||
467 | Despite this limitation, release-acquire chains are low-overhead as | ||
468 | well as simple and powerful, at least as memory-ordering mechanisms go. | ||
469 | |||
470 | |||
471 | Store buffering | ||
472 | --------------- | ||
473 | |||
474 | Store buffering can be thought of as upside-down load buffering, so | ||
475 | that one CPU first stores to one variable and then loads from a second, | ||
476 | while another CPU stores to the second variable and then loads from the | ||
477 | first. Preserving order requires nothing less than full barriers: | ||
478 | |||
479 | /* See SB+mbonceonces.litmus. */ | ||
480 | void CPU0(void) | ||
481 | { | ||
482 | WRITE_ONCE(x, 1); | ||
483 | smp_mb(); | ||
484 | r0 = READ_ONCE(y); | ||
485 | } | ||
486 | |||
487 | void CPU1(void) | ||
488 | { | ||
489 | WRITE_ONCE(y, 1); | ||
490 | smp_mb(); | ||
491 | r1 = READ_ONCE(x); | ||
492 | } | ||
493 | |||
494 | Omitting either smp_mb() will allow both r0 and r1 to have final | ||
495 | values of 0, but providing both full barriers as shown above prevents | ||
496 | this counter-intuitive outcome. | ||
497 | |||
498 | This pattern most famously appears as part of Dekker's locking | ||
499 | algorithm, but it has a much more practical use within the Linux kernel | ||
500 | of ordering wakeups. The following comment taken from waitqueue_active() | ||
501 | in include/linux/wait.h shows the canonical pattern: | ||
502 | |||
503 | * CPU0 - waker CPU1 - waiter | ||
504 | * | ||
505 | * for (;;) { | ||
506 | * @cond = true; prepare_to_wait(&wq_head, &wait, state); | ||
507 | * smp_mb(); // smp_mb() from set_current_state() | ||
508 | * if (waitqueue_active(wq_head)) if (@cond) | ||
509 | * wake_up(wq_head); break; | ||
510 | * schedule(); | ||
511 | * } | ||
512 | * finish_wait(&wq_head, &wait); | ||
513 | |||
514 | On CPU0, the store is to @cond and the load is in waitqueue_active(). | ||
515 | On CPU1, prepare_to_wait() contains both a store to wq_head and a call | ||
516 | to set_current_state(), which contains an smp_mb() barrier; the load is | ||
517 | "if (@cond)". The full barriers prevent the undesirable outcome where | ||
518 | CPU1 puts the waiting task to sleep and CPU0 fails to wake it up. | ||
519 | |||
520 | Note that use of locking can greatly simplify this pattern. | ||
521 | |||
522 | |||
523 | Rules of thumb | ||
524 | ============== | ||
525 | |||
526 | There might seem to be no pattern governing what ordering primitives are | ||
527 | needed in which situations, but this is not the case. There is a pattern | ||
528 | based on the relation between the accesses linking successive CPUs in a | ||
529 | given litmus test. There are three types of linkage: | ||
530 | |||
531 | 1. Write-to-read, where the next CPU reads the value that the | ||
532 | previous CPU wrote. The LB litmus-test patterns contain only | ||
533 | this type of relation. In formal memory-modeling texts, this | ||
534 | relation is called "reads-from" and is usually abbreviated "rf". | ||
535 | |||
536 | 2. Read-to-write, where the next CPU overwrites the value that the | ||
537 | previous CPU read. The SB litmus test contains only this type | ||
538 | of relation. In formal memory-modeling texts, this relation is | ||
539 | often called "from-reads" and is sometimes abbreviated "fr". | ||
540 | |||
541 | 3. Write-to-write, where the next CPU overwrites the value written | ||
542 | by the previous CPU. The Z6.0 litmus test pattern contains a | ||
543 | write-to-write relation between the last access of CPU1() and | ||
544 | the first access of CPU2(). In formal memory-modeling texts, | ||
545 | this relation is often called "coherence order" and is sometimes | ||
546 | abbreviated "co". In the C++ standard, it is instead called | ||
547 | "modification order" and often abbreviated "mo". | ||
548 | |||
549 | The strength of memory ordering required for a given litmus test to | ||
550 | avoid a counter-intuitive outcome depends on the types of relations | ||
551 | linking the memory accesses for the outcome in question: | ||
552 | |||
553 | o If all links are write-to-read links, then the weakest | ||
554 | possible ordering within each CPU suffices. For example, in | ||
555 | the LB litmus test, a control dependency was enough to do the | ||
556 | job. | ||
557 | |||
558 | o If all but one of the links are write-to-read links, then a | ||
559 | release-acquire chain suffices. Both the MP and the ISA2 | ||
560 | litmus tests illustrate this case. | ||
561 | |||
562 | o If more than one of the links are something other than | ||
563 | write-to-read links, then a full memory barrier is required | ||
564 | between each successive pair of non-write-to-read links. This | ||
565 | case is illustrated by the Z6.0 litmus tests, both in the | ||
566 | locking and in the release-acquire sections. | ||
567 | |||
568 | However, if you find yourself having to stretch these rules of thumb | ||
569 | to fit your situation, you should consider creating a litmus test and | ||
570 | running it on the model. | ||
diff --git a/tools/memory-model/Documentation/references.txt b/tools/memory-model/Documentation/references.txt new file mode 100644 index 000000000000..ba2e34c2ec3f --- /dev/null +++ b/tools/memory-model/Documentation/references.txt | |||
@@ -0,0 +1,107 @@ | |||
1 | This document provides background reading for memory models and related | ||
2 | tools. These documents are aimed at kernel hackers who are interested | ||
3 | in memory models. | ||
4 | |||
5 | |||
6 | Hardware manuals and models | ||
7 | =========================== | ||
8 | |||
9 | o SPARC International Inc. (Ed.). 1994. "The SPARC Architecture | ||
10 | Reference Manual Version 9". SPARC International Inc. | ||
11 | |||
12 | o Compaq Computer Corporation (Ed.). 2002. "Alpha Architecture | ||
13 | Reference Manual". Compaq Computer Corporation. | ||
14 | |||
15 | o Intel Corporation (Ed.). 2002. "A Formal Specification of Intel | ||
16 | Itanium Processor Family Memory Ordering". Intel Corporation. | ||
17 | |||
18 | o Intel Corporation (Ed.). 2002. "Intel 64 and IA-32 Architectures | ||
19 | Software Developer’s Manual". Intel Corporation. | ||
20 | |||
21 | o Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, | ||
22 | and Magnus O. Myreen. 2010. "x86-TSO: A Rigorous and Usable | ||
23 | Programmer's Model for x86 Multiprocessors". Commun. ACM 53, 7 | ||
24 | (July, 2010), 89-97. http://doi.acm.org/10.1145/1785414.1785443 | ||
25 | |||
26 | o IBM Corporation (Ed.). 2009. "Power ISA Version 2.06". IBM | ||
27 | Corporation. | ||
28 | |||
29 | o ARM Ltd. (Ed.). 2009. "ARM Barrier Litmus Tests and Cookbook". | ||
30 | ARM Ltd. | ||
31 | |||
32 | o Susmit Sarkar, Peter Sewell, Jade Alglave, Luc Maranget, and | ||
33 | Derek Williams. 2011. "Understanding POWER Multiprocessors". In | ||
34 | Proceedings of the 32Nd ACM SIGPLAN Conference on Programming | ||
35 | Language Design and Implementation (PLDI ’11). ACM, New York, | ||
36 | NY, USA, 175–186. | ||
37 | |||
38 | o Susmit Sarkar, Kayvan Memarian, Scott Owens, Mark Batty, | ||
39 | Peter Sewell, Luc Maranget, Jade Alglave, and Derek Williams. | ||
40 | 2012. "Synchronising C/C++ and POWER". In Proceedings of the 33rd | ||
41 | ACM SIGPLAN Conference on Programming Language Design and | ||
42 | Implementation (PLDI '12). ACM, New York, NY, USA, 311-322. | ||
43 | |||
44 | o ARM Ltd. (Ed.). 2014. "ARM Architecture Reference Manual (ARMv8, | ||
45 | for ARMv8-A architecture profile)". ARM Ltd. | ||
46 | |||
47 | o Imagination Technologies, LTD. 2015. "MIPS(R) Architecture | ||
48 | For Programmers, Volume II-A: The MIPS64(R) Instruction, | ||
49 | Set Reference Manual". Imagination Technologies, | ||
50 | LTD. https://imgtec.com/?do-download=4302. | ||
51 | |||
52 | o Shaked Flur, Kathryn E. Gray, Christopher Pulte, Susmit | ||
53 | Sarkar, Ali Sezgin, Luc Maranget, Will Deacon, and Peter | ||
54 | Sewell. 2016. "Modelling the ARMv8 Architecture, Operationally: | ||
55 | Concurrency and ISA". In Proceedings of the 43rd Annual ACM | ||
56 | SIGPLAN-SIGACT Symposium on Principles of Programming Languages | ||
57 | (POPL ’16). ACM, New York, NY, USA, 608–621. | ||
58 | |||
59 | o Shaked Flur, Susmit Sarkar, Christopher Pulte, Kyndylan Nienhuis, | ||
60 | Luc Maranget, Kathryn E. Gray, Ali Sezgin, Mark Batty, and Peter | ||
61 | Sewell. 2017. "Mixed-size Concurrency: ARM, POWER, C/C++11, | ||
62 | and SC". In Proceedings of the 44th ACM SIGPLAN Symposium on | ||
63 | Principles of Programming Languages (POPL 2017). ACM, New York, | ||
64 | NY, USA, 429–442. | ||
65 | |||
66 | |||
67 | Linux-kernel memory model | ||
68 | ========================= | ||
69 | |||
70 | o Andrea Parri, Alan Stern, Luc Maranget, Paul E. McKenney, | ||
71 | and Jade Alglave. 2017. "A formal model of | ||
72 | Linux-kernel memory ordering - companion webpage". | ||
73 | http://moscova.inria.fr/∼maranget/cats7/linux/. (2017). [Online; | ||
74 | accessed 30-January-2017]. | ||
75 | |||
76 | o Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and | ||
77 | Alan Stern. 2017. "A formal kernel memory-ordering model (part 1)" | ||
78 | Linux Weekly News. https://lwn.net/Articles/718628/ | ||
79 | |||
80 | o Jade Alglave, Luc Maranget, Paul E. McKenney, Andrea Parri, and | ||
81 | Alan Stern. 2017. "A formal kernel memory-ordering model (part 2)" | ||
82 | Linux Weekly News. https://lwn.net/Articles/720550/ | ||
83 | |||
84 | |||
85 | Memory-model tooling | ||
86 | ==================== | ||
87 | |||
88 | o Daniel Jackson. 2002. "Alloy: A Lightweight Object Modelling | ||
89 | Notation". ACM Trans. Softw. Eng. Methodol. 11, 2 (April 2002), | ||
90 | 256–290. http://doi.acm.org/10.1145/505145.505149 | ||
91 | |||
92 | o Jade Alglave, Luc Maranget, and Michael Tautschnig. 2014. "Herding | ||
93 | Cats: Modelling, Simulation, Testing, and Data Mining for Weak | ||
94 | Memory". ACM Trans. Program. Lang. Syst. 36, 2, Article 7 (July | ||
95 | 2014), 7:1–7:74 pages. | ||
96 | |||
97 | o Jade Alglave, Patrick Cousot, and Luc Maranget. 2016. "Syntax and | ||
98 | semantics of the weak consistency model specification language | ||
99 | cat". CoRR abs/1608.07531 (2016). http://arxiv.org/abs/1608.07531 | ||
100 | |||
101 | |||
102 | Memory-model comparisons | ||
103 | ======================== | ||
104 | |||
105 | o Paul E. McKenney, Ulrich Weigand, Andrea Parri, and Boqun | ||
106 | Feng. 2016. "Linux-Kernel Memory Model". (6 June 2016). | ||
107 | http://open-std.org/JTC1/SC22/WG21/docs/papers/2016/p0124r2.html. | ||
diff --git a/tools/memory-model/README b/tools/memory-model/README new file mode 100644 index 000000000000..0b3a5f3c9ccd --- /dev/null +++ b/tools/memory-model/README | |||
@@ -0,0 +1,206 @@ | |||
1 | ===================================== | ||
2 | LINUX KERNEL MEMORY CONSISTENCY MODEL | ||
3 | ===================================== | ||
4 | |||
5 | ============ | ||
6 | INTRODUCTION | ||
7 | ============ | ||
8 | |||
9 | This directory contains the memory consistency model (memory model, for | ||
10 | short) of the Linux kernel, written in the "cat" language and executable | ||
11 | by the externally provided "herd7" simulator, which exhaustively explores | ||
12 | the state space of small litmus tests. | ||
13 | |||
14 | In addition, the "klitmus7" tool (also externally provided) may be used | ||
15 | to convert a litmus test to a Linux kernel module, which in turn allows | ||
16 | that litmus test to be exercised within the Linux kernel. | ||
17 | |||
18 | |||
19 | ============ | ||
20 | REQUIREMENTS | ||
21 | ============ | ||
22 | |||
23 | Version 7.48 of the "herd7" and "klitmus7" tools must be downloaded | ||
24 | separately: | ||
25 | |||
26 | https://github.com/herd/herdtools7 | ||
27 | |||
28 | See "herdtools7/INSTALL.md" for installation instructions. | ||
29 | |||
30 | |||
31 | ================== | ||
32 | BASIC USAGE: HERD7 | ||
33 | ================== | ||
34 | |||
35 | The memory model is used, in conjunction with "herd7", to exhaustively | ||
36 | explore the state space of small litmus tests. | ||
37 | |||
38 | For example, to run SB+mbonceonces.litmus against the memory model: | ||
39 | |||
40 | $ herd7 -conf linux-kernel.cfg litmus-tests/SB+mbonceonces.litmus | ||
41 | |||
42 | Here is the corresponding output: | ||
43 | |||
44 | Test SB+mbonceonces Allowed | ||
45 | States 3 | ||
46 | 0:r0=0; 1:r0=1; | ||
47 | 0:r0=1; 1:r0=0; | ||
48 | 0:r0=1; 1:r0=1; | ||
49 | No | ||
50 | Witnesses | ||
51 | Positive: 0 Negative: 3 | ||
52 | Condition exists (0:r0=0 /\ 1:r0=0) | ||
53 | Observation SB+mbonceonces Never 0 3 | ||
54 | Time SB+mbonceonces 0.01 | ||
55 | Hash=d66d99523e2cac6b06e66f4c995ebb48 | ||
56 | |||
57 | The "Positive: 0 Negative: 3" and the "Never 0 3" each indicate that | ||
58 | this litmus test's "exists" clause can not be satisfied. | ||
59 | |||
60 | See "herd7 -help" or "herdtools7/doc/" for more information. | ||
61 | |||
62 | |||
63 | ===================== | ||
64 | BASIC USAGE: KLITMUS7 | ||
65 | ===================== | ||
66 | |||
67 | The "klitmus7" tool converts a litmus test into a Linux kernel module, | ||
68 | which may then be loaded and run. | ||
69 | |||
70 | For example, to run SB+mbonceonces.litmus against hardware: | ||
71 | |||
72 | $ mkdir mymodules | ||
73 | $ klitmus7 -o mymodules litmus-tests/SB+mbonceonces.litmus | ||
74 | $ cd mymodules ; make | ||
75 | $ sudo sh run.sh | ||
76 | |||
77 | The corresponding output includes: | ||
78 | |||
79 | Test SB+mbonceonces Allowed | ||
80 | Histogram (3 states) | ||
81 | 644580 :>0:r0=1; 1:r0=0; | ||
82 | 644328 :>0:r0=0; 1:r0=1; | ||
83 | 711092 :>0:r0=1; 1:r0=1; | ||
84 | No | ||
85 | Witnesses | ||
86 | Positive: 0, Negative: 2000000 | ||
87 | Condition exists (0:r0=0 /\ 1:r0=0) is NOT validated | ||
88 | Hash=d66d99523e2cac6b06e66f4c995ebb48 | ||
89 | Observation SB+mbonceonces Never 0 2000000 | ||
90 | Time SB+mbonceonces 0.16 | ||
91 | |||
92 | The "Positive: 0 Negative: 2000000" and the "Never 0 2000000" indicate | ||
93 | that during two million trials, the state specified in this litmus | ||
94 | test's "exists" clause was not reached. | ||
95 | |||
96 | And, as with "herd7", please see "klitmus7 -help" or "herdtools7/doc/" | ||
97 | for more information. | ||
98 | |||
99 | |||
100 | ==================== | ||
101 | DESCRIPTION OF FILES | ||
102 | ==================== | ||
103 | |||
104 | Documentation/cheatsheet.txt | ||
105 | Quick-reference guide to the Linux-kernel memory model. | ||
106 | |||
107 | Documentation/explanation.txt | ||
108 | Describes the memory model in detail. | ||
109 | |||
110 | Documentation/recipes.txt | ||
111 | Lists common memory-ordering patterns. | ||
112 | |||
113 | Documentation/references.txt | ||
114 | Provides background reading. | ||
115 | |||
116 | linux-kernel.bell | ||
117 | Categorizes the relevant instructions, including memory | ||
118 | references, memory barriers, atomic read-modify-write operations, | ||
119 | lock acquisition/release, and RCU operations. | ||
120 | |||
121 | More formally, this file (1) lists the subtypes of the various | ||
122 | event types used by the memory model and (2) performs RCU | ||
123 | read-side critical section nesting analysis. | ||
124 | |||
125 | linux-kernel.cat | ||
126 | Specifies what reorderings are forbidden by memory references, | ||
127 | memory barriers, atomic read-modify-write operations, and RCU. | ||
128 | |||
129 | More formally, this file specifies what executions are forbidden | ||
130 | by the memory model. Allowed executions are those which | ||
131 | satisfy the model's "coherence", "atomic", "happens-before", | ||
132 | "propagation", and "rcu" axioms, which are defined in the file. | ||
133 | |||
134 | linux-kernel.cfg | ||
135 | Convenience file that gathers the common-case herd7 command-line | ||
136 | arguments. | ||
137 | |||
138 | linux-kernel.def | ||
139 | Maps from C-like syntax to herd7's internal litmus-test | ||
140 | instruction-set architecture. | ||
141 | |||
142 | litmus-tests | ||
143 | Directory containing a few representative litmus tests, which | ||
144 | are listed in litmus-tests/README. A great deal more litmus | ||
145 | tests are available at https://github.com/paulmckrcu/litmus. | ||
146 | |||
147 | lock.cat | ||
148 | Provides a front-end analysis of lock acquisition and release, | ||
149 | for example, associating a lock acquisition with the preceding | ||
150 | and following releases and checking for self-deadlock. | ||
151 | |||
152 | More formally, this file defines a performance-enhanced scheme | ||
153 | for generation of the possible reads-from and coherence order | ||
154 | relations on the locking primitives. | ||
155 | |||
156 | README | ||
157 | This file. | ||
158 | |||
159 | |||
160 | =========== | ||
161 | LIMITATIONS | ||
162 | =========== | ||
163 | |||
164 | The Linux-kernel memory model has the following limitations: | ||
165 | |||
166 | 1. Compiler optimizations are not modeled. Of course, the use | ||
167 | of READ_ONCE() and WRITE_ONCE() limits the compiler's ability | ||
168 | to optimize, but there is Linux-kernel code that uses bare C | ||
169 | memory accesses. Handling this code is on the to-do list. | ||
170 | For more information, see Documentation/explanation.txt (in | ||
171 | particular, the "THE PROGRAM ORDER RELATION: po AND po-loc" | ||
172 | and "A WARNING" sections). | ||
173 | |||
174 | 2. Multiple access sizes for a single variable are not supported, | ||
175 | and neither are misaligned or partially overlapping accesses. | ||
176 | |||
177 | 3. Exceptions and interrupts are not modeled. In some cases, | ||
178 | this limitation can be overcome by modeling the interrupt or | ||
179 | exception with an additional process. | ||
180 | |||
181 | 4. I/O such as MMIO or DMA is not supported. | ||
182 | |||
183 | 5. Self-modifying code (such as that found in the kernel's | ||
184 | alternatives mechanism, function tracer, Berkeley Packet Filter | ||
185 | JIT compiler, and module loader) is not supported. | ||
186 | |||
187 | 6. Complete modeling of all variants of atomic read-modify-write | ||
188 | operations, locking primitives, and RCU is not provided. | ||
189 | For example, call_rcu() and rcu_barrier() are not supported. | ||
190 | However, a substantial amount of support is provided for these | ||
191 | operations, as shown in the linux-kernel.def file. | ||
192 | |||
193 | The "herd7" tool has some additional limitations of its own, apart from | ||
194 | the memory model: | ||
195 | |||
196 | 1. Non-trivial data structures such as arrays or structures are | ||
197 | not supported. However, pointers are supported, allowing trivial | ||
198 | linked lists to be constructed. | ||
199 | |||
200 | 2. Dynamic memory allocation is not supported, although this can | ||
201 | be worked around in some cases by supplying multiple statically | ||
202 | allocated variables. | ||
203 | |||
204 | Some of these limitations may be overcome in the future, but others are | ||
205 | more likely to be addressed by incorporating the Linux-kernel memory model | ||
206 | into other tools. | ||
diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell new file mode 100644 index 000000000000..432c7cf71b23 --- /dev/null +++ b/tools/memory-model/linux-kernel.bell | |||
@@ -0,0 +1,52 @@ | |||
1 | // SPDX-License-Identifier: GPL-2.0+ | ||
2 | (* | ||
3 | * Copyright (C) 2015 Jade Alglave <j.alglave@ucl.ac.uk>, | ||
4 | * Copyright (C) 2016 Luc Maranget <luc.maranget@inria.fr> for Inria | ||
5 | * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu>, | ||
6 | * Andrea Parri <parri.andrea@gmail.com> | ||
7 | * | ||
8 | * An earlier version of this file appears in the companion webpage for | ||
9 | * "Frightening small children and disconcerting grown-ups: Concurrency | ||
10 | * in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern, | ||
11 | * which is to appear in ASPLOS 2018. | ||
12 | *) | ||
13 | |||
14 | "Linux-kernel memory consistency model" | ||
15 | |||
16 | enum Accesses = 'once (*READ_ONCE,WRITE_ONCE,ACCESS_ONCE*) || | ||
17 | 'release (*smp_store_release*) || | ||
18 | 'acquire (*smp_load_acquire*) || | ||
19 | 'noreturn (* R of non-return RMW *) | ||
20 | instructions R[{'once,'acquire,'noreturn}] | ||
21 | instructions W[{'once,'release}] | ||
22 | instructions RMW[{'once,'acquire,'release}] | ||
23 | |||
24 | enum Barriers = 'wmb (*smp_wmb*) || | ||
25 | 'rmb (*smp_rmb*) || | ||
26 | 'mb (*smp_mb*) || | ||
27 | 'rcu-lock (*rcu_read_lock*) || | ||
28 | 'rcu-unlock (*rcu_read_unlock*) || | ||
29 | 'sync-rcu (*synchronize_rcu*) || | ||
30 | 'before-atomic (*smp_mb__before_atomic*) || | ||
31 | 'after-atomic (*smp_mb__after_atomic*) || | ||
32 | 'after-spinlock (*smp_mb__after_spinlock*) | ||
33 | instructions F[Barriers] | ||
34 | |||
35 | (* Compute matching pairs of nested Rcu-lock and Rcu-unlock *) | ||
36 | let matched = let rec | ||
37 | unmatched-locks = Rcu-lock \ domain(matched) | ||
38 | and unmatched-unlocks = Rcu-unlock \ range(matched) | ||
39 | and unmatched = unmatched-locks | unmatched-unlocks | ||
40 | and unmatched-po = [unmatched] ; po ; [unmatched] | ||
41 | and unmatched-locks-to-unlocks = | ||
42 | [unmatched-locks] ; po ; [unmatched-unlocks] | ||
43 | and matched = matched | (unmatched-locks-to-unlocks \ | ||
44 | (unmatched-po ; unmatched-po)) | ||
45 | in matched | ||
46 | |||
47 | (* Validate nesting *) | ||
48 | flag ~empty Rcu-lock \ domain(matched) as unbalanced-rcu-locking | ||
49 | flag ~empty Rcu-unlock \ range(matched) as unbalanced-rcu-locking | ||
50 | |||
51 | (* Outermost level of nesting only *) | ||
52 | let crit = matched \ (po^-1 ; matched ; po^-1) | ||
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat new file mode 100644 index 000000000000..df97db03b6c2 --- /dev/null +++ b/tools/memory-model/linux-kernel.cat | |||
@@ -0,0 +1,121 @@ | |||
1 | // SPDX-License-Identifier: GPL-2.0+ | ||
2 | (* | ||
3 | * Copyright (C) 2015 Jade Alglave <j.alglave@ucl.ac.uk>, | ||
4 | * Copyright (C) 2016 Luc Maranget <luc.maranget@inria.fr> for Inria | ||
5 | * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu>, | ||
6 | * Andrea Parri <parri.andrea@gmail.com> | ||
7 | * | ||
8 | * An earlier version of this file appears in the companion webpage for | ||
9 | * "Frightening small children and disconcerting grown-ups: Concurrency | ||
10 | * in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern, | ||
11 | * which is to appear in ASPLOS 2018. | ||
12 | *) | ||
13 | |||
14 | "Linux-kernel memory consistency model" | ||
15 | |||
16 | (* | ||
17 | * File "lock.cat" handles locks and is experimental. | ||
18 | * It can be replaced by include "cos.cat" for tests that do not use locks. | ||
19 | *) | ||
20 | |||
21 | include "lock.cat" | ||
22 | |||
23 | (*******************) | ||
24 | (* Basic relations *) | ||
25 | (*******************) | ||
26 | |||
27 | (* Fences *) | ||
28 | let rmb = [R \ Noreturn] ; fencerel(Rmb) ; [R \ Noreturn] | ||
29 | let wmb = [W] ; fencerel(Wmb) ; [W] | ||
30 | let mb = ([M] ; fencerel(Mb) ; [M]) | | ||
31 | ([M] ; fencerel(Before-atomic) ; [RMW] ; po? ; [M]) | | ||
32 | ([M] ; po? ; [RMW] ; fencerel(After-atomic) ; [M]) | | ||
33 | ([M] ; po? ; [LKW] ; fencerel(After-spinlock) ; [M]) | ||
34 | let gp = po ; [Sync-rcu] ; po? | ||
35 | |||
36 | let strong-fence = mb | gp | ||
37 | |||
38 | (* Release Acquire *) | ||
39 | let acq-po = [Acquire] ; po ; [M] | ||
40 | let po-rel = [M] ; po ; [Release] | ||
41 | let rfi-rel-acq = [Release] ; rfi ; [Acquire] | ||
42 | |||
43 | (**********************************) | ||
44 | (* Fundamental coherence ordering *) | ||
45 | (**********************************) | ||
46 | |||
47 | (* Sequential Consistency Per Variable *) | ||
48 | let com = rf | co | fr | ||
49 | acyclic po-loc | com as coherence | ||
50 | |||
51 | (* Atomic Read-Modify-Write *) | ||
52 | empty rmw & (fre ; coe) as atomic | ||
53 | |||
54 | (**********************************) | ||
55 | (* Instruction execution ordering *) | ||
56 | (**********************************) | ||
57 | |||
58 | (* Preserved Program Order *) | ||
59 | let dep = addr | data | ||
60 | let rwdep = (dep | ctrl) ; [W] | ||
61 | let overwrite = co | fr | ||
62 | let to-w = rwdep | (overwrite & int) | ||
63 | let to-r = addr | (dep ; rfi) | rfi-rel-acq | ||
64 | let fence = strong-fence | wmb | po-rel | rmb | acq-po | ||
65 | let ppo = to-r | to-w | fence | ||
66 | |||
67 | (* Propagation: Ordering from release operations and strong fences. *) | ||
68 | let A-cumul(r) = rfe? ; r | ||
69 | let cumul-fence = A-cumul(strong-fence | po-rel) | wmb | ||
70 | let prop = (overwrite & ext)? ; cumul-fence* ; rfe? | ||
71 | |||
72 | (* | ||
73 | * Happens Before: Ordering from the passage of time. | ||
74 | * No fences needed here for prop because relation confined to one process. | ||
75 | *) | ||
76 | let hb = ppo | rfe | ((prop \ id) & int) | ||
77 | acyclic hb as happens-before | ||
78 | |||
79 | (****************************************) | ||
80 | (* Write and fence propagation ordering *) | ||
81 | (****************************************) | ||
82 | |||
83 | (* Propagation: Each non-rf link needs a strong fence. *) | ||
84 | let pb = prop ; strong-fence ; hb* | ||
85 | acyclic pb as propagation | ||
86 | |||
87 | (*******) | ||
88 | (* RCU *) | ||
89 | (*******) | ||
90 | |||
91 | (* | ||
92 | * Effect of read-side critical section proceeds from the rcu_read_lock() | ||
93 | * onward on the one hand and from the rcu_read_unlock() backwards on the | ||
94 | * other hand. | ||
95 | *) | ||
96 | let rscs = po ; crit^-1 ; po? | ||
97 | |||
98 | (* | ||
99 | * The synchronize_rcu() strong fence is special in that it can order not | ||
100 | * one but two non-rf relations, but only in conjunction with an RCU | ||
101 | * read-side critical section. | ||
102 | *) | ||
103 | let link = hb* ; pb* ; prop | ||
104 | |||
105 | (* Chains that affect the RCU grace-period guarantee *) | ||
106 | let gp-link = gp ; link | ||
107 | let rscs-link = rscs ; link | ||
108 | |||
109 | (* | ||
110 | * A cycle containing at least as many grace periods as RCU read-side | ||
111 | * critical sections is forbidden. | ||
112 | *) | ||
113 | let rec rcu-path = | ||
114 | gp-link | | ||
115 | (gp-link ; rscs-link) | | ||
116 | (rscs-link ; gp-link) | | ||
117 | (rcu-path ; rcu-path) | | ||
118 | (gp-link ; rcu-path ; rscs-link) | | ||
119 | (rscs-link ; rcu-path ; gp-link) | ||
120 | |||
121 | irreflexive rcu-path as rcu | ||
diff --git a/tools/memory-model/linux-kernel.cfg b/tools/memory-model/linux-kernel.cfg new file mode 100644 index 000000000000..3c8098e99f41 --- /dev/null +++ b/tools/memory-model/linux-kernel.cfg | |||
@@ -0,0 +1,21 @@ | |||
1 | macros linux-kernel.def | ||
2 | bell linux-kernel.bell | ||
3 | model linux-kernel.cat | ||
4 | graph columns | ||
5 | squished true | ||
6 | showevents noregs | ||
7 | movelabel true | ||
8 | fontsize 8 | ||
9 | xscale 2.0 | ||
10 | yscale 1.5 | ||
11 | arrowsize 0.8 | ||
12 | showinitrf false | ||
13 | showfinalrf false | ||
14 | showinitwrites false | ||
15 | splines spline | ||
16 | pad 0.1 | ||
17 | edgeattr hb,color,indigo | ||
18 | edgeattr co,color,blue | ||
19 | edgeattr mb,color,darkgreen | ||
20 | edgeattr wmb,color,darkgreen | ||
21 | edgeattr rmb,color,darkgreen | ||
diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def new file mode 100644 index 000000000000..397e4e67e8c8 --- /dev/null +++ b/tools/memory-model/linux-kernel.def | |||
@@ -0,0 +1,106 @@ | |||
1 | // SPDX-License-Identifier: GPL-2.0+ | ||
2 | // | ||
3 | // An earlier version of this file appears in the companion webpage for | ||
4 | // "Frightening small children and disconcerting grown-ups: Concurrency | ||
5 | // in the Linux kernel" by Alglave, Maranget, McKenney, Parri, and Stern, | ||
6 | // which is to appear in ASPLOS 2018. | ||
7 | |||
8 | // ONCE | ||
9 | READ_ONCE(X) __load{once}(X) | ||
10 | WRITE_ONCE(X,V) { __store{once}(X,V); } | ||
11 | |||
12 | // Release Acquire and friends | ||
13 | smp_store_release(X,V) { __store{release}(*X,V); } | ||
14 | smp_load_acquire(X) __load{acquire}(*X) | ||
15 | rcu_assign_pointer(X,V) { __store{release}(X,V); } | ||
16 | rcu_dereference(X) __load{once}(X) | ||
17 | |||
18 | // Fences | ||
19 | smp_mb() { __fence{mb} ; } | ||
20 | smp_rmb() { __fence{rmb} ; } | ||
21 | smp_wmb() { __fence{wmb} ; } | ||
22 | smp_mb__before_atomic() { __fence{before-atomic} ; } | ||
23 | smp_mb__after_atomic() { __fence{after-atomic} ; } | ||
24 | smp_mb__after_spinlock() { __fence{after-spinlock} ; } | ||
25 | |||
26 | // Exchange | ||
27 | xchg(X,V) __xchg{mb}(X,V) | ||
28 | xchg_relaxed(X,V) __xchg{once}(X,V) | ||
29 | xchg_release(X,V) __xchg{release}(X,V) | ||
30 | xchg_acquire(X,V) __xchg{acquire}(X,V) | ||
31 | cmpxchg(X,V,W) __cmpxchg{mb}(X,V,W) | ||
32 | cmpxchg_relaxed(X,V,W) __cmpxchg{once}(X,V,W) | ||
33 | cmpxchg_acquire(X,V,W) __cmpxchg{acquire}(X,V,W) | ||
34 | cmpxchg_release(X,V,W) __cmpxchg{release}(X,V,W) | ||
35 | |||
36 | // Spinlocks | ||
37 | spin_lock(X) { __lock(X) ; } | ||
38 | spin_unlock(X) { __unlock(X) ; } | ||
39 | spin_trylock(X) __trylock(X) | ||
40 | |||
41 | // RCU | ||
42 | rcu_read_lock() { __fence{rcu-lock}; } | ||
43 | rcu_read_unlock() { __fence{rcu-unlock};} | ||
44 | synchronize_rcu() { __fence{sync-rcu}; } | ||
45 | synchronize_rcu_expedited() { __fence{sync-rcu}; } | ||
46 | |||
47 | // Atomic | ||
48 | atomic_read(X) READ_ONCE(*X) | ||
49 | atomic_set(X,V) { WRITE_ONCE(*X,V) ; } | ||
50 | atomic_read_acquire(X) smp_load_acquire(X) | ||
51 | atomic_set_release(X,V) { smp_store_release(X,V); } | ||
52 | |||
53 | atomic_add(V,X) { __atomic_op(X,+,V) ; } | ||
54 | atomic_sub(V,X) { __atomic_op(X,-,V) ; } | ||
55 | atomic_inc(X) { __atomic_op(X,+,1) ; } | ||
56 | atomic_dec(X) { __atomic_op(X,-,1) ; } | ||
57 | |||
58 | atomic_add_return(V,X) __atomic_op_return{mb}(X,+,V) | ||
59 | atomic_add_return_relaxed(V,X) __atomic_op_return{once}(X,+,V) | ||
60 | atomic_add_return_acquire(V,X) __atomic_op_return{acquire}(X,+,V) | ||
61 | atomic_add_return_release(V,X) __atomic_op_return{release}(X,+,V) | ||
62 | atomic_fetch_add(V,X) __atomic_fetch_op{mb}(X,+,V) | ||
63 | atomic_fetch_add_relaxed(V,X) __atomic_fetch_op{once}(X,+,V) | ||
64 | atomic_fetch_add_acquire(V,X) __atomic_fetch_op{acquire}(X,+,V) | ||
65 | atomic_fetch_add_release(V,X) __atomic_fetch_op{release}(X,+,V) | ||
66 | |||
67 | atomic_inc_return(X) __atomic_op_return{mb}(X,+,1) | ||
68 | atomic_inc_return_relaxed(X) __atomic_op_return{once}(X,+,1) | ||
69 | atomic_inc_return_acquire(X) __atomic_op_return{acquire}(X,+,1) | ||
70 | atomic_inc_return_release(X) __atomic_op_return{release}(X,+,1) | ||
71 | atomic_fetch_inc(X) __atomic_fetch_op{mb}(X,+,1) | ||
72 | atomic_fetch_inc_relaxed(X) __atomic_fetch_op{once}(X,+,1) | ||
73 | atomic_fetch_inc_acquire(X) __atomic_fetch_op{acquire}(X,+,1) | ||
74 | atomic_fetch_inc_release(X) __atomic_fetch_op{release}(X,+,1) | ||
75 | |||
76 | atomic_sub_return(V,X) __atomic_op_return{mb}(X,-,V) | ||
77 | atomic_sub_return_relaxed(V,X) __atomic_op_return{once}(X,-,V) | ||
78 | atomic_sub_return_acquire(V,X) __atomic_op_return{acquire}(X,-,V) | ||
79 | atomic_sub_return_release(V,X) __atomic_op_return{release}(X,-,V) | ||
80 | atomic_fetch_sub(V,X) __atomic_fetch_op{mb}(X,-,V) | ||
81 | atomic_fetch_sub_relaxed(V,X) __atomic_fetch_op{once}(X,-,V) | ||
82 | atomic_fetch_sub_acquire(V,X) __atomic_fetch_op{acquire}(X,-,V) | ||
83 | atomic_fetch_sub_release(V,X) __atomic_fetch_op{release}(X,-,V) | ||
84 | |||
85 | atomic_dec_return(X) __atomic_op_return{mb}(X,-,1) | ||
86 | atomic_dec_return_relaxed(X) __atomic_op_return{once}(X,-,1) | ||
87 | atomic_dec_return_acquire(X) __atomic_op_return{acquire}(X,-,1) | ||
88 | atomic_dec_return_release(X) __atomic_op_return{release}(X,-,1) | ||
89 | atomic_fetch_dec(X) __atomic_fetch_op{mb}(X,-,1) | ||
90 | atomic_fetch_dec_relaxed(X) __atomic_fetch_op{once}(X,-,1) | ||
91 | atomic_fetch_dec_acquire(X) __atomic_fetch_op{acquire}(X,-,1) | ||
92 | atomic_fetch_dec_release(X) __atomic_fetch_op{release}(X,-,1) | ||
93 | |||
94 | atomic_xchg(X,V) __xchg{mb}(X,V) | ||
95 | atomic_xchg_relaxed(X,V) __xchg{once}(X,V) | ||
96 | atomic_xchg_release(X,V) __xchg{release}(X,V) | ||
97 | atomic_xchg_acquire(X,V) __xchg{acquire}(X,V) | ||
98 | atomic_cmpxchg(X,V,W) __cmpxchg{mb}(X,V,W) | ||
99 | atomic_cmpxchg_relaxed(X,V,W) __cmpxchg{once}(X,V,W) | ||
100 | atomic_cmpxchg_acquire(X,V,W) __cmpxchg{acquire}(X,V,W) | ||
101 | atomic_cmpxchg_release(X,V,W) __cmpxchg{release}(X,V,W) | ||
102 | |||
103 | atomic_sub_and_test(V,X) __atomic_op_return{mb}(X,-,V) == 0 | ||
104 | atomic_dec_and_test(X) __atomic_op_return{mb}(X,-,1) == 0 | ||
105 | atomic_inc_and_test(X) __atomic_op_return{mb}(X,+,1) == 0 | ||
106 | atomic_add_negative(V,X) __atomic_op_return{mb}(X,+,V) < 0 | ||
diff --git a/tools/memory-model/litmus-tests/CoRR+poonceonce+Once.litmus b/tools/memory-model/litmus-tests/CoRR+poonceonce+Once.litmus new file mode 100644 index 000000000000..967f9f2a6226 --- /dev/null +++ b/tools/memory-model/litmus-tests/CoRR+poonceonce+Once.litmus | |||
@@ -0,0 +1,26 @@ | |||
1 | C CoRR+poonceonce+Once | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Test of read-read coherence, that is, whether or not two successive | ||
7 | * reads from the same variable are ordered. | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x) | ||
13 | { | ||
14 | WRITE_ONCE(*x, 1); | ||
15 | } | ||
16 | |||
17 | P1(int *x) | ||
18 | { | ||
19 | int r0; | ||
20 | int r1; | ||
21 | |||
22 | r0 = READ_ONCE(*x); | ||
23 | r1 = READ_ONCE(*x); | ||
24 | } | ||
25 | |||
26 | exists (1:r0=1 /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/CoRW+poonceonce+Once.litmus b/tools/memory-model/litmus-tests/CoRW+poonceonce+Once.litmus new file mode 100644 index 000000000000..4635739f3974 --- /dev/null +++ b/tools/memory-model/litmus-tests/CoRW+poonceonce+Once.litmus | |||
@@ -0,0 +1,25 @@ | |||
1 | C CoRW+poonceonce+Once | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Test of read-write coherence, that is, whether or not a read from | ||
7 | * a given variable and a later write to that same variable are ordered. | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x) | ||
13 | { | ||
14 | int r0; | ||
15 | |||
16 | r0 = READ_ONCE(*x); | ||
17 | WRITE_ONCE(*x, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x) | ||
21 | { | ||
22 | WRITE_ONCE(*x, 2); | ||
23 | } | ||
24 | |||
25 | exists (x=2 /\ 0:r0=2) | ||
diff --git a/tools/memory-model/litmus-tests/CoWR+poonceonce+Once.litmus b/tools/memory-model/litmus-tests/CoWR+poonceonce+Once.litmus new file mode 100644 index 000000000000..bb068c92d8da --- /dev/null +++ b/tools/memory-model/litmus-tests/CoWR+poonceonce+Once.litmus | |||
@@ -0,0 +1,25 @@ | |||
1 | C CoWR+poonceonce+Once | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Test of write-read coherence, that is, whether or not a write to a | ||
7 | * given variable and a later read from that same variable are ordered. | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x) | ||
13 | { | ||
14 | int r0; | ||
15 | |||
16 | WRITE_ONCE(*x, 1); | ||
17 | r0 = READ_ONCE(*x); | ||
18 | } | ||
19 | |||
20 | P1(int *x) | ||
21 | { | ||
22 | WRITE_ONCE(*x, 2); | ||
23 | } | ||
24 | |||
25 | exists (x=1 /\ 0:r0=2) | ||
diff --git a/tools/memory-model/litmus-tests/CoWW+poonceonce.litmus b/tools/memory-model/litmus-tests/CoWW+poonceonce.litmus new file mode 100644 index 000000000000..0d9f0a958799 --- /dev/null +++ b/tools/memory-model/litmus-tests/CoWW+poonceonce.litmus | |||
@@ -0,0 +1,18 @@ | |||
1 | C CoWW+poonceonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Test of write-write coherence, that is, whether or not two successive | ||
7 | * writes to the same variable are ordered. | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x) | ||
13 | { | ||
14 | WRITE_ONCE(*x, 1); | ||
15 | WRITE_ONCE(*x, 2); | ||
16 | } | ||
17 | |||
18 | exists (x=1) | ||
diff --git a/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus b/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus new file mode 100644 index 000000000000..50d5db9ea983 --- /dev/null +++ b/tools/memory-model/litmus-tests/IRIW+mbonceonces+OnceOnce.litmus | |||
@@ -0,0 +1,45 @@ | |||
1 | C IRIW+mbonceonces+OnceOnce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Test of independent reads from independent writes with smp_mb() | ||
7 | * between each pairs of reads. In other words, is smp_mb() sufficient to | ||
8 | * cause two different reading processes to agree on the order of a pair | ||
9 | * of writes, where each write is to a different variable by a different | ||
10 | * process? | ||
11 | *) | ||
12 | |||
13 | {} | ||
14 | |||
15 | P0(int *x) | ||
16 | { | ||
17 | WRITE_ONCE(*x, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x, int *y) | ||
21 | { | ||
22 | int r0; | ||
23 | int r1; | ||
24 | |||
25 | r0 = READ_ONCE(*x); | ||
26 | smp_mb(); | ||
27 | r1 = READ_ONCE(*y); | ||
28 | } | ||
29 | |||
30 | P2(int *y) | ||
31 | { | ||
32 | WRITE_ONCE(*y, 1); | ||
33 | } | ||
34 | |||
35 | P3(int *x, int *y) | ||
36 | { | ||
37 | int r0; | ||
38 | int r1; | ||
39 | |||
40 | r0 = READ_ONCE(*y); | ||
41 | smp_mb(); | ||
42 | r1 = READ_ONCE(*x); | ||
43 | } | ||
44 | |||
45 | exists (1:r0=1 /\ 1:r1=0 /\ 3:r0=1 /\ 3:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/IRIW+poonceonces+OnceOnce.litmus b/tools/memory-model/litmus-tests/IRIW+poonceonces+OnceOnce.litmus new file mode 100644 index 000000000000..4b54dd6a6cd9 --- /dev/null +++ b/tools/memory-model/litmus-tests/IRIW+poonceonces+OnceOnce.litmus | |||
@@ -0,0 +1,43 @@ | |||
1 | C IRIW+poonceonces+OnceOnce | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * Test of independent reads from independent writes with nothing | ||
7 | * between each pairs of reads. In other words, is anything at all | ||
8 | * needed to cause two different reading processes to agree on the order | ||
9 | * of a pair of writes, where each write is to a different variable by a | ||
10 | * different process? | ||
11 | *) | ||
12 | |||
13 | {} | ||
14 | |||
15 | P0(int *x) | ||
16 | { | ||
17 | WRITE_ONCE(*x, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x, int *y) | ||
21 | { | ||
22 | int r0; | ||
23 | int r1; | ||
24 | |||
25 | r0 = READ_ONCE(*x); | ||
26 | r1 = READ_ONCE(*y); | ||
27 | } | ||
28 | |||
29 | P2(int *y) | ||
30 | { | ||
31 | WRITE_ONCE(*y, 1); | ||
32 | } | ||
33 | |||
34 | P3(int *x, int *y) | ||
35 | { | ||
36 | int r0; | ||
37 | int r1; | ||
38 | |||
39 | r0 = READ_ONCE(*y); | ||
40 | r1 = READ_ONCE(*x); | ||
41 | } | ||
42 | |||
43 | exists (1:r0=1 /\ 1:r1=0 /\ 3:r0=1 /\ 3:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus new file mode 100644 index 000000000000..7a39a0aaa976 --- /dev/null +++ b/tools/memory-model/litmus-tests/ISA2+pooncelock+pooncelock+pombonce.litmus | |||
@@ -0,0 +1,41 @@ | |||
1 | C ISA2+pooncelock+pooncelock+pombonce.litmus | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This test shows that the ordering provided by a lock-protected S | ||
7 | * litmus test (P0() and P1()) are not visible to external process P2(). | ||
8 | * This is likely to change soon. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y, spinlock_t *mylock) | ||
14 | { | ||
15 | spin_lock(mylock); | ||
16 | WRITE_ONCE(*x, 1); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | spin_unlock(mylock); | ||
19 | } | ||
20 | |||
21 | P1(int *y, int *z, spinlock_t *mylock) | ||
22 | { | ||
23 | int r0; | ||
24 | |||
25 | spin_lock(mylock); | ||
26 | r0 = READ_ONCE(*y); | ||
27 | WRITE_ONCE(*z, 1); | ||
28 | spin_unlock(mylock); | ||
29 | } | ||
30 | |||
31 | P2(int *x, int *z) | ||
32 | { | ||
33 | int r1; | ||
34 | int r2; | ||
35 | |||
36 | r2 = READ_ONCE(*z); | ||
37 | smp_mb(); | ||
38 | r1 = READ_ONCE(*x); | ||
39 | } | ||
40 | |||
41 | exists (1:r0=1 /\ 2:r2=1 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/ISA2+poonceonces.litmus b/tools/memory-model/litmus-tests/ISA2+poonceonces.litmus new file mode 100644 index 000000000000..b321aa6f4ea5 --- /dev/null +++ b/tools/memory-model/litmus-tests/ISA2+poonceonces.litmus | |||
@@ -0,0 +1,37 @@ | |||
1 | C ISA2+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * Given a release-acquire chain ordering the first process's store | ||
7 | * against the last process's load, is ordering preserved if all of the | ||
8 | * smp_store_release() invocations are replaced by WRITE_ONCE() and all | ||
9 | * of the smp_load_acquire() invocations are replaced by READ_ONCE()? | ||
10 | *) | ||
11 | |||
12 | {} | ||
13 | |||
14 | P0(int *x, int *y) | ||
15 | { | ||
16 | WRITE_ONCE(*x, 1); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *y, int *z) | ||
21 | { | ||
22 | int r0; | ||
23 | |||
24 | r0 = READ_ONCE(*y); | ||
25 | WRITE_ONCE(*z, 1); | ||
26 | } | ||
27 | |||
28 | P2(int *x, int *z) | ||
29 | { | ||
30 | int r0; | ||
31 | int r1; | ||
32 | |||
33 | r0 = READ_ONCE(*z); | ||
34 | r1 = READ_ONCE(*x); | ||
35 | } | ||
36 | |||
37 | exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus b/tools/memory-model/litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus new file mode 100644 index 000000000000..025b0462ec9b --- /dev/null +++ b/tools/memory-model/litmus-tests/ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus | |||
@@ -0,0 +1,39 @@ | |||
1 | C ISA2+pooncerelease+poacquirerelease+poacquireonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that a release-acquire chain suffices | ||
7 | * to order P0()'s initial write against P2()'s final read. The reason | ||
8 | * that the release-acquire chain suffices is because in all but one | ||
9 | * case (P2() to P0()), each process reads from the preceding process's | ||
10 | * write. In memory-model-speak, there is only one non-reads-from | ||
11 | * (AKA non-rf) link, so release-acquire is all that is needed. | ||
12 | *) | ||
13 | |||
14 | {} | ||
15 | |||
16 | P0(int *x, int *y) | ||
17 | { | ||
18 | WRITE_ONCE(*x, 1); | ||
19 | smp_store_release(y, 1); | ||
20 | } | ||
21 | |||
22 | P1(int *y, int *z) | ||
23 | { | ||
24 | int r0; | ||
25 | |||
26 | r0 = smp_load_acquire(y); | ||
27 | smp_store_release(z, 1); | ||
28 | } | ||
29 | |||
30 | P2(int *x, int *z) | ||
31 | { | ||
32 | int r0; | ||
33 | int r1; | ||
34 | |||
35 | r0 = smp_load_acquire(z); | ||
36 | r1 = READ_ONCE(*x); | ||
37 | } | ||
38 | |||
39 | exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/LB+ctrlonceonce+mbonceonce.litmus b/tools/memory-model/litmus-tests/LB+ctrlonceonce+mbonceonce.litmus new file mode 100644 index 000000000000..de6708229dd1 --- /dev/null +++ b/tools/memory-model/litmus-tests/LB+ctrlonceonce+mbonceonce.litmus | |||
@@ -0,0 +1,34 @@ | |||
1 | C LB+ctrlonceonce+mbonceonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that lightweight ordering suffices for | ||
7 | * the load-buffering pattern, in other words, preventing all processes | ||
8 | * reading from the preceding process's write. In this example, the | ||
9 | * combination of a control dependency and a full memory barrier are enough | ||
10 | * to do the trick. (But the full memory barrier could be replaced with | ||
11 | * another control dependency and order would still be maintained.) | ||
12 | *) | ||
13 | |||
14 | {} | ||
15 | |||
16 | P0(int *x, int *y) | ||
17 | { | ||
18 | int r0; | ||
19 | |||
20 | r0 = READ_ONCE(*x); | ||
21 | if (r0) | ||
22 | WRITE_ONCE(*y, 1); | ||
23 | } | ||
24 | |||
25 | P1(int *x, int *y) | ||
26 | { | ||
27 | int r0; | ||
28 | |||
29 | r0 = READ_ONCE(*y); | ||
30 | smp_mb(); | ||
31 | WRITE_ONCE(*x, 1); | ||
32 | } | ||
33 | |||
34 | exists (0:r0=1 /\ 1:r0=1) | ||
diff --git a/tools/memory-model/litmus-tests/LB+poacquireonce+pooncerelease.litmus b/tools/memory-model/litmus-tests/LB+poacquireonce+pooncerelease.litmus new file mode 100644 index 000000000000..07b9904b0e49 --- /dev/null +++ b/tools/memory-model/litmus-tests/LB+poacquireonce+pooncerelease.litmus | |||
@@ -0,0 +1,29 @@ | |||
1 | C LB+poacquireonce+pooncerelease | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Does a release-acquire pair suffice for the load-buffering litmus | ||
7 | * test, where each process reads from one of two variables then writes | ||
8 | * to the other? | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y) | ||
14 | { | ||
15 | int r0; | ||
16 | |||
17 | r0 = READ_ONCE(*x); | ||
18 | smp_store_release(y, 1); | ||
19 | } | ||
20 | |||
21 | P1(int *x, int *y) | ||
22 | { | ||
23 | int r0; | ||
24 | |||
25 | r0 = smp_load_acquire(y); | ||
26 | WRITE_ONCE(*x, 1); | ||
27 | } | ||
28 | |||
29 | exists (0:r0=1 /\ 1:r0=1) | ||
diff --git a/tools/memory-model/litmus-tests/LB+poonceonces.litmus b/tools/memory-model/litmus-tests/LB+poonceonces.litmus new file mode 100644 index 000000000000..74c49cb3c37b --- /dev/null +++ b/tools/memory-model/litmus-tests/LB+poonceonces.litmus | |||
@@ -0,0 +1,28 @@ | |||
1 | C LB+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * Can the counter-intuitive outcome for the load-buffering pattern | ||
7 | * be prevented even with no explicit ordering? | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x, int *y) | ||
13 | { | ||
14 | int r0; | ||
15 | |||
16 | r0 = READ_ONCE(*x); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x, int *y) | ||
21 | { | ||
22 | int r0; | ||
23 | |||
24 | r0 = READ_ONCE(*y); | ||
25 | WRITE_ONCE(*x, 1); | ||
26 | } | ||
27 | |||
28 | exists (0:r0=1 /\ 1:r0=1) | ||
diff --git a/tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus b/tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus new file mode 100644 index 000000000000..97731b4bbdd8 --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus | |||
@@ -0,0 +1,34 @@ | |||
1 | C MP+onceassign+derefonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that rcu_assign_pointer() and | ||
7 | * rcu_dereference() suffice to ensure that an RCU reader will not see | ||
8 | * pre-initialization garbage when it traverses an RCU-protected data | ||
9 | * structure containing a newly inserted element. | ||
10 | *) | ||
11 | |||
12 | { | ||
13 | y=z; | ||
14 | z=0; | ||
15 | } | ||
16 | |||
17 | P0(int *x, int **y) | ||
18 | { | ||
19 | WRITE_ONCE(*x, 1); | ||
20 | rcu_assign_pointer(*y, x); | ||
21 | } | ||
22 | |||
23 | P1(int *x, int **y) | ||
24 | { | ||
25 | int *r0; | ||
26 | int r1; | ||
27 | |||
28 | rcu_read_lock(); | ||
29 | r0 = rcu_dereference(*y); | ||
30 | r1 = READ_ONCE(*r0); | ||
31 | rcu_read_unlock(); | ||
32 | } | ||
33 | |||
34 | exists (1:r0=x /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/MP+polocks.litmus b/tools/memory-model/litmus-tests/MP+polocks.litmus new file mode 100644 index 000000000000..712a4fcdf6ce --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+polocks.litmus | |||
@@ -0,0 +1,35 @@ | |||
1 | C MP+polocks | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates how lock acquisitions and releases can | ||
7 | * stand in for smp_load_acquire() and smp_store_release(), respectively. | ||
8 | * In other words, when holding a given lock (or indeed after releasing a | ||
9 | * given lock), a CPU is not only guaranteed to see the accesses that other | ||
10 | * CPUs made while previously holding that lock, it is also guaranteed | ||
11 | * to see all prior accesses by those other CPUs. | ||
12 | *) | ||
13 | |||
14 | {} | ||
15 | |||
16 | P0(int *x, int *y, spinlock_t *mylock) | ||
17 | { | ||
18 | WRITE_ONCE(*x, 1); | ||
19 | spin_lock(mylock); | ||
20 | WRITE_ONCE(*y, 1); | ||
21 | spin_unlock(mylock); | ||
22 | } | ||
23 | |||
24 | P1(int *x, int *y, spinlock_t *mylock) | ||
25 | { | ||
26 | int r0; | ||
27 | int r1; | ||
28 | |||
29 | spin_lock(mylock); | ||
30 | r0 = READ_ONCE(*y); | ||
31 | spin_unlock(mylock); | ||
32 | r1 = READ_ONCE(*x); | ||
33 | } | ||
34 | |||
35 | exists (1:r0=1 /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/MP+poonceonces.litmus b/tools/memory-model/litmus-tests/MP+poonceonces.litmus new file mode 100644 index 000000000000..b2b60b84fb9d --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+poonceonces.litmus | |||
@@ -0,0 +1,27 @@ | |||
1 | C MP+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Maybe | ||
5 | * | ||
6 | * Can the counter-intuitive message-passing outcome be prevented with | ||
7 | * no ordering at all? | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x, int *y) | ||
13 | { | ||
14 | WRITE_ONCE(*x, 1); | ||
15 | WRITE_ONCE(*y, 1); | ||
16 | } | ||
17 | |||
18 | P1(int *x, int *y) | ||
19 | { | ||
20 | int r0; | ||
21 | int r1; | ||
22 | |||
23 | r0 = READ_ONCE(*y); | ||
24 | r1 = READ_ONCE(*x); | ||
25 | } | ||
26 | |||
27 | exists (1:r0=1 /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/MP+pooncerelease+poacquireonce.litmus b/tools/memory-model/litmus-tests/MP+pooncerelease+poacquireonce.litmus new file mode 100644 index 000000000000..d52c68429722 --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+pooncerelease+poacquireonce.litmus | |||
@@ -0,0 +1,28 @@ | |||
1 | C MP+pooncerelease+poacquireonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that smp_store_release() and | ||
7 | * smp_load_acquire() provide sufficient ordering for the message-passing | ||
8 | * pattern. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y) | ||
14 | { | ||
15 | WRITE_ONCE(*x, 1); | ||
16 | smp_store_release(y, 1); | ||
17 | } | ||
18 | |||
19 | P1(int *x, int *y) | ||
20 | { | ||
21 | int r0; | ||
22 | int r1; | ||
23 | |||
24 | r0 = smp_load_acquire(y); | ||
25 | r1 = READ_ONCE(*x); | ||
26 | } | ||
27 | |||
28 | exists (1:r0=1 /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/MP+porevlocks.litmus b/tools/memory-model/litmus-tests/MP+porevlocks.litmus new file mode 100644 index 000000000000..72c9276b363e --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+porevlocks.litmus | |||
@@ -0,0 +1,35 @@ | |||
1 | C MP+porevlocks | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates how lock acquisitions and releases can | ||
7 | * stand in for smp_load_acquire() and smp_store_release(), respectively. | ||
8 | * In other words, when holding a given lock (or indeed after releasing a | ||
9 | * given lock), a CPU is not only guaranteed to see the accesses that other | ||
10 | * CPUs made while previously holding that lock, it is also guaranteed to | ||
11 | * see all prior accesses by those other CPUs. | ||
12 | *) | ||
13 | |||
14 | {} | ||
15 | |||
16 | P0(int *x, int *y, spinlock_t *mylock) | ||
17 | { | ||
18 | int r0; | ||
19 | int r1; | ||
20 | |||
21 | r0 = READ_ONCE(*y); | ||
22 | spin_lock(mylock); | ||
23 | r1 = READ_ONCE(*x); | ||
24 | spin_unlock(mylock); | ||
25 | } | ||
26 | |||
27 | P1(int *x, int *y, spinlock_t *mylock) | ||
28 | { | ||
29 | spin_lock(mylock); | ||
30 | WRITE_ONCE(*x, 1); | ||
31 | spin_unlock(mylock); | ||
32 | WRITE_ONCE(*y, 1); | ||
33 | } | ||
34 | |||
35 | exists (0:r0=1 /\ 0:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/MP+wmbonceonce+rmbonceonce.litmus b/tools/memory-model/litmus-tests/MP+wmbonceonce+rmbonceonce.litmus new file mode 100644 index 000000000000..c078f38ff27a --- /dev/null +++ b/tools/memory-model/litmus-tests/MP+wmbonceonce+rmbonceonce.litmus | |||
@@ -0,0 +1,30 @@ | |||
1 | C MP+wmbonceonce+rmbonceonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that smp_wmb() and smp_rmb() provide | ||
7 | * sufficient ordering for the message-passing pattern. However, it | ||
8 | * is usually better to use smp_store_release() and smp_load_acquire(). | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y) | ||
14 | { | ||
15 | WRITE_ONCE(*x, 1); | ||
16 | smp_wmb(); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x, int *y) | ||
21 | { | ||
22 | int r0; | ||
23 | int r1; | ||
24 | |||
25 | r0 = READ_ONCE(*y); | ||
26 | smp_rmb(); | ||
27 | r1 = READ_ONCE(*x); | ||
28 | } | ||
29 | |||
30 | exists (1:r0=1 /\ 1:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/R+mbonceonces.litmus b/tools/memory-model/litmus-tests/R+mbonceonces.litmus new file mode 100644 index 000000000000..a0e884ad2132 --- /dev/null +++ b/tools/memory-model/litmus-tests/R+mbonceonces.litmus | |||
@@ -0,0 +1,30 @@ | |||
1 | C R+mbonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This is the fully ordered (via smp_mb()) version of one of the classic | ||
7 | * counterintuitive litmus tests that illustrates the effects of store | ||
8 | * propagation delays. Note that weakening either of the barriers would | ||
9 | * cause the resulting test to be allowed. | ||
10 | *) | ||
11 | |||
12 | {} | ||
13 | |||
14 | P0(int *x, int *y) | ||
15 | { | ||
16 | WRITE_ONCE(*x, 1); | ||
17 | smp_mb(); | ||
18 | WRITE_ONCE(*y, 1); | ||
19 | } | ||
20 | |||
21 | P1(int *x, int *y) | ||
22 | { | ||
23 | int r0; | ||
24 | |||
25 | WRITE_ONCE(*y, 2); | ||
26 | smp_mb(); | ||
27 | r0 = READ_ONCE(*x); | ||
28 | } | ||
29 | |||
30 | exists (y=2 /\ 1:r0=0) | ||
diff --git a/tools/memory-model/litmus-tests/R+poonceonces.litmus b/tools/memory-model/litmus-tests/R+poonceonces.litmus new file mode 100644 index 000000000000..5386f128a131 --- /dev/null +++ b/tools/memory-model/litmus-tests/R+poonceonces.litmus | |||
@@ -0,0 +1,27 @@ | |||
1 | C R+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This is the unordered (thus lacking smp_mb()) version of one of the | ||
7 | * classic counterintuitive litmus tests that illustrates the effects of | ||
8 | * store propagation delays. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y) | ||
14 | { | ||
15 | WRITE_ONCE(*x, 1); | ||
16 | WRITE_ONCE(*y, 1); | ||
17 | } | ||
18 | |||
19 | P1(int *x, int *y) | ||
20 | { | ||
21 | int r0; | ||
22 | |||
23 | WRITE_ONCE(*y, 2); | ||
24 | r0 = READ_ONCE(*x); | ||
25 | } | ||
26 | |||
27 | exists (y=2 /\ 1:r0=0) | ||
diff --git a/tools/memory-model/litmus-tests/README b/tools/memory-model/litmus-tests/README new file mode 100644 index 000000000000..04096fb8b8d9 --- /dev/null +++ b/tools/memory-model/litmus-tests/README | |||
@@ -0,0 +1,131 @@ | |||
1 | This directory contains the following litmus tests: | ||
2 | |||
3 | CoRR+poonceonce+Once.litmus | ||
4 | Test of read-read coherence, that is, whether or not two | ||
5 | successive reads from the same variable are ordered. | ||
6 | |||
7 | CoRW+poonceonce+Once.litmus | ||
8 | Test of read-write coherence, that is, whether or not a read | ||
9 | from a given variable followed by a write to that same variable | ||
10 | are ordered. | ||
11 | |||
12 | CoWR+poonceonce+Once.litmus | ||
13 | Test of write-read coherence, that is, whether or not a write | ||
14 | to a given variable followed by a read from that same variable | ||
15 | are ordered. | ||
16 | |||
17 | CoWW+poonceonce.litmus | ||
18 | Test of write-write coherence, that is, whether or not two | ||
19 | successive writes to the same variable are ordered. | ||
20 | |||
21 | IRIW+mbonceonces+OnceOnce.litmus | ||
22 | Test of independent reads from independent writes with smp_mb() | ||
23 | between each pairs of reads. In other words, is smp_mb() | ||
24 | sufficient to cause two different reading processes to agree on | ||
25 | the order of a pair of writes, where each write is to a different | ||
26 | variable by a different process? | ||
27 | |||
28 | IRIW+poonceonces+OnceOnce.litmus | ||
29 | Test of independent reads from independent writes with nothing | ||
30 | between each pairs of reads. In other words, is anything at all | ||
31 | needed to cause two different reading processes to agree on the | ||
32 | order of a pair of writes, where each write is to a different | ||
33 | variable by a different process? | ||
34 | |||
35 | ISA2+pooncelock+pooncelock+pombonce.litmus | ||
36 | Tests whether the ordering provided by a lock-protected S | ||
37 | litmus test is visible to an external process whose accesses are | ||
38 | separated by smp_mb(). This addition of an external process to | ||
39 | S is otherwise known as ISA2. | ||
40 | |||
41 | ISA2+poonceonces.litmus | ||
42 | As below, but with store-release replaced with WRITE_ONCE() | ||
43 | and load-acquire replaced with READ_ONCE(). | ||
44 | |||
45 | ISA2+pooncerelease+poacquirerelease+poacquireonce.litmus | ||
46 | Can a release-acquire chain order a prior store against | ||
47 | a later load? | ||
48 | |||
49 | LB+ctrlonceonce+mbonceonce.litmus | ||
50 | Does a control dependency and an smp_mb() suffice for the | ||
51 | load-buffering litmus test, where each process reads from one | ||
52 | of two variables then writes to the other? | ||
53 | |||
54 | LB+poacquireonce+pooncerelease.litmus | ||
55 | Does a release-acquire pair suffice for the load-buffering | ||
56 | litmus test, where each process reads from one of two variables then | ||
57 | writes to the other? | ||
58 | |||
59 | LB+poonceonces.litmus | ||
60 | As above, but with store-release replaced with WRITE_ONCE() | ||
61 | and load-acquire replaced with READ_ONCE(). | ||
62 | |||
63 | MP+onceassign+derefonce.litmus | ||
64 | As below, but with rcu_assign_pointer() and an rcu_dereference(). | ||
65 | |||
66 | MP+polocks.litmus | ||
67 | As below, but with the second access of the writer process | ||
68 | and the first access of reader process protected by a lock. | ||
69 | |||
70 | MP+poonceonces.litmus | ||
71 | As below, but without the smp_rmb() and smp_wmb(). | ||
72 | |||
73 | MP+pooncerelease+poacquireonce.litmus | ||
74 | As below, but with a release-acquire chain. | ||
75 | |||
76 | MP+porevlocks.litmus | ||
77 | As below, but with the first access of the writer process | ||
78 | and the second access of reader process protected by a lock. | ||
79 | |||
80 | MP+wmbonceonce+rmbonceonce.litmus | ||
81 | Does a smp_wmb() (between the stores) and an smp_rmb() (between | ||
82 | the loads) suffice for the message-passing litmus test, where one | ||
83 | process writes data and then a flag, and the other process reads | ||
84 | the flag and then the data. (This is similar to the ISA2 tests, | ||
85 | but with two processes instead of three.) | ||
86 | |||
87 | R+mbonceonces.litmus | ||
88 | This is the fully ordered (via smp_mb()) version of one of | ||
89 | the classic counterintuitive litmus tests that illustrates the | ||
90 | effects of store propagation delays. | ||
91 | |||
92 | R+poonceonces.litmus | ||
93 | As above, but without the smp_mb() invocations. | ||
94 | |||
95 | SB+mbonceonces.litmus | ||
96 | This is the fully ordered (again, via smp_mb() version of store | ||
97 | buffering, which forms the core of Dekker's mutual-exclusion | ||
98 | algorithm. | ||
99 | |||
100 | SB+poonceonces.litmus | ||
101 | As above, but without the smp_mb() invocations. | ||
102 | |||
103 | S+poonceonces.litmus | ||
104 | As below, but without the smp_wmb() and acquire load. | ||
105 | |||
106 | S+wmbonceonce+poacquireonce.litmus | ||
107 | Can a smp_wmb(), instead of a release, and an acquire order | ||
108 | a prior store against a subsequent store? | ||
109 | |||
110 | WRC+poonceonces+Once.litmus | ||
111 | WRC+pooncerelease+rmbonceonce+Once.litmus | ||
112 | These two are members of an extension of the MP litmus-test class | ||
113 | in which the first write is moved to a separate process. | ||
114 | |||
115 | Z6.0+pooncelock+pooncelock+pombonce.litmus | ||
116 | Is the ordering provided by a spin_unlock() and a subsequent | ||
117 | spin_lock() sufficient to make ordering apparent to accesses | ||
118 | by a process not holding the lock? | ||
119 | |||
120 | Z6.0+pooncelock+poonceLock+pombonce.litmus | ||
121 | As above, but with smp_mb__after_spinlock() immediately | ||
122 | following the spin_lock(). | ||
123 | |||
124 | Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus | ||
125 | Is the ordering provided by a release-acquire chain sufficient | ||
126 | to make ordering apparent to accesses by a process that does | ||
127 | not participate in that release-acquire chain? | ||
128 | |||
129 | A great many more litmus tests are available here: | ||
130 | |||
131 | https://github.com/paulmckrcu/litmus | ||
diff --git a/tools/memory-model/litmus-tests/S+poonceonces.litmus b/tools/memory-model/litmus-tests/S+poonceonces.litmus new file mode 100644 index 000000000000..8c9c2f81a580 --- /dev/null +++ b/tools/memory-model/litmus-tests/S+poonceonces.litmus | |||
@@ -0,0 +1,28 @@ | |||
1 | C S+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * Starting with a two-process release-acquire chain ordering P0()'s | ||
7 | * first store against P1()'s final load, if the smp_store_release() | ||
8 | * is replaced by WRITE_ONCE() and the smp_load_acquire() replaced by | ||
9 | * READ_ONCE(), is ordering preserved? | ||
10 | *) | ||
11 | |||
12 | {} | ||
13 | |||
14 | P0(int *x, int *y) | ||
15 | { | ||
16 | WRITE_ONCE(*x, 2); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | } | ||
19 | |||
20 | P1(int *x, int *y) | ||
21 | { | ||
22 | int r0; | ||
23 | |||
24 | r0 = READ_ONCE(*y); | ||
25 | WRITE_ONCE(*x, 1); | ||
26 | } | ||
27 | |||
28 | exists (x=2 /\ 1:r0=1) | ||
diff --git a/tools/memory-model/litmus-tests/S+wmbonceonce+poacquireonce.litmus b/tools/memory-model/litmus-tests/S+wmbonceonce+poacquireonce.litmus new file mode 100644 index 000000000000..c53350205d28 --- /dev/null +++ b/tools/memory-model/litmus-tests/S+wmbonceonce+poacquireonce.litmus | |||
@@ -0,0 +1,27 @@ | |||
1 | C S+wmbonceonce+poacquireonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * Can a smp_wmb(), instead of a release, and an acquire order a prior | ||
7 | * store against a subsequent store? | ||
8 | *) | ||
9 | |||
10 | {} | ||
11 | |||
12 | P0(int *x, int *y) | ||
13 | { | ||
14 | WRITE_ONCE(*x, 2); | ||
15 | smp_wmb(); | ||
16 | WRITE_ONCE(*y, 1); | ||
17 | } | ||
18 | |||
19 | P1(int *x, int *y) | ||
20 | { | ||
21 | int r0; | ||
22 | |||
23 | r0 = smp_load_acquire(y); | ||
24 | WRITE_ONCE(*x, 1); | ||
25 | } | ||
26 | |||
27 | exists (x=2 /\ 1:r0=1) | ||
diff --git a/tools/memory-model/litmus-tests/SB+mbonceonces.litmus b/tools/memory-model/litmus-tests/SB+mbonceonces.litmus new file mode 100644 index 000000000000..74b874ffa8da --- /dev/null +++ b/tools/memory-model/litmus-tests/SB+mbonceonces.litmus | |||
@@ -0,0 +1,32 @@ | |||
1 | C SB+mbonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates that full memory barriers suffice to | ||
7 | * order the store-buffering pattern, where each process writes to the | ||
8 | * variable that the preceding process reads. (Locking and RCU can also | ||
9 | * suffice, but not much else.) | ||
10 | *) | ||
11 | |||
12 | {} | ||
13 | |||
14 | P0(int *x, int *y) | ||
15 | { | ||
16 | int r0; | ||
17 | |||
18 | WRITE_ONCE(*x, 1); | ||
19 | smp_mb(); | ||
20 | r0 = READ_ONCE(*y); | ||
21 | } | ||
22 | |||
23 | P1(int *x, int *y) | ||
24 | { | ||
25 | int r0; | ||
26 | |||
27 | WRITE_ONCE(*y, 1); | ||
28 | smp_mb(); | ||
29 | r0 = READ_ONCE(*x); | ||
30 | } | ||
31 | |||
32 | exists (0:r0=0 /\ 1:r0=0) | ||
diff --git a/tools/memory-model/litmus-tests/SB+poonceonces.litmus b/tools/memory-model/litmus-tests/SB+poonceonces.litmus new file mode 100644 index 000000000000..10d550730b25 --- /dev/null +++ b/tools/memory-model/litmus-tests/SB+poonceonces.litmus | |||
@@ -0,0 +1,29 @@ | |||
1 | C SB+poonceonces | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This litmus test demonstrates that at least some ordering is required | ||
7 | * to order the store-buffering pattern, where each process writes to the | ||
8 | * variable that the preceding process reads. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y) | ||
14 | { | ||
15 | int r0; | ||
16 | |||
17 | WRITE_ONCE(*x, 1); | ||
18 | r0 = READ_ONCE(*y); | ||
19 | } | ||
20 | |||
21 | P1(int *x, int *y) | ||
22 | { | ||
23 | int r0; | ||
24 | |||
25 | WRITE_ONCE(*y, 1); | ||
26 | r0 = READ_ONCE(*x); | ||
27 | } | ||
28 | |||
29 | exists (0:r0=0 /\ 1:r0=0) | ||
diff --git a/tools/memory-model/litmus-tests/WRC+poonceonces+Once.litmus b/tools/memory-model/litmus-tests/WRC+poonceonces+Once.litmus new file mode 100644 index 000000000000..6a2bc12a1af1 --- /dev/null +++ b/tools/memory-model/litmus-tests/WRC+poonceonces+Once.litmus | |||
@@ -0,0 +1,35 @@ | |||
1 | C WRC+poonceonces+Once | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This litmus test is an extension of the message-passing pattern, | ||
7 | * where the first write is moved to a separate process. Note that this | ||
8 | * test has no ordering at all. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x) | ||
14 | { | ||
15 | WRITE_ONCE(*x, 1); | ||
16 | } | ||
17 | |||
18 | P1(int *x, int *y) | ||
19 | { | ||
20 | int r0; | ||
21 | |||
22 | r0 = READ_ONCE(*x); | ||
23 | WRITE_ONCE(*y, 1); | ||
24 | } | ||
25 | |||
26 | P2(int *x, int *y) | ||
27 | { | ||
28 | int r0; | ||
29 | int r1; | ||
30 | |||
31 | r0 = READ_ONCE(*y); | ||
32 | r1 = READ_ONCE(*x); | ||
33 | } | ||
34 | |||
35 | exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus b/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus new file mode 100644 index 000000000000..97fcbffde9a0 --- /dev/null +++ b/tools/memory-model/litmus-tests/WRC+pooncerelease+rmbonceonce+Once.litmus | |||
@@ -0,0 +1,36 @@ | |||
1 | C WRC+pooncerelease+rmbonceonce+Once | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test is an extension of the message-passing pattern, where | ||
7 | * the first write is moved to a separate process. Because it features | ||
8 | * a release and a read memory barrier, it should be forbidden. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x) | ||
14 | { | ||
15 | WRITE_ONCE(*x, 1); | ||
16 | } | ||
17 | |||
18 | P1(int *x, int *y) | ||
19 | { | ||
20 | int r0; | ||
21 | |||
22 | r0 = READ_ONCE(*x); | ||
23 | smp_store_release(y, 1); | ||
24 | } | ||
25 | |||
26 | P2(int *x, int *y) | ||
27 | { | ||
28 | int r0; | ||
29 | int r1; | ||
30 | |||
31 | r0 = READ_ONCE(*y); | ||
32 | smp_rmb(); | ||
33 | r1 = READ_ONCE(*x); | ||
34 | } | ||
35 | |||
36 | exists (1:r0=1 /\ 2:r0=1 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/Z6.0+pooncelock+poonceLock+pombonce.litmus b/tools/memory-model/litmus-tests/Z6.0+pooncelock+poonceLock+pombonce.litmus new file mode 100644 index 000000000000..415248fb6699 --- /dev/null +++ b/tools/memory-model/litmus-tests/Z6.0+pooncelock+poonceLock+pombonce.litmus | |||
@@ -0,0 +1,42 @@ | |||
1 | C Z6.0+pooncelock+poonceLock+pombonce | ||
2 | |||
3 | (* | ||
4 | * Result: Never | ||
5 | * | ||
6 | * This litmus test demonstrates how smp_mb__after_spinlock() may be | ||
7 | * used to ensure that accesses in different critical sections for a | ||
8 | * given lock running on different CPUs are nevertheless seen in order | ||
9 | * by CPUs not holding that lock. | ||
10 | *) | ||
11 | |||
12 | {} | ||
13 | |||
14 | P0(int *x, int *y, spinlock_t *mylock) | ||
15 | { | ||
16 | spin_lock(mylock); | ||
17 | WRITE_ONCE(*x, 1); | ||
18 | WRITE_ONCE(*y, 1); | ||
19 | spin_unlock(mylock); | ||
20 | } | ||
21 | |||
22 | P1(int *y, int *z, spinlock_t *mylock) | ||
23 | { | ||
24 | int r0; | ||
25 | |||
26 | spin_lock(mylock); | ||
27 | smp_mb__after_spinlock(); | ||
28 | r0 = READ_ONCE(*y); | ||
29 | WRITE_ONCE(*z, 1); | ||
30 | spin_unlock(mylock); | ||
31 | } | ||
32 | |||
33 | P2(int *x, int *z) | ||
34 | { | ||
35 | int r1; | ||
36 | |||
37 | WRITE_ONCE(*z, 2); | ||
38 | smp_mb(); | ||
39 | r1 = READ_ONCE(*x); | ||
40 | } | ||
41 | |||
42 | exists (1:r0=1 /\ z=2 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/Z6.0+pooncelock+pooncelock+pombonce.litmus b/tools/memory-model/litmus-tests/Z6.0+pooncelock+pooncelock+pombonce.litmus new file mode 100644 index 000000000000..10a2aa04cd07 --- /dev/null +++ b/tools/memory-model/litmus-tests/Z6.0+pooncelock+pooncelock+pombonce.litmus | |||
@@ -0,0 +1,40 @@ | |||
1 | C Z6.0+pooncelock+pooncelock+pombonce | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This example demonstrates that a pair of accesses made by different | ||
7 | * processes each while holding a given lock will not necessarily be | ||
8 | * seen as ordered by a third process not holding that lock. | ||
9 | *) | ||
10 | |||
11 | {} | ||
12 | |||
13 | P0(int *x, int *y, spinlock_t *mylock) | ||
14 | { | ||
15 | spin_lock(mylock); | ||
16 | WRITE_ONCE(*x, 1); | ||
17 | WRITE_ONCE(*y, 1); | ||
18 | spin_unlock(mylock); | ||
19 | } | ||
20 | |||
21 | P1(int *y, int *z, spinlock_t *mylock) | ||
22 | { | ||
23 | int r0; | ||
24 | |||
25 | spin_lock(mylock); | ||
26 | r0 = READ_ONCE(*y); | ||
27 | WRITE_ONCE(*z, 1); | ||
28 | spin_unlock(mylock); | ||
29 | } | ||
30 | |||
31 | P2(int *x, int *z) | ||
32 | { | ||
33 | int r1; | ||
34 | |||
35 | WRITE_ONCE(*z, 2); | ||
36 | smp_mb(); | ||
37 | r1 = READ_ONCE(*x); | ||
38 | } | ||
39 | |||
40 | exists (1:r0=1 /\ z=2 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/litmus-tests/Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus b/tools/memory-model/litmus-tests/Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus new file mode 100644 index 000000000000..a20fc3fafb53 --- /dev/null +++ b/tools/memory-model/litmus-tests/Z6.0+pooncerelease+poacquirerelease+mbonceonce.litmus | |||
@@ -0,0 +1,42 @@ | |||
1 | C Z6.0+pooncerelease+poacquirerelease+mbonceonce | ||
2 | |||
3 | (* | ||
4 | * Result: Sometimes | ||
5 | * | ||
6 | * This litmus test shows that a release-acquire chain, while sufficient | ||
7 | * when there is but one non-reads-from (AKA non-rf) link, does not suffice | ||
8 | * if there is more than one. Of the three processes, only P1() reads from | ||
9 | * P0's write, which means that there are two non-rf links: P1() to P2() | ||
10 | * is a write-to-write link (AKA a "coherence" or just "co" link) and P2() | ||
11 | * to P0() is a read-to-write link (AKA a "from-reads" or just "fr" link). | ||
12 | * When there are two or more non-rf links, you typically will need one | ||
13 | * full barrier for each non-rf link. (Exceptions include some cases | ||
14 | * involving locking.) | ||
15 | *) | ||
16 | |||
17 | {} | ||
18 | |||
19 | P0(int *x, int *y) | ||
20 | { | ||
21 | WRITE_ONCE(*x, 1); | ||
22 | smp_store_release(y, 1); | ||
23 | } | ||
24 | |||
25 | P1(int *y, int *z) | ||
26 | { | ||
27 | int r0; | ||
28 | |||
29 | r0 = smp_load_acquire(y); | ||
30 | smp_store_release(z, 1); | ||
31 | } | ||
32 | |||
33 | P2(int *x, int *z) | ||
34 | { | ||
35 | int r1; | ||
36 | |||
37 | WRITE_ONCE(*z, 2); | ||
38 | smp_mb(); | ||
39 | r1 = READ_ONCE(*x); | ||
40 | } | ||
41 | |||
42 | exists (1:r0=1 /\ z=2 /\ 2:r1=0) | ||
diff --git a/tools/memory-model/lock.cat b/tools/memory-model/lock.cat new file mode 100644 index 000000000000..ba4a4ec6d313 --- /dev/null +++ b/tools/memory-model/lock.cat | |||
@@ -0,0 +1,99 @@ | |||
1 | // SPDX-License-Identifier: GPL-2.0+ | ||
2 | (* | ||
3 | * Copyright (C) 2016 Luc Maranget <luc.maranget@inria.fr> for Inria | ||
4 | * Copyright (C) 2017 Alan Stern <stern@rowland.harvard.edu> | ||
5 | *) | ||
6 | |||
7 | (* Generate coherence orders and handle lock operations *) | ||
8 | |||
9 | include "cross.cat" | ||
10 | |||
11 | (* From lock reads to their partner lock writes *) | ||
12 | let lk-rmw = ([LKR] ; po-loc ; [LKW]) \ (po ; po) | ||
13 | let rmw = rmw | lk-rmw | ||
14 | |||
15 | (* | ||
16 | * A paired LKR must always see an unlocked value; spin_lock() calls nested | ||
17 | * inside a critical section (for the same lock) always deadlock. | ||
18 | *) | ||
19 | empty ([LKW] ; po-loc ; [domain(lk-rmw)]) \ (po-loc ; [UL] ; po-loc) | ||
20 | as lock-nest | ||
21 | |||
22 | (* The litmus test is invalid if an LKW event is not part of an RMW pair *) | ||
23 | flag ~empty LKW \ range(lk-rmw) as unpaired-LKW | ||
24 | |||
25 | (* This will be allowed if we implement spin_is_locked() *) | ||
26 | flag ~empty LKR \ domain(lk-rmw) as unpaired-LKR | ||
27 | |||
28 | (* There should be no R or W accesses to spinlocks *) | ||
29 | let ALL-LOCKS = LKR | LKW | UL | LF | ||
30 | flag ~empty [M \ IW] ; loc ; [ALL-LOCKS] as mixed-lock-accesses | ||
31 | |||
32 | (* The final value of a spinlock should not be tested *) | ||
33 | flag ~empty [FW] ; loc ; [ALL-LOCKS] as lock-final | ||
34 | |||
35 | |||
36 | (* | ||
37 | * Put lock operations in their appropriate classes, but leave UL out of W | ||
38 | * until after the co relation has been generated. | ||
39 | *) | ||
40 | let R = R | LKR | LF | ||
41 | let W = W | LKW | ||
42 | |||
43 | let Release = Release | UL | ||
44 | let Acquire = Acquire | LKR | ||
45 | |||
46 | |||
47 | (* Match LKW events to their corresponding UL events *) | ||
48 | let critical = ([LKW] ; po-loc ; [UL]) \ (po-loc ; [LKW | UL] ; po-loc) | ||
49 | |||
50 | flag ~empty UL \ range(critical) as unmatched-unlock | ||
51 | |||
52 | (* Allow up to one unmatched LKW per location; more must deadlock *) | ||
53 | let UNMATCHED-LKW = LKW \ domain(critical) | ||
54 | empty ([UNMATCHED-LKW] ; loc ; [UNMATCHED-LKW]) \ id as unmatched-locks | ||
55 | |||
56 | |||
57 | (* rfi for LF events: link each LKW to the LF events in its critical section *) | ||
58 | let rfi-lf = ([LKW] ; po-loc ; [LF]) \ ([LKW] ; po-loc ; [UL] ; po-loc) | ||
59 | |||
60 | (* rfe for LF events *) | ||
61 | let all-possible-rfe-lf = | ||
62 | (* | ||
63 | * Given an LF event r, compute the possible rfe edges for that event | ||
64 | * (all those starting from LKW events in other threads), | ||
65 | * and then convert that relation to a set of single-edge relations. | ||
66 | *) | ||
67 | let possible-rfe-lf r = | ||
68 | let pair-to-relation p = p ++ 0 | ||
69 | in map pair-to-relation ((LKW * {r}) & loc & ext) | ||
70 | (* Do this for each LF event r that isn't in rfi-lf *) | ||
71 | in map possible-rfe-lf (LF \ range(rfi-lf)) | ||
72 | |||
73 | (* Generate all rf relations for LF events *) | ||
74 | with rfe-lf from cross(all-possible-rfe-lf) | ||
75 | let rf = rf | rfi-lf | rfe-lf | ||
76 | |||
77 | |||
78 | (* Generate all co relations, including LKW events but not UL *) | ||
79 | let co0 = co0 | ([IW] ; loc ; [LKW]) | | ||
80 | (([LKW] ; loc ; [UNMATCHED-LKW]) \ [UNMATCHED-LKW]) | ||
81 | include "cos-opt.cat" | ||
82 | let W = W | UL | ||
83 | let M = R | W | ||
84 | |||
85 | (* Merge UL events into co *) | ||
86 | let co = (co | critical | (critical^-1 ; co))+ | ||
87 | let coe = co & ext | ||
88 | let coi = co & int | ||
89 | |||
90 | (* Merge LKR events into rf *) | ||
91 | let rf = rf | ([IW | UL] ; singlestep(co) ; lk-rmw^-1) | ||
92 | let rfe = rf & ext | ||
93 | let rfi = rf & int | ||
94 | |||
95 | let fr = rf^-1 ; co | ||
96 | let fre = fr & ext | ||
97 | let fri = fr & int | ||
98 | |||
99 | show co,rf,fr | ||