aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/this_cpu_ops.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/this_cpu_ops.txt')
-rw-r--r--Documentation/this_cpu_ops.txt213
1 files changed, 171 insertions, 42 deletions
diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt
index 1a4ce7e3e05f..0ec995712176 100644
--- a/Documentation/this_cpu_ops.txt
+++ b/Documentation/this_cpu_ops.txt
@@ -2,26 +2,26 @@ this_cpu operations
2------------------- 2-------------------
3 3
4this_cpu operations are a way of optimizing access to per cpu 4this_cpu operations are a way of optimizing access to per cpu
5variables associated with the *currently* executing processor through 5variables associated with the *currently* executing processor. This is
6the use of segment registers (or a dedicated register where the cpu 6done through the use of segment registers (or a dedicated register where
7permanently stored the beginning of the per cpu area for a specific 7the cpu permanently stored the beginning of the per cpu area for a
8processor). 8specific processor).
9 9
10The this_cpu operations add a per cpu variable offset to the processor 10this_cpu operations add a per cpu variable offset to the processor
11specific percpu base and encode that operation in the instruction 11specific per cpu base and encode that operation in the instruction
12operating on the per cpu variable. 12operating on the per cpu variable.
13 13
14This means there are no atomicity issues between the calculation of 14This means that there are no atomicity issues between the calculation of
15the offset and the operation on the data. Therefore it is not 15the offset and the operation on the data. Therefore it is not
16necessary to disable preempt or interrupts to ensure that the 16necessary to disable preemption or interrupts to ensure that the
17processor is not changed between the calculation of the address and 17processor is not changed between the calculation of the address and
18the operation on the data. 18the operation on the data.
19 19
20Read-modify-write operations are of particular interest. Frequently 20Read-modify-write operations are of particular interest. Frequently
21processors have special lower latency instructions that can operate 21processors have special lower latency instructions that can operate
22without the typical synchronization overhead but still provide some 22without the typical synchronization overhead, but still provide some
23sort of relaxed atomicity guarantee. The x86 for example can execute 23sort of relaxed atomicity guarantees. The x86, for example, can execute
24RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the 24RMW (Read Modify Write) instructions like inc/dec/cmpxchg without the
25lock prefix and the associated latency penalty. 25lock prefix and the associated latency penalty.
26 26
27Access to the variable without the lock prefix is not synchronized but 27Access to the variable without the lock prefix is not synchronized but
@@ -30,6 +30,38 @@ data specific to the currently executing processor. Only the current
30processor should be accessing that variable and therefore there are no 30processor should be accessing that variable and therefore there are no
31concurrency issues with other processors in the system. 31concurrency issues with other processors in the system.
32 32
33Please note that accesses by remote processors to a per cpu area are
34exceptional situations and may impact performance and/or correctness
35(remote write operations) of local RMW operations via this_cpu_*.
36
37The main use of the this_cpu operations has been to optimize counter
38operations.
39
40The following this_cpu() operations with implied preemption protection
41are defined. These operations can be used without worrying about
42preemption and interrupts.
43
44 this_cpu_add()
45 this_cpu_read(pcp)
46 this_cpu_write(pcp, val)
47 this_cpu_add(pcp, val)
48 this_cpu_and(pcp, val)
49 this_cpu_or(pcp, val)
50 this_cpu_add_return(pcp, val)
51 this_cpu_xchg(pcp, nval)
52 this_cpu_cmpxchg(pcp, oval, nval)
53 this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
54 this_cpu_sub(pcp, val)
55 this_cpu_inc(pcp)
56 this_cpu_dec(pcp)
57 this_cpu_sub_return(pcp, val)
58 this_cpu_inc_return(pcp)
59 this_cpu_dec_return(pcp)
60
61
62Inner working of this_cpu operations
63------------------------------------
64
33On x86 the fs: or the gs: segment registers contain the base of the 65On x86 the fs: or the gs: segment registers contain the base of the
34per cpu area. It is then possible to simply use the segment override 66per cpu area. It is then possible to simply use the segment override
35to relocate a per cpu relative address to the proper per cpu area for 67to relocate a per cpu relative address to the proper per cpu area for
@@ -48,22 +80,21 @@ results in a single instruction
48 mov ax, gs:[x] 80 mov ax, gs:[x]
49 81
50instead of a sequence of calculation of the address and then a fetch 82instead of a sequence of calculation of the address and then a fetch
51from that address which occurs with the percpu operations. Before 83from that address which occurs with the per cpu operations. Before
52this_cpu_ops such sequence also required preempt disable/enable to 84this_cpu_ops such sequence also required preempt disable/enable to
53prevent the kernel from moving the thread to a different processor 85prevent the kernel from moving the thread to a different processor
54while the calculation is performed. 86while the calculation is performed.
55 87
56The main use of the this_cpu operations has been to optimize counter 88Consider the following this_cpu operation:
57operations.
58 89
59 this_cpu_inc(x) 90 this_cpu_inc(x)
60 91
61results in the following single instruction (no lock prefix!) 92The above results in the following single instruction (no lock prefix!)
62 93
63 inc gs:[x] 94 inc gs:[x]
64 95
65instead of the following operations required if there is no segment 96instead of the following operations required if there is no segment
66register. 97register:
67 98
68 int *y; 99 int *y;
69 int cpu; 100 int cpu;
@@ -73,10 +104,10 @@ register.
73 (*y)++; 104 (*y)++;
74 put_cpu(); 105 put_cpu();
75 106
76Note that these operations can only be used on percpu data that is 107Note that these operations can only be used on per cpu data that is
77reserved for a specific processor. Without disabling preemption in the 108reserved for a specific processor. Without disabling preemption in the
78surrounding code this_cpu_inc() will only guarantee that one of the 109surrounding code this_cpu_inc() will only guarantee that one of the
79percpu counters is correctly incremented. However, there is no 110per cpu counters is correctly incremented. However, there is no
80guarantee that the OS will not move the process directly before or 111guarantee that the OS will not move the process directly before or
81after the this_cpu instruction is executed. In general this means that 112after the this_cpu instruction is executed. In general this means that
82the value of the individual counters for each processor are 113the value of the individual counters for each processor are
@@ -86,9 +117,9 @@ that is of interest.
86Per cpu variables are used for performance reasons. Bouncing cache 117Per cpu variables are used for performance reasons. Bouncing cache
87lines can be avoided if multiple processors concurrently go through 118lines can be avoided if multiple processors concurrently go through
88the same code paths. Since each processor has its own per cpu 119the same code paths. Since each processor has its own per cpu
89variables no concurrent cacheline updates take place. The price that 120variables no concurrent cache line updates take place. The price that
90has to be paid for this optimization is the need to add up the per cpu 121has to be paid for this optimization is the need to add up the per cpu
91counters when the value of the counter is needed. 122counters when the value of a counter is needed.
92 123
93 124
94Special operations: 125Special operations:
@@ -100,33 +131,39 @@ Takes the offset of a per cpu variable (&x !) and returns the address
100of the per cpu variable that belongs to the currently executing 131of the per cpu variable that belongs to the currently executing
101processor. this_cpu_ptr avoids multiple steps that the common 132processor. this_cpu_ptr avoids multiple steps that the common
102get_cpu/put_cpu sequence requires. No processor number is 133get_cpu/put_cpu sequence requires. No processor number is
103available. Instead the offset of the local per cpu area is simply 134available. Instead, the offset of the local per cpu area is simply
104added to the percpu offset. 135added to the per cpu offset.
105 136
137Note that this operation is usually used in a code segment when
138preemption has been disabled. The pointer is then used to
139access local per cpu data in a critical section. When preemption
140is re-enabled this pointer is usually no longer useful since it may
141no longer point to per cpu data of the current processor.
106 142
107 143
108Per cpu variables and offsets 144Per cpu variables and offsets
109----------------------------- 145-----------------------------
110 146
111Per cpu variables have *offsets* to the beginning of the percpu 147Per cpu variables have *offsets* to the beginning of the per cpu
112area. They do not have addresses although they look like that in the 148area. They do not have addresses although they look like that in the
113code. Offsets cannot be directly dereferenced. The offset must be 149code. Offsets cannot be directly dereferenced. The offset must be
114added to a base pointer of a percpu area of a processor in order to 150added to a base pointer of a per cpu area of a processor in order to
115form a valid address. 151form a valid address.
116 152
117Therefore the use of x or &x outside of the context of per cpu 153Therefore the use of x or &x outside of the context of per cpu
118operations is invalid and will generally be treated like a NULL 154operations is invalid and will generally be treated like a NULL
119pointer dereference. 155pointer dereference.
120 156
121In the context of per cpu operations 157 DEFINE_PER_CPU(int, x);
122 158
123 x is a per cpu variable. Most this_cpu operations take a cpu 159In the context of per cpu operations the above implies that x is a per
124 variable. 160cpu variable. Most this_cpu operations take a cpu variable.
125 161
126 &x is the *offset* a per cpu variable. this_cpu_ptr() takes 162 int __percpu *p = &x;
127 the offset of a per cpu variable which makes this look a bit
128 strange.
129 163
164&x and hence p is the *offset* of a per cpu variable. this_cpu_ptr()
165takes the offset of a per cpu variable which makes this look a bit
166strange.
130 167
131 168
132Operations on a field of a per cpu structure 169Operations on a field of a per cpu structure
@@ -152,7 +189,7 @@ If we have an offset to struct s:
152 189
153 struct s __percpu *ps = &p; 190 struct s __percpu *ps = &p;
154 191
155 z = this_cpu_dec(ps->m); 192 this_cpu_dec(ps->m);
156 193
157 z = this_cpu_inc_return(ps->n); 194 z = this_cpu_inc_return(ps->n);
158 195
@@ -172,29 +209,52 @@ if we do not make use of this_cpu ops later to manipulate fields:
172Variants of this_cpu ops 209Variants of this_cpu ops
173------------------------- 210-------------------------
174 211
175this_cpu ops are interrupt safe. Some architecture do not support 212this_cpu ops are interrupt safe. Some architectures do not support
176these per cpu local operations. In that case the operation must be 213these per cpu local operations. In that case the operation must be
177replaced by code that disables interrupts, then does the operations 214replaced by code that disables interrupts, then does the operations
178that are guaranteed to be atomic and then reenable interrupts. Doing 215that are guaranteed to be atomic and then re-enable interrupts. Doing
179so is expensive. If there are other reasons why the scheduler cannot 216so is expensive. If there are other reasons why the scheduler cannot
180change the processor we are executing on then there is no reason to 217change the processor we are executing on then there is no reason to
181disable interrupts. For that purpose the __this_cpu operations are 218disable interrupts. For that purpose the following __this_cpu operations
182provided. For example. 219are provided.
183 220
184 __this_cpu_inc(x); 221These operations have no guarantee against concurrent interrupts or
185 222preemption. If a per cpu variable is not used in an interrupt context
186Will increment x and will not fallback to code that disables 223and the scheduler cannot preempt, then they are safe. If any interrupts
224still occur while an operation is in progress and if the interrupt too
225modifies the variable, then RMW actions can not be guaranteed to be
226safe.
227
228 __this_cpu_add()
229 __this_cpu_read(pcp)
230 __this_cpu_write(pcp, val)
231 __this_cpu_add(pcp, val)
232 __this_cpu_and(pcp, val)
233 __this_cpu_or(pcp, val)
234 __this_cpu_add_return(pcp, val)
235 __this_cpu_xchg(pcp, nval)
236 __this_cpu_cmpxchg(pcp, oval, nval)
237 __this_cpu_cmpxchg_double(pcp1, pcp2, oval1, oval2, nval1, nval2)
238 __this_cpu_sub(pcp, val)
239 __this_cpu_inc(pcp)
240 __this_cpu_dec(pcp)
241 __this_cpu_sub_return(pcp, val)
242 __this_cpu_inc_return(pcp)
243 __this_cpu_dec_return(pcp)
244
245
246Will increment x and will not fall-back to code that disables
187interrupts on platforms that cannot accomplish atomicity through 247interrupts on platforms that cannot accomplish atomicity through
188address relocation and a Read-Modify-Write operation in the same 248address relocation and a Read-Modify-Write operation in the same
189instruction. 249instruction.
190 250
191 251
192
193&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) 252&this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n)
194-------------------------------------------- 253--------------------------------------------
195 254
196The first operation takes the offset and forms an address and then 255The first operation takes the offset and forms an address and then
197adds the offset of the n field. 256adds the offset of the n field. This may result in two add
257instructions emitted by the compiler.
198 258
199The second one first adds the two offsets and then does the 259The second one first adds the two offsets and then does the
200relocation. IMHO the second form looks cleaner and has an easier time 260relocation. IMHO the second form looks cleaner and has an easier time
@@ -202,4 +262,73 @@ with (). The second form also is consistent with the way
202this_cpu_read() and friends are used. 262this_cpu_read() and friends are used.
203 263
204 264
205Christoph Lameter, April 3rd, 2013 265Remote access to per cpu data
266------------------------------
267
268Per cpu data structures are designed to be used by one cpu exclusively.
269If you use the variables as intended, this_cpu_ops() are guaranteed to
270be "atomic" as no other CPU has access to these data structures.
271
272There are special cases where you might need to access per cpu data
273structures remotely. It is usually safe to do a remote read access
274and that is frequently done to summarize counters. Remote write access
275something which could be problematic because this_cpu ops do not
276have lock semantics. A remote write may interfere with a this_cpu
277RMW operation.
278
279Remote write accesses to percpu data structures are highly discouraged
280unless absolutely necessary. Please consider using an IPI to wake up
281the remote CPU and perform the update to its per cpu area.
282
283To access per-cpu data structure remotely, typically the per_cpu_ptr()
284function is used:
285
286
287 DEFINE_PER_CPU(struct data, datap);
288
289 struct data *p = per_cpu_ptr(&datap, cpu);
290
291This makes it explicit that we are getting ready to access a percpu
292area remotely.
293
294You can also do the following to convert the datap offset to an address
295
296 struct data *p = this_cpu_ptr(&datap);
297
298but, passing of pointers calculated via this_cpu_ptr to other cpus is
299unusual and should be avoided.
300
301Remote access are typically only for reading the status of another cpus
302per cpu data. Write accesses can cause unique problems due to the
303relaxed synchronization requirements for this_cpu operations.
304
305One example that illustrates some concerns with write operations is
306the following scenario that occurs because two per cpu variables
307share a cache-line but the relaxed synchronization is applied to
308only one process updating the cache-line.
309
310Consider the following example
311
312
313 struct test {
314 atomic_t a;
315 int b;
316 };
317
318 DEFINE_PER_CPU(struct test, onecacheline);
319
320There is some concern about what would happen if the field 'a' is updated
321remotely from one processor and the local processor would use this_cpu ops
322to update field b. Care should be taken that such simultaneous accesses to
323data within the same cache line are avoided. Also costly synchronization
324may be necessary. IPIs are generally recommended in such scenarios instead
325of a remote write to the per cpu area of another processor.
326
327Even in cases where the remote writes are rare, please bear in
328mind that a remote write will evict the cache line from the processor
329that most likely will access it. If the processor wakes up and finds a
330missing local cache line of a per cpu area, its performance and hence
331the wake up times will be affected.
332
333Christoph Lameter, August 4th, 2014
334Pranith Kumar, Aug 2nd, 2014