diff options
author | Christoph Lameter <cl@linux.com> | 2013-04-04 10:41:08 -0400 |
---|---|---|
committer | Tejun Heo <tj@kernel.org> | 2013-04-04 13:24:53 -0400 |
commit | a1b2a555d6375f1ed34d0a1761b540de3a8727c6 (patch) | |
tree | 50e3dfcef9e04feb6e958967e7e5829420d30fba /Documentation/this_cpu_ops.txt | |
parent | 07961ac7c0ee8b546658717034fe692fd12eefa9 (diff) |
percpu: add documentation on this_cpu operations
Document the rationale and the way to use this_cpu operations.
V2: Improved after feedback from Randy Dunlap
v3: Further spelling fixes from Randy. Paragraphs refilled to 75
column.
tj: Added .txt file extension to the document.
Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
Diffstat (limited to 'Documentation/this_cpu_ops.txt')
-rw-r--r-- | Documentation/this_cpu_ops.txt | 205 |
1 files changed, 205 insertions, 0 deletions
diff --git a/Documentation/this_cpu_ops.txt b/Documentation/this_cpu_ops.txt new file mode 100644 index 000000000000..1a4ce7e3e05f --- /dev/null +++ b/Documentation/this_cpu_ops.txt | |||
@@ -0,0 +1,205 @@ | |||
1 | this_cpu operations | ||
2 | ------------------- | ||
3 | |||
4 | this_cpu operations are a way of optimizing access to per cpu | ||
5 | variables associated with the *currently* executing processor through | ||
6 | the use of segment registers (or a dedicated register where the cpu | ||
7 | permanently stored the beginning of the per cpu area for a specific | ||
8 | processor). | ||
9 | |||
10 | The this_cpu operations add a per cpu variable offset to the processor | ||
11 | specific percpu base and encode that operation in the instruction | ||
12 | operating on the per cpu variable. | ||
13 | |||
14 | This means there are no atomicity issues between the calculation of | ||
15 | the offset and the operation on the data. Therefore it is not | ||
16 | necessary to disable preempt or interrupts to ensure that the | ||
17 | processor is not changed between the calculation of the address and | ||
18 | the operation on the data. | ||
19 | |||
20 | Read-modify-write operations are of particular interest. Frequently | ||
21 | processors have special lower latency instructions that can operate | ||
22 | without the typical synchronization overhead but still provide some | ||
23 | sort of relaxed atomicity guarantee. The x86 for example can execute | ||
24 | RMV (Read Modify Write) instructions like inc/dec/cmpxchg without the | ||
25 | lock prefix and the associated latency penalty. | ||
26 | |||
27 | Access to the variable without the lock prefix is not synchronized but | ||
28 | synchronization is not necessary since we are dealing with per cpu | ||
29 | data specific to the currently executing processor. Only the current | ||
30 | processor should be accessing that variable and therefore there are no | ||
31 | concurrency issues with other processors in the system. | ||
32 | |||
33 | On x86 the fs: or the gs: segment registers contain the base of the | ||
34 | per cpu area. It is then possible to simply use the segment override | ||
35 | to relocate a per cpu relative address to the proper per cpu area for | ||
36 | the processor. So the relocation to the per cpu base is encoded in the | ||
37 | instruction via a segment register prefix. | ||
38 | |||
39 | For example: | ||
40 | |||
41 | DEFINE_PER_CPU(int, x); | ||
42 | int z; | ||
43 | |||
44 | z = this_cpu_read(x); | ||
45 | |||
46 | results in a single instruction | ||
47 | |||
48 | mov ax, gs:[x] | ||
49 | |||
50 | instead of a sequence of calculation of the address and then a fetch | ||
51 | from that address which occurs with the percpu operations. Before | ||
52 | this_cpu_ops such sequence also required preempt disable/enable to | ||
53 | prevent the kernel from moving the thread to a different processor | ||
54 | while the calculation is performed. | ||
55 | |||
56 | The main use of the this_cpu operations has been to optimize counter | ||
57 | operations. | ||
58 | |||
59 | this_cpu_inc(x) | ||
60 | |||
61 | results in the following single instruction (no lock prefix!) | ||
62 | |||
63 | inc gs:[x] | ||
64 | |||
65 | instead of the following operations required if there is no segment | ||
66 | register. | ||
67 | |||
68 | int *y; | ||
69 | int cpu; | ||
70 | |||
71 | cpu = get_cpu(); | ||
72 | y = per_cpu_ptr(&x, cpu); | ||
73 | (*y)++; | ||
74 | put_cpu(); | ||
75 | |||
76 | Note that these operations can only be used on percpu data that is | ||
77 | reserved for a specific processor. Without disabling preemption in the | ||
78 | surrounding code this_cpu_inc() will only guarantee that one of the | ||
79 | percpu counters is correctly incremented. However, there is no | ||
80 | guarantee that the OS will not move the process directly before or | ||
81 | after the this_cpu instruction is executed. In general this means that | ||
82 | the value of the individual counters for each processor are | ||
83 | meaningless. The sum of all the per cpu counters is the only value | ||
84 | that is of interest. | ||
85 | |||
86 | Per cpu variables are used for performance reasons. Bouncing cache | ||
87 | lines can be avoided if multiple processors concurrently go through | ||
88 | the same code paths. Since each processor has its own per cpu | ||
89 | variables no concurrent cacheline updates take place. The price that | ||
90 | has to be paid for this optimization is the need to add up the per cpu | ||
91 | counters when the value of the counter is needed. | ||
92 | |||
93 | |||
94 | Special operations: | ||
95 | ------------------- | ||
96 | |||
97 | y = this_cpu_ptr(&x) | ||
98 | |||
99 | Takes the offset of a per cpu variable (&x !) and returns the address | ||
100 | of the per cpu variable that belongs to the currently executing | ||
101 | processor. this_cpu_ptr avoids multiple steps that the common | ||
102 | get_cpu/put_cpu sequence requires. No processor number is | ||
103 | available. Instead the offset of the local per cpu area is simply | ||
104 | added to the percpu offset. | ||
105 | |||
106 | |||
107 | |||
108 | Per cpu variables and offsets | ||
109 | ----------------------------- | ||
110 | |||
111 | Per cpu variables have *offsets* to the beginning of the percpu | ||
112 | area. They do not have addresses although they look like that in the | ||
113 | code. Offsets cannot be directly dereferenced. The offset must be | ||
114 | added to a base pointer of a percpu area of a processor in order to | ||
115 | form a valid address. | ||
116 | |||
117 | Therefore the use of x or &x outside of the context of per cpu | ||
118 | operations is invalid and will generally be treated like a NULL | ||
119 | pointer dereference. | ||
120 | |||
121 | In the context of per cpu operations | ||
122 | |||
123 | x is a per cpu variable. Most this_cpu operations take a cpu | ||
124 | variable. | ||
125 | |||
126 | &x is the *offset* a per cpu variable. this_cpu_ptr() takes | ||
127 | the offset of a per cpu variable which makes this look a bit | ||
128 | strange. | ||
129 | |||
130 | |||
131 | |||
132 | Operations on a field of a per cpu structure | ||
133 | -------------------------------------------- | ||
134 | |||
135 | Let's say we have a percpu structure | ||
136 | |||
137 | struct s { | ||
138 | int n,m; | ||
139 | }; | ||
140 | |||
141 | DEFINE_PER_CPU(struct s, p); | ||
142 | |||
143 | |||
144 | Operations on these fields are straightforward | ||
145 | |||
146 | this_cpu_inc(p.m) | ||
147 | |||
148 | z = this_cpu_cmpxchg(p.m, 0, 1); | ||
149 | |||
150 | |||
151 | If we have an offset to struct s: | ||
152 | |||
153 | struct s __percpu *ps = &p; | ||
154 | |||
155 | z = this_cpu_dec(ps->m); | ||
156 | |||
157 | z = this_cpu_inc_return(ps->n); | ||
158 | |||
159 | |||
160 | The calculation of the pointer may require the use of this_cpu_ptr() | ||
161 | if we do not make use of this_cpu ops later to manipulate fields: | ||
162 | |||
163 | struct s *pp; | ||
164 | |||
165 | pp = this_cpu_ptr(&p); | ||
166 | |||
167 | pp->m--; | ||
168 | |||
169 | z = pp->n++; | ||
170 | |||
171 | |||
172 | Variants of this_cpu ops | ||
173 | ------------------------- | ||
174 | |||
175 | this_cpu ops are interrupt safe. Some architecture do not support | ||
176 | these per cpu local operations. In that case the operation must be | ||
177 | replaced by code that disables interrupts, then does the operations | ||
178 | that are guaranteed to be atomic and then reenable interrupts. Doing | ||
179 | so is expensive. If there are other reasons why the scheduler cannot | ||
180 | change the processor we are executing on then there is no reason to | ||
181 | disable interrupts. For that purpose the __this_cpu operations are | ||
182 | provided. For example. | ||
183 | |||
184 | __this_cpu_inc(x); | ||
185 | |||
186 | Will increment x and will not fallback to code that disables | ||
187 | interrupts on platforms that cannot accomplish atomicity through | ||
188 | address relocation and a Read-Modify-Write operation in the same | ||
189 | instruction. | ||
190 | |||
191 | |||
192 | |||
193 | &this_cpu_ptr(pp)->n vs this_cpu_ptr(&pp->n) | ||
194 | -------------------------------------------- | ||
195 | |||
196 | The first operation takes the offset and forms an address and then | ||
197 | adds the offset of the n field. | ||
198 | |||
199 | The second one first adds the two offsets and then does the | ||
200 | relocation. IMHO the second form looks cleaner and has an easier time | ||
201 | with (). The second form also is consistent with the way | ||
202 | this_cpu_read() and friends are used. | ||
203 | |||
204 | |||
205 | Christoph Lameter, April 3rd, 2013 | ||