diff options
author | Jason Baron <jbaron@redhat.com> | 2012-02-21 15:03:30 -0500 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2012-02-24 03:12:19 -0500 |
commit | 1cfa60dc7d7c7cc774a44eee47ff135a644a1f31 (patch) | |
tree | 07036a68752d26ab864ec4852a8a4fac801b3495 /Documentation/static-keys.txt | |
parent | a746e3cc984b0aa5b620dd07c1a433283b1835cf (diff) |
static keys: Add docs better explaining the whole 'struct static_key' mechanism
Add better documentation for static keys.
Signed-off-by: Jason Baron <jbaron@redhat.com>
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: davem@davemloft.net
Cc: ddaney.cavm@gmail.com
Cc: a.p.zijlstra@chello.nl
Link: http://lkml.kernel.org/r/52570e566e5f1914f27b67e4eafb5781b8f9f9db.1329851692.git.jbaron@redhat.com
[ Added a 'Summary' section and rewrote it to explain static keys ]
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'Documentation/static-keys.txt')
-rw-r--r-- | Documentation/static-keys.txt | 286 |
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt new file mode 100644 index 000000000000..d93f3c00f245 --- /dev/null +++ b/Documentation/static-keys.txt | |||
@@ -0,0 +1,286 @@ | |||
1 | Static Keys | ||
2 | ----------- | ||
3 | |||
4 | By: Jason Baron <jbaron@redhat.com> | ||
5 | |||
6 | 0) Abstract | ||
7 | |||
8 | Static keys allows the inclusion of seldom used features in | ||
9 | performance-sensitive fast-path kernel code, via a GCC feature and a code | ||
10 | patching technique. A quick example: | ||
11 | |||
12 | struct static_key key = STATIC_KEY_INIT_FALSE; | ||
13 | |||
14 | ... | ||
15 | |||
16 | if (static_key_false(&key)) | ||
17 | do unlikely code | ||
18 | else | ||
19 | do likely code | ||
20 | |||
21 | ... | ||
22 | static_key_slow_inc(); | ||
23 | ... | ||
24 | static_key_slow_inc(); | ||
25 | ... | ||
26 | |||
27 | The static_key_false() branch will be generated into the code with as little | ||
28 | impact to the likely code path as possible. | ||
29 | |||
30 | |||
31 | 1) Motivation | ||
32 | |||
33 | |||
34 | Currently, tracepoints are implemented using a conditional branch. The | ||
35 | conditional check requires checking a global variable for each tracepoint. | ||
36 | Although the overhead of this check is small, it increases when the memory | ||
37 | cache comes under pressure (memory cache lines for these global variables may | ||
38 | be shared with other memory accesses). As we increase the number of tracepoints | ||
39 | in the kernel this overhead may become more of an issue. In addition, | ||
40 | tracepoints are often dormant (disabled) and provide no direct kernel | ||
41 | functionality. Thus, it is highly desirable to reduce their impact as much as | ||
42 | possible. Although tracepoints are the original motivation for this work, other | ||
43 | kernel code paths should be able to make use of the static keys facility. | ||
44 | |||
45 | |||
46 | 2) Solution | ||
47 | |||
48 | |||
49 | gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: | ||
50 | |||
51 | http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html | ||
52 | |||
53 | Using the 'asm goto', we can create branches that are either taken or not taken | ||
54 | by default, without the need to check memory. Then, at run-time, we can patch | ||
55 | the branch site to change the branch direction. | ||
56 | |||
57 | For example, if we have a simple branch that is disabled by default: | ||
58 | |||
59 | if (static_key_false(&key)) | ||
60 | printk("I am the true branch\n"); | ||
61 | |||
62 | Thus, by default the 'printk' will not be emitted. And the code generated will | ||
63 | consist of a single atomic 'no-op' instruction (5 bytes on x86), in the | ||
64 | straight-line code path. When the branch is 'flipped', we will patch the | ||
65 | 'no-op' in the straight-line codepath with a 'jump' instruction to the | ||
66 | out-of-line true branch. Thus, changing branch direction is expensive but | ||
67 | branch selection is basically 'free'. That is the basic tradeoff of this | ||
68 | optimization. | ||
69 | |||
70 | This lowlevel patching mechanism is called 'jump label patching', and it gives | ||
71 | the basis for the static keys facility. | ||
72 | |||
73 | 3) Static key label API, usage and examples: | ||
74 | |||
75 | |||
76 | In order to make use of this optimization you must first define a key: | ||
77 | |||
78 | struct static_key key; | ||
79 | |||
80 | Which is initialized as: | ||
81 | |||
82 | struct static_key key = STATIC_KEY_INIT_TRUE; | ||
83 | |||
84 | or: | ||
85 | |||
86 | struct static_key key = STATIC_KEY_INIT_FALSE; | ||
87 | |||
88 | If the key is not initialized, it is default false. The 'struct static_key', | ||
89 | must be a 'global'. That is, it can't be allocated on the stack or dynamically | ||
90 | allocated at run-time. | ||
91 | |||
92 | The key is then used in code as: | ||
93 | |||
94 | if (static_key_false(&key)) | ||
95 | do unlikely code | ||
96 | else | ||
97 | do likely code | ||
98 | |||
99 | Or: | ||
100 | |||
101 | if (static_key_true(&key)) | ||
102 | do likely code | ||
103 | else | ||
104 | do unlikely code | ||
105 | |||
106 | A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a | ||
107 | 'static_key_false()' construct. Likewise, a key initialized via | ||
108 | 'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A | ||
109 | single key can be used in many branches, but all the branches must match the | ||
110 | way that the key has been initialized. | ||
111 | |||
112 | The branch(es) can then be switched via: | ||
113 | |||
114 | static_key_slow_inc(&key); | ||
115 | ... | ||
116 | static_key_slow_dec(&key); | ||
117 | |||
118 | Thus, 'static_key_slow_inc()' means 'make the branch true', and | ||
119 | 'static_key_slow_dec()' means 'make the the branch false' with appropriate | ||
120 | reference counting. For example, if the key is initialized true, a | ||
121 | static_key_slow_dec(), will switch the branch to false. And a subsequent | ||
122 | static_key_slow_inc(), will change the branch back to true. Likewise, if the | ||
123 | key is initialized false, a 'static_key_slow_inc()', will change the branch to | ||
124 | true. And then a 'static_key_slow_dec()', will again make the branch false. | ||
125 | |||
126 | An example usage in the kernel is the implementation of tracepoints: | ||
127 | |||
128 | static inline void trace_##name(proto) \ | ||
129 | { \ | ||
130 | if (static_key_false(&__tracepoint_##name.key)) \ | ||
131 | __DO_TRACE(&__tracepoint_##name, \ | ||
132 | TP_PROTO(data_proto), \ | ||
133 | TP_ARGS(data_args), \ | ||
134 | TP_CONDITION(cond)); \ | ||
135 | } | ||
136 | |||
137 | Tracepoints are disabled by default, and can be placed in performance critical | ||
138 | pieces of the kernel. Thus, by using a static key, the tracepoints can have | ||
139 | absolutely minimal impact when not in use. | ||
140 | |||
141 | |||
142 | 4) Architecture level code patching interface, 'jump labels' | ||
143 | |||
144 | |||
145 | There are a few functions and macros that architectures must implement in order | ||
146 | to take advantage of this optimization. If there is no architecture support, we | ||
147 | simply fall back to a traditional, load, test, and jump sequence. | ||
148 | |||
149 | * select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig | ||
150 | |||
151 | * #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h | ||
152 | |||
153 | * __always_inline bool arch_static_branch(struct static_key *key), see: | ||
154 | arch/x86/include/asm/jump_label.h | ||
155 | |||
156 | * void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), | ||
157 | see: arch/x86/kernel/jump_label.c | ||
158 | |||
159 | * __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), | ||
160 | see: arch/x86/kernel/jump_label.c | ||
161 | |||
162 | |||
163 | * struct jump_entry, see: arch/x86/include/asm/jump_label.h | ||
164 | |||
165 | |||
166 | 5) Static keys / jump label analysis, results (x86_64): | ||
167 | |||
168 | |||
169 | As an example, let's add the following branch to 'getppid()', such that the | ||
170 | system call now looks like: | ||
171 | |||
172 | SYSCALL_DEFINE0(getppid) | ||
173 | { | ||
174 | int pid; | ||
175 | |||
176 | + if (static_key_false(&key)) | ||
177 | + printk("I am the true branch\n"); | ||
178 | |||
179 | rcu_read_lock(); | ||
180 | pid = task_tgid_vnr(rcu_dereference(current->real_parent)); | ||
181 | rcu_read_unlock(); | ||
182 | |||
183 | return pid; | ||
184 | } | ||
185 | |||
186 | The resulting instructions with jump labels generated by GCC is: | ||
187 | |||
188 | ffffffff81044290 <sys_getppid>: | ||
189 | ffffffff81044290: 55 push %rbp | ||
190 | ffffffff81044291: 48 89 e5 mov %rsp,%rbp | ||
191 | ffffffff81044294: e9 00 00 00 00 jmpq ffffffff81044299 <sys_getppid+0x9> | ||
192 | ffffffff81044299: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax | ||
193 | ffffffff810442a0: 00 00 | ||
194 | ffffffff810442a2: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax | ||
195 | ffffffff810442a9: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax | ||
196 | ffffffff810442b0: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi | ||
197 | ffffffff810442b7: e8 f4 d9 00 00 callq ffffffff81051cb0 <pid_vnr> | ||
198 | ffffffff810442bc: 5d pop %rbp | ||
199 | ffffffff810442bd: 48 98 cltq | ||
200 | ffffffff810442bf: c3 retq | ||
201 | ffffffff810442c0: 48 c7 c7 e3 54 98 81 mov $0xffffffff819854e3,%rdi | ||
202 | ffffffff810442c7: 31 c0 xor %eax,%eax | ||
203 | ffffffff810442c9: e8 71 13 6d 00 callq ffffffff8171563f <printk> | ||
204 | ffffffff810442ce: eb c9 jmp ffffffff81044299 <sys_getppid+0x9> | ||
205 | |||
206 | Without the jump label optimization it looks like: | ||
207 | |||
208 | ffffffff810441f0 <sys_getppid>: | ||
209 | ffffffff810441f0: 8b 05 8a 52 d8 00 mov 0xd8528a(%rip),%eax # ffffffff81dc9480 <key> | ||
210 | ffffffff810441f6: 55 push %rbp | ||
211 | ffffffff810441f7: 48 89 e5 mov %rsp,%rbp | ||
212 | ffffffff810441fa: 85 c0 test %eax,%eax | ||
213 | ffffffff810441fc: 75 27 jne ffffffff81044225 <sys_getppid+0x35> | ||
214 | ffffffff810441fe: 65 48 8b 04 25 c0 b6 mov %gs:0xb6c0,%rax | ||
215 | ffffffff81044205: 00 00 | ||
216 | ffffffff81044207: 48 8b 80 80 02 00 00 mov 0x280(%rax),%rax | ||
217 | ffffffff8104420e: 48 8b 80 b0 02 00 00 mov 0x2b0(%rax),%rax | ||
218 | ffffffff81044215: 48 8b b8 e8 02 00 00 mov 0x2e8(%rax),%rdi | ||
219 | ffffffff8104421c: e8 2f da 00 00 callq ffffffff81051c50 <pid_vnr> | ||
220 | ffffffff81044221: 5d pop %rbp | ||
221 | ffffffff81044222: 48 98 cltq | ||
222 | ffffffff81044224: c3 retq | ||
223 | ffffffff81044225: 48 c7 c7 13 53 98 81 mov $0xffffffff81985313,%rdi | ||
224 | ffffffff8104422c: 31 c0 xor %eax,%eax | ||
225 | ffffffff8104422e: e8 60 0f 6d 00 callq ffffffff81715193 <printk> | ||
226 | ffffffff81044233: eb c9 jmp ffffffff810441fe <sys_getppid+0xe> | ||
227 | ffffffff81044235: 66 66 2e 0f 1f 84 00 data32 nopw %cs:0x0(%rax,%rax,1) | ||
228 | ffffffff8104423c: 00 00 00 00 | ||
229 | |||
230 | Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction | ||
231 | vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched | ||
232 | to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump | ||
233 | label case adds: | ||
234 | |||
235 | 6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. | ||
236 | |||
237 | If we then include the padding bytes, the jump label code saves, 16 total bytes | ||
238 | of instruction memory for this small fucntion. In this case the non-jump label | ||
239 | function is 80 bytes long. Thus, we have have saved 20% of the instruction | ||
240 | footprint. We can in fact improve this even further, since the 5-byte no-op | ||
241 | really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. | ||
242 | However, we have not yet implemented optimal no-op sizes (they are currently | ||
243 | hard-coded). | ||
244 | |||
245 | Since there are a number of static key API uses in the scheduler paths, | ||
246 | 'pipe-test' (also known as 'perf bench sched pipe') can be used to show the | ||
247 | performance improvement. Testing done on 3.3.0-rc2: | ||
248 | |||
249 | jump label disabled: | ||
250 | |||
251 | Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): | ||
252 | |||
253 | 855.700314 task-clock # 0.534 CPUs utilized ( +- 0.11% ) | ||
254 | 200,003 context-switches # 0.234 M/sec ( +- 0.00% ) | ||
255 | 0 CPU-migrations # 0.000 M/sec ( +- 39.58% ) | ||
256 | 487 page-faults # 0.001 M/sec ( +- 0.02% ) | ||
257 | 1,474,374,262 cycles # 1.723 GHz ( +- 0.17% ) | ||
258 | <not supported> stalled-cycles-frontend | ||
259 | <not supported> stalled-cycles-backend | ||
260 | 1,178,049,567 instructions # 0.80 insns per cycle ( +- 0.06% ) | ||
261 | 208,368,926 branches # 243.507 M/sec ( +- 0.06% ) | ||
262 | 5,569,188 branch-misses # 2.67% of all branches ( +- 0.54% ) | ||
263 | |||
264 | 1.601607384 seconds time elapsed ( +- 0.07% ) | ||
265 | |||
266 | jump label enabled: | ||
267 | |||
268 | Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): | ||
269 | |||
270 | 841.043185 task-clock # 0.533 CPUs utilized ( +- 0.12% ) | ||
271 | 200,004 context-switches # 0.238 M/sec ( +- 0.00% ) | ||
272 | 0 CPU-migrations # 0.000 M/sec ( +- 40.87% ) | ||
273 | 487 page-faults # 0.001 M/sec ( +- 0.05% ) | ||
274 | 1,432,559,428 cycles # 1.703 GHz ( +- 0.18% ) | ||
275 | <not supported> stalled-cycles-frontend | ||
276 | <not supported> stalled-cycles-backend | ||
277 | 1,175,363,994 instructions # 0.82 insns per cycle ( +- 0.04% ) | ||
278 | 206,859,359 branches # 245.956 M/sec ( +- 0.04% ) | ||
279 | 4,884,119 branch-misses # 2.36% of all branches ( +- 0.85% ) | ||
280 | |||
281 | 1.579384366 seconds time elapsed | ||
282 | |||
283 | The percentage of saved branches is .7%, and we've saved 12% on | ||
284 | 'branch-misses'. This is where we would expect to get the most savings, since | ||
285 | this optimization is about reducing the number of branches. In addition, we've | ||
286 | saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time. | ||