diff options
-rw-r--r-- | Documentation/kmemcheck.txt | 773 |
1 files changed, 773 insertions, 0 deletions
diff --git a/Documentation/kmemcheck.txt b/Documentation/kmemcheck.txt new file mode 100644 index 000000000000..363044609dad --- /dev/null +++ b/Documentation/kmemcheck.txt | |||
@@ -0,0 +1,773 @@ | |||
1 | GETTING STARTED WITH KMEMCHECK | ||
2 | ============================== | ||
3 | |||
4 | Vegard Nossum <vegardno@ifi.uio.no> | ||
5 | |||
6 | |||
7 | Contents | ||
8 | ======== | ||
9 | 0. Introduction | ||
10 | 1. Downloading | ||
11 | 2. Configuring and compiling | ||
12 | 3. How to use | ||
13 | 3.1. Booting | ||
14 | 3.2. Run-time enable/disable | ||
15 | 3.3. Debugging | ||
16 | 3.4. Annotating false positives | ||
17 | 4. Reporting errors | ||
18 | 5. Technical description | ||
19 | |||
20 | |||
21 | 0. Introduction | ||
22 | =============== | ||
23 | |||
24 | kmemcheck is a debugging feature for the Linux Kernel. More specifically, it | ||
25 | is a dynamic checker that detects and warns about some uses of uninitialized | ||
26 | memory. | ||
27 | |||
28 | Userspace programmers might be familiar with Valgrind's memcheck. The main | ||
29 | difference between memcheck and kmemcheck is that memcheck works for userspace | ||
30 | programs only, and kmemcheck works for the kernel only. The implementations | ||
31 | are of course vastly different. Because of this, kmemcheck is not as accurate | ||
32 | as memcheck, but it turns out to be good enough in practice to discover real | ||
33 | programmer errors that the compiler is not able to find through static | ||
34 | analysis. | ||
35 | |||
36 | Enabling kmemcheck on a kernel will probably slow it down to the extent that | ||
37 | the machine will not be usable for normal workloads such as e.g. an | ||
38 | interactive desktop. kmemcheck will also cause the kernel to use about twice | ||
39 | as much memory as normal. For this reason, kmemcheck is strictly a debugging | ||
40 | feature. | ||
41 | |||
42 | |||
43 | 1. Downloading | ||
44 | ============== | ||
45 | |||
46 | kmemcheck can only be downloaded using git. If you want to write patches | ||
47 | against the current code, you should use the kmemcheck development branch of | ||
48 | the tip tree. It is also possible to use the linux-next tree, which also | ||
49 | includes the latest version of kmemcheck. | ||
50 | |||
51 | Assuming that you've already cloned the linux-2.6.git repository, all you | ||
52 | have to do is add the -tip tree as a remote, like this: | ||
53 | |||
54 | $ git remote add tip git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git | ||
55 | |||
56 | To actually download the tree, fetch the remote: | ||
57 | |||
58 | $ git fetch tip | ||
59 | |||
60 | And to check out a new local branch with the kmemcheck code: | ||
61 | |||
62 | $ git checkout -b kmemcheck tip/kmemcheck | ||
63 | |||
64 | General instructions for the -tip tree can be found here: | ||
65 | http://people.redhat.com/mingo/tip.git/readme.txt | ||
66 | |||
67 | |||
68 | 2. Configuring and compiling | ||
69 | ============================ | ||
70 | |||
71 | kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of | ||
72 | configuration variables must have specific settings in order for the kmemcheck | ||
73 | menu to even appear in "menuconfig". These are: | ||
74 | |||
75 | o CONFIG_CC_OPTIMIZE_FOR_SIZE=n | ||
76 | |||
77 | This option is located under "General setup" / "Optimize for size". | ||
78 | |||
79 | Without this, gcc will use certain optimizations that usually lead to | ||
80 | false positive warnings from kmemcheck. An example of this is a 16-bit | ||
81 | field in a struct, where gcc may load 32 bits, then discard the upper | ||
82 | 16 bits. kmemcheck sees only the 32-bit load, and may trigger a | ||
83 | warning for the upper 16 bits (if they're uninitialized). | ||
84 | |||
85 | o CONFIG_SLAB=y or CONFIG_SLUB=y | ||
86 | |||
87 | This option is located under "General setup" / "Choose SLAB | ||
88 | allocator". | ||
89 | |||
90 | o CONFIG_FUNCTION_TRACER=n | ||
91 | |||
92 | This option is located under "Kernel hacking" / "Tracers" / "Kernel | ||
93 | Function Tracer" | ||
94 | |||
95 | When function tracing is compiled in, gcc emits a call to another | ||
96 | function at the beginning of every function. This means that when the | ||
97 | page fault handler is called, the ftrace framework will be called | ||
98 | before kmemcheck has had a chance to handle the fault. If ftrace then | ||
99 | modifies memory that was tracked by kmemcheck, the result is an | ||
100 | endless recursive page fault. | ||
101 | |||
102 | o CONFIG_DEBUG_PAGEALLOC=n | ||
103 | |||
104 | This option is located under "Kernel hacking" / "Debug page memory | ||
105 | allocations". | ||
106 | |||
107 | In addition, I highly recommend turning on CONFIG_DEBUG_INFO=y. This is also | ||
108 | located under "Kernel hacking". With this, you will be able to get line number | ||
109 | information from the kmemcheck warnings, which is extremely valuable in | ||
110 | debugging a problem. This option is not mandatory, however, because it slows | ||
111 | down the compilation process and produces a much bigger kernel image. | ||
112 | |||
113 | Now the kmemcheck menu should be visible (under "Kernel hacking" / "kmemcheck: | ||
114 | trap use of uninitialized memory"). Here follows a description of the | ||
115 | kmemcheck configuration variables: | ||
116 | |||
117 | o CONFIG_KMEMCHECK | ||
118 | |||
119 | This must be enabled in order to use kmemcheck at all... | ||
120 | |||
121 | o CONFIG_KMEMCHECK_[DISABLED | ENABLED | ONESHOT]_BY_DEFAULT | ||
122 | |||
123 | This option controls the status of kmemcheck at boot-time. "Enabled" | ||
124 | will enable kmemcheck right from the start, "disabled" will boot the | ||
125 | kernel as normal (but with the kmemcheck code compiled in, so it can | ||
126 | be enabled at run-time after the kernel has booted), and "one-shot" is | ||
127 | a special mode which will turn kmemcheck off automatically after | ||
128 | detecting the first use of uninitialized memory. | ||
129 | |||
130 | If you are using kmemcheck to actively debug a problem, then you | ||
131 | probably want to choose "enabled" here. | ||
132 | |||
133 | The one-shot mode is mostly useful in automated test setups because it | ||
134 | can prevent floods of warnings and increase the chances of the machine | ||
135 | surviving in case something is really wrong. In other cases, the one- | ||
136 | shot mode could actually be counter-productive because it would turn | ||
137 | itself off at the very first error -- in the case of a false positive | ||
138 | too -- and this would come in the way of debugging the specific | ||
139 | problem you were interested in. | ||
140 | |||
141 | If you would like to use your kernel as normal, but with a chance to | ||
142 | enable kmemcheck in case of some problem, it might be a good idea to | ||
143 | choose "disabled" here. When kmemcheck is disabled, most of the run- | ||
144 | time overhead is not incurred, and the kernel will be almost as fast | ||
145 | as normal. | ||
146 | |||
147 | o CONFIG_KMEMCHECK_QUEUE_SIZE | ||
148 | |||
149 | Select the maximum number of error reports to store in an internal | ||
150 | (fixed-size) buffer. Since errors can occur virtually anywhere and in | ||
151 | any context, we need a temporary storage area which is guaranteed not | ||
152 | to generate any other page faults when accessed. The queue will be | ||
153 | emptied as soon as a tasklet may be scheduled. If the queue is full, | ||
154 | new error reports will be lost. | ||
155 | |||
156 | The default value of 64 is probably fine. If some code produces more | ||
157 | than 64 errors within an irqs-off section, then the code is likely to | ||
158 | produce many, many more, too, and these additional reports seldom give | ||
159 | any more information (the first report is usually the most valuable | ||
160 | anyway). | ||
161 | |||
162 | This number might have to be adjusted if you are not using serial | ||
163 | console or similar to capture the kernel log. If you are using the | ||
164 | "dmesg" command to save the log, then getting a lot of kmemcheck | ||
165 | warnings might overflow the kernel log itself, and the earlier reports | ||
166 | will get lost in that way instead. Try setting this to 10 or so on | ||
167 | such a setup. | ||
168 | |||
169 | o CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT | ||
170 | |||
171 | Select the number of shadow bytes to save along with each entry of the | ||
172 | error-report queue. These bytes indicate what parts of an allocation | ||
173 | are initialized, uninitialized, etc. and will be displayed when an | ||
174 | error is detected to help the debugging of a particular problem. | ||
175 | |||
176 | The number entered here is actually the logarithm of the number of | ||
177 | bytes that will be saved. So if you pick for example 5 here, kmemcheck | ||
178 | will save 2^5 = 32 bytes. | ||
179 | |||
180 | The default value should be fine for debugging most problems. It also | ||
181 | fits nicely within 80 columns. | ||
182 | |||
183 | o CONFIG_KMEMCHECK_PARTIAL_OK | ||
184 | |||
185 | This option (when enabled) works around certain GCC optimizations that | ||
186 | produce 32-bit reads from 16-bit variables where the upper 16 bits are | ||
187 | thrown away afterwards. | ||
188 | |||
189 | The default value (enabled) is recommended. This may of course hide | ||
190 | some real errors, but disabling it would probably produce a lot of | ||
191 | false positives. | ||
192 | |||
193 | o CONFIG_KMEMCHECK_BITOPS_OK | ||
194 | |||
195 | This option silences warnings that would be generated for bit-field | ||
196 | accesses where not all the bits are initialized at the same time. This | ||
197 | may also hide some real bugs. | ||
198 | |||
199 | This option is probably obsolete, or it should be replaced with | ||
200 | the kmemcheck-/bitfield-annotations for the code in question. The | ||
201 | default value is therefore fine. | ||
202 | |||
203 | Now compile the kernel as usual. | ||
204 | |||
205 | |||
206 | 3. How to use | ||
207 | ============= | ||
208 | |||
209 | 3.1. Booting | ||
210 | ============ | ||
211 | |||
212 | First some information about the command-line options. There is only one | ||
213 | option specific to kmemcheck, and this is called "kmemcheck". It can be used | ||
214 | to override the default mode as chosen by the CONFIG_KMEMCHECK_*_BY_DEFAULT | ||
215 | option. Its possible settings are: | ||
216 | |||
217 | o kmemcheck=0 (disabled) | ||
218 | o kmemcheck=1 (enabled) | ||
219 | o kmemcheck=2 (one-shot mode) | ||
220 | |||
221 | If SLUB debugging has been enabled in the kernel, it may take precedence over | ||
222 | kmemcheck in such a way that the slab caches which are under SLUB debugging | ||
223 | will not be tracked by kmemcheck. In order to ensure that this doesn't happen | ||
224 | (even though it shouldn't by default), use SLUB's boot option "slub_debug", | ||
225 | like this: slub_debug=- | ||
226 | |||
227 | In fact, this option may also be used for fine-grained control over SLUB vs. | ||
228 | kmemcheck. For example, if the command line includes "kmemcheck=1 | ||
229 | slub_debug=,dentry", then SLUB debugging will be used only for the "dentry" | ||
230 | slab cache, and with kmemcheck tracking all the other caches. This is advanced | ||
231 | usage, however, and is not generally recommended. | ||
232 | |||
233 | |||
234 | 3.2. Run-time enable/disable | ||
235 | ============================ | ||
236 | |||
237 | When the kernel has booted, it is possible to enable or disable kmemcheck at | ||
238 | run-time. WARNING: This feature is still experimental and may cause false | ||
239 | positive warnings to appear. Therefore, try not to use this. If you find that | ||
240 | it doesn't work properly (e.g. you see an unreasonable amount of warnings), I | ||
241 | will be happy to take bug reports. | ||
242 | |||
243 | Use the file /proc/sys/kernel/kmemcheck for this purpose, e.g.: | ||
244 | |||
245 | $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck | ||
246 | |||
247 | The numbers are the same as for the kmemcheck= command-line option. | ||
248 | |||
249 | |||
250 | 3.3. Debugging | ||
251 | ============== | ||
252 | |||
253 | A typical report will look something like this: | ||
254 | |||
255 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | ||
256 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
257 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
258 | ^ | ||
259 | |||
260 | Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A | ||
261 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | ||
262 | RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 | ||
263 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | ||
264 | RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 | ||
265 | RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 | ||
266 | R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e | ||
267 | R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 | ||
268 | FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 | ||
269 | CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 | ||
270 | CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 | ||
271 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | ||
272 | DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 | ||
273 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | ||
274 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | ||
275 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | ||
276 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 | ||
277 | [<ffffffffffffffff>] 0xffffffffffffffff | ||
278 | |||
279 | The single most valuable information in this report is the RIP (or EIP on 32- | ||
280 | bit) value. This will help us pinpoint exactly which instruction that caused | ||
281 | the warning. | ||
282 | |||
283 | If your kernel was compiled with CONFIG_DEBUG_INFO=y, then all we have to do | ||
284 | is give this address to the addr2line program, like this: | ||
285 | |||
286 | $ addr2line -e vmlinux -i ffffffff8104ede8 | ||
287 | arch/x86/include/asm/string_64.h:12 | ||
288 | include/asm-generic/siginfo.h:287 | ||
289 | kernel/signal.c:380 | ||
290 | kernel/signal.c:410 | ||
291 | |||
292 | The "-e vmlinux" tells addr2line which file to look in. IMPORTANT: This must | ||
293 | be the vmlinux of the kernel that produced the warning in the first place! If | ||
294 | not, the line number information will almost certainly be wrong. | ||
295 | |||
296 | The "-i" tells addr2line to also print the line numbers of inlined functions. | ||
297 | In this case, the flag was very important, because otherwise, it would only | ||
298 | have printed the first line, which is just a call to memcpy(), which could be | ||
299 | called from a thousand places in the kernel, and is therefore not very useful. | ||
300 | These inlined functions would not show up in the stack trace above, simply | ||
301 | because the kernel doesn't load the extra debugging information. This | ||
302 | technique can of course be used with ordinary kernel oopses as well. | ||
303 | |||
304 | In this case, it's the caller of memcpy() that is interesting, and it can be | ||
305 | found in include/asm-generic/siginfo.h, line 287: | ||
306 | |||
307 | 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) | ||
308 | 282 { | ||
309 | 283 if (from->si_code < 0) | ||
310 | 284 memcpy(to, from, sizeof(*to)); | ||
311 | 285 else | ||
312 | 286 /* _sigchld is currently the largest know union member */ | ||
313 | 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); | ||
314 | 288 } | ||
315 | |||
316 | Since this was a read (kmemcheck usually warns about reads only, though it can | ||
317 | warn about writes to unallocated or freed memory as well), it was probably the | ||
318 | "from" argument which contained some uninitialized bytes. Following the chain | ||
319 | of calls, we move upwards to see where "from" was allocated or initialized, | ||
320 | kernel/signal.c, line 380: | ||
321 | |||
322 | 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) | ||
323 | 360 { | ||
324 | ... | ||
325 | 367 list_for_each_entry(q, &list->list, list) { | ||
326 | 368 if (q->info.si_signo == sig) { | ||
327 | 369 if (first) | ||
328 | 370 goto still_pending; | ||
329 | 371 first = q; | ||
330 | ... | ||
331 | 377 if (first) { | ||
332 | 378 still_pending: | ||
333 | 379 list_del_init(&first->list); | ||
334 | 380 copy_siginfo(info, &first->info); | ||
335 | 381 __sigqueue_free(first); | ||
336 | ... | ||
337 | 392 } | ||
338 | 393 } | ||
339 | |||
340 | Here, it is &first->info that is being passed on to copy_siginfo(). The | ||
341 | variable "first" was found on a list -- passed in as the second argument to | ||
342 | collect_signal(). We continue our journey through the stack, to figure out | ||
343 | where the item on "list" was allocated or initialized. We move to line 410: | ||
344 | |||
345 | 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, | ||
346 | 396 siginfo_t *info) | ||
347 | 397 { | ||
348 | ... | ||
349 | 410 collect_signal(sig, pending, info); | ||
350 | ... | ||
351 | 414 } | ||
352 | |||
353 | Now we need to follow the "pending" pointer, since that is being passed on to | ||
354 | collect_signal() as "list". At this point, we've run out of lines from the | ||
355 | "addr2line" output. Not to worry, we just paste the next addresses from the | ||
356 | kmemcheck stack dump, i.e.: | ||
357 | |||
358 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | ||
359 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | ||
360 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | ||
361 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 | ||
362 | |||
363 | $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ | ||
364 | ffffffff8100b87d ffffffff8100c7b5 | ||
365 | kernel/signal.c:446 | ||
366 | kernel/signal.c:1806 | ||
367 | arch/x86/kernel/signal.c:805 | ||
368 | arch/x86/kernel/signal.c:871 | ||
369 | arch/x86/kernel/entry_64.S:694 | ||
370 | |||
371 | Remember that since these addresses were found on the stack and not as the | ||
372 | RIP value, they actually point to the _next_ instruction (they are return | ||
373 | addresses). This becomes obvious when we look at the code for line 446: | ||
374 | |||
375 | 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) | ||
376 | 423 { | ||
377 | ... | ||
378 | 431 signr = __dequeue_signal(&tsk->signal->shared_pending, | ||
379 | 432 mask, info); | ||
380 | 433 /* | ||
381 | 434 * itimer signal ? | ||
382 | 435 * | ||
383 | 436 * itimers are process shared and we restart periodic | ||
384 | 437 * itimers in the signal delivery path to prevent DoS | ||
385 | 438 * attacks in the high resolution timer case. This is | ||
386 | 439 * compliant with the old way of self restarting | ||
387 | 440 * itimers, as the SIGALRM is a legacy signal and only | ||
388 | 441 * queued once. Changing the restart behaviour to | ||
389 | 442 * restart the timer in the signal dequeue path is | ||
390 | 443 * reducing the timer noise on heavy loaded !highres | ||
391 | 444 * systems too. | ||
392 | 445 */ | ||
393 | 446 if (unlikely(signr == SIGALRM)) { | ||
394 | ... | ||
395 | 489 } | ||
396 | |||
397 | So instead of looking at 446, we should be looking at 431, which is the line | ||
398 | that executes just before 446. Here we see that what we are looking for is | ||
399 | &tsk->signal->shared_pending. | ||
400 | |||
401 | Our next task is now to figure out which function that puts items on this | ||
402 | "shared_pending" list. A crude, but efficient tool, is git grep: | ||
403 | |||
404 | $ git grep -n 'shared_pending' kernel/ | ||
405 | ... | ||
406 | kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; | ||
407 | kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; | ||
408 | ... | ||
409 | |||
410 | There were more results, but none of them were related to list operations, | ||
411 | and these were the only assignments. We inspect the line numbers more closely | ||
412 | and find that this is indeed where items are being added to the list: | ||
413 | |||
414 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | ||
415 | 817 int group) | ||
416 | 818 { | ||
417 | ... | ||
418 | 828 pending = group ? &t->signal->shared_pending : &t->pending; | ||
419 | ... | ||
420 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | ||
421 | 852 (is_si_special(info) || | ||
422 | 853 info->si_code >= 0))); | ||
423 | 854 if (q) { | ||
424 | 855 list_add_tail(&q->list, &pending->list); | ||
425 | ... | ||
426 | 890 } | ||
427 | |||
428 | and: | ||
429 | |||
430 | 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) | ||
431 | 1310 { | ||
432 | .... | ||
433 | 1339 pending = group ? &t->signal->shared_pending : &t->pending; | ||
434 | 1340 list_add_tail(&q->list, &pending->list); | ||
435 | .... | ||
436 | 1347 } | ||
437 | |||
438 | In the first case, the list element we are looking for, "q", is being returned | ||
439 | from the function __sigqueue_alloc(), which looks like an allocation function. | ||
440 | Let's take a look at it: | ||
441 | |||
442 | 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, | ||
443 | 188 int override_rlimit) | ||
444 | 189 { | ||
445 | 190 struct sigqueue *q = NULL; | ||
446 | 191 struct user_struct *user; | ||
447 | 192 | ||
448 | 193 /* | ||
449 | 194 * We won't get problems with the target's UID changing under us | ||
450 | 195 * because changing it requires RCU be used, and if t != current, the | ||
451 | 196 * caller must be holding the RCU readlock (by way of a spinlock) and | ||
452 | 197 * we use RCU protection here | ||
453 | 198 */ | ||
454 | 199 user = get_uid(__task_cred(t)->user); | ||
455 | 200 atomic_inc(&user->sigpending); | ||
456 | 201 if (override_rlimit || | ||
457 | 202 atomic_read(&user->sigpending) <= | ||
458 | 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) | ||
459 | 204 q = kmem_cache_alloc(sigqueue_cachep, flags); | ||
460 | 205 if (unlikely(q == NULL)) { | ||
461 | 206 atomic_dec(&user->sigpending); | ||
462 | 207 free_uid(user); | ||
463 | 208 } else { | ||
464 | 209 INIT_LIST_HEAD(&q->list); | ||
465 | 210 q->flags = 0; | ||
466 | 211 q->user = user; | ||
467 | 212 } | ||
468 | 213 | ||
469 | 214 return q; | ||
470 | 215 } | ||
471 | |||
472 | We see that this function initializes q->list, q->flags, and q->user. It seems | ||
473 | that now is the time to look at the definition of "struct sigqueue", e.g.: | ||
474 | |||
475 | 14 struct sigqueue { | ||
476 | 15 struct list_head list; | ||
477 | 16 int flags; | ||
478 | 17 siginfo_t info; | ||
479 | 18 struct user_struct *user; | ||
480 | 19 }; | ||
481 | |||
482 | And, you might remember, it was a memcpy() on &first->info that caused the | ||
483 | warning, so this makes perfect sense. It also seems reasonable to assume that | ||
484 | it is the caller of __sigqueue_alloc() that has the responsibility of filling | ||
485 | out (initializing) this member. | ||
486 | |||
487 | But just which fields of the struct were uninitialized? Let's look at | ||
488 | kmemcheck's report again: | ||
489 | |||
490 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | ||
491 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
492 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
493 | ^ | ||
494 | |||
495 | These first two lines are the memory dump of the memory object itself, and the | ||
496 | shadow bytemap, respectively. The memory object itself is in this case | ||
497 | &first->info. Just beware that the start of this dump is NOT the start of the | ||
498 | object itself! The position of the caret (^) corresponds with the address of | ||
499 | the read (ffff88003e4a2024). | ||
500 | |||
501 | The shadow bytemap dump legend is as follows: | ||
502 | |||
503 | i - initialized | ||
504 | u - uninitialized | ||
505 | a - unallocated (memory has been allocated by the slab layer, but has not | ||
506 | yet been handed off to anybody) | ||
507 | f - freed (memory has been allocated by the slab layer, but has been freed | ||
508 | by the previous owner) | ||
509 | |||
510 | In order to figure out where (relative to the start of the object) the | ||
511 | uninitialized memory was located, we have to look at the disassembly. For | ||
512 | that, we'll need the RIP address again: | ||
513 | |||
514 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | ||
515 | |||
516 | $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: | ||
517 | ffffffff8104edc8: mov %r8,0x8(%r8) | ||
518 | ffffffff8104edcc: test %r10d,%r10d | ||
519 | ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> | ||
520 | ffffffff8104edd5: mov %rax,%rdx | ||
521 | ffffffff8104edd8: mov $0xc,%ecx | ||
522 | ffffffff8104eddd: mov %r13,%rdi | ||
523 | ffffffff8104ede0: mov $0x30,%eax | ||
524 | ffffffff8104ede5: mov %rdx,%rsi | ||
525 | ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) | ||
526 | ffffffff8104edea: test $0x2,%al | ||
527 | ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> | ||
528 | ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) | ||
529 | ffffffff8104edf0: test $0x1,%al | ||
530 | ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> | ||
531 | ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) | ||
532 | ffffffff8104edf5: mov %r8,%rdi | ||
533 | ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> | ||
534 | |||
535 | As expected, it's the "rep movsl" instruction from the memcpy() that causes | ||
536 | the warning. We know about REP MOVSL that it uses the register RCX to count | ||
537 | the number of remaining iterations. By taking a look at the register dump | ||
538 | again (from the kmemcheck report), we can figure out how many bytes were left | ||
539 | to copy: | ||
540 | |||
541 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | ||
542 | |||
543 | By looking at the disassembly, we also see that %ecx is being loaded with the | ||
544 | value $0xc just before (ffffffff8104edd8), so we are very lucky. Keep in mind | ||
545 | that this is the number of iterations, not bytes. And since this is a "long" | ||
546 | operation, we need to multiply by 4 to get the number of bytes. So this means | ||
547 | that the uninitialized value was encountered at 4 * (0xc - 0x9) = 12 bytes | ||
548 | from the start of the object. | ||
549 | |||
550 | We can now try to figure out which field of the "struct siginfo" that was not | ||
551 | initialized. This is the beginning of the struct: | ||
552 | |||
553 | 40 typedef struct siginfo { | ||
554 | 41 int si_signo; | ||
555 | 42 int si_errno; | ||
556 | 43 int si_code; | ||
557 | 44 | ||
558 | 45 union { | ||
559 | .. | ||
560 | 92 } _sifields; | ||
561 | 93 } siginfo_t; | ||
562 | |||
563 | On 64-bit, the int is 4 bytes long, so it must the the union member that has | ||
564 | not been initialized. We can verify this using gdb: | ||
565 | |||
566 | $ gdb vmlinux | ||
567 | ... | ||
568 | (gdb) p &((struct siginfo *) 0)->_sifields | ||
569 | $1 = (union {...} *) 0x10 | ||
570 | |||
571 | Actually, it seems that the union member is located at offset 0x10 -- which | ||
572 | means that gcc has inserted 4 bytes of padding between the members si_code | ||
573 | and _sifields. We can now get a fuller picture of the memory dump: | ||
574 | |||
575 | _----------------------------=> si_code | ||
576 | / _--------------------=> (padding) | ||
577 | | / _------------=> _sifields(._kill._pid) | ||
578 | | | / _----=> _sifields(._kill._uid) | ||
579 | | | | / | ||
580 | -------|-------|-------|-------| | ||
581 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
582 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
583 | |||
584 | This allows us to realize another important fact: si_code contains the value | ||
585 | 0x80. Remember that x86 is little endian, so the first 4 bytes "80000000" are | ||
586 | really the number 0x00000080. With a bit of research, we find that this is | ||
587 | actually the constant SI_KERNEL defined in include/asm-generic/siginfo.h: | ||
588 | |||
589 | 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ | ||
590 | |||
591 | This macro is used in exactly one place in the x86 kernel: In send_signal() | ||
592 | in kernel/signal.c: | ||
593 | |||
594 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | ||
595 | 817 int group) | ||
596 | 818 { | ||
597 | ... | ||
598 | 828 pending = group ? &t->signal->shared_pending : &t->pending; | ||
599 | ... | ||
600 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | ||
601 | 852 (is_si_special(info) || | ||
602 | 853 info->si_code >= 0))); | ||
603 | 854 if (q) { | ||
604 | 855 list_add_tail(&q->list, &pending->list); | ||
605 | 856 switch ((unsigned long) info) { | ||
606 | ... | ||
607 | 865 case (unsigned long) SEND_SIG_PRIV: | ||
608 | 866 q->info.si_signo = sig; | ||
609 | 867 q->info.si_errno = 0; | ||
610 | 868 q->info.si_code = SI_KERNEL; | ||
611 | 869 q->info.si_pid = 0; | ||
612 | 870 q->info.si_uid = 0; | ||
613 | 871 break; | ||
614 | ... | ||
615 | 890 } | ||
616 | |||
617 | Not only does this match with the .si_code member, it also matches the place | ||
618 | we found earlier when looking for where siginfo_t objects are enqueued on the | ||
619 | "shared_pending" list. | ||
620 | |||
621 | So to sum up: It seems that it is the padding introduced by the compiler | ||
622 | between two struct fields that is uninitialized, and this gets reported when | ||
623 | we do a memcpy() on the struct. This means that we have identified a false | ||
624 | positive warning. | ||
625 | |||
626 | Normally, kmemcheck will not report uninitialized accesses in memcpy() calls | ||
627 | when both the source and destination addresses are tracked. (Instead, we copy | ||
628 | the shadow bytemap as well). In this case, the destination address clearly | ||
629 | was not tracked. We can dig a little deeper into the stack trace from above: | ||
630 | |||
631 | arch/x86/kernel/signal.c:805 | ||
632 | arch/x86/kernel/signal.c:871 | ||
633 | arch/x86/kernel/entry_64.S:694 | ||
634 | |||
635 | And we clearly see that the destination siginfo object is located on the | ||
636 | stack: | ||
637 | |||
638 | 782 static void do_signal(struct pt_regs *regs) | ||
639 | 783 { | ||
640 | 784 struct k_sigaction ka; | ||
641 | 785 siginfo_t info; | ||
642 | ... | ||
643 | 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); | ||
644 | ... | ||
645 | 854 } | ||
646 | |||
647 | And this &info is what eventually gets passed to copy_siginfo() as the | ||
648 | destination argument. | ||
649 | |||
650 | Now, even though we didn't find an actual error here, the example is still a | ||
651 | good one, because it shows how one would go about to find out what the report | ||
652 | was all about. | ||
653 | |||
654 | |||
655 | 3.4. Annotating false positives | ||
656 | =============================== | ||
657 | |||
658 | There are a few different ways to make annotations in the source code that | ||
659 | will keep kmemcheck from checking and reporting certain allocations. Here | ||
660 | they are: | ||
661 | |||
662 | o __GFP_NOTRACK_FALSE_POSITIVE | ||
663 | |||
664 | This flag can be passed to kmalloc() or kmem_cache_alloc() (therefore | ||
665 | also to other functions that end up calling one of these) to indicate | ||
666 | that the allocation should not be tracked because it would lead to | ||
667 | a false positive report. This is a "big hammer" way of silencing | ||
668 | kmemcheck; after all, even if the false positive pertains to | ||
669 | particular field in a struct, for example, we will now lose the | ||
670 | ability to find (real) errors in other parts of the same struct. | ||
671 | |||
672 | Example: | ||
673 | |||
674 | /* No warnings will ever trigger on accessing any part of x */ | ||
675 | x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); | ||
676 | |||
677 | o kmemcheck_bitfield_begin(name)/kmemcheck_bitfield_end(name) and | ||
678 | kmemcheck_annotate_bitfield(ptr, name) | ||
679 | |||
680 | The first two of these three macros can be used inside struct | ||
681 | definitions to signal, respectively, the beginning and end of a | ||
682 | bitfield. Additionally, this will assign the bitfield a name, which | ||
683 | is given as an argument to the macros. | ||
684 | |||
685 | Having used these markers, one can later use | ||
686 | kmemcheck_annotate_bitfield() at the point of allocation, to indicate | ||
687 | which parts of the allocation is part of a bitfield. | ||
688 | |||
689 | Example: | ||
690 | |||
691 | struct foo { | ||
692 | int x; | ||
693 | |||
694 | kmemcheck_bitfield_begin(flags); | ||
695 | int flag_a:1; | ||
696 | int flag_b:1; | ||
697 | kmemcheck_bitfield_end(flags); | ||
698 | |||
699 | int y; | ||
700 | }; | ||
701 | |||
702 | struct foo *x = kmalloc(sizeof *x); | ||
703 | |||
704 | /* No warnings will trigger on accessing the bitfield of x */ | ||
705 | kmemcheck_annotate_bitfield(x, flags); | ||
706 | |||
707 | Note that kmemcheck_annotate_bitfield() can be used even before the | ||
708 | return value of kmalloc() is checked -- in other words, passing NULL | ||
709 | as the first argument is legal (and will do nothing). | ||
710 | |||
711 | |||
712 | 4. Reporting errors | ||
713 | =================== | ||
714 | |||
715 | As we have seen, kmemcheck will produce false positive reports. Therefore, it | ||
716 | is not very wise to blindly post kmemcheck warnings to mailing lists and | ||
717 | maintainers. Instead, I encourage maintainers and developers to find errors | ||
718 | in their own code. If you get a warning, you can try to work around it, try | ||
719 | to figure out if it's a real error or not, or simply ignore it. Most | ||
720 | developers know their own code and will quickly and efficiently determine the | ||
721 | root cause of a kmemcheck report. This is therefore also the most efficient | ||
722 | way to work with kmemcheck. | ||
723 | |||
724 | That said, we (the kmemcheck maintainers) will always be on the lookout for | ||
725 | false positives that we can annotate and silence. So whatever you find, | ||
726 | please drop us a note privately! Kernel configs and steps to reproduce (if | ||
727 | available) are of course a great help too. | ||
728 | |||
729 | Happy hacking! | ||
730 | |||
731 | |||
732 | 5. Technical description | ||
733 | ======================== | ||
734 | |||
735 | kmemcheck works by marking memory pages non-present. This means that whenever | ||
736 | somebody attempts to access the page, a page fault is generated. The page | ||
737 | fault handler notices that the page was in fact only hidden, and so it calls | ||
738 | on the kmemcheck code to make further investigations. | ||
739 | |||
740 | When the investigations are completed, kmemcheck "shows" the page by marking | ||
741 | it present (as it would be under normal circumstances). This way, the | ||
742 | interrupted code can continue as usual. | ||
743 | |||
744 | But after the instruction has been executed, we should hide the page again, so | ||
745 | that we can catch the next access too! Now kmemcheck makes use of a debugging | ||
746 | feature of the processor, namely single-stepping. When the processor has | ||
747 | finished the one instruction that generated the memory access, a debug | ||
748 | exception is raised. From here, we simply hide the page again and continue | ||
749 | execution, this time with the single-stepping feature turned off. | ||
750 | |||
751 | kmemcheck requires some assistance from the memory allocator in order to work. | ||
752 | The memory allocator needs to | ||
753 | |||
754 | 1. Tell kmemcheck about newly allocated pages and pages that are about to | ||
755 | be freed. This allows kmemcheck to set up and tear down the shadow memory | ||
756 | for the pages in question. The shadow memory stores the status of each | ||
757 | byte in the allocation proper, e.g. whether it is initialized or | ||
758 | uninitialized. | ||
759 | |||
760 | 2. Tell kmemcheck which parts of memory should be marked uninitialized. | ||
761 | There are actually a few more states, such as "not yet allocated" and | ||
762 | "recently freed". | ||
763 | |||
764 | If a slab cache is set up using the SLAB_NOTRACK flag, it will never return | ||
765 | memory that can take page faults because of kmemcheck. | ||
766 | |||
767 | If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still | ||
768 | request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. | ||
769 | This does not prevent the page faults from occurring, however, but marks the | ||
770 | object in question as being initialized so that no warnings will ever be | ||
771 | produced for this object. | ||
772 | |||
773 | Currently, the SLAB and SLUB allocators are supported by kmemcheck. | ||