diff options
| author | Levin, Alexander (Sasha Levin) <alexander.levin@verizon.com> | 2017-11-15 20:36:02 -0500 |
|---|---|---|
| committer | Linus Torvalds <torvalds@linux-foundation.org> | 2017-11-15 21:21:05 -0500 |
| commit | 4675ff05de2d76d167336b368bd07f3fef6ed5a6 (patch) | |
| tree | 212d8adf40e13c2a27ac7834d14ca4900923b98c /Documentation/dev-tools | |
| parent | d8be75663cec0069b85f80191abd2682ce4a512f (diff) | |
kmemcheck: rip it out
Fix up makefiles, remove references, and git rm kmemcheck.
Link: http://lkml.kernel.org/r/20171007030159.22241-4-alexander.levin@verizon.com
Signed-off-by: Sasha Levin <alexander.levin@verizon.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Vegard Nossum <vegardno@ifi.uio.no>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Tim Hansen <devtimhansen@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/dev-tools')
| -rw-r--r-- | Documentation/dev-tools/index.rst | 1 | ||||
| -rw-r--r-- | Documentation/dev-tools/kmemcheck.rst | 733 |
2 files changed, 0 insertions, 734 deletions
diff --git a/Documentation/dev-tools/index.rst b/Documentation/dev-tools/index.rst index a81787cd47d7..e313925fb0fa 100644 --- a/Documentation/dev-tools/index.rst +++ b/Documentation/dev-tools/index.rst | |||
| @@ -21,7 +21,6 @@ whole; patches welcome! | |||
| 21 | kasan | 21 | kasan |
| 22 | ubsan | 22 | ubsan |
| 23 | kmemleak | 23 | kmemleak |
| 24 | kmemcheck | ||
| 25 | gdb-kernel-debugging | 24 | gdb-kernel-debugging |
| 26 | kgdb | 25 | kgdb |
| 27 | kselftest | 26 | kselftest |
diff --git a/Documentation/dev-tools/kmemcheck.rst b/Documentation/dev-tools/kmemcheck.rst deleted file mode 100644 index 7f3d1985de74..000000000000 --- a/Documentation/dev-tools/kmemcheck.rst +++ /dev/null | |||
| @@ -1,733 +0,0 @@ | |||
| 1 | Getting started with kmemcheck | ||
| 2 | ============================== | ||
| 3 | |||
| 4 | Vegard Nossum <vegardno@ifi.uio.no> | ||
| 5 | |||
| 6 | |||
| 7 | Introduction | ||
| 8 | ------------ | ||
| 9 | |||
| 10 | kmemcheck is a debugging feature for the Linux Kernel. More specifically, it | ||
| 11 | is a dynamic checker that detects and warns about some uses of uninitialized | ||
| 12 | memory. | ||
| 13 | |||
| 14 | Userspace programmers might be familiar with Valgrind's memcheck. The main | ||
| 15 | difference between memcheck and kmemcheck is that memcheck works for userspace | ||
| 16 | programs only, and kmemcheck works for the kernel only. The implementations | ||
| 17 | are of course vastly different. Because of this, kmemcheck is not as accurate | ||
| 18 | as memcheck, but it turns out to be good enough in practice to discover real | ||
| 19 | programmer errors that the compiler is not able to find through static | ||
| 20 | analysis. | ||
| 21 | |||
| 22 | Enabling kmemcheck on a kernel will probably slow it down to the extent that | ||
| 23 | the machine will not be usable for normal workloads such as e.g. an | ||
| 24 | interactive desktop. kmemcheck will also cause the kernel to use about twice | ||
| 25 | as much memory as normal. For this reason, kmemcheck is strictly a debugging | ||
| 26 | feature. | ||
| 27 | |||
| 28 | |||
| 29 | Downloading | ||
| 30 | ----------- | ||
| 31 | |||
| 32 | As of version 2.6.31-rc1, kmemcheck is included in the mainline kernel. | ||
| 33 | |||
| 34 | |||
| 35 | Configuring and compiling | ||
| 36 | ------------------------- | ||
| 37 | |||
| 38 | kmemcheck only works for the x86 (both 32- and 64-bit) platform. A number of | ||
| 39 | configuration variables must have specific settings in order for the kmemcheck | ||
| 40 | menu to even appear in "menuconfig". These are: | ||
| 41 | |||
| 42 | - ``CONFIG_CC_OPTIMIZE_FOR_SIZE=n`` | ||
| 43 | This option is located under "General setup" / "Optimize for size". | ||
| 44 | |||
| 45 | Without this, gcc will use certain optimizations that usually lead to | ||
| 46 | false positive warnings from kmemcheck. An example of this is a 16-bit | ||
| 47 | field in a struct, where gcc may load 32 bits, then discard the upper | ||
| 48 | 16 bits. kmemcheck sees only the 32-bit load, and may trigger a | ||
| 49 | warning for the upper 16 bits (if they're uninitialized). | ||
| 50 | |||
| 51 | - ``CONFIG_SLAB=y`` or ``CONFIG_SLUB=y`` | ||
| 52 | This option is located under "General setup" / "Choose SLAB | ||
| 53 | allocator". | ||
| 54 | |||
| 55 | - ``CONFIG_FUNCTION_TRACER=n`` | ||
| 56 | This option is located under "Kernel hacking" / "Tracers" / "Kernel | ||
| 57 | Function Tracer" | ||
| 58 | |||
| 59 | When function tracing is compiled in, gcc emits a call to another | ||
| 60 | function at the beginning of every function. This means that when the | ||
| 61 | page fault handler is called, the ftrace framework will be called | ||
| 62 | before kmemcheck has had a chance to handle the fault. If ftrace then | ||
| 63 | modifies memory that was tracked by kmemcheck, the result is an | ||
| 64 | endless recursive page fault. | ||
| 65 | |||
| 66 | - ``CONFIG_DEBUG_PAGEALLOC=n`` | ||
| 67 | This option is located under "Kernel hacking" / "Memory Debugging" | ||
| 68 | / "Debug page memory allocations". | ||
| 69 | |||
| 70 | In addition, I highly recommend turning on ``CONFIG_DEBUG_INFO=y``. This is also | ||
| 71 | located under "Kernel hacking". With this, you will be able to get line number | ||
| 72 | information from the kmemcheck warnings, which is extremely valuable in | ||
| 73 | debugging a problem. This option is not mandatory, however, because it slows | ||
| 74 | down the compilation process and produces a much bigger kernel image. | ||
| 75 | |||
| 76 | Now the kmemcheck menu should be visible (under "Kernel hacking" / "Memory | ||
| 77 | Debugging" / "kmemcheck: trap use of uninitialized memory"). Here follows | ||
| 78 | a description of the kmemcheck configuration variables: | ||
| 79 | |||
| 80 | - ``CONFIG_KMEMCHECK`` | ||
| 81 | This must be enabled in order to use kmemcheck at all... | ||
| 82 | |||
| 83 | - ``CONFIG_KMEMCHECK_``[``DISABLED`` | ``ENABLED`` | ``ONESHOT``]``_BY_DEFAULT`` | ||
| 84 | This option controls the status of kmemcheck at boot-time. "Enabled" | ||
| 85 | will enable kmemcheck right from the start, "disabled" will boot the | ||
| 86 | kernel as normal (but with the kmemcheck code compiled in, so it can | ||
| 87 | be enabled at run-time after the kernel has booted), and "one-shot" is | ||
| 88 | a special mode which will turn kmemcheck off automatically after | ||
| 89 | detecting the first use of uninitialized memory. | ||
| 90 | |||
| 91 | If you are using kmemcheck to actively debug a problem, then you | ||
| 92 | probably want to choose "enabled" here. | ||
| 93 | |||
| 94 | The one-shot mode is mostly useful in automated test setups because it | ||
| 95 | can prevent floods of warnings and increase the chances of the machine | ||
| 96 | surviving in case something is really wrong. In other cases, the one- | ||
| 97 | shot mode could actually be counter-productive because it would turn | ||
| 98 | itself off at the very first error -- in the case of a false positive | ||
| 99 | too -- and this would come in the way of debugging the specific | ||
| 100 | problem you were interested in. | ||
| 101 | |||
| 102 | If you would like to use your kernel as normal, but with a chance to | ||
| 103 | enable kmemcheck in case of some problem, it might be a good idea to | ||
| 104 | choose "disabled" here. When kmemcheck is disabled, most of the run- | ||
| 105 | time overhead is not incurred, and the kernel will be almost as fast | ||
| 106 | as normal. | ||
| 107 | |||
| 108 | - ``CONFIG_KMEMCHECK_QUEUE_SIZE`` | ||
| 109 | Select the maximum number of error reports to store in an internal | ||
| 110 | (fixed-size) buffer. Since errors can occur virtually anywhere and in | ||
| 111 | any context, we need a temporary storage area which is guaranteed not | ||
| 112 | to generate any other page faults when accessed. The queue will be | ||
| 113 | emptied as soon as a tasklet may be scheduled. If the queue is full, | ||
| 114 | new error reports will be lost. | ||
| 115 | |||
| 116 | The default value of 64 is probably fine. If some code produces more | ||
| 117 | than 64 errors within an irqs-off section, then the code is likely to | ||
| 118 | produce many, many more, too, and these additional reports seldom give | ||
| 119 | any more information (the first report is usually the most valuable | ||
| 120 | anyway). | ||
| 121 | |||
| 122 | This number might have to be adjusted if you are not using serial | ||
| 123 | console or similar to capture the kernel log. If you are using the | ||
| 124 | "dmesg" command to save the log, then getting a lot of kmemcheck | ||
| 125 | warnings might overflow the kernel log itself, and the earlier reports | ||
| 126 | will get lost in that way instead. Try setting this to 10 or so on | ||
| 127 | such a setup. | ||
| 128 | |||
| 129 | - ``CONFIG_KMEMCHECK_SHADOW_COPY_SHIFT`` | ||
| 130 | Select the number of shadow bytes to save along with each entry of the | ||
| 131 | error-report queue. These bytes indicate what parts of an allocation | ||
| 132 | are initialized, uninitialized, etc. and will be displayed when an | ||
| 133 | error is detected to help the debugging of a particular problem. | ||
| 134 | |||
| 135 | The number entered here is actually the logarithm of the number of | ||
| 136 | bytes that will be saved. So if you pick for example 5 here, kmemcheck | ||
| 137 | will save 2^5 = 32 bytes. | ||
| 138 | |||
| 139 | The default value should be fine for debugging most problems. It also | ||
| 140 | fits nicely within 80 columns. | ||
| 141 | |||
| 142 | - ``CONFIG_KMEMCHECK_PARTIAL_OK`` | ||
| 143 | This option (when enabled) works around certain GCC optimizations that | ||
| 144 | produce 32-bit reads from 16-bit variables where the upper 16 bits are | ||
| 145 | thrown away afterwards. | ||
| 146 | |||
| 147 | The default value (enabled) is recommended. This may of course hide | ||
| 148 | some real errors, but disabling it would probably produce a lot of | ||
| 149 | false positives. | ||
| 150 | |||
| 151 | - ``CONFIG_KMEMCHECK_BITOPS_OK`` | ||
| 152 | This option silences warnings that would be generated for bit-field | ||
| 153 | accesses where not all the bits are initialized at the same time. This | ||
| 154 | may also hide some real bugs. | ||
| 155 | |||
| 156 | This option is probably obsolete, or it should be replaced with | ||
| 157 | the kmemcheck-/bitfield-annotations for the code in question. The | ||
| 158 | default value is therefore fine. | ||
| 159 | |||
| 160 | Now compile the kernel as usual. | ||
| 161 | |||
| 162 | |||
| 163 | How to use | ||
| 164 | ---------- | ||
| 165 | |||
| 166 | Booting | ||
| 167 | ~~~~~~~ | ||
| 168 | |||
| 169 | First some information about the command-line options. There is only one | ||
| 170 | option specific to kmemcheck, and this is called "kmemcheck". It can be used | ||
| 171 | to override the default mode as chosen by the ``CONFIG_KMEMCHECK_*_BY_DEFAULT`` | ||
| 172 | option. Its possible settings are: | ||
| 173 | |||
| 174 | - ``kmemcheck=0`` (disabled) | ||
| 175 | - ``kmemcheck=1`` (enabled) | ||
| 176 | - ``kmemcheck=2`` (one-shot mode) | ||
| 177 | |||
| 178 | If SLUB debugging has been enabled in the kernel, it may take precedence over | ||
| 179 | kmemcheck in such a way that the slab caches which are under SLUB debugging | ||
| 180 | will not be tracked by kmemcheck. In order to ensure that this doesn't happen | ||
| 181 | (even though it shouldn't by default), use SLUB's boot option ``slub_debug``, | ||
| 182 | like this: ``slub_debug=-`` | ||
| 183 | |||
| 184 | In fact, this option may also be used for fine-grained control over SLUB vs. | ||
| 185 | kmemcheck. For example, if the command line includes | ||
| 186 | ``kmemcheck=1 slub_debug=,dentry``, then SLUB debugging will be used only | ||
| 187 | for the "dentry" slab cache, and with kmemcheck tracking all the other | ||
| 188 | caches. This is advanced usage, however, and is not generally recommended. | ||
| 189 | |||
| 190 | |||
| 191 | Run-time enable/disable | ||
| 192 | ~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 193 | |||
| 194 | When the kernel has booted, it is possible to enable or disable kmemcheck at | ||
| 195 | run-time. WARNING: This feature is still experimental and may cause false | ||
| 196 | positive warnings to appear. Therefore, try not to use this. If you find that | ||
| 197 | it doesn't work properly (e.g. you see an unreasonable amount of warnings), I | ||
| 198 | will be happy to take bug reports. | ||
| 199 | |||
| 200 | Use the file ``/proc/sys/kernel/kmemcheck`` for this purpose, e.g.:: | ||
| 201 | |||
| 202 | $ echo 0 > /proc/sys/kernel/kmemcheck # disables kmemcheck | ||
| 203 | |||
| 204 | The numbers are the same as for the ``kmemcheck=`` command-line option. | ||
| 205 | |||
| 206 | |||
| 207 | Debugging | ||
| 208 | ~~~~~~~~~ | ||
| 209 | |||
| 210 | A typical report will look something like this:: | ||
| 211 | |||
| 212 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | ||
| 213 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
| 214 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
| 215 | ^ | ||
| 216 | |||
| 217 | Pid: 1856, comm: ntpdate Not tainted 2.6.29-rc5 #264 945P-A | ||
| 218 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | ||
| 219 | RSP: 0018:ffff88003cdf7d98 EFLAGS: 00210002 | ||
| 220 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | ||
| 221 | RDX: ffff88003e5d6018 RSI: ffff88003e5d6024 RDI: ffff88003cdf7e84 | ||
| 222 | RBP: ffff88003cdf7db8 R08: ffff88003e5d6000 R09: 0000000000000000 | ||
| 223 | R10: 0000000000000080 R11: 0000000000000000 R12: 000000000000000e | ||
| 224 | R13: ffff88003cdf7e78 R14: ffff88003d530710 R15: ffff88003d5a98c8 | ||
| 225 | FS: 0000000000000000(0000) GS:ffff880001982000(0063) knlGS:00000 | ||
| 226 | CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033 | ||
| 227 | CR2: ffff88003f806ea0 CR3: 000000003c036000 CR4: 00000000000006a0 | ||
| 228 | DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 | ||
| 229 | DR3: 0000000000000000 DR6: 00000000ffff4ff0 DR7: 0000000000000400 | ||
| 230 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | ||
| 231 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | ||
| 232 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | ||
| 233 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 | ||
| 234 | [<ffffffffffffffff>] 0xffffffffffffffff | ||
| 235 | |||
| 236 | The single most valuable information in this report is the RIP (or EIP on 32- | ||
| 237 | bit) value. This will help us pinpoint exactly which instruction that caused | ||
| 238 | the warning. | ||
| 239 | |||
| 240 | If your kernel was compiled with ``CONFIG_DEBUG_INFO=y``, then all we have to do | ||
| 241 | is give this address to the addr2line program, like this:: | ||
| 242 | |||
| 243 | $ addr2line -e vmlinux -i ffffffff8104ede8 | ||
| 244 | arch/x86/include/asm/string_64.h:12 | ||
| 245 | include/asm-generic/siginfo.h:287 | ||
| 246 | kernel/signal.c:380 | ||
| 247 | kernel/signal.c:410 | ||
| 248 | |||
| 249 | The "``-e vmlinux``" tells addr2line which file to look in. **IMPORTANT:** | ||
| 250 | This must be the vmlinux of the kernel that produced the warning in the | ||
| 251 | first place! If not, the line number information will almost certainly be | ||
| 252 | wrong. | ||
| 253 | |||
| 254 | The "``-i``" tells addr2line to also print the line numbers of inlined | ||
| 255 | functions. In this case, the flag was very important, because otherwise, | ||
| 256 | it would only have printed the first line, which is just a call to | ||
| 257 | ``memcpy()``, which could be called from a thousand places in the kernel, and | ||
| 258 | is therefore not very useful. These inlined functions would not show up in | ||
| 259 | the stack trace above, simply because the kernel doesn't load the extra | ||
| 260 | debugging information. This technique can of course be used with ordinary | ||
| 261 | kernel oopses as well. | ||
| 262 | |||
| 263 | In this case, it's the caller of ``memcpy()`` that is interesting, and it can be | ||
| 264 | found in ``include/asm-generic/siginfo.h``, line 287:: | ||
| 265 | |||
| 266 | 281 static inline void copy_siginfo(struct siginfo *to, struct siginfo *from) | ||
| 267 | 282 { | ||
| 268 | 283 if (from->si_code < 0) | ||
| 269 | 284 memcpy(to, from, sizeof(*to)); | ||
| 270 | 285 else | ||
| 271 | 286 /* _sigchld is currently the largest know union member */ | ||
| 272 | 287 memcpy(to, from, __ARCH_SI_PREAMBLE_SIZE + sizeof(from->_sifields._sigchld)); | ||
| 273 | 288 } | ||
| 274 | |||
| 275 | Since this was a read (kmemcheck usually warns about reads only, though it can | ||
| 276 | warn about writes to unallocated or freed memory as well), it was probably the | ||
| 277 | "from" argument which contained some uninitialized bytes. Following the chain | ||
| 278 | of calls, we move upwards to see where "from" was allocated or initialized, | ||
| 279 | ``kernel/signal.c``, line 380:: | ||
| 280 | |||
| 281 | 359 static void collect_signal(int sig, struct sigpending *list, siginfo_t *info) | ||
| 282 | 360 { | ||
| 283 | ... | ||
| 284 | 367 list_for_each_entry(q, &list->list, list) { | ||
| 285 | 368 if (q->info.si_signo == sig) { | ||
| 286 | 369 if (first) | ||
| 287 | 370 goto still_pending; | ||
| 288 | 371 first = q; | ||
| 289 | ... | ||
| 290 | 377 if (first) { | ||
| 291 | 378 still_pending: | ||
| 292 | 379 list_del_init(&first->list); | ||
| 293 | 380 copy_siginfo(info, &first->info); | ||
| 294 | 381 __sigqueue_free(first); | ||
| 295 | ... | ||
| 296 | 392 } | ||
| 297 | 393 } | ||
| 298 | |||
| 299 | Here, it is ``&first->info`` that is being passed on to ``copy_siginfo()``. The | ||
| 300 | variable ``first`` was found on a list -- passed in as the second argument to | ||
| 301 | ``collect_signal()``. We continue our journey through the stack, to figure out | ||
| 302 | where the item on "list" was allocated or initialized. We move to line 410:: | ||
| 303 | |||
| 304 | 395 static int __dequeue_signal(struct sigpending *pending, sigset_t *mask, | ||
| 305 | 396 siginfo_t *info) | ||
| 306 | 397 { | ||
| 307 | ... | ||
| 308 | 410 collect_signal(sig, pending, info); | ||
| 309 | ... | ||
| 310 | 414 } | ||
| 311 | |||
| 312 | Now we need to follow the ``pending`` pointer, since that is being passed on to | ||
| 313 | ``collect_signal()`` as ``list``. At this point, we've run out of lines from the | ||
| 314 | "addr2line" output. Not to worry, we just paste the next addresses from the | ||
| 315 | kmemcheck stack dump, i.e.:: | ||
| 316 | |||
| 317 | [<ffffffff8104f04e>] dequeue_signal+0x8e/0x170 | ||
| 318 | [<ffffffff81050bd8>] get_signal_to_deliver+0x98/0x390 | ||
| 319 | [<ffffffff8100b87d>] do_notify_resume+0xad/0x7d0 | ||
| 320 | [<ffffffff8100c7b5>] int_signal+0x12/0x17 | ||
| 321 | |||
| 322 | $ addr2line -e vmlinux -i ffffffff8104f04e ffffffff81050bd8 \ | ||
| 323 | ffffffff8100b87d ffffffff8100c7b5 | ||
| 324 | kernel/signal.c:446 | ||
| 325 | kernel/signal.c:1806 | ||
| 326 | arch/x86/kernel/signal.c:805 | ||
| 327 | arch/x86/kernel/signal.c:871 | ||
| 328 | arch/x86/kernel/entry_64.S:694 | ||
| 329 | |||
| 330 | Remember that since these addresses were found on the stack and not as the | ||
| 331 | RIP value, they actually point to the _next_ instruction (they are return | ||
| 332 | addresses). This becomes obvious when we look at the code for line 446:: | ||
| 333 | |||
| 334 | 422 int dequeue_signal(struct task_struct *tsk, sigset_t *mask, siginfo_t *info) | ||
| 335 | 423 { | ||
| 336 | ... | ||
| 337 | 431 signr = __dequeue_signal(&tsk->signal->shared_pending, | ||
| 338 | 432 mask, info); | ||
| 339 | 433 /* | ||
| 340 | 434 * itimer signal ? | ||
| 341 | 435 * | ||
| 342 | 436 * itimers are process shared and we restart periodic | ||
| 343 | 437 * itimers in the signal delivery path to prevent DoS | ||
| 344 | 438 * attacks in the high resolution timer case. This is | ||
| 345 | 439 * compliant with the old way of self restarting | ||
| 346 | 440 * itimers, as the SIGALRM is a legacy signal and only | ||
| 347 | 441 * queued once. Changing the restart behaviour to | ||
| 348 | 442 * restart the timer in the signal dequeue path is | ||
| 349 | 443 * reducing the timer noise on heavy loaded !highres | ||
| 350 | 444 * systems too. | ||
| 351 | 445 */ | ||
| 352 | 446 if (unlikely(signr == SIGALRM)) { | ||
| 353 | ... | ||
| 354 | 489 } | ||
| 355 | |||
| 356 | So instead of looking at 446, we should be looking at 431, which is the line | ||
| 357 | that executes just before 446. Here we see that what we are looking for is | ||
| 358 | ``&tsk->signal->shared_pending``. | ||
| 359 | |||
| 360 | Our next task is now to figure out which function that puts items on this | ||
| 361 | ``shared_pending`` list. A crude, but efficient tool, is ``git grep``:: | ||
| 362 | |||
| 363 | $ git grep -n 'shared_pending' kernel/ | ||
| 364 | ... | ||
| 365 | kernel/signal.c:828: pending = group ? &t->signal->shared_pending : &t->pending; | ||
| 366 | kernel/signal.c:1339: pending = group ? &t->signal->shared_pending : &t->pending; | ||
| 367 | ... | ||
| 368 | |||
| 369 | There were more results, but none of them were related to list operations, | ||
| 370 | and these were the only assignments. We inspect the line numbers more closely | ||
| 371 | and find that this is indeed where items are being added to the list:: | ||
| 372 | |||
| 373 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | ||
| 374 | 817 int group) | ||
| 375 | 818 { | ||
| 376 | ... | ||
| 377 | 828 pending = group ? &t->signal->shared_pending : &t->pending; | ||
| 378 | ... | ||
| 379 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | ||
| 380 | 852 (is_si_special(info) || | ||
| 381 | 853 info->si_code >= 0))); | ||
| 382 | 854 if (q) { | ||
| 383 | 855 list_add_tail(&q->list, &pending->list); | ||
| 384 | ... | ||
| 385 | 890 } | ||
| 386 | |||
| 387 | and:: | ||
| 388 | |||
| 389 | 1309 int send_sigqueue(struct sigqueue *q, struct task_struct *t, int group) | ||
| 390 | 1310 { | ||
| 391 | .... | ||
| 392 | 1339 pending = group ? &t->signal->shared_pending : &t->pending; | ||
| 393 | 1340 list_add_tail(&q->list, &pending->list); | ||
| 394 | .... | ||
| 395 | 1347 } | ||
| 396 | |||
| 397 | In the first case, the list element we are looking for, ``q``, is being | ||
| 398 | returned from the function ``__sigqueue_alloc()``, which looks like an | ||
| 399 | allocation function. Let's take a look at it:: | ||
| 400 | |||
| 401 | 187 static struct sigqueue *__sigqueue_alloc(struct task_struct *t, gfp_t flags, | ||
| 402 | 188 int override_rlimit) | ||
| 403 | 189 { | ||
| 404 | 190 struct sigqueue *q = NULL; | ||
| 405 | 191 struct user_struct *user; | ||
| 406 | 192 | ||
| 407 | 193 /* | ||
| 408 | 194 * We won't get problems with the target's UID changing under us | ||
| 409 | 195 * because changing it requires RCU be used, and if t != current, the | ||
| 410 | 196 * caller must be holding the RCU readlock (by way of a spinlock) and | ||
| 411 | 197 * we use RCU protection here | ||
| 412 | 198 */ | ||
| 413 | 199 user = get_uid(__task_cred(t)->user); | ||
| 414 | 200 atomic_inc(&user->sigpending); | ||
| 415 | 201 if (override_rlimit || | ||
| 416 | 202 atomic_read(&user->sigpending) <= | ||
| 417 | 203 t->signal->rlim[RLIMIT_SIGPENDING].rlim_cur) | ||
| 418 | 204 q = kmem_cache_alloc(sigqueue_cachep, flags); | ||
| 419 | 205 if (unlikely(q == NULL)) { | ||
| 420 | 206 atomic_dec(&user->sigpending); | ||
| 421 | 207 free_uid(user); | ||
| 422 | 208 } else { | ||
| 423 | 209 INIT_LIST_HEAD(&q->list); | ||
| 424 | 210 q->flags = 0; | ||
| 425 | 211 q->user = user; | ||
| 426 | 212 } | ||
| 427 | 213 | ||
| 428 | 214 return q; | ||
| 429 | 215 } | ||
| 430 | |||
| 431 | We see that this function initializes ``q->list``, ``q->flags``, and | ||
| 432 | ``q->user``. It seems that now is the time to look at the definition of | ||
| 433 | ``struct sigqueue``, e.g.:: | ||
| 434 | |||
| 435 | 14 struct sigqueue { | ||
| 436 | 15 struct list_head list; | ||
| 437 | 16 int flags; | ||
| 438 | 17 siginfo_t info; | ||
| 439 | 18 struct user_struct *user; | ||
| 440 | 19 }; | ||
| 441 | |||
| 442 | And, you might remember, it was a ``memcpy()`` on ``&first->info`` that | ||
| 443 | caused the warning, so this makes perfect sense. It also seems reasonable | ||
| 444 | to assume that it is the caller of ``__sigqueue_alloc()`` that has the | ||
| 445 | responsibility of filling out (initializing) this member. | ||
| 446 | |||
| 447 | But just which fields of the struct were uninitialized? Let's look at | ||
| 448 | kmemcheck's report again:: | ||
| 449 | |||
| 450 | WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (ffff88003e4a2024) | ||
| 451 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
| 452 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
| 453 | ^ | ||
| 454 | |||
| 455 | These first two lines are the memory dump of the memory object itself, and | ||
| 456 | the shadow bytemap, respectively. The memory object itself is in this case | ||
| 457 | ``&first->info``. Just beware that the start of this dump is NOT the start | ||
| 458 | of the object itself! The position of the caret (^) corresponds with the | ||
| 459 | address of the read (ffff88003e4a2024). | ||
| 460 | |||
| 461 | The shadow bytemap dump legend is as follows: | ||
| 462 | |||
| 463 | - i: initialized | ||
| 464 | - u: uninitialized | ||
| 465 | - a: unallocated (memory has been allocated by the slab layer, but has not | ||
| 466 | yet been handed off to anybody) | ||
| 467 | - f: freed (memory has been allocated by the slab layer, but has been freed | ||
| 468 | by the previous owner) | ||
| 469 | |||
| 470 | In order to figure out where (relative to the start of the object) the | ||
| 471 | uninitialized memory was located, we have to look at the disassembly. For | ||
| 472 | that, we'll need the RIP address again:: | ||
| 473 | |||
| 474 | RIP: 0010:[<ffffffff8104ede8>] [<ffffffff8104ede8>] __dequeue_signal+0xc8/0x190 | ||
| 475 | |||
| 476 | $ objdump -d --no-show-raw-insn vmlinux | grep -C 8 ffffffff8104ede8: | ||
| 477 | ffffffff8104edc8: mov %r8,0x8(%r8) | ||
| 478 | ffffffff8104edcc: test %r10d,%r10d | ||
| 479 | ffffffff8104edcf: js ffffffff8104ee88 <__dequeue_signal+0x168> | ||
| 480 | ffffffff8104edd5: mov %rax,%rdx | ||
| 481 | ffffffff8104edd8: mov $0xc,%ecx | ||
| 482 | ffffffff8104eddd: mov %r13,%rdi | ||
| 483 | ffffffff8104ede0: mov $0x30,%eax | ||
| 484 | ffffffff8104ede5: mov %rdx,%rsi | ||
| 485 | ffffffff8104ede8: rep movsl %ds:(%rsi),%es:(%rdi) | ||
| 486 | ffffffff8104edea: test $0x2,%al | ||
| 487 | ffffffff8104edec: je ffffffff8104edf0 <__dequeue_signal+0xd0> | ||
| 488 | ffffffff8104edee: movsw %ds:(%rsi),%es:(%rdi) | ||
| 489 | ffffffff8104edf0: test $0x1,%al | ||
| 490 | ffffffff8104edf2: je ffffffff8104edf5 <__dequeue_signal+0xd5> | ||
| 491 | ffffffff8104edf4: movsb %ds:(%rsi),%es:(%rdi) | ||
| 492 | ffffffff8104edf5: mov %r8,%rdi | ||
| 493 | ffffffff8104edf8: callq ffffffff8104de60 <__sigqueue_free> | ||
| 494 | |||
| 495 | As expected, it's the "``rep movsl``" instruction from the ``memcpy()`` | ||
| 496 | that causes the warning. We know about ``REP MOVSL`` that it uses the register | ||
| 497 | ``RCX`` to count the number of remaining iterations. By taking a look at the | ||
| 498 | register dump again (from the kmemcheck report), we can figure out how many | ||
| 499 | bytes were left to copy:: | ||
| 500 | |||
| 501 | RAX: 0000000000000030 RBX: ffff88003d4ea968 RCX: 0000000000000009 | ||
| 502 | |||
| 503 | By looking at the disassembly, we also see that ``%ecx`` is being loaded | ||
| 504 | with the value ``$0xc`` just before (ffffffff8104edd8), so we are very | ||
| 505 | lucky. Keep in mind that this is the number of iterations, not bytes. And | ||
| 506 | since this is a "long" operation, we need to multiply by 4 to get the | ||
| 507 | number of bytes. So this means that the uninitialized value was encountered | ||
| 508 | at 4 * (0xc - 0x9) = 12 bytes from the start of the object. | ||
| 509 | |||
| 510 | We can now try to figure out which field of the "``struct siginfo``" that | ||
| 511 | was not initialized. This is the beginning of the struct:: | ||
| 512 | |||
| 513 | 40 typedef struct siginfo { | ||
| 514 | 41 int si_signo; | ||
| 515 | 42 int si_errno; | ||
| 516 | 43 int si_code; | ||
| 517 | 44 | ||
| 518 | 45 union { | ||
| 519 | .. | ||
| 520 | 92 } _sifields; | ||
| 521 | 93 } siginfo_t; | ||
| 522 | |||
| 523 | On 64-bit, the int is 4 bytes long, so it must the union member that has | ||
| 524 | not been initialized. We can verify this using gdb:: | ||
| 525 | |||
| 526 | $ gdb vmlinux | ||
| 527 | ... | ||
| 528 | (gdb) p &((struct siginfo *) 0)->_sifields | ||
| 529 | $1 = (union {...} *) 0x10 | ||
| 530 | |||
| 531 | Actually, it seems that the union member is located at offset 0x10 -- which | ||
| 532 | means that gcc has inserted 4 bytes of padding between the members ``si_code`` | ||
| 533 | and ``_sifields``. We can now get a fuller picture of the memory dump:: | ||
| 534 | |||
| 535 | _----------------------------=> si_code | ||
| 536 | / _--------------------=> (padding) | ||
| 537 | | / _------------=> _sifields(._kill._pid) | ||
| 538 | | | / _----=> _sifields(._kill._uid) | ||
| 539 | | | | / | ||
| 540 | -------|-------|-------|-------| | ||
| 541 | 80000000000000000000000000000000000000000088ffff0000000000000000 | ||
| 542 | i i i i u u u u i i i i i i i i u u u u u u u u u u u u u u u u | ||
| 543 | |||
| 544 | This allows us to realize another important fact: ``si_code`` contains the | ||
| 545 | value 0x80. Remember that x86 is little endian, so the first 4 bytes | ||
| 546 | "80000000" are really the number 0x00000080. With a bit of research, we | ||
| 547 | find that this is actually the constant ``SI_KERNEL`` defined in | ||
| 548 | ``include/asm-generic/siginfo.h``:: | ||
| 549 | |||
| 550 | 144 #define SI_KERNEL 0x80 /* sent by the kernel from somewhere */ | ||
| 551 | |||
| 552 | This macro is used in exactly one place in the x86 kernel: In ``send_signal()`` | ||
| 553 | in ``kernel/signal.c``:: | ||
| 554 | |||
| 555 | 816 static int send_signal(int sig, struct siginfo *info, struct task_struct *t, | ||
| 556 | 817 int group) | ||
| 557 | 818 { | ||
| 558 | ... | ||
| 559 | 828 pending = group ? &t->signal->shared_pending : &t->pending; | ||
| 560 | ... | ||
| 561 | 851 q = __sigqueue_alloc(t, GFP_ATOMIC, (sig < SIGRTMIN && | ||
| 562 | 852 (is_si_special(info) || | ||
| 563 | 853 info->si_code >= 0))); | ||
| 564 | 854 if (q) { | ||
| 565 | 855 list_add_tail(&q->list, &pending->list); | ||
| 566 | 856 switch ((unsigned long) info) { | ||
| 567 | ... | ||
| 568 | 865 case (unsigned long) SEND_SIG_PRIV: | ||
| 569 | 866 q->info.si_signo = sig; | ||
| 570 | 867 q->info.si_errno = 0; | ||
| 571 | 868 q->info.si_code = SI_KERNEL; | ||
| 572 | 869 q->info.si_pid = 0; | ||
| 573 | 870 q->info.si_uid = 0; | ||
| 574 | 871 break; | ||
| 575 | ... | ||
| 576 | 890 } | ||
| 577 | |||
| 578 | Not only does this match with the ``.si_code`` member, it also matches the place | ||
| 579 | we found earlier when looking for where siginfo_t objects are enqueued on the | ||
| 580 | ``shared_pending`` list. | ||
| 581 | |||
| 582 | So to sum up: It seems that it is the padding introduced by the compiler | ||
| 583 | between two struct fields that is uninitialized, and this gets reported when | ||
| 584 | we do a ``memcpy()`` on the struct. This means that we have identified a false | ||
| 585 | positive warning. | ||
| 586 | |||
| 587 | Normally, kmemcheck will not report uninitialized accesses in ``memcpy()`` calls | ||
| 588 | when both the source and destination addresses are tracked. (Instead, we copy | ||
| 589 | the shadow bytemap as well). In this case, the destination address clearly | ||
| 590 | was not tracked. We can dig a little deeper into the stack trace from above:: | ||
| 591 | |||
| 592 | arch/x86/kernel/signal.c:805 | ||
| 593 | arch/x86/kernel/signal.c:871 | ||
| 594 | arch/x86/kernel/entry_64.S:694 | ||
| 595 | |||
| 596 | And we clearly see that the destination siginfo object is located on the | ||
| 597 | stack:: | ||
| 598 | |||
| 599 | 782 static void do_signal(struct pt_regs *regs) | ||
| 600 | 783 { | ||
| 601 | 784 struct k_sigaction ka; | ||
| 602 | 785 siginfo_t info; | ||
| 603 | ... | ||
| 604 | 804 signr = get_signal_to_deliver(&info, &ka, regs, NULL); | ||
| 605 | ... | ||
| 606 | 854 } | ||
| 607 | |||
| 608 | And this ``&info`` is what eventually gets passed to ``copy_siginfo()`` as the | ||
| 609 | destination argument. | ||
| 610 | |||
| 611 | Now, even though we didn't find an actual error here, the example is still a | ||
| 612 | good one, because it shows how one would go about to find out what the report | ||
| 613 | was all about. | ||
| 614 | |||
| 615 | |||
| 616 | Annotating false positives | ||
| 617 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
| 618 | |||
| 619 | There are a few different ways to make annotations in the source code that | ||
| 620 | will keep kmemcheck from checking and reporting certain allocations. Here | ||
| 621 | they are: | ||
| 622 | |||
| 623 | - ``__GFP_NOTRACK_FALSE_POSITIVE`` | ||
| 624 | This flag can be passed to ``kmalloc()`` or ``kmem_cache_alloc()`` | ||
| 625 | (therefore also to other functions that end up calling one of | ||
| 626 | these) to indicate that the allocation should not be tracked | ||
| 627 | because it would lead to a false positive report. This is a "big | ||
| 628 | hammer" way of silencing kmemcheck; after all, even if the false | ||
| 629 | positive pertains to particular field in a struct, for example, we | ||
| 630 | will now lose the ability to find (real) errors in other parts of | ||
| 631 | the same struct. | ||
| 632 | |||
| 633 | Example:: | ||
| 634 | |||
| 635 | /* No warnings will ever trigger on accessing any part of x */ | ||
| 636 | x = kmalloc(sizeof *x, GFP_KERNEL | __GFP_NOTRACK_FALSE_POSITIVE); | ||
| 637 | |||
| 638 | - ``kmemcheck_bitfield_begin(name)``/``kmemcheck_bitfield_end(name)`` and | ||
| 639 | ``kmemcheck_annotate_bitfield(ptr, name)`` | ||
| 640 | The first two of these three macros can be used inside struct | ||
| 641 | definitions to signal, respectively, the beginning and end of a | ||
| 642 | bitfield. Additionally, this will assign the bitfield a name, which | ||
| 643 | is given as an argument to the macros. | ||
| 644 | |||
| 645 | Having used these markers, one can later use | ||
| 646 | kmemcheck_annotate_bitfield() at the point of allocation, to indicate | ||
| 647 | which parts of the allocation is part of a bitfield. | ||
| 648 | |||
| 649 | Example:: | ||
| 650 | |||
| 651 | struct foo { | ||
| 652 | int x; | ||
| 653 | |||
| 654 | kmemcheck_bitfield_begin(flags); | ||
| 655 | int flag_a:1; | ||
| 656 | int flag_b:1; | ||
| 657 | kmemcheck_bitfield_end(flags); | ||
| 658 | |||
| 659 | int y; | ||
| 660 | }; | ||
| 661 | |||
| 662 | struct foo *x = kmalloc(sizeof *x); | ||
| 663 | |||
| 664 | /* No warnings will trigger on accessing the bitfield of x */ | ||
| 665 | kmemcheck_annotate_bitfield(x, flags); | ||
| 666 | |||
| 667 | Note that ``kmemcheck_annotate_bitfield()`` can be used even before the | ||
| 668 | return value of ``kmalloc()`` is checked -- in other words, passing NULL | ||
| 669 | as the first argument is legal (and will do nothing). | ||
| 670 | |||
| 671 | |||
| 672 | Reporting errors | ||
| 673 | ---------------- | ||
| 674 | |||
| 675 | As we have seen, kmemcheck will produce false positive reports. Therefore, it | ||
| 676 | is not very wise to blindly post kmemcheck warnings to mailing lists and | ||
| 677 | maintainers. Instead, I encourage maintainers and developers to find errors | ||
| 678 | in their own code. If you get a warning, you can try to work around it, try | ||
| 679 | to figure out if it's a real error or not, or simply ignore it. Most | ||
| 680 | developers know their own code and will quickly and efficiently determine the | ||
| 681 | root cause of a kmemcheck report. This is therefore also the most efficient | ||
| 682 | way to work with kmemcheck. | ||
| 683 | |||
| 684 | That said, we (the kmemcheck maintainers) will always be on the lookout for | ||
| 685 | false positives that we can annotate and silence. So whatever you find, | ||
| 686 | please drop us a note privately! Kernel configs and steps to reproduce (if | ||
| 687 | available) are of course a great help too. | ||
| 688 | |||
| 689 | Happy hacking! | ||
| 690 | |||
| 691 | |||
| 692 | Technical description | ||
| 693 | --------------------- | ||
| 694 | |||
| 695 | kmemcheck works by marking memory pages non-present. This means that whenever | ||
| 696 | somebody attempts to access the page, a page fault is generated. The page | ||
| 697 | fault handler notices that the page was in fact only hidden, and so it calls | ||
| 698 | on the kmemcheck code to make further investigations. | ||
| 699 | |||
| 700 | When the investigations are completed, kmemcheck "shows" the page by marking | ||
| 701 | it present (as it would be under normal circumstances). This way, the | ||
| 702 | interrupted code can continue as usual. | ||
| 703 | |||
| 704 | But after the instruction has been executed, we should hide the page again, so | ||
| 705 | that we can catch the next access too! Now kmemcheck makes use of a debugging | ||
| 706 | feature of the processor, namely single-stepping. When the processor has | ||
| 707 | finished the one instruction that generated the memory access, a debug | ||
| 708 | exception is raised. From here, we simply hide the page again and continue | ||
| 709 | execution, this time with the single-stepping feature turned off. | ||
| 710 | |||
| 711 | kmemcheck requires some assistance from the memory allocator in order to work. | ||
| 712 | The memory allocator needs to | ||
| 713 | |||
| 714 | 1. Tell kmemcheck about newly allocated pages and pages that are about to | ||
| 715 | be freed. This allows kmemcheck to set up and tear down the shadow memory | ||
| 716 | for the pages in question. The shadow memory stores the status of each | ||
| 717 | byte in the allocation proper, e.g. whether it is initialized or | ||
| 718 | uninitialized. | ||
| 719 | |||
| 720 | 2. Tell kmemcheck which parts of memory should be marked uninitialized. | ||
| 721 | There are actually a few more states, such as "not yet allocated" and | ||
| 722 | "recently freed". | ||
| 723 | |||
| 724 | If a slab cache is set up using the SLAB_NOTRACK flag, it will never return | ||
| 725 | memory that can take page faults because of kmemcheck. | ||
| 726 | |||
| 727 | If a slab cache is NOT set up using the SLAB_NOTRACK flag, callers can still | ||
| 728 | request memory with the __GFP_NOTRACK or __GFP_NOTRACK_FALSE_POSITIVE flags. | ||
| 729 | This does not prevent the page faults from occurring, however, but marks the | ||
| 730 | object in question as being initialized so that no warnings will ever be | ||
| 731 | produced for this object. | ||
| 732 | |||
| 733 | Currently, the SLAB and SLUB allocators are supported by kmemcheck. | ||
