diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/debugfs-kmemtrace | 71 | ||||
-rw-r--r-- | Documentation/ftrace.txt | 74 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 10 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 2 | ||||
-rw-r--r-- | Documentation/vm/kmemtrace.txt | 126 |
5 files changed, 283 insertions, 0 deletions
diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace new file mode 100644 index 000000000000..5e6a92a02d85 --- /dev/null +++ b/Documentation/ABI/testing/debugfs-kmemtrace | |||
@@ -0,0 +1,71 @@ | |||
1 | What: /sys/kernel/debug/kmemtrace/ | ||
2 | Date: July 2008 | ||
3 | Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> | ||
4 | Description: | ||
5 | |||
6 | In kmemtrace-enabled kernels, the following files are created: | ||
7 | |||
8 | /sys/kernel/debug/kmemtrace/ | ||
9 | cpu<n> (0400) Per-CPU tracing data, see below. (binary) | ||
10 | total_overruns (0400) Total number of bytes which were dropped from | ||
11 | cpu<n> files because of full buffer condition, | ||
12 | non-binary. (text) | ||
13 | abi_version (0400) Kernel's kmemtrace ABI version. (text) | ||
14 | |||
15 | Each per-CPU file should be read according to the relay interface. That is, | ||
16 | the reader should set affinity to that specific CPU and, as currently done by | ||
17 | the userspace application (though there are other methods), use poll() with | ||
18 | an infinite timeout before every read(). Otherwise, erroneous data may be | ||
19 | read. The binary data has the following _core_ format: | ||
20 | |||
21 | Event ID (1 byte) Unsigned integer, one of: | ||
22 | 0 - represents an allocation (KMEMTRACE_EVENT_ALLOC) | ||
23 | 1 - represents a freeing of previously allocated memory | ||
24 | (KMEMTRACE_EVENT_FREE) | ||
25 | Type ID (1 byte) Unsigned integer, one of: | ||
26 | 0 - this is a kmalloc() / kfree() | ||
27 | 1 - this is a kmem_cache_alloc() / kmem_cache_free() | ||
28 | 2 - this is a __get_free_pages() et al. | ||
29 | Event size (2 bytes) Unsigned integer representing the | ||
30 | size of this event. Used to extend | ||
31 | kmemtrace. Discard the bytes you | ||
32 | don't know about. | ||
33 | Sequence number (4 bytes) Signed integer used to reorder data | ||
34 | logged on SMP machines. Wraparound | ||
35 | must be taken into account, although | ||
36 | it is unlikely. | ||
37 | Caller address (8 bytes) Return address to the caller. | ||
38 | Pointer to mem (8 bytes) Pointer to target memory area. Can be | ||
39 | NULL, but not all such calls might be | ||
40 | recorded. | ||
41 | |||
42 | In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow: | ||
43 | |||
44 | Requested bytes (8 bytes) Total number of requested bytes, | ||
45 | unsigned, must not be zero. | ||
46 | Allocated bytes (8 bytes) Total number of actually allocated | ||
47 | bytes, unsigned, must not be lower | ||
48 | than requested bytes. | ||
49 | Requested flags (4 bytes) GFP flags supplied by the caller. | ||
50 | Target CPU (4 bytes) Signed integer, valid for event id 1. | ||
51 | If equal to -1, target CPU is the same | ||
52 | as origin CPU, but the reverse might | ||
53 | not be true. | ||
54 | |||
55 | The data is made available in the same endianness the machine has. | ||
56 | |||
57 | Other event ids and type ids may be defined and added. Other fields may be | ||
58 | added by increasing event size, but see below for details. | ||
59 | Every modification to the ABI, including new id definitions, are followed | ||
60 | by bumping the ABI version by one. | ||
61 | |||
62 | Adding new data to the packet (features) is done at the end of the mandatory | ||
63 | data: | ||
64 | Feature size (2 byte) | ||
65 | Feature ID (1 byte) | ||
66 | Feature data (Feature size - 3 bytes) | ||
67 | |||
68 | |||
69 | Users: | ||
70 | kmemtrace-user - git://repo.or.cz/kmemtrace-user.git | ||
71 | |||
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt index 803b1318b13d..758fb42a1b68 100644 --- a/Documentation/ftrace.txt +++ b/Documentation/ftrace.txt | |||
@@ -165,6 +165,8 @@ Here is the list of current tracers that may be configured. | |||
165 | nop - This is not a tracer. To remove all tracers from tracing | 165 | nop - This is not a tracer. To remove all tracers from tracing |
166 | simply echo "nop" into current_tracer. | 166 | simply echo "nop" into current_tracer. |
167 | 167 | ||
168 | hw-branch-tracer - traces branches on all cpu's in a circular buffer. | ||
169 | |||
168 | 170 | ||
169 | Examples of using the tracer | 171 | Examples of using the tracer |
170 | ---------------------------- | 172 | ---------------------------- |
@@ -1152,6 +1154,78 @@ int main (int argc, char **argv) | |||
1152 | return 0; | 1154 | return 0; |
1153 | } | 1155 | } |
1154 | 1156 | ||
1157 | |||
1158 | hw-branch-tracer (x86 only) | ||
1159 | --------------------------- | ||
1160 | |||
1161 | This tracer uses the x86 last branch tracing hardware feature to | ||
1162 | collect a branch trace on all cpus with relatively low overhead. | ||
1163 | |||
1164 | The tracer uses a fixed-size circular buffer per cpu and only | ||
1165 | traces ring 0 branches. The trace file dumps that buffer in the | ||
1166 | following format: | ||
1167 | |||
1168 | # tracer: hw-branch-tracer | ||
1169 | # | ||
1170 | # CPU# TO <- FROM | ||
1171 | 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 | ||
1172 | 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a | ||
1173 | 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf | ||
1174 | 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf | ||
1175 | 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a | ||
1176 | 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf | ||
1177 | |||
1178 | |||
1179 | The tracer may be used to dump the trace for the oops'ing cpu on a | ||
1180 | kernel oops into the system log. To enable this, ftrace_dump_on_oops | ||
1181 | must be set. To set ftrace_dump_on_oops, one can either use the sysctl | ||
1182 | function or set it via the proc system interface. | ||
1183 | |||
1184 | sysctl kernel.ftrace_dump_on_oops=1 | ||
1185 | |||
1186 | or | ||
1187 | |||
1188 | echo 1 > /proc/sys/kernel/ftrace_dump_on_oops | ||
1189 | |||
1190 | |||
1191 | Here's an example of such a dump after a null pointer dereference in a | ||
1192 | kernel module: | ||
1193 | |||
1194 | [57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 | ||
1195 | [57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1196 | [57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 | ||
1197 | [57848.106019] Oops: 0002 [#1] SMP | ||
1198 | [57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus | ||
1199 | [57848.106019] Dumping ftrace buffer: | ||
1200 | [57848.106019] --------------------------------- | ||
1201 | [...] | ||
1202 | [57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 | ||
1203 | [57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 | ||
1204 | [57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 | ||
1205 | [57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 | ||
1206 | [57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 | ||
1207 | [57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] | ||
1208 | [57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 | ||
1209 | [57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b | ||
1210 | [57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 | ||
1211 | [57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 | ||
1212 | [57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 | ||
1213 | [...] | ||
1214 | [57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 | ||
1215 | [57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 | ||
1216 | [57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 | ||
1217 | [57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 | ||
1218 | [...] | ||
1219 | [57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 | ||
1220 | [57848.106019] --------------------------------- | ||
1221 | [57848.106019] CPU 0 | ||
1222 | [57848.106019] Modules linked in: oops | ||
1223 | [57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 | ||
1224 | [57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1225 | [57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 | ||
1226 | [...] | ||
1227 | |||
1228 | |||
1155 | dynamic ftrace | 1229 | dynamic ftrace |
1156 | -------------- | 1230 | -------------- |
1157 | 1231 | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index b182626739ea..fc22e9223427 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -49,6 +49,7 @@ parameter is applicable: | |||
49 | ISAPNP ISA PnP code is enabled. | 49 | ISAPNP ISA PnP code is enabled. |
50 | ISDN Appropriate ISDN support is enabled. | 50 | ISDN Appropriate ISDN support is enabled. |
51 | JOY Appropriate joystick support is enabled. | 51 | JOY Appropriate joystick support is enabled. |
52 | KMEMTRACE kmemtrace is enabled. | ||
52 | LIBATA Libata driver is enabled | 53 | LIBATA Libata driver is enabled |
53 | LP Printer support is enabled. | 54 | LP Printer support is enabled. |
54 | LOOP Loopback device support is enabled. | 55 | LOOP Loopback device support is enabled. |
@@ -1045,6 +1046,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1045 | use the HighMem zone if it exists, and the Normal | 1046 | use the HighMem zone if it exists, and the Normal |
1046 | zone if it does not. | 1047 | zone if it does not. |
1047 | 1048 | ||
1049 | kmemtrace.enable= [KNL,KMEMTRACE] Format: { yes | no } | ||
1050 | Controls whether kmemtrace is enabled | ||
1051 | at boot-time. | ||
1052 | |||
1053 | kmemtrace.subbufs=n [KNL,KMEMTRACE] Overrides the number of | ||
1054 | subbufs kmemtrace's relay channel has. Set this | ||
1055 | higher than default (KMEMTRACE_N_SUBBUFS in code) if | ||
1056 | you experience buffer overruns. | ||
1057 | |||
1048 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter | 1058 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter |
1049 | is similar to kernelcore except it specifies the | 1059 | is similar to kernelcore except it specifies the |
1050 | amount of memory used for migratable allocations. | 1060 | amount of memory used for migratable allocations. |
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 9e592c718afb..535aeb936dbc 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -113,6 +113,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
113 | 113 | ||
114 | 'x' - Used by xmon interface on ppc/powerpc platforms. | 114 | 'x' - Used by xmon interface on ppc/powerpc platforms. |
115 | 115 | ||
116 | 'z' - Dump the ftrace buffer | ||
117 | |||
116 | '0'-'9' - Sets the console log level, controlling which kernel messages | 118 | '0'-'9' - Sets the console log level, controlling which kernel messages |
117 | will be printed to your console. ('0', for example would make | 119 | will be printed to your console. ('0', for example would make |
118 | it so that only emergency messages like PANICs or OOPSes would | 120 | it so that only emergency messages like PANICs or OOPSes would |
diff --git a/Documentation/vm/kmemtrace.txt b/Documentation/vm/kmemtrace.txt new file mode 100644 index 000000000000..a956d9b7f943 --- /dev/null +++ b/Documentation/vm/kmemtrace.txt | |||
@@ -0,0 +1,126 @@ | |||
1 | kmemtrace - Kernel Memory Tracer | ||
2 | |||
3 | by Eduard - Gabriel Munteanu | ||
4 | <eduard.munteanu@linux360.ro> | ||
5 | |||
6 | I. Introduction | ||
7 | =============== | ||
8 | |||
9 | kmemtrace helps kernel developers figure out two things: | ||
10 | 1) how different allocators (SLAB, SLUB etc.) perform | ||
11 | 2) how kernel code allocates memory and how much | ||
12 | |||
13 | To do this, we trace every allocation and export information to the userspace | ||
14 | through the relay interface. We export things such as the number of requested | ||
15 | bytes, the number of bytes actually allocated (i.e. including internal | ||
16 | fragmentation), whether this is a slab allocation or a plain kmalloc() and so | ||
17 | on. | ||
18 | |||
19 | The actual analysis is performed by a userspace tool (see section III for | ||
20 | details on where to get it from). It logs the data exported by the kernel, | ||
21 | processes it and (as of writing this) can provide the following information: | ||
22 | - the total amount of memory allocated and fragmentation per call-site | ||
23 | - the amount of memory allocated and fragmentation per allocation | ||
24 | - total memory allocated and fragmentation in the collected dataset | ||
25 | - number of cross-CPU allocation and frees (makes sense in NUMA environments) | ||
26 | |||
27 | Moreover, it can potentially find inconsistent and erroneous behavior in | ||
28 | kernel code, such as using slab free functions on kmalloc'ed memory or | ||
29 | allocating less memory than requested (but not truly failed allocations). | ||
30 | |||
31 | kmemtrace also makes provisions for tracing on some arch and analysing the | ||
32 | data on another. | ||
33 | |||
34 | II. Design and goals | ||
35 | ==================== | ||
36 | |||
37 | kmemtrace was designed to handle rather large amounts of data. Thus, it uses | ||
38 | the relay interface to export whatever is logged to userspace, which then | ||
39 | stores it. Analysis and reporting is done asynchronously, that is, after the | ||
40 | data is collected and stored. By design, it allows one to log and analyse | ||
41 | on different machines and different arches. | ||
42 | |||
43 | As of writing this, the ABI is not considered stable, though it might not | ||
44 | change much. However, no guarantees are made about compatibility yet. When | ||
45 | deemed stable, the ABI should still allow easy extension while maintaining | ||
46 | backward compatibility. This is described further in Documentation/ABI. | ||
47 | |||
48 | Summary of design goals: | ||
49 | - allow logging and analysis to be done across different machines | ||
50 | - be fast and anticipate usage in high-load environments (*) | ||
51 | - be reasonably extensible | ||
52 | - make it possible for GNU/Linux distributions to have kmemtrace | ||
53 | included in their repositories | ||
54 | |||
55 | (*) - one of the reasons Pekka Enberg's original userspace data analysis | ||
56 | tool's code was rewritten from Perl to C (although this is more than a | ||
57 | simple conversion) | ||
58 | |||
59 | |||
60 | III. Quick usage guide | ||
61 | ====================== | ||
62 | |||
63 | 1) Get a kernel that supports kmemtrace and build it accordingly (i.e. enable | ||
64 | CONFIG_KMEMTRACE). | ||
65 | |||
66 | 2) Get the userspace tool and build it: | ||
67 | $ git-clone git://repo.or.cz/kmemtrace-user.git # current repository | ||
68 | $ cd kmemtrace-user/ | ||
69 | $ ./autogen.sh | ||
70 | $ ./configure | ||
71 | $ make | ||
72 | |||
73 | 3) Boot the kmemtrace-enabled kernel if you haven't, preferably in the | ||
74 | 'single' runlevel (so that relay buffers don't fill up easily), and run | ||
75 | kmemtrace: | ||
76 | # '$' does not mean user, but root here. | ||
77 | $ mount -t debugfs none /sys/kernel/debug | ||
78 | $ mount -t proc none /proc | ||
79 | $ cd path/to/kmemtrace-user/ | ||
80 | $ ./kmemtraced | ||
81 | Wait a bit, then stop it with CTRL+C. | ||
82 | $ cat /sys/kernel/debug/kmemtrace/total_overruns # Check if we didn't | ||
83 | # overrun, should | ||
84 | # be zero. | ||
85 | $ (Optionally) [Run kmemtrace_check separately on each cpu[0-9]*.out file to | ||
86 | check its correctness] | ||
87 | $ ./kmemtrace-report | ||
88 | |||
89 | Now you should have a nice and short summary of how the allocator performs. | ||
90 | |||
91 | IV. FAQ and known issues | ||
92 | ======================== | ||
93 | |||
94 | Q: 'cat /sys/kernel/debug/kmemtrace/total_overruns' is non-zero, how do I fix | ||
95 | this? Should I worry? | ||
96 | A: If it's non-zero, this affects kmemtrace's accuracy, depending on how | ||
97 | large the number is. You can fix it by supplying a higher | ||
98 | 'kmemtrace.subbufs=N' kernel parameter. | ||
99 | --- | ||
100 | |||
101 | Q: kmemtrace_check reports errors, how do I fix this? Should I worry? | ||
102 | A: This is a bug and should be reported. It can occur for a variety of | ||
103 | reasons: | ||
104 | - possible bugs in relay code | ||
105 | - possible misuse of relay by kmemtrace | ||
106 | - timestamps being collected unorderly | ||
107 | Or you may fix it yourself and send us a patch. | ||
108 | --- | ||
109 | |||
110 | Q: kmemtrace_report shows many errors, how do I fix this? Should I worry? | ||
111 | A: This is a known issue and I'm working on it. These might be true errors | ||
112 | in kernel code, which may have inconsistent behavior (e.g. allocating memory | ||
113 | with kmem_cache_alloc() and freeing it with kfree()). Pekka Enberg pointed | ||
114 | out this behavior may work with SLAB, but may fail with other allocators. | ||
115 | |||
116 | It may also be due to lack of tracing in some unusual allocator functions. | ||
117 | |||
118 | We don't want bug reports regarding this issue yet. | ||
119 | --- | ||
120 | |||
121 | V. See also | ||
122 | =========== | ||
123 | |||
124 | Documentation/kernel-parameters.txt | ||
125 | Documentation/ABI/testing/debugfs-kmemtrace | ||
126 | |||