diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/ABI/testing/debugfs-kmemtrace | 71 | ||||
-rw-r--r-- | Documentation/ftrace.txt | 1134 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 10 | ||||
-rw-r--r-- | Documentation/sysrq.txt | 2 | ||||
-rw-r--r-- | Documentation/vm/kmemtrace.txt | 126 |
5 files changed, 978 insertions, 365 deletions
diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace new file mode 100644 index 000000000000..5e6a92a02d85 --- /dev/null +++ b/Documentation/ABI/testing/debugfs-kmemtrace | |||
@@ -0,0 +1,71 @@ | |||
1 | What: /sys/kernel/debug/kmemtrace/ | ||
2 | Date: July 2008 | ||
3 | Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> | ||
4 | Description: | ||
5 | |||
6 | In kmemtrace-enabled kernels, the following files are created: | ||
7 | |||
8 | /sys/kernel/debug/kmemtrace/ | ||
9 | cpu<n> (0400) Per-CPU tracing data, see below. (binary) | ||
10 | total_overruns (0400) Total number of bytes which were dropped from | ||
11 | cpu<n> files because of full buffer condition, | ||
12 | non-binary. (text) | ||
13 | abi_version (0400) Kernel's kmemtrace ABI version. (text) | ||
14 | |||
15 | Each per-CPU file should be read according to the relay interface. That is, | ||
16 | the reader should set affinity to that specific CPU and, as currently done by | ||
17 | the userspace application (though there are other methods), use poll() with | ||
18 | an infinite timeout before every read(). Otherwise, erroneous data may be | ||
19 | read. The binary data has the following _core_ format: | ||
20 | |||
21 | Event ID (1 byte) Unsigned integer, one of: | ||
22 | 0 - represents an allocation (KMEMTRACE_EVENT_ALLOC) | ||
23 | 1 - represents a freeing of previously allocated memory | ||
24 | (KMEMTRACE_EVENT_FREE) | ||
25 | Type ID (1 byte) Unsigned integer, one of: | ||
26 | 0 - this is a kmalloc() / kfree() | ||
27 | 1 - this is a kmem_cache_alloc() / kmem_cache_free() | ||
28 | 2 - this is a __get_free_pages() et al. | ||
29 | Event size (2 bytes) Unsigned integer representing the | ||
30 | size of this event. Used to extend | ||
31 | kmemtrace. Discard the bytes you | ||
32 | don't know about. | ||
33 | Sequence number (4 bytes) Signed integer used to reorder data | ||
34 | logged on SMP machines. Wraparound | ||
35 | must be taken into account, although | ||
36 | it is unlikely. | ||
37 | Caller address (8 bytes) Return address to the caller. | ||
38 | Pointer to mem (8 bytes) Pointer to target memory area. Can be | ||
39 | NULL, but not all such calls might be | ||
40 | recorded. | ||
41 | |||
42 | In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow: | ||
43 | |||
44 | Requested bytes (8 bytes) Total number of requested bytes, | ||
45 | unsigned, must not be zero. | ||
46 | Allocated bytes (8 bytes) Total number of actually allocated | ||
47 | bytes, unsigned, must not be lower | ||
48 | than requested bytes. | ||
49 | Requested flags (4 bytes) GFP flags supplied by the caller. | ||
50 | Target CPU (4 bytes) Signed integer, valid for event id 1. | ||
51 | If equal to -1, target CPU is the same | ||
52 | as origin CPU, but the reverse might | ||
53 | not be true. | ||
54 | |||
55 | The data is made available in the same endianness the machine has. | ||
56 | |||
57 | Other event ids and type ids may be defined and added. Other fields may be | ||
58 | added by increasing event size, but see below for details. | ||
59 | Every modification to the ABI, including new id definitions, are followed | ||
60 | by bumping the ABI version by one. | ||
61 | |||
62 | Adding new data to the packet (features) is done at the end of the mandatory | ||
63 | data: | ||
64 | Feature size (2 byte) | ||
65 | Feature ID (1 byte) | ||
66 | Feature data (Feature size - 3 bytes) | ||
67 | |||
68 | |||
69 | Users: | ||
70 | kmemtrace-user - git://repo.or.cz/kmemtrace-user.git | ||
71 | |||
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt index 803b1318b13d..2041ee951c1a 100644 --- a/Documentation/ftrace.txt +++ b/Documentation/ftrace.txt | |||
@@ -15,31 +15,31 @@ Introduction | |||
15 | 15 | ||
16 | Ftrace is an internal tracer designed to help out developers and | 16 | Ftrace is an internal tracer designed to help out developers and |
17 | designers of systems to find what is going on inside the kernel. | 17 | designers of systems to find what is going on inside the kernel. |
18 | It can be used for debugging or analyzing latencies and performance | 18 | It can be used for debugging or analyzing latencies and |
19 | issues that take place outside of user-space. | 19 | performance issues that take place outside of user-space. |
20 | 20 | ||
21 | Although ftrace is the function tracer, it also includes an | 21 | Although ftrace is the function tracer, it also includes an |
22 | infrastructure that allows for other types of tracing. Some of the | 22 | infrastructure that allows for other types of tracing. Some of |
23 | tracers that are currently in ftrace include a tracer to trace | 23 | the tracers that are currently in ftrace include a tracer to |
24 | context switches, the time it takes for a high priority task to | 24 | trace context switches, the time it takes for a high priority |
25 | run after it was woken up, the time interrupts are disabled, and | 25 | task to run after it was woken up, the time interrupts are |
26 | more (ftrace allows for tracer plugins, which means that the list of | 26 | disabled, and more (ftrace allows for tracer plugins, which |
27 | tracers can always grow). | 27 | means that the list of tracers can always grow). |
28 | 28 | ||
29 | 29 | ||
30 | The File System | 30 | The File System |
31 | --------------- | 31 | --------------- |
32 | 32 | ||
33 | Ftrace uses the debugfs file system to hold the control files as well | 33 | Ftrace uses the debugfs file system to hold the control files as |
34 | as the files to display output. | 34 | well as the files to display output. |
35 | 35 | ||
36 | To mount the debugfs system: | 36 | To mount the debugfs system: |
37 | 37 | ||
38 | # mkdir /debug | 38 | # mkdir /debug |
39 | # mount -t debugfs nodev /debug | 39 | # mount -t debugfs nodev /debug |
40 | 40 | ||
41 | (Note: it is more common to mount at /sys/kernel/debug, but for simplicity | 41 | ( Note: it is more common to mount at /sys/kernel/debug, but for |
42 | this document will use /debug) | 42 | simplicity this document will use /debug) |
43 | 43 | ||
44 | That's it! (assuming that you have ftrace configured into your kernel) | 44 | That's it! (assuming that you have ftrace configured into your kernel) |
45 | 45 | ||
@@ -50,90 +50,124 @@ of ftrace. Here is a list of some of the key files: | |||
50 | 50 | ||
51 | Note: all time values are in microseconds. | 51 | Note: all time values are in microseconds. |
52 | 52 | ||
53 | current_tracer: This is used to set or display the current tracer | 53 | current_tracer: |
54 | that is configured. | 54 | |
55 | 55 | This is used to set or display the current tracer | |
56 | available_tracers: This holds the different types of tracers that | 56 | that is configured. |
57 | have been compiled into the kernel. The tracers | 57 | |
58 | listed here can be configured by echoing their name | 58 | available_tracers: |
59 | into current_tracer. | 59 | |
60 | 60 | This holds the different types of tracers that | |
61 | tracing_enabled: This sets or displays whether the current_tracer | 61 | have been compiled into the kernel. The |
62 | is activated and tracing or not. Echo 0 into this | 62 | tracers listed here can be configured by |
63 | file to disable the tracer or 1 to enable it. | 63 | echoing their name into current_tracer. |
64 | 64 | ||
65 | trace: This file holds the output of the trace in a human readable | 65 | tracing_enabled: |
66 | format (described below). | 66 | |
67 | 67 | This sets or displays whether the current_tracer | |
68 | latency_trace: This file shows the same trace but the information | 68 | is activated and tracing or not. Echo 0 into this |
69 | is organized more to display possible latencies | 69 | file to disable the tracer or 1 to enable it. |
70 | in the system (described below). | 70 | |
71 | 71 | trace: | |
72 | trace_pipe: The output is the same as the "trace" file but this | 72 | |
73 | file is meant to be streamed with live tracing. | 73 | This file holds the output of the trace in a human |
74 | Reads from this file will block until new data | 74 | readable format (described below). |
75 | is retrieved. Unlike the "trace" and "latency_trace" | 75 | |
76 | files, this file is a consumer. This means reading | 76 | latency_trace: |
77 | from this file causes sequential reads to display | 77 | |
78 | more current data. Once data is read from this | 78 | This file shows the same trace but the information |
79 | file, it is consumed, and will not be read | 79 | is organized more to display possible latencies |
80 | again with a sequential read. The "trace" and | 80 | in the system (described below). |
81 | "latency_trace" files are static, and if the | 81 | |
82 | tracer is not adding more data, they will display | 82 | trace_pipe: |
83 | the same information every time they are read. | 83 | |
84 | 84 | The output is the same as the "trace" file but this | |
85 | trace_options: This file lets the user control the amount of data | 85 | file is meant to be streamed with live tracing. |
86 | that is displayed in one of the above output | 86 | Reads from this file will block until new data |
87 | files. | 87 | is retrieved. Unlike the "trace" and "latency_trace" |
88 | 88 | files, this file is a consumer. This means reading | |
89 | trace_max_latency: Some of the tracers record the max latency. | 89 | from this file causes sequential reads to display |
90 | For example, the time interrupts are disabled. | 90 | more current data. Once data is read from this |
91 | This time is saved in this file. The max trace | 91 | file, it is consumed, and will not be read |
92 | will also be stored, and displayed by either | 92 | again with a sequential read. The "trace" and |
93 | "trace" or "latency_trace". A new max trace will | 93 | "latency_trace" files are static, and if the |
94 | only be recorded if the latency is greater than | 94 | tracer is not adding more data, they will display |
95 | the value in this file. (in microseconds) | 95 | the same information every time they are read. |
96 | 96 | ||
97 | buffer_size_kb: This sets or displays the number of kilobytes each CPU | 97 | trace_options: |
98 | buffer can hold. The tracer buffers are the same size | 98 | |
99 | for each CPU. The displayed number is the size of the | 99 | This file lets the user control the amount of data |
100 | CPU buffer and not total size of all buffers. The | 100 | that is displayed in one of the above output |
101 | trace buffers are allocated in pages (blocks of memory | 101 | files. |
102 | that the kernel uses for allocation, usually 4 KB in size). | 102 | |
103 | If the last page allocated has room for more bytes | 103 | trace_max_latency: |
104 | than requested, the rest of the page will be used, | 104 | |
105 | making the actual allocation bigger than requested. | 105 | Some of the tracers record the max latency. |
106 | (Note, the size may not be a multiple of the page size due | 106 | For example, the time interrupts are disabled. |
107 | to buffer managment overhead.) | 107 | This time is saved in this file. The max trace |
108 | 108 | will also be stored, and displayed by either | |
109 | This can only be updated when the current_tracer | 109 | "trace" or "latency_trace". A new max trace will |
110 | is set to "nop". | 110 | only be recorded if the latency is greater than |
111 | 111 | the value in this file. (in microseconds) | |
112 | tracing_cpumask: This is a mask that lets the user only trace | 112 | |
113 | on specified CPUS. The format is a hex string | 113 | buffer_size_kb: |
114 | representing the CPUS. | 114 | |
115 | 115 | This sets or displays the number of kilobytes each CPU | |
116 | set_ftrace_filter: When dynamic ftrace is configured in (see the | 116 | buffer can hold. The tracer buffers are the same size |
117 | section below "dynamic ftrace"), the code is dynamically | 117 | for each CPU. The displayed number is the size of the |
118 | modified (code text rewrite) to disable calling of the | 118 | CPU buffer and not total size of all buffers. The |
119 | function profiler (mcount). This lets tracing be configured | 119 | trace buffers are allocated in pages (blocks of memory |
120 | in with practically no overhead in performance. This also | 120 | that the kernel uses for allocation, usually 4 KB in size). |
121 | has a side effect of enabling or disabling specific functions | 121 | If the last page allocated has room for more bytes |
122 | to be traced. Echoing names of functions into this file | 122 | than requested, the rest of the page will be used, |
123 | will limit the trace to only those functions. | 123 | making the actual allocation bigger than requested. |
124 | 124 | ( Note, the size may not be a multiple of the page size | |
125 | set_ftrace_notrace: This has an effect opposite to that of | 125 | due to buffer managment overhead. ) |
126 | set_ftrace_filter. Any function that is added here will not | 126 | |
127 | be traced. If a function exists in both set_ftrace_filter | 127 | This can only be updated when the current_tracer |
128 | and set_ftrace_notrace, the function will _not_ be traced. | 128 | is set to "nop". |
129 | 129 | ||
130 | set_ftrace_pid: Have the function tracer only trace a single thread. | 130 | tracing_cpumask: |
131 | 131 | ||
132 | available_filter_functions: This lists the functions that ftrace | 132 | This is a mask that lets the user only trace |
133 | has processed and can trace. These are the function | 133 | on specified CPUS. The format is a hex string |
134 | names that you can pass to "set_ftrace_filter" or | 134 | representing the CPUS. |
135 | "set_ftrace_notrace". (See the section "dynamic ftrace" | 135 | |
136 | below for more details.) | 136 | set_ftrace_filter: |
137 | |||
138 | When dynamic ftrace is configured in (see the | ||
139 | section below "dynamic ftrace"), the code is dynamically | ||
140 | modified (code text rewrite) to disable calling of the | ||
141 | function profiler (mcount). This lets tracing be configured | ||
142 | in with practically no overhead in performance. This also | ||
143 | has a side effect of enabling or disabling specific functions | ||
144 | to be traced. Echoing names of functions into this file | ||
145 | will limit the trace to only those functions. | ||
146 | |||
147 | set_ftrace_notrace: | ||
148 | |||
149 | This has an effect opposite to that of | ||
150 | set_ftrace_filter. Any function that is added here will not | ||
151 | be traced. If a function exists in both set_ftrace_filter | ||
152 | and set_ftrace_notrace, the function will _not_ be traced. | ||
153 | |||
154 | set_ftrace_pid: | ||
155 | |||
156 | Have the function tracer only trace a single thread. | ||
157 | |||
158 | set_graph_function: | ||
159 | |||
160 | Set a "trigger" function where tracing should start | ||
161 | with the function graph tracer (See the section | ||
162 | "dynamic ftrace" for more details). | ||
163 | |||
164 | available_filter_functions: | ||
165 | |||
166 | This lists the functions that ftrace | ||
167 | has processed and can trace. These are the function | ||
168 | names that you can pass to "set_ftrace_filter" or | ||
169 | "set_ftrace_notrace". (See the section "dynamic ftrace" | ||
170 | below for more details.) | ||
137 | 171 | ||
138 | 172 | ||
139 | The Tracers | 173 | The Tracers |
@@ -141,36 +175,66 @@ The Tracers | |||
141 | 175 | ||
142 | Here is the list of current tracers that may be configured. | 176 | Here is the list of current tracers that may be configured. |
143 | 177 | ||
144 | function - function tracer that uses mcount to trace all functions. | 178 | "function" |
179 | |||
180 | Function call tracer to trace all kernel functions. | ||
181 | |||
182 | "function_graph_tracer" | ||
183 | |||
184 | Similar to the function tracer except that the | ||
185 | function tracer probes the functions on their entry | ||
186 | whereas the function graph tracer traces on both entry | ||
187 | and exit of the functions. It then provides the ability | ||
188 | to draw a graph of function calls similar to C code | ||
189 | source. | ||
145 | 190 | ||
146 | sched_switch - traces the context switches between tasks. | 191 | "sched_switch" |
147 | 192 | ||
148 | irqsoff - traces the areas that disable interrupts and saves | 193 | Traces the context switches and wakeups between tasks. |
149 | the trace with the longest max latency. | ||
150 | See tracing_max_latency. When a new max is recorded, | ||
151 | it replaces the old trace. It is best to view this | ||
152 | trace via the latency_trace file. | ||
153 | 194 | ||
154 | preemptoff - Similar to irqsoff but traces and records the amount of | 195 | "irqsoff" |
155 | time for which preemption is disabled. | ||
156 | 196 | ||
157 | preemptirqsoff - Similar to irqsoff and preemptoff, but traces and | 197 | Traces the areas that disable interrupts and saves |
158 | records the largest time for which irqs and/or preemption | 198 | the trace with the longest max latency. |
159 | is disabled. | 199 | See tracing_max_latency. When a new max is recorded, |
200 | it replaces the old trace. It is best to view this | ||
201 | trace via the latency_trace file. | ||
160 | 202 | ||
161 | wakeup - Traces and records the max latency that it takes for | 203 | "preemptoff" |
162 | the highest priority task to get scheduled after | ||
163 | it has been woken up. | ||
164 | 204 | ||
165 | nop - This is not a tracer. To remove all tracers from tracing | 205 | Similar to irqsoff but traces and records the amount of |
166 | simply echo "nop" into current_tracer. | 206 | time for which preemption is disabled. |
207 | |||
208 | "preemptirqsoff" | ||
209 | |||
210 | Similar to irqsoff and preemptoff, but traces and | ||
211 | records the largest time for which irqs and/or preemption | ||
212 | is disabled. | ||
213 | |||
214 | "wakeup" | ||
215 | |||
216 | Traces and records the max latency that it takes for | ||
217 | the highest priority task to get scheduled after | ||
218 | it has been woken up. | ||
219 | |||
220 | "hw-branch-tracer" | ||
221 | |||
222 | Uses the BTS CPU feature on x86 CPUs to traces all | ||
223 | branches executed. | ||
224 | |||
225 | "nop" | ||
226 | |||
227 | This is the "trace nothing" tracer. To remove all | ||
228 | tracers from tracing simply echo "nop" into | ||
229 | current_tracer. | ||
167 | 230 | ||
168 | 231 | ||
169 | Examples of using the tracer | 232 | Examples of using the tracer |
170 | ---------------------------- | 233 | ---------------------------- |
171 | 234 | ||
172 | Here are typical examples of using the tracers when controlling them only | 235 | Here are typical examples of using the tracers when controlling |
173 | with the debugfs interface (without using any user-land utilities). | 236 | them only with the debugfs interface (without using any |
237 | user-land utilities). | ||
174 | 238 | ||
175 | Output format: | 239 | Output format: |
176 | -------------- | 240 | -------------- |
@@ -187,16 +251,16 @@ Here is an example of the output format of the file "trace" | |||
187 | bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput | 251 | bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput |
188 | -------- | 252 | -------- |
189 | 253 | ||
190 | A header is printed with the tracer name that is represented by the trace. | 254 | A header is printed with the tracer name that is represented by |
191 | In this case the tracer is "function". Then a header showing the format. Task | 255 | the trace. In this case the tracer is "function". Then a header |
192 | name "bash", the task PID "4251", the CPU that it was running on | 256 | showing the format. Task name "bash", the task PID "4251", the |
193 | "01", the timestamp in <secs>.<usecs> format, the function name that was | 257 | CPU that it was running on "01", the timestamp in <secs>.<usecs> |
194 | traced "path_put" and the parent function that called this function | 258 | format, the function name that was traced "path_put" and the |
195 | "path_walk". The timestamp is the time at which the function was | 259 | parent function that called this function "path_walk". The |
196 | entered. | 260 | timestamp is the time at which the function was entered. |
197 | 261 | ||
198 | The sched_switch tracer also includes tracing of task wakeups and | 262 | The sched_switch tracer also includes tracing of task wakeups |
199 | context switches. | 263 | and context switches. |
200 | 264 | ||
201 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S | 265 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S |
202 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S | 266 | ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S |
@@ -205,8 +269,8 @@ context switches. | |||
205 | kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R | 269 | kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R |
206 | ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R | 270 | ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R |
207 | 271 | ||
208 | Wake ups are represented by a "+" and the context switches are shown as | 272 | Wake ups are represented by a "+" and the context switches are |
209 | "==>". The format is: | 273 | shown as "==>". The format is: |
210 | 274 | ||
211 | Context switches: | 275 | Context switches: |
212 | 276 | ||
@@ -220,19 +284,20 @@ Wake ups are represented by a "+" and the context switches are shown as | |||
220 | 284 | ||
221 | <pid>:<prio>:<state> + <pid>:<prio>:<state> | 285 | <pid>:<prio>:<state> + <pid>:<prio>:<state> |
222 | 286 | ||
223 | The prio is the internal kernel priority, which is the inverse of the | 287 | The prio is the internal kernel priority, which is the inverse |
224 | priority that is usually displayed by user-space tools. Zero represents | 288 | of the priority that is usually displayed by user-space tools. |
225 | the highest priority (99). Prio 100 starts the "nice" priorities with | 289 | Zero represents the highest priority (99). Prio 100 starts the |
226 | 100 being equal to nice -20 and 139 being nice 19. The prio "140" is | 290 | "nice" priorities with 100 being equal to nice -20 and 139 being |
227 | reserved for the idle task which is the lowest priority thread (pid 0). | 291 | nice 19. The prio "140" is reserved for the idle task which is |
292 | the lowest priority thread (pid 0). | ||
228 | 293 | ||
229 | 294 | ||
230 | Latency trace format | 295 | Latency trace format |
231 | -------------------- | 296 | -------------------- |
232 | 297 | ||
233 | For traces that display latency times, the latency_trace file gives | 298 | For traces that display latency times, the latency_trace file |
234 | somewhat more information to see why a latency happened. Here is a typical | 299 | gives somewhat more information to see why a latency happened. |
235 | trace. | 300 | Here is a typical trace. |
236 | 301 | ||
237 | # tracer: irqsoff | 302 | # tracer: irqsoff |
238 | # | 303 | # |
@@ -259,20 +324,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
259 | <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) | 324 | <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) |
260 | 325 | ||
261 | 326 | ||
327 | This shows that the current tracer is "irqsoff" tracing the time | ||
328 | for which interrupts were disabled. It gives the trace version | ||
329 | and the version of the kernel upon which this was executed on | ||
330 | (2.6.26-rc8). Then it displays the max latency in microsecs (97 | ||
331 | us). The number of trace entries displayed and the total number | ||
332 | recorded (both are three: #3/3). The type of preemption that was | ||
333 | used (PREEMPT). VP, KP, SP, and HP are always zero and are | ||
334 | reserved for later use. #P is the number of online CPUS (#P:2). | ||
262 | 335 | ||
263 | This shows that the current tracer is "irqsoff" tracing the time for which | 336 | The task is the process that was running when the latency |
264 | interrupts were disabled. It gives the trace version and the version | 337 | occurred. (swapper pid: 0). |
265 | of the kernel upon which this was executed on (2.6.26-rc8). Then it displays | ||
266 | the max latency in microsecs (97 us). The number of trace entries displayed | ||
267 | and the total number recorded (both are three: #3/3). The type of | ||
268 | preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero | ||
269 | and are reserved for later use. #P is the number of online CPUS (#P:2). | ||
270 | |||
271 | The task is the process that was running when the latency occurred. | ||
272 | (swapper pid: 0). | ||
273 | 338 | ||
274 | The start and stop (the functions in which the interrupts were disabled and | 339 | The start and stop (the functions in which the interrupts were |
275 | enabled respectively) that caused the latencies: | 340 | disabled and enabled respectively) that caused the latencies: |
276 | 341 | ||
277 | apic_timer_interrupt is where the interrupts were disabled. | 342 | apic_timer_interrupt is where the interrupts were disabled. |
278 | do_softirq is where they were enabled again. | 343 | do_softirq is where they were enabled again. |
@@ -308,12 +373,12 @@ The above is mostly meaningful for kernel developers. | |||
308 | latency_trace file is relative to the start of the trace. | 373 | latency_trace file is relative to the start of the trace. |
309 | 374 | ||
310 | delay: This is just to help catch your eye a bit better. And | 375 | delay: This is just to help catch your eye a bit better. And |
311 | needs to be fixed to be only relative to the same CPU. | 376 | needs to be fixed to be only relative to the same CPU. |
312 | The marks are determined by the difference between this | 377 | The marks are determined by the difference between this |
313 | current trace and the next trace. | 378 | current trace and the next trace. |
314 | '!' - greater than preempt_mark_thresh (default 100) | 379 | '!' - greater than preempt_mark_thresh (default 100) |
315 | '+' - greater than 1 microsecond | 380 | '+' - greater than 1 microsecond |
316 | ' ' - less than or equal to 1 microsecond. | 381 | ' ' - less than or equal to 1 microsecond. |
317 | 382 | ||
318 | The rest is the same as the 'trace' file. | 383 | The rest is the same as the 'trace' file. |
319 | 384 | ||
@@ -321,14 +386,15 @@ The above is mostly meaningful for kernel developers. | |||
321 | trace_options | 386 | trace_options |
322 | ------------- | 387 | ------------- |
323 | 388 | ||
324 | The trace_options file is used to control what gets printed in the trace | 389 | The trace_options file is used to control what gets printed in |
325 | output. To see what is available, simply cat the file: | 390 | the trace output. To see what is available, simply cat the file: |
326 | 391 | ||
327 | cat /debug/tracing/trace_options | 392 | cat /debug/tracing/trace_options |
328 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ | 393 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ |
329 | noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj | 394 | noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj |
330 | 395 | ||
331 | To disable one of the options, echo in the option prepended with "no". | 396 | To disable one of the options, echo in the option prepended with |
397 | "no". | ||
332 | 398 | ||
333 | echo noprint-parent > /debug/tracing/trace_options | 399 | echo noprint-parent > /debug/tracing/trace_options |
334 | 400 | ||
@@ -338,8 +404,8 @@ To enable an option, leave off the "no". | |||
338 | 404 | ||
339 | Here are the available options: | 405 | Here are the available options: |
340 | 406 | ||
341 | print-parent - On function traces, display the calling function | 407 | print-parent - On function traces, display the calling (parent) |
342 | as well as the function being traced. | 408 | function as well as the function being traced. |
343 | 409 | ||
344 | print-parent: | 410 | print-parent: |
345 | bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul | 411 | bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul |
@@ -348,15 +414,16 @@ Here are the available options: | |||
348 | bash-4000 [01] 1477.606694: simple_strtoul | 414 | bash-4000 [01] 1477.606694: simple_strtoul |
349 | 415 | ||
350 | 416 | ||
351 | sym-offset - Display not only the function name, but also the offset | 417 | sym-offset - Display not only the function name, but also the |
352 | in the function. For example, instead of seeing just | 418 | offset in the function. For example, instead of |
353 | "ktime_get", you will see "ktime_get+0xb/0x20". | 419 | seeing just "ktime_get", you will see |
420 | "ktime_get+0xb/0x20". | ||
354 | 421 | ||
355 | sym-offset: | 422 | sym-offset: |
356 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 | 423 | bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 |
357 | 424 | ||
358 | sym-addr - this will also display the function address as well as | 425 | sym-addr - this will also display the function address as well |
359 | the function name. | 426 | as the function name. |
360 | 427 | ||
361 | sym-addr: | 428 | sym-addr: |
362 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> | 429 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> |
@@ -366,35 +433,41 @@ Here are the available options: | |||
366 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ | 433 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ |
367 | (+0.000ms): simple_strtoul (strict_strtoul) | 434 | (+0.000ms): simple_strtoul (strict_strtoul) |
368 | 435 | ||
369 | raw - This will display raw numbers. This option is best for use with | 436 | raw - This will display raw numbers. This option is best for |
370 | user applications that can translate the raw numbers better than | 437 | use with user applications that can translate the raw |
371 | having it done in the kernel. | 438 | numbers better than having it done in the kernel. |
372 | 439 | ||
373 | hex - Similar to raw, but the numbers will be in a hexadecimal format. | 440 | hex - Similar to raw, but the numbers will be in a hexadecimal |
441 | format. | ||
374 | 442 | ||
375 | bin - This will print out the formats in raw binary. | 443 | bin - This will print out the formats in raw binary. |
376 | 444 | ||
377 | block - TBD (needs update) | 445 | block - TBD (needs update) |
378 | 446 | ||
379 | stacktrace - This is one of the options that changes the trace itself. | 447 | stacktrace - This is one of the options that changes the trace |
380 | When a trace is recorded, so is the stack of functions. | 448 | itself. When a trace is recorded, so is the stack |
381 | This allows for back traces of trace sites. | 449 | of functions. This allows for back traces of |
450 | trace sites. | ||
382 | 451 | ||
383 | userstacktrace - This option changes the trace. | 452 | userstacktrace - This option changes the trace. It records a |
384 | It records a stacktrace of the current userspace thread. | 453 | stacktrace of the current userspace thread. |
385 | 454 | ||
386 | sym-userobj - when user stacktrace are enabled, look up which object the | 455 | sym-userobj - when user stacktrace are enabled, look up which |
387 | address belongs to, and print a relative address | 456 | object the address belongs to, and print a |
388 | This is especially useful when ASLR is on, otherwise you don't | 457 | relative address. This is especially useful when |
389 | get a chance to resolve the address to object/file/line after the app is no | 458 | ASLR is on, otherwise you don't get a chance to |
390 | longer running | 459 | resolve the address to object/file/line after |
460 | the app is no longer running | ||
391 | 461 | ||
392 | The lookup is performed when you read trace,trace_pipe,latency_trace. Example: | 462 | The lookup is performed when you read |
463 | trace,trace_pipe,latency_trace. Example: | ||
393 | 464 | ||
394 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 | 465 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 |
395 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | 466 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] |
396 | 467 | ||
397 | sched-tree - TBD (any users??) | 468 | sched-tree - trace all tasks that are on the runqueue, at |
469 | every scheduling event. Will add overhead if | ||
470 | there's a lot of tasks running at once. | ||
398 | 471 | ||
399 | 472 | ||
400 | sched_switch | 473 | sched_switch |
@@ -431,18 +504,19 @@ of how to use it. | |||
431 | [...] | 504 | [...] |
432 | 505 | ||
433 | 506 | ||
434 | As we have discussed previously about this format, the header shows | 507 | As we have discussed previously about this format, the header |
435 | the name of the trace and points to the options. The "FUNCTION" | 508 | shows the name of the trace and points to the options. The |
436 | is a misnomer since here it represents the wake ups and context | 509 | "FUNCTION" is a misnomer since here it represents the wake ups |
437 | switches. | 510 | and context switches. |
438 | 511 | ||
439 | The sched_switch file only lists the wake ups (represented with '+') | 512 | The sched_switch file only lists the wake ups (represented with |
440 | and context switches ('==>') with the previous task or current task | 513 | '+') and context switches ('==>') with the previous task or |
441 | first followed by the next task or task waking up. The format for both | 514 | current task first followed by the next task or task waking up. |
442 | of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO | 515 | The format for both of these is PID:KERNEL-PRIO:TASK-STATE. |
443 | is the inverse of the actual priority with zero (0) being the highest | 516 | Remember that the KERNEL-PRIO is the inverse of the actual |
444 | priority and the nice values starting at 100 (nice -20). Below is | 517 | priority with zero (0) being the highest priority and the nice |
445 | a quick chart to map the kernel priority to user land priorities. | 518 | values starting at 100 (nice -20). Below is a quick chart to map |
519 | the kernel priority to user land priorities. | ||
446 | 520 | ||
447 | Kernel priority: 0 to 99 ==> user RT priority 99 to 0 | 521 | Kernel priority: 0 to 99 ==> user RT priority 99 to 0 |
448 | Kernel priority: 100 to 139 ==> user nice -20 to 19 | 522 | Kernel priority: 100 to 139 ==> user nice -20 to 19 |
@@ -463,10 +537,10 @@ The task states are: | |||
463 | ftrace_enabled | 537 | ftrace_enabled |
464 | -------------- | 538 | -------------- |
465 | 539 | ||
466 | The following tracers (listed below) give different output depending | 540 | The following tracers (listed below) give different output |
467 | on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled, | 541 | depending on whether or not the sysctl ftrace_enabled is set. To |
468 | one can either use the sysctl function or set it via the proc | 542 | set ftrace_enabled, one can either use the sysctl function or |
469 | file system interface. | 543 | set it via the proc file system interface. |
470 | 544 | ||
471 | sysctl kernel.ftrace_enabled=1 | 545 | sysctl kernel.ftrace_enabled=1 |
472 | 546 | ||
@@ -474,12 +548,12 @@ file system interface. | |||
474 | 548 | ||
475 | echo 1 > /proc/sys/kernel/ftrace_enabled | 549 | echo 1 > /proc/sys/kernel/ftrace_enabled |
476 | 550 | ||
477 | To disable ftrace_enabled simply replace the '1' with '0' in | 551 | To disable ftrace_enabled simply replace the '1' with '0' in the |
478 | the above commands. | 552 | above commands. |
479 | 553 | ||
480 | When ftrace_enabled is set the tracers will also record the functions | 554 | When ftrace_enabled is set the tracers will also record the |
481 | that are within the trace. The descriptions of the tracers | 555 | functions that are within the trace. The descriptions of the |
482 | will also show an example with ftrace enabled. | 556 | tracers will also show an example with ftrace enabled. |
483 | 557 | ||
484 | 558 | ||
485 | irqsoff | 559 | irqsoff |
@@ -487,17 +561,18 @@ irqsoff | |||
487 | 561 | ||
488 | When interrupts are disabled, the CPU can not react to any other | 562 | When interrupts are disabled, the CPU can not react to any other |
489 | external event (besides NMIs and SMIs). This prevents the timer | 563 | external event (besides NMIs and SMIs). This prevents the timer |
490 | interrupt from triggering or the mouse interrupt from letting the | 564 | interrupt from triggering or the mouse interrupt from letting |
491 | kernel know of a new mouse event. The result is a latency with the | 565 | the kernel know of a new mouse event. The result is a latency |
492 | reaction time. | 566 | with the reaction time. |
493 | 567 | ||
494 | The irqsoff tracer tracks the time for which interrupts are disabled. | 568 | The irqsoff tracer tracks the time for which interrupts are |
495 | When a new maximum latency is hit, the tracer saves the trace leading up | 569 | disabled. When a new maximum latency is hit, the tracer saves |
496 | to that latency point so that every time a new maximum is reached, the old | 570 | the trace leading up to that latency point so that every time a |
497 | saved trace is discarded and the new trace is saved. | 571 | new maximum is reached, the old saved trace is discarded and the |
572 | new trace is saved. | ||
498 | 573 | ||
499 | To reset the maximum, echo 0 into tracing_max_latency. Here is an | 574 | To reset the maximum, echo 0 into tracing_max_latency. Here is |
500 | example: | 575 | an example: |
501 | 576 | ||
502 | # echo irqsoff > /debug/tracing/current_tracer | 577 | # echo irqsoff > /debug/tracing/current_tracer |
503 | # echo 0 > /debug/tracing/tracing_max_latency | 578 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -532,10 +607,11 @@ irqsoff latency trace v1.1.5 on 2.6.26 | |||
532 | 607 | ||
533 | 608 | ||
534 | Here we see that that we had a latency of 12 microsecs (which is | 609 | Here we see that that we had a latency of 12 microsecs (which is |
535 | very good). The _write_lock_irq in sys_setpgid disabled interrupts. | 610 | very good). The _write_lock_irq in sys_setpgid disabled |
536 | The difference between the 12 and the displayed timestamp 14us occurred | 611 | interrupts. The difference between the 12 and the displayed |
537 | because the clock was incremented between the time of recording the max | 612 | timestamp 14us occurred because the clock was incremented |
538 | latency and the time of recording the function that had that latency. | 613 | between the time of recording the max latency and the time of |
614 | recording the function that had that latency. | ||
539 | 615 | ||
540 | Note the above example had ftrace_enabled not set. If we set the | 616 | Note the above example had ftrace_enabled not set. If we set the |
541 | ftrace_enabled, we get a much larger output: | 617 | ftrace_enabled, we get a much larger output: |
@@ -586,24 +662,24 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
586 | 662 | ||
587 | 663 | ||
588 | Here we traced a 50 microsecond latency. But we also see all the | 664 | Here we traced a 50 microsecond latency. But we also see all the |
589 | functions that were called during that time. Note that by enabling | 665 | functions that were called during that time. Note that by |
590 | function tracing, we incur an added overhead. This overhead may | 666 | enabling function tracing, we incur an added overhead. This |
591 | extend the latency times. But nevertheless, this trace has provided | 667 | overhead may extend the latency times. But nevertheless, this |
592 | some very helpful debugging information. | 668 | trace has provided some very helpful debugging information. |
593 | 669 | ||
594 | 670 | ||
595 | preemptoff | 671 | preemptoff |
596 | ---------- | 672 | ---------- |
597 | 673 | ||
598 | When preemption is disabled, we may be able to receive interrupts but | 674 | When preemption is disabled, we may be able to receive |
599 | the task cannot be preempted and a higher priority task must wait | 675 | interrupts but the task cannot be preempted and a higher |
600 | for preemption to be enabled again before it can preempt a lower | 676 | priority task must wait for preemption to be enabled again |
601 | priority task. | 677 | before it can preempt a lower priority task. |
602 | 678 | ||
603 | The preemptoff tracer traces the places that disable preemption. | 679 | The preemptoff tracer traces the places that disable preemption. |
604 | Like the irqsoff tracer, it records the maximum latency for which preemption | 680 | Like the irqsoff tracer, it records the maximum latency for |
605 | was disabled. The control of preemptoff tracer is much like the irqsoff | 681 | which preemption was disabled. The control of preemptoff tracer |
606 | tracer. | 682 | is much like the irqsoff tracer. |
607 | 683 | ||
608 | # echo preemptoff > /debug/tracing/current_tracer | 684 | # echo preemptoff > /debug/tracing/current_tracer |
609 | # echo 0 > /debug/tracing/tracing_max_latency | 685 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -637,11 +713,12 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8 | |||
637 | sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) | 713 | sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) |
638 | 714 | ||
639 | 715 | ||
640 | This has some more changes. Preemption was disabled when an interrupt | 716 | This has some more changes. Preemption was disabled when an |
641 | came in (notice the 'h'), and was enabled while doing a softirq. | 717 | interrupt came in (notice the 'h'), and was enabled while doing |
642 | (notice the 's'). But we also see that interrupts have been disabled | 718 | a softirq. (notice the 's'). But we also see that interrupts |
643 | when entering the preempt off section and leaving it (the 'd'). | 719 | have been disabled when entering the preempt off section and |
644 | We do not know if interrupts were enabled in the mean time. | 720 | leaving it (the 'd'). We do not know if interrupts were enabled |
721 | in the mean time. | ||
645 | 722 | ||
646 | # tracer: preemptoff | 723 | # tracer: preemptoff |
647 | # | 724 | # |
@@ -700,28 +777,30 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8 | |||
700 | sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) | 777 | sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) |
701 | 778 | ||
702 | 779 | ||
703 | The above is an example of the preemptoff trace with ftrace_enabled | 780 | The above is an example of the preemptoff trace with |
704 | set. Here we see that interrupts were disabled the entire time. | 781 | ftrace_enabled set. Here we see that interrupts were disabled |
705 | The irq_enter code lets us know that we entered an interrupt 'h'. | 782 | the entire time. The irq_enter code lets us know that we entered |
706 | Before that, the functions being traced still show that it is not | 783 | an interrupt 'h'. Before that, the functions being traced still |
707 | in an interrupt, but we can see from the functions themselves that | 784 | show that it is not in an interrupt, but we can see from the |
708 | this is not the case. | 785 | functions themselves that this is not the case. |
709 | 786 | ||
710 | Notice that __do_softirq when called does not have a preempt_count. | 787 | Notice that __do_softirq when called does not have a |
711 | It may seem that we missed a preempt enabling. What really happened | 788 | preempt_count. It may seem that we missed a preempt enabling. |
712 | is that the preempt count is held on the thread's stack and we | 789 | What really happened is that the preempt count is held on the |
713 | switched to the softirq stack (4K stacks in effect). The code | 790 | thread's stack and we switched to the softirq stack (4K stacks |
714 | does not copy the preempt count, but because interrupts are disabled, | 791 | in effect). The code does not copy the preempt count, but |
715 | we do not need to worry about it. Having a tracer like this is good | 792 | because interrupts are disabled, we do not need to worry about |
716 | for letting people know what really happens inside the kernel. | 793 | it. Having a tracer like this is good for letting people know |
794 | what really happens inside the kernel. | ||
717 | 795 | ||
718 | 796 | ||
719 | preemptirqsoff | 797 | preemptirqsoff |
720 | -------------- | 798 | -------------- |
721 | 799 | ||
722 | Knowing the locations that have interrupts disabled or preemption | 800 | Knowing the locations that have interrupts disabled or |
723 | disabled for the longest times is helpful. But sometimes we would | 801 | preemption disabled for the longest times is helpful. But |
724 | like to know when either preemption and/or interrupts are disabled. | 802 | sometimes we would like to know when either preemption and/or |
803 | interrupts are disabled. | ||
725 | 804 | ||
726 | Consider the following code: | 805 | Consider the following code: |
727 | 806 | ||
@@ -741,11 +820,13 @@ The preemptoff tracer will record the total length of | |||
741 | call_function_with_irqs_and_preemption_off() and | 820 | call_function_with_irqs_and_preemption_off() and |
742 | call_function_with_preemption_off(). | 821 | call_function_with_preemption_off(). |
743 | 822 | ||
744 | But neither will trace the time that interrupts and/or preemption | 823 | But neither will trace the time that interrupts and/or |
745 | is disabled. This total time is the time that we can not schedule. | 824 | preemption is disabled. This total time is the time that we can |
746 | To record this time, use the preemptirqsoff tracer. | 825 | not schedule. To record this time, use the preemptirqsoff |
826 | tracer. | ||
747 | 827 | ||
748 | Again, using this trace is much like the irqsoff and preemptoff tracers. | 828 | Again, using this trace is much like the irqsoff and preemptoff |
829 | tracers. | ||
749 | 830 | ||
750 | # echo preemptirqsoff > /debug/tracing/current_tracer | 831 | # echo preemptirqsoff > /debug/tracing/current_tracer |
751 | # echo 0 > /debug/tracing/tracing_max_latency | 832 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -781,9 +862,10 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
781 | 862 | ||
782 | 863 | ||
783 | The trace_hardirqs_off_thunk is called from assembly on x86 when | 864 | The trace_hardirqs_off_thunk is called from assembly on x86 when |
784 | interrupts are disabled in the assembly code. Without the function | 865 | interrupts are disabled in the assembly code. Without the |
785 | tracing, we do not know if interrupts were enabled within the preemption | 866 | function tracing, we do not know if interrupts were enabled |
786 | points. We do see that it started with preemption enabled. | 867 | within the preemption points. We do see that it started with |
868 | preemption enabled. | ||
787 | 869 | ||
788 | Here is a trace with ftrace_enabled set: | 870 | Here is a trace with ftrace_enabled set: |
789 | 871 | ||
@@ -871,40 +953,42 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | |||
871 | sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) | 953 | sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) |
872 | 954 | ||
873 | 955 | ||
874 | This is a very interesting trace. It started with the preemption of | 956 | This is a very interesting trace. It started with the preemption |
875 | the ls task. We see that the task had the "need_resched" bit set | 957 | of the ls task. We see that the task had the "need_resched" bit |
876 | via the 'N' in the trace. Interrupts were disabled before the spin_lock | 958 | set via the 'N' in the trace. Interrupts were disabled before |
877 | at the beginning of the trace. We see that a schedule took place to run | 959 | the spin_lock at the beginning of the trace. We see that a |
878 | sshd. When the interrupts were enabled, we took an interrupt. | 960 | schedule took place to run sshd. When the interrupts were |
879 | On return from the interrupt handler, the softirq ran. We took another | 961 | enabled, we took an interrupt. On return from the interrupt |
880 | interrupt while running the softirq as we see from the capital 'H'. | 962 | handler, the softirq ran. We took another interrupt while |
963 | running the softirq as we see from the capital 'H'. | ||
881 | 964 | ||
882 | 965 | ||
883 | wakeup | 966 | wakeup |
884 | ------ | 967 | ------ |
885 | 968 | ||
886 | In a Real-Time environment it is very important to know the wakeup | 969 | In a Real-Time environment it is very important to know the |
887 | time it takes for the highest priority task that is woken up to the | 970 | wakeup time it takes for the highest priority task that is woken |
888 | time that it executes. This is also known as "schedule latency". | 971 | up to the time that it executes. This is also known as "schedule |
889 | I stress the point that this is about RT tasks. It is also important | 972 | latency". I stress the point that this is about RT tasks. It is |
890 | to know the scheduling latency of non-RT tasks, but the average | 973 | also important to know the scheduling latency of non-RT tasks, |
891 | schedule latency is better for non-RT tasks. Tools like | 974 | but the average schedule latency is better for non-RT tasks. |
892 | LatencyTop are more appropriate for such measurements. | 975 | Tools like LatencyTop are more appropriate for such |
976 | measurements. | ||
893 | 977 | ||
894 | Real-Time environments are interested in the worst case latency. | 978 | Real-Time environments are interested in the worst case latency. |
895 | That is the longest latency it takes for something to happen, and | 979 | That is the longest latency it takes for something to happen, |
896 | not the average. We can have a very fast scheduler that may only | 980 | and not the average. We can have a very fast scheduler that may |
897 | have a large latency once in a while, but that would not work well | 981 | only have a large latency once in a while, but that would not |
898 | with Real-Time tasks. The wakeup tracer was designed to record | 982 | work well with Real-Time tasks. The wakeup tracer was designed |
899 | the worst case wakeups of RT tasks. Non-RT tasks are not recorded | 983 | to record the worst case wakeups of RT tasks. Non-RT tasks are |
900 | because the tracer only records one worst case and tracing non-RT | 984 | not recorded because the tracer only records one worst case and |
901 | tasks that are unpredictable will overwrite the worst case latency | 985 | tracing non-RT tasks that are unpredictable will overwrite the |
902 | of RT tasks. | 986 | worst case latency of RT tasks. |
903 | 987 | ||
904 | Since this tracer only deals with RT tasks, we will run this slightly | 988 | Since this tracer only deals with RT tasks, we will run this |
905 | differently than we did with the previous tracers. Instead of performing | 989 | slightly differently than we did with the previous tracers. |
906 | an 'ls', we will run 'sleep 1' under 'chrt' which changes the | 990 | Instead of performing an 'ls', we will run 'sleep 1' under |
907 | priority of the task. | 991 | 'chrt' which changes the priority of the task. |
908 | 992 | ||
909 | # echo wakeup > /debug/tracing/current_tracer | 993 | # echo wakeup > /debug/tracing/current_tracer |
910 | # echo 0 > /debug/tracing/tracing_max_latency | 994 | # echo 0 > /debug/tracing/tracing_max_latency |
@@ -934,17 +1018,16 @@ wakeup latency trace v1.1.5 on 2.6.26-rc8 | |||
934 | <idle>-0 1d..4 4us : schedule (cpu_idle) | 1018 | <idle>-0 1d..4 4us : schedule (cpu_idle) |
935 | 1019 | ||
936 | 1020 | ||
1021 | Running this on an idle system, we see that it only took 4 | ||
1022 | microseconds to perform the task switch. Note, since the trace | ||
1023 | marker in the schedule is before the actual "switch", we stop | ||
1024 | the tracing when the recorded task is about to schedule in. This | ||
1025 | may change if we add a new marker at the end of the scheduler. | ||
937 | 1026 | ||
938 | Running this on an idle system, we see that it only took 4 microseconds | 1027 | Notice that the recorded task is 'sleep' with the PID of 4901 |
939 | to perform the task switch. Note, since the trace marker in the | 1028 | and it has an rt_prio of 5. This priority is user-space priority |
940 | schedule is before the actual "switch", we stop the tracing when | 1029 | and not the internal kernel priority. The policy is 1 for |
941 | the recorded task is about to schedule in. This may change if | 1030 | SCHED_FIFO and 2 for SCHED_RR. |
942 | we add a new marker at the end of the scheduler. | ||
943 | |||
944 | Notice that the recorded task is 'sleep' with the PID of 4901 and it | ||
945 | has an rt_prio of 5. This priority is user-space priority and not | ||
946 | the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 | ||
947 | for SCHED_RR. | ||
948 | 1031 | ||
949 | Doing the same with chrt -r 5 and ftrace_enabled set. | 1032 | Doing the same with chrt -r 5 and ftrace_enabled set. |
950 | 1033 | ||
@@ -1001,24 +1084,25 @@ ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) | |||
1001 | ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) | 1084 | ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) |
1002 | ksoftirq-7 1d..4 50us : schedule (__cond_resched) | 1085 | ksoftirq-7 1d..4 50us : schedule (__cond_resched) |
1003 | 1086 | ||
1004 | The interrupt went off while running ksoftirqd. This task runs at | 1087 | The interrupt went off while running ksoftirqd. This task runs |
1005 | SCHED_OTHER. Why did not we see the 'N' set early? This may be | 1088 | at SCHED_OTHER. Why did not we see the 'N' set early? This may |
1006 | a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks | 1089 | be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K |
1007 | configured, the interrupt and softirq run with their own stack. | 1090 | stacks configured, the interrupt and softirq run with their own |
1008 | Some information is held on the top of the task's stack (need_resched | 1091 | stack. Some information is held on the top of the task's stack |
1009 | and preempt_count are both stored there). The setting of the NEED_RESCHED | 1092 | (need_resched and preempt_count are both stored there). The |
1010 | bit is done directly to the task's stack, but the reading of the | 1093 | setting of the NEED_RESCHED bit is done directly to the task's |
1011 | NEED_RESCHED is done by looking at the current stack, which in this case | 1094 | stack, but the reading of the NEED_RESCHED is done by looking at |
1012 | is the stack for the hard interrupt. This hides the fact that NEED_RESCHED | 1095 | the current stack, which in this case is the stack for the hard |
1013 | has been set. We do not see the 'N' until we switch back to the task's | 1096 | interrupt. This hides the fact that NEED_RESCHED has been set. |
1097 | We do not see the 'N' until we switch back to the task's | ||
1014 | assigned stack. | 1098 | assigned stack. |
1015 | 1099 | ||
1016 | function | 1100 | function |
1017 | -------- | 1101 | -------- |
1018 | 1102 | ||
1019 | This tracer is the function tracer. Enabling the function tracer | 1103 | This tracer is the function tracer. Enabling the function tracer |
1020 | can be done from the debug file system. Make sure the ftrace_enabled is | 1104 | can be done from the debug file system. Make sure the |
1021 | set; otherwise this tracer is a nop. | 1105 | ftrace_enabled is set; otherwise this tracer is a nop. |
1022 | 1106 | ||
1023 | # sysctl kernel.ftrace_enabled=1 | 1107 | # sysctl kernel.ftrace_enabled=1 |
1024 | # echo function > /debug/tracing/current_tracer | 1108 | # echo function > /debug/tracing/current_tracer |
@@ -1048,14 +1132,15 @@ set; otherwise this tracer is a nop. | |||
1048 | [...] | 1132 | [...] |
1049 | 1133 | ||
1050 | 1134 | ||
1051 | Note: function tracer uses ring buffers to store the above entries. | 1135 | Note: function tracer uses ring buffers to store the above |
1052 | The newest data may overwrite the oldest data. Sometimes using echo to | 1136 | entries. The newest data may overwrite the oldest data. |
1053 | stop the trace is not sufficient because the tracing could have overwritten | 1137 | Sometimes using echo to stop the trace is not sufficient because |
1054 | the data that you wanted to record. For this reason, it is sometimes better to | 1138 | the tracing could have overwritten the data that you wanted to |
1055 | disable tracing directly from a program. This allows you to stop the | 1139 | record. For this reason, it is sometimes better to disable |
1056 | tracing at the point that you hit the part that you are interested in. | 1140 | tracing directly from a program. This allows you to stop the |
1057 | To disable the tracing directly from a C program, something like following | 1141 | tracing at the point that you hit the part that you are |
1058 | code snippet can be used: | 1142 | interested in. To disable the tracing directly from a C program, |
1143 | something like following code snippet can be used: | ||
1059 | 1144 | ||
1060 | int trace_fd; | 1145 | int trace_fd; |
1061 | [...] | 1146 | [...] |
@@ -1070,10 +1155,10 @@ int main(int argc, char *argv[]) { | |||
1070 | } | 1155 | } |
1071 | 1156 | ||
1072 | Note: Here we hard coded the path name. The debugfs mount is not | 1157 | Note: Here we hard coded the path name. The debugfs mount is not |
1073 | guaranteed to be at /debug (and is more commonly at /sys/kernel/debug). | 1158 | guaranteed to be at /debug (and is more commonly at |
1074 | For simple one time traces, the above is sufficent. For anything else, | 1159 | /sys/kernel/debug). For simple one time traces, the above is |
1075 | a search through /proc/mounts may be needed to find where the debugfs | 1160 | sufficent. For anything else, a search through /proc/mounts may |
1076 | file-system is mounted. | 1161 | be needed to find where the debugfs file-system is mounted. |
1077 | 1162 | ||
1078 | 1163 | ||
1079 | Single thread tracing | 1164 | Single thread tracing |
@@ -1152,49 +1237,297 @@ int main (int argc, char **argv) | |||
1152 | return 0; | 1237 | return 0; |
1153 | } | 1238 | } |
1154 | 1239 | ||
1240 | |||
1241 | hw-branch-tracer (x86 only) | ||
1242 | --------------------------- | ||
1243 | |||
1244 | This tracer uses the x86 last branch tracing hardware feature to | ||
1245 | collect a branch trace on all cpus with relatively low overhead. | ||
1246 | |||
1247 | The tracer uses a fixed-size circular buffer per cpu and only | ||
1248 | traces ring 0 branches. The trace file dumps that buffer in the | ||
1249 | following format: | ||
1250 | |||
1251 | # tracer: hw-branch-tracer | ||
1252 | # | ||
1253 | # CPU# TO <- FROM | ||
1254 | 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 | ||
1255 | 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a | ||
1256 | 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf | ||
1257 | 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf | ||
1258 | 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a | ||
1259 | 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf | ||
1260 | |||
1261 | |||
1262 | The tracer may be used to dump the trace for the oops'ing cpu on | ||
1263 | a kernel oops into the system log. To enable this, | ||
1264 | ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one | ||
1265 | can either use the sysctl function or set it via the proc system | ||
1266 | interface. | ||
1267 | |||
1268 | sysctl kernel.ftrace_dump_on_oops=1 | ||
1269 | |||
1270 | or | ||
1271 | |||
1272 | echo 1 > /proc/sys/kernel/ftrace_dump_on_oops | ||
1273 | |||
1274 | |||
1275 | Here's an example of such a dump after a null pointer | ||
1276 | dereference in a kernel module: | ||
1277 | |||
1278 | [57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 | ||
1279 | [57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1280 | [57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 | ||
1281 | [57848.106019] Oops: 0002 [#1] SMP | ||
1282 | [57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus | ||
1283 | [57848.106019] Dumping ftrace buffer: | ||
1284 | [57848.106019] --------------------------------- | ||
1285 | [...] | ||
1286 | [57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 | ||
1287 | [57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 | ||
1288 | [57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 | ||
1289 | [57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 | ||
1290 | [57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 | ||
1291 | [57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] | ||
1292 | [57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 | ||
1293 | [57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b | ||
1294 | [57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 | ||
1295 | [57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 | ||
1296 | [57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 | ||
1297 | [...] | ||
1298 | [57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 | ||
1299 | [57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 | ||
1300 | [57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 | ||
1301 | [57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 | ||
1302 | [...] | ||
1303 | [57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 | ||
1304 | [57848.106019] --------------------------------- | ||
1305 | [57848.106019] CPU 0 | ||
1306 | [57848.106019] Modules linked in: oops | ||
1307 | [57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 | ||
1308 | [57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1309 | [57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 | ||
1310 | [...] | ||
1311 | |||
1312 | |||
1313 | function graph tracer | ||
1314 | --------------------------- | ||
1315 | |||
1316 | This tracer is similar to the function tracer except that it | ||
1317 | probes a function on its entry and its exit. This is done by | ||
1318 | using a dynamically allocated stack of return addresses in each | ||
1319 | task_struct. On function entry the tracer overwrites the return | ||
1320 | address of each function traced to set a custom probe. Thus the | ||
1321 | original return address is stored on the stack of return address | ||
1322 | in the task_struct. | ||
1323 | |||
1324 | Probing on both ends of a function leads to special features | ||
1325 | such as: | ||
1326 | |||
1327 | - measure of a function's time execution | ||
1328 | - having a reliable call stack to draw function calls graph | ||
1329 | |||
1330 | This tracer is useful in several situations: | ||
1331 | |||
1332 | - you want to find the reason of a strange kernel behavior and | ||
1333 | need to see what happens in detail on any areas (or specific | ||
1334 | ones). | ||
1335 | |||
1336 | - you are experiencing weird latencies but it's difficult to | ||
1337 | find its origin. | ||
1338 | |||
1339 | - you want to find quickly which path is taken by a specific | ||
1340 | function | ||
1341 | |||
1342 | - you just want to peek inside a working kernel and want to see | ||
1343 | what happens there. | ||
1344 | |||
1345 | # tracer: function_graph | ||
1346 | # | ||
1347 | # CPU DURATION FUNCTION CALLS | ||
1348 | # | | | | | | | | ||
1349 | |||
1350 | 0) | sys_open() { | ||
1351 | 0) | do_sys_open() { | ||
1352 | 0) | getname() { | ||
1353 | 0) | kmem_cache_alloc() { | ||
1354 | 0) 1.382 us | __might_sleep(); | ||
1355 | 0) 2.478 us | } | ||
1356 | 0) | strncpy_from_user() { | ||
1357 | 0) | might_fault() { | ||
1358 | 0) 1.389 us | __might_sleep(); | ||
1359 | 0) 2.553 us | } | ||
1360 | 0) 3.807 us | } | ||
1361 | 0) 7.876 us | } | ||
1362 | 0) | alloc_fd() { | ||
1363 | 0) 0.668 us | _spin_lock(); | ||
1364 | 0) 0.570 us | expand_files(); | ||
1365 | 0) 0.586 us | _spin_unlock(); | ||
1366 | |||
1367 | |||
1368 | There are several columns that can be dynamically | ||
1369 | enabled/disabled. You can use every combination of options you | ||
1370 | want, depending on your needs. | ||
1371 | |||
1372 | - The cpu number on which the function executed is default | ||
1373 | enabled. It is sometimes better to only trace one cpu (see | ||
1374 | tracing_cpu_mask file) or you might sometimes see unordered | ||
1375 | function calls while cpu tracing switch. | ||
1376 | |||
1377 | hide: echo nofuncgraph-cpu > /debug/tracing/trace_options | ||
1378 | show: echo funcgraph-cpu > /debug/tracing/trace_options | ||
1379 | |||
1380 | - The duration (function's time of execution) is displayed on | ||
1381 | the closing bracket line of a function or on the same line | ||
1382 | than the current function in case of a leaf one. It is default | ||
1383 | enabled. | ||
1384 | |||
1385 | hide: echo nofuncgraph-duration > /debug/tracing/trace_options | ||
1386 | show: echo funcgraph-duration > /debug/tracing/trace_options | ||
1387 | |||
1388 | - The overhead field precedes the duration field in case of | ||
1389 | reached duration thresholds. | ||
1390 | |||
1391 | hide: echo nofuncgraph-overhead > /debug/tracing/trace_options | ||
1392 | show: echo funcgraph-overhead > /debug/tracing/trace_options | ||
1393 | depends on: funcgraph-duration | ||
1394 | |||
1395 | ie: | ||
1396 | |||
1397 | 0) | up_write() { | ||
1398 | 0) 0.646 us | _spin_lock_irqsave(); | ||
1399 | 0) 0.684 us | _spin_unlock_irqrestore(); | ||
1400 | 0) 3.123 us | } | ||
1401 | 0) 0.548 us | fput(); | ||
1402 | 0) + 58.628 us | } | ||
1403 | |||
1404 | [...] | ||
1405 | |||
1406 | 0) | putname() { | ||
1407 | 0) | kmem_cache_free() { | ||
1408 | 0) 0.518 us | __phys_addr(); | ||
1409 | 0) 1.757 us | } | ||
1410 | 0) 2.861 us | } | ||
1411 | 0) ! 115.305 us | } | ||
1412 | 0) ! 116.402 us | } | ||
1413 | |||
1414 | + means that the function exceeded 10 usecs. | ||
1415 | ! means that the function exceeded 100 usecs. | ||
1416 | |||
1417 | |||
1418 | - The task/pid field displays the thread cmdline and pid which | ||
1419 | executed the function. It is default disabled. | ||
1420 | |||
1421 | hide: echo nofuncgraph-proc > /debug/tracing/trace_options | ||
1422 | show: echo funcgraph-proc > /debug/tracing/trace_options | ||
1423 | |||
1424 | ie: | ||
1425 | |||
1426 | # tracer: function_graph | ||
1427 | # | ||
1428 | # CPU TASK/PID DURATION FUNCTION CALLS | ||
1429 | # | | | | | | | | | | ||
1430 | 0) sh-4802 | | d_free() { | ||
1431 | 0) sh-4802 | | call_rcu() { | ||
1432 | 0) sh-4802 | | __call_rcu() { | ||
1433 | 0) sh-4802 | 0.616 us | rcu_process_gp_end(); | ||
1434 | 0) sh-4802 | 0.586 us | check_for_new_grace_period(); | ||
1435 | 0) sh-4802 | 2.899 us | } | ||
1436 | 0) sh-4802 | 4.040 us | } | ||
1437 | 0) sh-4802 | 5.151 us | } | ||
1438 | 0) sh-4802 | + 49.370 us | } | ||
1439 | |||
1440 | |||
1441 | - The absolute time field is an absolute timestamp given by the | ||
1442 | system clock since it started. A snapshot of this time is | ||
1443 | given on each entry/exit of functions | ||
1444 | |||
1445 | hide: echo nofuncgraph-abstime > /debug/tracing/trace_options | ||
1446 | show: echo funcgraph-abstime > /debug/tracing/trace_options | ||
1447 | |||
1448 | ie: | ||
1449 | |||
1450 | # | ||
1451 | # TIME CPU DURATION FUNCTION CALLS | ||
1452 | # | | | | | | | | | ||
1453 | 360.774522 | 1) 0.541 us | } | ||
1454 | 360.774522 | 1) 4.663 us | } | ||
1455 | 360.774523 | 1) 0.541 us | __wake_up_bit(); | ||
1456 | 360.774524 | 1) 6.796 us | } | ||
1457 | 360.774524 | 1) 7.952 us | } | ||
1458 | 360.774525 | 1) 9.063 us | } | ||
1459 | 360.774525 | 1) 0.615 us | journal_mark_dirty(); | ||
1460 | 360.774527 | 1) 0.578 us | __brelse(); | ||
1461 | 360.774528 | 1) | reiserfs_prepare_for_journal() { | ||
1462 | 360.774528 | 1) | unlock_buffer() { | ||
1463 | 360.774529 | 1) | wake_up_bit() { | ||
1464 | 360.774529 | 1) | bit_waitqueue() { | ||
1465 | 360.774530 | 1) 0.594 us | __phys_addr(); | ||
1466 | |||
1467 | |||
1468 | You can put some comments on specific functions by using | ||
1469 | ftrace_printk() For example, if you want to put a comment inside | ||
1470 | the __might_sleep() function, you just have to include | ||
1471 | <linux/ftrace.h> and call ftrace_printk() inside __might_sleep() | ||
1472 | |||
1473 | ftrace_printk("I'm a comment!\n") | ||
1474 | |||
1475 | will produce: | ||
1476 | |||
1477 | 1) | __might_sleep() { | ||
1478 | 1) | /* I'm a comment! */ | ||
1479 | 1) 1.449 us | } | ||
1480 | |||
1481 | |||
1482 | You might find other useful features for this tracer in the | ||
1483 | following "dynamic ftrace" section such as tracing only specific | ||
1484 | functions or tasks. | ||
1485 | |||
1155 | dynamic ftrace | 1486 | dynamic ftrace |
1156 | -------------- | 1487 | -------------- |
1157 | 1488 | ||
1158 | If CONFIG_DYNAMIC_FTRACE is set, the system will run with | 1489 | If CONFIG_DYNAMIC_FTRACE is set, the system will run with |
1159 | virtually no overhead when function tracing is disabled. The way | 1490 | virtually no overhead when function tracing is disabled. The way |
1160 | this works is the mcount function call (placed at the start of | 1491 | this works is the mcount function call (placed at the start of |
1161 | every kernel function, produced by the -pg switch in gcc), starts | 1492 | every kernel function, produced by the -pg switch in gcc), |
1162 | of pointing to a simple return. (Enabling FTRACE will include the | 1493 | starts of pointing to a simple return. (Enabling FTRACE will |
1163 | -pg switch in the compiling of the kernel.) | 1494 | include the -pg switch in the compiling of the kernel.) |
1164 | 1495 | ||
1165 | At compile time every C file object is run through the | 1496 | At compile time every C file object is run through the |
1166 | recordmcount.pl script (located in the scripts directory). This | 1497 | recordmcount.pl script (located in the scripts directory). This |
1167 | script will process the C object using objdump to find all the | 1498 | script will process the C object using objdump to find all the |
1168 | locations in the .text section that call mcount. (Note, only | 1499 | locations in the .text section that call mcount. (Note, only the |
1169 | the .text section is processed, since processing other sections | 1500 | .text section is processed, since processing other sections like |
1170 | like .init.text may cause races due to those sections being freed). | 1501 | .init.text may cause races due to those sections being freed). |
1171 | 1502 | ||
1172 | A new section called "__mcount_loc" is created that holds references | 1503 | A new section called "__mcount_loc" is created that holds |
1173 | to all the mcount call sites in the .text section. This section is | 1504 | references to all the mcount call sites in the .text section. |
1174 | compiled back into the original object. The final linker will add | 1505 | This section is compiled back into the original object. The |
1175 | all these references into a single table. | 1506 | final linker will add all these references into a single table. |
1176 | 1507 | ||
1177 | On boot up, before SMP is initialized, the dynamic ftrace code | 1508 | On boot up, before SMP is initialized, the dynamic ftrace code |
1178 | scans this table and updates all the locations into nops. It also | 1509 | scans this table and updates all the locations into nops. It |
1179 | records the locations, which are added to the available_filter_functions | 1510 | also records the locations, which are added to the |
1180 | list. Modules are processed as they are loaded and before they are | 1511 | available_filter_functions list. Modules are processed as they |
1181 | executed. When a module is unloaded, it also removes its functions from | 1512 | are loaded and before they are executed. When a module is |
1182 | the ftrace function list. This is automatic in the module unload | 1513 | unloaded, it also removes its functions from the ftrace function |
1183 | code, and the module author does not need to worry about it. | 1514 | list. This is automatic in the module unload code, and the |
1184 | 1515 | module author does not need to worry about it. | |
1185 | When tracing is enabled, kstop_machine is called to prevent races | 1516 | |
1186 | with the CPUS executing code being modified (which can cause the | 1517 | When tracing is enabled, kstop_machine is called to prevent |
1187 | CPU to do undesireable things), and the nops are patched back | 1518 | races with the CPUS executing code being modified (which can |
1188 | to calls. But this time, they do not call mcount (which is just | 1519 | cause the CPU to do undesireable things), and the nops are |
1189 | a function stub). They now call into the ftrace infrastructure. | 1520 | patched back to calls. But this time, they do not call mcount |
1521 | (which is just a function stub). They now call into the ftrace | ||
1522 | infrastructure. | ||
1190 | 1523 | ||
1191 | One special side-effect to the recording of the functions being | 1524 | One special side-effect to the recording of the functions being |
1192 | traced is that we can now selectively choose which functions we | 1525 | traced is that we can now selectively choose which functions we |
1193 | wish to trace and which ones we want the mcount calls to remain as | 1526 | wish to trace and which ones we want the mcount calls to remain |
1194 | nops. | 1527 | as nops. |
1195 | 1528 | ||
1196 | Two files are used, one for enabling and one for disabling the tracing | 1529 | Two files are used, one for enabling and one for disabling the |
1197 | of specified functions. They are: | 1530 | tracing of specified functions. They are: |
1198 | 1531 | ||
1199 | set_ftrace_filter | 1532 | set_ftrace_filter |
1200 | 1533 | ||
@@ -1202,8 +1535,8 @@ and | |||
1202 | 1535 | ||
1203 | set_ftrace_notrace | 1536 | set_ftrace_notrace |
1204 | 1537 | ||
1205 | A list of available functions that you can add to these files is listed | 1538 | A list of available functions that you can add to these files is |
1206 | in: | 1539 | listed in: |
1207 | 1540 | ||
1208 | available_filter_functions | 1541 | available_filter_functions |
1209 | 1542 | ||
@@ -1240,8 +1573,8 @@ hrtimer_interrupt | |||
1240 | sys_nanosleep | 1573 | sys_nanosleep |
1241 | 1574 | ||
1242 | 1575 | ||
1243 | Perhaps this is not enough. The filters also allow simple wild cards. | 1576 | Perhaps this is not enough. The filters also allow simple wild |
1244 | Only the following are currently available | 1577 | cards. Only the following are currently available |
1245 | 1578 | ||
1246 | <match>* - will match functions that begin with <match> | 1579 | <match>* - will match functions that begin with <match> |
1247 | *<match> - will match functions that end with <match> | 1580 | *<match> - will match functions that end with <match> |
@@ -1251,9 +1584,9 @@ These are the only wild cards which are supported. | |||
1251 | 1584 | ||
1252 | <match>*<match> will not work. | 1585 | <match>*<match> will not work. |
1253 | 1586 | ||
1254 | Note: It is better to use quotes to enclose the wild cards, otherwise | 1587 | Note: It is better to use quotes to enclose the wild cards, |
1255 | the shell may expand the parameters into names of files in the local | 1588 | otherwise the shell may expand the parameters into names |
1256 | directory. | 1589 | of files in the local directory. |
1257 | 1590 | ||
1258 | # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter | 1591 | # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter |
1259 | 1592 | ||
@@ -1299,7 +1632,8 @@ This is because the '>' and '>>' act just like they do in bash. | |||
1299 | To rewrite the filters, use '>' | 1632 | To rewrite the filters, use '>' |
1300 | To append to the filters, use '>>' | 1633 | To append to the filters, use '>>' |
1301 | 1634 | ||
1302 | To clear out a filter so that all functions will be recorded again: | 1635 | To clear out a filter so that all functions will be recorded |
1636 | again: | ||
1303 | 1637 | ||
1304 | # echo > /debug/tracing/set_ftrace_filter | 1638 | # echo > /debug/tracing/set_ftrace_filter |
1305 | # cat /debug/tracing/set_ftrace_filter | 1639 | # cat /debug/tracing/set_ftrace_filter |
@@ -1331,7 +1665,8 @@ hrtimer_get_res | |||
1331 | hrtimer_init_sleeper | 1665 | hrtimer_init_sleeper |
1332 | 1666 | ||
1333 | 1667 | ||
1334 | The set_ftrace_notrace prevents those functions from being traced. | 1668 | The set_ftrace_notrace prevents those functions from being |
1669 | traced. | ||
1335 | 1670 | ||
1336 | # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace | 1671 | # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace |
1337 | 1672 | ||
@@ -1353,13 +1688,75 @@ Produces: | |||
1353 | 1688 | ||
1354 | We can see that there's no more lock or preempt tracing. | 1689 | We can see that there's no more lock or preempt tracing. |
1355 | 1690 | ||
1691 | |||
1692 | Dynamic ftrace with the function graph tracer | ||
1693 | --------------------------------------------- | ||
1694 | |||
1695 | Although what has been explained above concerns both the | ||
1696 | function tracer and the function-graph-tracer, there are some | ||
1697 | special features only available in the function-graph tracer. | ||
1698 | |||
1699 | If you want to trace only one function and all of its children, | ||
1700 | you just have to echo its name into set_graph_function: | ||
1701 | |||
1702 | echo __do_fault > set_graph_function | ||
1703 | |||
1704 | will produce the following "expanded" trace of the __do_fault() | ||
1705 | function: | ||
1706 | |||
1707 | 0) | __do_fault() { | ||
1708 | 0) | filemap_fault() { | ||
1709 | 0) | find_lock_page() { | ||
1710 | 0) 0.804 us | find_get_page(); | ||
1711 | 0) | __might_sleep() { | ||
1712 | 0) 1.329 us | } | ||
1713 | 0) 3.904 us | } | ||
1714 | 0) 4.979 us | } | ||
1715 | 0) 0.653 us | _spin_lock(); | ||
1716 | 0) 0.578 us | page_add_file_rmap(); | ||
1717 | 0) 0.525 us | native_set_pte_at(); | ||
1718 | 0) 0.585 us | _spin_unlock(); | ||
1719 | 0) | unlock_page() { | ||
1720 | 0) 0.541 us | page_waitqueue(); | ||
1721 | 0) 0.639 us | __wake_up_bit(); | ||
1722 | 0) 2.786 us | } | ||
1723 | 0) + 14.237 us | } | ||
1724 | 0) | __do_fault() { | ||
1725 | 0) | filemap_fault() { | ||
1726 | 0) | find_lock_page() { | ||
1727 | 0) 0.698 us | find_get_page(); | ||
1728 | 0) | __might_sleep() { | ||
1729 | 0) 1.412 us | } | ||
1730 | 0) 3.950 us | } | ||
1731 | 0) 5.098 us | } | ||
1732 | 0) 0.631 us | _spin_lock(); | ||
1733 | 0) 0.571 us | page_add_file_rmap(); | ||
1734 | 0) 0.526 us | native_set_pte_at(); | ||
1735 | 0) 0.586 us | _spin_unlock(); | ||
1736 | 0) | unlock_page() { | ||
1737 | 0) 0.533 us | page_waitqueue(); | ||
1738 | 0) 0.638 us | __wake_up_bit(); | ||
1739 | 0) 2.793 us | } | ||
1740 | 0) + 14.012 us | } | ||
1741 | |||
1742 | You can also expand several functions at once: | ||
1743 | |||
1744 | echo sys_open > set_graph_function | ||
1745 | echo sys_close >> set_graph_function | ||
1746 | |||
1747 | Now if you want to go back to trace all functions you can clear | ||
1748 | this special filter via: | ||
1749 | |||
1750 | echo > set_graph_function | ||
1751 | |||
1752 | |||
1356 | trace_pipe | 1753 | trace_pipe |
1357 | ---------- | 1754 | ---------- |
1358 | 1755 | ||
1359 | The trace_pipe outputs the same content as the trace file, but the effect | 1756 | The trace_pipe outputs the same content as the trace file, but |
1360 | on the tracing is different. Every read from trace_pipe is consumed. | 1757 | the effect on the tracing is different. Every read from |
1361 | This means that subsequent reads will be different. The trace | 1758 | trace_pipe is consumed. This means that subsequent reads will be |
1362 | is live. | 1759 | different. The trace is live. |
1363 | 1760 | ||
1364 | # echo function > /debug/tracing/current_tracer | 1761 | # echo function > /debug/tracing/current_tracer |
1365 | # cat /debug/tracing/trace_pipe > /tmp/trace.out & | 1762 | # cat /debug/tracing/trace_pipe > /tmp/trace.out & |
@@ -1387,38 +1784,45 @@ is live. | |||
1387 | bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up | 1784 | bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up |
1388 | 1785 | ||
1389 | 1786 | ||
1390 | Note, reading the trace_pipe file will block until more input is added. | 1787 | Note, reading the trace_pipe file will block until more input is |
1391 | By changing the tracer, trace_pipe will issue an EOF. We needed | 1788 | added. By changing the tracer, trace_pipe will issue an EOF. We |
1392 | to set the function tracer _before_ we "cat" the trace_pipe file. | 1789 | needed to set the function tracer _before_ we "cat" the |
1790 | trace_pipe file. | ||
1393 | 1791 | ||
1394 | 1792 | ||
1395 | trace entries | 1793 | trace entries |
1396 | ------------- | 1794 | ------------- |
1397 | 1795 | ||
1398 | Having too much or not enough data can be troublesome in diagnosing | 1796 | Having too much or not enough data can be troublesome in |
1399 | an issue in the kernel. The file buffer_size_kb is used to modify | 1797 | diagnosing an issue in the kernel. The file buffer_size_kb is |
1400 | the size of the internal trace buffers. The number listed | 1798 | used to modify the size of the internal trace buffers. The |
1401 | is the number of entries that can be recorded per CPU. To know | 1799 | number listed is the number of entries that can be recorded per |
1402 | the full size, multiply the number of possible CPUS with the | 1800 | CPU. To know the full size, multiply the number of possible CPUS |
1403 | number of entries. | 1801 | with the number of entries. |
1404 | 1802 | ||
1405 | # cat /debug/tracing/buffer_size_kb | 1803 | # cat /debug/tracing/buffer_size_kb |
1406 | 1408 (units kilobytes) | 1804 | 1408 (units kilobytes) |
1407 | 1805 | ||
1408 | Note, to modify this, you must have tracing completely disabled. To do that, | 1806 | Note, to modify this, you must have tracing completely disabled. |
1409 | echo "nop" into the current_tracer. If the current_tracer is not set | 1807 | To do that, echo "nop" into the current_tracer. If the |
1410 | to "nop", an EINVAL error will be returned. | 1808 | current_tracer is not set to "nop", an EINVAL error will be |
1809 | returned. | ||
1411 | 1810 | ||
1412 | # echo nop > /debug/tracing/current_tracer | 1811 | # echo nop > /debug/tracing/current_tracer |
1413 | # echo 10000 > /debug/tracing/buffer_size_kb | 1812 | # echo 10000 > /debug/tracing/buffer_size_kb |
1414 | # cat /debug/tracing/buffer_size_kb | 1813 | # cat /debug/tracing/buffer_size_kb |
1415 | 10000 (units kilobytes) | 1814 | 10000 (units kilobytes) |
1416 | 1815 | ||
1417 | The number of pages which will be allocated is limited to a percentage | 1816 | The number of pages which will be allocated is limited to a |
1418 | of available memory. Allocating too much will produce an error. | 1817 | percentage of available memory. Allocating too much will produce |
1818 | an error. | ||
1419 | 1819 | ||
1420 | # echo 1000000000000 > /debug/tracing/buffer_size_kb | 1820 | # echo 1000000000000 > /debug/tracing/buffer_size_kb |
1421 | -bash: echo: write error: Cannot allocate memory | 1821 | -bash: echo: write error: Cannot allocate memory |
1422 | # cat /debug/tracing/buffer_size_kb | 1822 | # cat /debug/tracing/buffer_size_kb |
1423 | 85 | 1823 | 85 |
1424 | 1824 | ||
1825 | ----------- | ||
1826 | |||
1827 | More details can be found in the source code, in the | ||
1828 | kernel/tracing/*.c files. | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 319785b6dcb1..4a7700876c96 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -49,6 +49,7 @@ parameter is applicable: | |||
49 | ISAPNP ISA PnP code is enabled. | 49 | ISAPNP ISA PnP code is enabled. |
50 | ISDN Appropriate ISDN support is enabled. | 50 | ISDN Appropriate ISDN support is enabled. |
51 | JOY Appropriate joystick support is enabled. | 51 | JOY Appropriate joystick support is enabled. |
52 | KMEMTRACE kmemtrace is enabled. | ||
52 | LIBATA Libata driver is enabled | 53 | LIBATA Libata driver is enabled |
53 | LP Printer support is enabled. | 54 | LP Printer support is enabled. |
54 | LOOP Loopback device support is enabled. | 55 | LOOP Loopback device support is enabled. |
@@ -1045,6 +1046,15 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1045 | use the HighMem zone if it exists, and the Normal | 1046 | use the HighMem zone if it exists, and the Normal |
1046 | zone if it does not. | 1047 | zone if it does not. |
1047 | 1048 | ||
1049 | kmemtrace.enable= [KNL,KMEMTRACE] Format: { yes | no } | ||
1050 | Controls whether kmemtrace is enabled | ||
1051 | at boot-time. | ||
1052 | |||
1053 | kmemtrace.subbufs=n [KNL,KMEMTRACE] Overrides the number of | ||
1054 | subbufs kmemtrace's relay channel has. Set this | ||
1055 | higher than default (KMEMTRACE_N_SUBBUFS in code) if | ||
1056 | you experience buffer overruns. | ||
1057 | |||
1048 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter | 1058 | movablecore=nn[KMG] [KNL,X86-32,IA-64,PPC,X86-64] This parameter |
1049 | is similar to kernelcore except it specifies the | 1059 | is similar to kernelcore except it specifies the |
1050 | amount of memory used for migratable allocations. | 1060 | amount of memory used for migratable allocations. |
diff --git a/Documentation/sysrq.txt b/Documentation/sysrq.txt index 9e592c718afb..535aeb936dbc 100644 --- a/Documentation/sysrq.txt +++ b/Documentation/sysrq.txt | |||
@@ -113,6 +113,8 @@ On all - write a character to /proc/sysrq-trigger. e.g.: | |||
113 | 113 | ||
114 | 'x' - Used by xmon interface on ppc/powerpc platforms. | 114 | 'x' - Used by xmon interface on ppc/powerpc platforms. |
115 | 115 | ||
116 | 'z' - Dump the ftrace buffer | ||
117 | |||
116 | '0'-'9' - Sets the console log level, controlling which kernel messages | 118 | '0'-'9' - Sets the console log level, controlling which kernel messages |
117 | will be printed to your console. ('0', for example would make | 119 | will be printed to your console. ('0', for example would make |
118 | it so that only emergency messages like PANICs or OOPSes would | 120 | it so that only emergency messages like PANICs or OOPSes would |
diff --git a/Documentation/vm/kmemtrace.txt b/Documentation/vm/kmemtrace.txt new file mode 100644 index 000000000000..a956d9b7f943 --- /dev/null +++ b/Documentation/vm/kmemtrace.txt | |||
@@ -0,0 +1,126 @@ | |||
1 | kmemtrace - Kernel Memory Tracer | ||
2 | |||
3 | by Eduard - Gabriel Munteanu | ||
4 | <eduard.munteanu@linux360.ro> | ||
5 | |||
6 | I. Introduction | ||
7 | =============== | ||
8 | |||
9 | kmemtrace helps kernel developers figure out two things: | ||
10 | 1) how different allocators (SLAB, SLUB etc.) perform | ||
11 | 2) how kernel code allocates memory and how much | ||
12 | |||
13 | To do this, we trace every allocation and export information to the userspace | ||
14 | through the relay interface. We export things such as the number of requested | ||
15 | bytes, the number of bytes actually allocated (i.e. including internal | ||
16 | fragmentation), whether this is a slab allocation or a plain kmalloc() and so | ||
17 | on. | ||
18 | |||
19 | The actual analysis is performed by a userspace tool (see section III for | ||
20 | details on where to get it from). It logs the data exported by the kernel, | ||
21 | processes it and (as of writing this) can provide the following information: | ||
22 | - the total amount of memory allocated and fragmentation per call-site | ||
23 | - the amount of memory allocated and fragmentation per allocation | ||
24 | - total memory allocated and fragmentation in the collected dataset | ||
25 | - number of cross-CPU allocation and frees (makes sense in NUMA environments) | ||
26 | |||
27 | Moreover, it can potentially find inconsistent and erroneous behavior in | ||
28 | kernel code, such as using slab free functions on kmalloc'ed memory or | ||
29 | allocating less memory than requested (but not truly failed allocations). | ||
30 | |||
31 | kmemtrace also makes provisions for tracing on some arch and analysing the | ||
32 | data on another. | ||
33 | |||
34 | II. Design and goals | ||
35 | ==================== | ||
36 | |||
37 | kmemtrace was designed to handle rather large amounts of data. Thus, it uses | ||
38 | the relay interface to export whatever is logged to userspace, which then | ||
39 | stores it. Analysis and reporting is done asynchronously, that is, after the | ||
40 | data is collected and stored. By design, it allows one to log and analyse | ||
41 | on different machines and different arches. | ||
42 | |||
43 | As of writing this, the ABI is not considered stable, though it might not | ||
44 | change much. However, no guarantees are made about compatibility yet. When | ||
45 | deemed stable, the ABI should still allow easy extension while maintaining | ||
46 | backward compatibility. This is described further in Documentation/ABI. | ||
47 | |||
48 | Summary of design goals: | ||
49 | - allow logging and analysis to be done across different machines | ||
50 | - be fast and anticipate usage in high-load environments (*) | ||
51 | - be reasonably extensible | ||
52 | - make it possible for GNU/Linux distributions to have kmemtrace | ||
53 | included in their repositories | ||
54 | |||
55 | (*) - one of the reasons Pekka Enberg's original userspace data analysis | ||
56 | tool's code was rewritten from Perl to C (although this is more than a | ||
57 | simple conversion) | ||
58 | |||
59 | |||
60 | III. Quick usage guide | ||
61 | ====================== | ||
62 | |||
63 | 1) Get a kernel that supports kmemtrace and build it accordingly (i.e. enable | ||
64 | CONFIG_KMEMTRACE). | ||
65 | |||
66 | 2) Get the userspace tool and build it: | ||
67 | $ git-clone git://repo.or.cz/kmemtrace-user.git # current repository | ||
68 | $ cd kmemtrace-user/ | ||
69 | $ ./autogen.sh | ||
70 | $ ./configure | ||
71 | $ make | ||
72 | |||
73 | 3) Boot the kmemtrace-enabled kernel if you haven't, preferably in the | ||
74 | 'single' runlevel (so that relay buffers don't fill up easily), and run | ||
75 | kmemtrace: | ||
76 | # '$' does not mean user, but root here. | ||
77 | $ mount -t debugfs none /sys/kernel/debug | ||
78 | $ mount -t proc none /proc | ||
79 | $ cd path/to/kmemtrace-user/ | ||
80 | $ ./kmemtraced | ||
81 | Wait a bit, then stop it with CTRL+C. | ||
82 | $ cat /sys/kernel/debug/kmemtrace/total_overruns # Check if we didn't | ||
83 | # overrun, should | ||
84 | # be zero. | ||
85 | $ (Optionally) [Run kmemtrace_check separately on each cpu[0-9]*.out file to | ||
86 | check its correctness] | ||
87 | $ ./kmemtrace-report | ||
88 | |||
89 | Now you should have a nice and short summary of how the allocator performs. | ||
90 | |||
91 | IV. FAQ and known issues | ||
92 | ======================== | ||
93 | |||
94 | Q: 'cat /sys/kernel/debug/kmemtrace/total_overruns' is non-zero, how do I fix | ||
95 | this? Should I worry? | ||
96 | A: If it's non-zero, this affects kmemtrace's accuracy, depending on how | ||
97 | large the number is. You can fix it by supplying a higher | ||
98 | 'kmemtrace.subbufs=N' kernel parameter. | ||
99 | --- | ||
100 | |||
101 | Q: kmemtrace_check reports errors, how do I fix this? Should I worry? | ||
102 | A: This is a bug and should be reported. It can occur for a variety of | ||
103 | reasons: | ||
104 | - possible bugs in relay code | ||
105 | - possible misuse of relay by kmemtrace | ||
106 | - timestamps being collected unorderly | ||
107 | Or you may fix it yourself and send us a patch. | ||
108 | --- | ||
109 | |||
110 | Q: kmemtrace_report shows many errors, how do I fix this? Should I worry? | ||
111 | A: This is a known issue and I'm working on it. These might be true errors | ||
112 | in kernel code, which may have inconsistent behavior (e.g. allocating memory | ||
113 | with kmem_cache_alloc() and freeing it with kfree()). Pekka Enberg pointed | ||
114 | out this behavior may work with SLAB, but may fail with other allocators. | ||
115 | |||
116 | It may also be due to lack of tracing in some unusual allocator functions. | ||
117 | |||
118 | We don't want bug reports regarding this issue yet. | ||
119 | --- | ||
120 | |||
121 | V. See also | ||
122 | =========== | ||
123 | |||
124 | Documentation/kernel-parameters.txt | ||
125 | Documentation/ABI/testing/debugfs-kmemtrace | ||
126 | |||