diff options
Diffstat (limited to 'Documentation/trace/ftrace.txt')
-rw-r--r-- | Documentation/trace/ftrace.txt | 2097 |
1 files changed, 1480 insertions, 617 deletions
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index a372304aef10..bfe8c29b1f1d 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt | |||
@@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc. | |||
8 | Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, | 8 | Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, |
9 | John Kacur, and David Teigland. | 9 | John Kacur, and David Teigland. |
10 | Written for: 2.6.28-rc2 | 10 | Written for: 2.6.28-rc2 |
11 | Updated for: 3.10 | ||
11 | 12 | ||
12 | Introduction | 13 | Introduction |
13 | ------------ | 14 | ------------ |
@@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel. | |||
17 | It can be used for debugging or analyzing latencies and | 18 | It can be used for debugging or analyzing latencies and |
18 | performance issues that take place outside of user-space. | 19 | performance issues that take place outside of user-space. |
19 | 20 | ||
20 | Although ftrace is the function tracer, it also includes an | 21 | Although ftrace is typically considered the function tracer, it |
21 | infrastructure that allows for other types of tracing. Some of | 22 | is really a frame work of several assorted tracing utilities. |
22 | the tracers that are currently in ftrace include a tracer to | 23 | There's latency tracing to examine what occurs between interrupts |
23 | trace context switches, the time it takes for a high priority | 24 | disabled and enabled, as well as for preemption and from a time |
24 | task to run after it was woken up, the time interrupts are | 25 | a task is woken to the task is actually scheduled in. |
25 | disabled, and more (ftrace allows for tracer plugins, which | 26 | |
26 | means that the list of tracers can always grow). | 27 | One of the most common uses of ftrace is the event tracing. |
28 | Through out the kernel is hundreds of static event points that | ||
29 | can be enabled via the debugfs file system to see what is | ||
30 | going on in certain parts of the kernel. | ||
27 | 31 | ||
28 | 32 | ||
29 | Implementation Details | 33 | Implementation Details |
@@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name. | |||
61 | 65 | ||
62 | That's it! (assuming that you have ftrace configured into your kernel) | 66 | That's it! (assuming that you have ftrace configured into your kernel) |
63 | 67 | ||
64 | After mounting the debugfs, you can see a directory called | 68 | After mounting debugfs, you can see a directory called |
65 | "tracing". This directory contains the control and output files | 69 | "tracing". This directory contains the control and output files |
66 | of ftrace. Here is a list of some of the key files: | 70 | of ftrace. Here is a list of some of the key files: |
67 | 71 | ||
@@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files: | |||
84 | 88 | ||
85 | This sets or displays whether writing to the trace | 89 | This sets or displays whether writing to the trace |
86 | ring buffer is enabled. Echo 0 into this file to disable | 90 | ring buffer is enabled. Echo 0 into this file to disable |
87 | the tracer or 1 to enable it. | 91 | the tracer or 1 to enable it. Note, this only disables |
92 | writing to the ring buffer, the tracing overhead may | ||
93 | still be occurring. | ||
88 | 94 | ||
89 | trace: | 95 | trace: |
90 | 96 | ||
@@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files: | |||
109 | 115 | ||
110 | This file lets the user control the amount of data | 116 | This file lets the user control the amount of data |
111 | that is displayed in one of the above output | 117 | that is displayed in one of the above output |
112 | files. | 118 | files. Options also exist to modify how a tracer |
119 | or events work (stack traces, timestamps, etc). | ||
120 | |||
121 | options: | ||
122 | |||
123 | This is a directory that has a file for every available | ||
124 | trace option (also in trace_options). Options may also be set | ||
125 | or cleared by writing a "1" or "0" respectively into the | ||
126 | corresponding file with the option name. | ||
113 | 127 | ||
114 | tracing_max_latency: | 128 | tracing_max_latency: |
115 | 129 | ||
@@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files: | |||
121 | latency is greater than the value in this | 135 | latency is greater than the value in this |
122 | file. (in microseconds) | 136 | file. (in microseconds) |
123 | 137 | ||
138 | tracing_thresh: | ||
139 | |||
140 | Some latency tracers will record a trace whenever the | ||
141 | latency is greater than the number in this file. | ||
142 | Only active when the file contains a number greater than 0. | ||
143 | (in microseconds) | ||
144 | |||
124 | buffer_size_kb: | 145 | buffer_size_kb: |
125 | 146 | ||
126 | This sets or displays the number of kilobytes each CPU | 147 | This sets or displays the number of kilobytes each CPU |
127 | buffer can hold. The tracer buffers are the same size | 148 | buffer holds. By default, the trace buffers are the same size |
128 | for each CPU. The displayed number is the size of the | 149 | for each CPU. The displayed number is the size of the |
129 | CPU buffer and not total size of all buffers. The | 150 | CPU buffer and not total size of all buffers. The |
130 | trace buffers are allocated in pages (blocks of memory | 151 | trace buffers are allocated in pages (blocks of memory |
@@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files: | |||
133 | than requested, the rest of the page will be used, | 154 | than requested, the rest of the page will be used, |
134 | making the actual allocation bigger than requested. | 155 | making the actual allocation bigger than requested. |
135 | ( Note, the size may not be a multiple of the page size | 156 | ( Note, the size may not be a multiple of the page size |
136 | due to buffer management overhead. ) | 157 | due to buffer management meta-data. ) |
137 | 158 | ||
138 | This can only be updated when the current_tracer | 159 | buffer_total_size_kb: |
139 | is set to "nop". | 160 | |
161 | This displays the total combined size of all the trace buffers. | ||
162 | |||
163 | free_buffer: | ||
164 | |||
165 | If a process is performing the tracing, and the ring buffer | ||
166 | should be shrunk "freed" when the process is finished, even | ||
167 | if it were to be killed by a signal, this file can be used | ||
168 | for that purpose. On close of this file, the ring buffer will | ||
169 | be resized to its minimum size. Having a process that is tracing | ||
170 | also open this file, when the process exits its file descriptor | ||
171 | for this file will be closed, and in doing so, the ring buffer | ||
172 | will be "freed". | ||
173 | |||
174 | It may also stop tracing if disable_on_free option is set. | ||
140 | 175 | ||
141 | tracing_cpumask: | 176 | tracing_cpumask: |
142 | 177 | ||
143 | This is a mask that lets the user only trace | 178 | This is a mask that lets the user only trace |
144 | on specified CPUS. The format is a hex string | 179 | on specified CPUs. The format is a hex string |
145 | representing the CPUS. | 180 | representing the CPUs. |
146 | 181 | ||
147 | set_ftrace_filter: | 182 | set_ftrace_filter: |
148 | 183 | ||
@@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files: | |||
183 | "set_ftrace_notrace". (See the section "dynamic ftrace" | 218 | "set_ftrace_notrace". (See the section "dynamic ftrace" |
184 | below for more details.) | 219 | below for more details.) |
185 | 220 | ||
221 | enabled_functions: | ||
222 | |||
223 | This file is more for debugging ftrace, but can also be useful | ||
224 | in seeing if any function has a callback attached to it. | ||
225 | Not only does the trace infrastructure use ftrace function | ||
226 | trace utility, but other subsystems might too. This file | ||
227 | displays all functions that have a callback attached to them | ||
228 | as well as the number of callbacks that have been attached. | ||
229 | Note, a callback may also call multiple functions which will | ||
230 | not be listed in this count. | ||
231 | |||
232 | If the callback registered to be traced by a function with | ||
233 | the "save regs" attribute (thus even more overhead), a 'R' | ||
234 | will be displayed on the same line as the function that | ||
235 | is returning registers. | ||
236 | |||
237 | function_profile_enabled: | ||
238 | |||
239 | When set it will enable all functions with either the function | ||
240 | tracer, or if enabled, the function graph tracer. It will | ||
241 | keep a histogram of the number of functions that were called | ||
242 | and if run with the function graph tracer, it will also keep | ||
243 | track of the time spent in those functions. The histogram | ||
244 | content can be displayed in the files: | ||
245 | |||
246 | trace_stats/function<cpu> ( function0, function1, etc). | ||
247 | |||
248 | trace_stats: | ||
249 | |||
250 | A directory that holds different tracing stats. | ||
251 | |||
252 | kprobe_events: | ||
253 | |||
254 | Enable dynamic trace points. See kprobetrace.txt. | ||
255 | |||
256 | kprobe_profile: | ||
257 | |||
258 | Dynamic trace points stats. See kprobetrace.txt. | ||
259 | |||
260 | max_graph_depth: | ||
261 | |||
262 | Used with the function graph tracer. This is the max depth | ||
263 | it will trace into a function. Setting this to a value of | ||
264 | one will show only the first kernel function that is called | ||
265 | from user space. | ||
266 | |||
267 | printk_formats: | ||
268 | |||
269 | This is for tools that read the raw format files. If an event in | ||
270 | the ring buffer references a string (currently only trace_printk() | ||
271 | does this), only a pointer to the string is recorded into the buffer | ||
272 | and not the string itself. This prevents tools from knowing what | ||
273 | that string was. This file displays the string and address for | ||
274 | the string allowing tools to map the pointers to what the | ||
275 | strings were. | ||
276 | |||
277 | saved_cmdlines: | ||
278 | |||
279 | Only the pid of the task is recorded in a trace event unless | ||
280 | the event specifically saves the task comm as well. Ftrace | ||
281 | makes a cache of pid mappings to comms to try to display | ||
282 | comms for events. If a pid for a comm is not listed, then | ||
283 | "<...>" is displayed in the output. | ||
284 | |||
285 | snapshot: | ||
286 | |||
287 | This displays the "snapshot" buffer and also lets the user | ||
288 | take a snapshot of the current running trace. | ||
289 | See the "Snapshot" section below for more details. | ||
290 | |||
291 | stack_max_size: | ||
292 | |||
293 | When the stack tracer is activated, this will display the | ||
294 | maximum stack size it has encountered. | ||
295 | See the "Stack Trace" section below. | ||
296 | |||
297 | stack_trace: | ||
298 | |||
299 | This displays the stack back trace of the largest stack | ||
300 | that was encountered when the stack tracer is activated. | ||
301 | See the "Stack Trace" section below. | ||
302 | |||
303 | stack_trace_filter: | ||
304 | |||
305 | This is similar to "set_ftrace_filter" but it limits what | ||
306 | functions the stack tracer will check. | ||
307 | |||
308 | trace_clock: | ||
309 | |||
310 | Whenever an event is recorded into the ring buffer, a | ||
311 | "timestamp" is added. This stamp comes from a specified | ||
312 | clock. By default, ftrace uses the "local" clock. This | ||
313 | clock is very fast and strictly per cpu, but on some | ||
314 | systems it may not be monotonic with respect to other | ||
315 | CPUs. In other words, the local clocks may not be in sync | ||
316 | with local clocks on other CPUs. | ||
317 | |||
318 | Usual clocks for tracing: | ||
319 | |||
320 | # cat trace_clock | ||
321 | [local] global counter x86-tsc | ||
322 | |||
323 | local: Default clock, but may not be in sync across CPUs | ||
324 | |||
325 | global: This clock is in sync with all CPUs but may | ||
326 | be a bit slower than the local clock. | ||
327 | |||
328 | counter: This is not a clock at all, but literally an atomic | ||
329 | counter. It counts up one by one, but is in sync | ||
330 | with all CPUs. This is useful when you need to | ||
331 | know exactly the order events occurred with respect to | ||
332 | each other on different CPUs. | ||
333 | |||
334 | uptime: This uses the jiffies counter and the time stamp | ||
335 | is relative to the time since boot up. | ||
336 | |||
337 | perf: This makes ftrace use the same clock that perf uses. | ||
338 | Eventually perf will be able to read ftrace buffers | ||
339 | and this will help out in interleaving the data. | ||
340 | |||
341 | x86-tsc: Architectures may define their own clocks. For | ||
342 | example, x86 uses its own TSC cycle clock here. | ||
343 | |||
344 | To set a clock, simply echo the clock name into this file. | ||
345 | |||
346 | echo global > trace_clock | ||
347 | |||
348 | trace_marker: | ||
349 | |||
350 | This is a very useful file for synchronizing user space | ||
351 | with events happening in the kernel. Writing strings into | ||
352 | this file will be written into the ftrace buffer. | ||
353 | |||
354 | It is useful in applications to open this file at the start | ||
355 | of the application and just reference the file descriptor | ||
356 | for the file. | ||
357 | |||
358 | void trace_write(const char *fmt, ...) | ||
359 | { | ||
360 | va_list ap; | ||
361 | char buf[256]; | ||
362 | int n; | ||
363 | |||
364 | if (trace_fd < 0) | ||
365 | return; | ||
366 | |||
367 | va_start(ap, fmt); | ||
368 | n = vsnprintf(buf, 256, fmt, ap); | ||
369 | va_end(ap); | ||
370 | |||
371 | write(trace_fd, buf, n); | ||
372 | } | ||
373 | |||
374 | start: | ||
375 | |||
376 | trace_fd = open("trace_marker", WR_ONLY); | ||
377 | |||
378 | uprobe_events: | ||
379 | |||
380 | Add dynamic tracepoints in programs. | ||
381 | See uprobetracer.txt | ||
382 | |||
383 | uprobe_profile: | ||
384 | |||
385 | Uprobe statistics. See uprobetrace.txt | ||
386 | |||
387 | instances: | ||
388 | |||
389 | This is a way to make multiple trace buffers where different | ||
390 | events can be recorded in different buffers. | ||
391 | See "Instances" section below. | ||
392 | |||
393 | events: | ||
394 | |||
395 | This is the trace event directory. It holds event tracepoints | ||
396 | (also known as static tracepoints) that have been compiled | ||
397 | into the kernel. It shows what event tracepoints exist | ||
398 | and how they are grouped by system. There are "enable" | ||
399 | files at various levels that can enable the tracepoints | ||
400 | when a "1" is written to them. | ||
401 | |||
402 | See events.txt for more information. | ||
403 | |||
404 | per_cpu: | ||
405 | |||
406 | This is a directory that contains the trace per_cpu information. | ||
407 | |||
408 | per_cpu/cpu0/buffer_size_kb: | ||
409 | |||
410 | The ftrace buffer is defined per_cpu. That is, there's a separate | ||
411 | buffer for each CPU to allow writes to be done atomically, | ||
412 | and free from cache bouncing. These buffers may have different | ||
413 | size buffers. This file is similar to the buffer_size_kb | ||
414 | file, but it only displays or sets the buffer size for the | ||
415 | specific CPU. (here cpu0). | ||
416 | |||
417 | per_cpu/cpu0/trace: | ||
418 | |||
419 | This is similar to the "trace" file, but it will only display | ||
420 | the data specific for the CPU. If written to, it only clears | ||
421 | the specific CPU buffer. | ||
422 | |||
423 | per_cpu/cpu0/trace_pipe | ||
424 | |||
425 | This is similar to the "trace_pipe" file, and is a consuming | ||
426 | read, but it will only display (and consume) the data specific | ||
427 | for the CPU. | ||
428 | |||
429 | per_cpu/cpu0/trace_pipe_raw | ||
430 | |||
431 | For tools that can parse the ftrace ring buffer binary format, | ||
432 | the trace_pipe_raw file can be used to extract the data | ||
433 | from the ring buffer directly. With the use of the splice() | ||
434 | system call, the buffer data can be quickly transferred to | ||
435 | a file or to the network where a server is collecting the | ||
436 | data. | ||
437 | |||
438 | Like trace_pipe, this is a consuming reader, where multiple | ||
439 | reads will always produce different data. | ||
440 | |||
441 | per_cpu/cpu0/snapshot: | ||
442 | |||
443 | This is similar to the main "snapshot" file, but will only | ||
444 | snapshot the current CPU (if supported). It only displays | ||
445 | the content of the snapshot for a given CPU, and if | ||
446 | written to, only clears this CPU buffer. | ||
447 | |||
448 | per_cpu/cpu0/snapshot_raw: | ||
449 | |||
450 | Similar to the trace_pipe_raw, but will read the binary format | ||
451 | from the snapshot buffer for the given CPU. | ||
452 | |||
453 | per_cpu/cpu0/stats: | ||
454 | |||
455 | This displays certain stats about the ring buffer: | ||
456 | |||
457 | entries: The number of events that are still in the buffer. | ||
458 | |||
459 | overrun: The number of lost events due to overwriting when | ||
460 | the buffer was full. | ||
461 | |||
462 | commit overrun: Should always be zero. | ||
463 | This gets set if so many events happened within a nested | ||
464 | event (ring buffer is re-entrant), that it fills the | ||
465 | buffer and starts dropping events. | ||
466 | |||
467 | bytes: Bytes actually read (not overwritten). | ||
468 | |||
469 | oldest event ts: The oldest timestamp in the buffer | ||
470 | |||
471 | now ts: The current timestamp | ||
472 | |||
473 | dropped events: Events lost due to overwrite option being off. | ||
474 | |||
475 | read events: The number of events read. | ||
186 | 476 | ||
187 | The Tracers | 477 | The Tracers |
188 | ----------- | 478 | ----------- |
@@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured. | |||
234 | RT tasks (as the current "wakeup" does). This is useful | 524 | RT tasks (as the current "wakeup" does). This is useful |
235 | for those interested in wake up timings of RT tasks. | 525 | for those interested in wake up timings of RT tasks. |
236 | 526 | ||
237 | "hw-branch-tracer" | ||
238 | |||
239 | Uses the BTS CPU feature on x86 CPUs to traces all | ||
240 | branches executed. | ||
241 | |||
242 | "nop" | 527 | "nop" |
243 | 528 | ||
244 | This is the "trace nothing" tracer. To remove all | 529 | This is the "trace nothing" tracer. To remove all |
@@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace" | |||
261 | -------- | 546 | -------- |
262 | # tracer: function | 547 | # tracer: function |
263 | # | 548 | # |
264 | # TASK-PID CPU# TIMESTAMP FUNCTION | 549 | # entries-in-buffer/entries-written: 140080/250280 #P:4 |
265 | # | | | | | | 550 | # |
266 | bash-4251 [01] 10152.583854: path_put <-path_walk | 551 | # _-----=> irqs-off |
267 | bash-4251 [01] 10152.583855: dput <-path_put | 552 | # / _----=> need-resched |
268 | bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput | 553 | # | / _---=> hardirq/softirq |
554 | # || / _--=> preempt-depth | ||
555 | # ||| / delay | ||
556 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
557 | # | | | |||| | | | ||
558 | bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath | ||
559 | bash-1977 [000] .... 17284.993653: __close_fd <-sys_close | ||
560 | bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd | ||
561 | sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify | ||
562 | bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock | ||
563 | bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd | ||
564 | bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock | ||
565 | bash-1977 [000] .... 17284.993657: filp_close <-__close_fd | ||
566 | bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close | ||
567 | sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath | ||
269 | -------- | 568 | -------- |
270 | 569 | ||
271 | A header is printed with the tracer name that is represented by | 570 | A header is printed with the tracer name that is represented by |
272 | the trace. In this case the tracer is "function". Then a header | 571 | the trace. In this case the tracer is "function". Then it shows the |
273 | showing the format. Task name "bash", the task PID "4251", the | 572 | number of events in the buffer as well as the total number of entries |
274 | CPU that it was running on "01", the timestamp in <secs>.<usecs> | 573 | that were written. The difference is the number of entries that were |
275 | format, the function name that was traced "path_put" and the | 574 | lost due to the buffer filling up (250280 - 140080 = 110200 events |
276 | parent function that called this function "path_walk". The | 575 | lost). |
277 | timestamp is the time at which the function was entered. | 576 | |
577 | The header explains the content of the events. Task name "bash", the task | ||
578 | PID "1977", the CPU that it was running on "000", the latency format | ||
579 | (explained below), the timestamp in <secs>.<usecs> format, the | ||
580 | function name that was traced "sys_close" and the parent function that | ||
581 | called this function "system_call_fastpath". The timestamp is the time | ||
582 | at which the function was entered. | ||
278 | 583 | ||
279 | Latency trace format | 584 | Latency trace format |
280 | -------------------- | 585 | -------------------- |
281 | 586 | ||
282 | When the latency-format option is enabled, the trace file gives | 587 | When the latency-format option is enabled or when one of the latency |
283 | somewhat more information to see why a latency happened. | 588 | tracers is set, the trace file gives somewhat more information to see |
284 | Here is a typical trace. | 589 | why a latency happened. Here is a typical trace. |
285 | 590 | ||
286 | # tracer: irqsoff | 591 | # tracer: irqsoff |
287 | # | 592 | # |
288 | irqsoff latency trace v1.1.5 on 2.6.26-rc8 | 593 | # irqsoff latency trace v1.1.5 on 3.8.0-test+ |
289 | -------------------------------------------------------------------- | 594 | # -------------------------------------------------------------------- |
290 | latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 595 | # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
291 | ----------------- | 596 | # ----------------- |
292 | | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) | 597 | # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) |
293 | ----------------- | 598 | # ----------------- |
294 | => started at: apic_timer_interrupt | 599 | # => started at: __lock_task_sighand |
295 | => ended at: do_softirq | 600 | # => ended at: _raw_spin_unlock_irqrestore |
296 | 601 | # | |
297 | # _------=> CPU# | 602 | # |
298 | # / _-----=> irqs-off | 603 | # _------=> CPU# |
299 | # | / _----=> need-resched | 604 | # / _-----=> irqs-off |
300 | # || / _---=> hardirq/softirq | 605 | # | / _----=> need-resched |
301 | # ||| / _--=> preempt-depth | 606 | # || / _---=> hardirq/softirq |
302 | # |||| / | 607 | # ||| / _--=> preempt-depth |
303 | # ||||| delay | 608 | # |||| / delay |
304 | # cmd pid ||||| time | caller | 609 | # cmd pid ||||| time | caller |
305 | # \ / ||||| \ | / | 610 | # \ / ||||| \ | / |
306 | <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) | 611 | ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand |
307 | <idle>-0 0d.s. 97us : __do_softirq (do_softirq) | 612 | ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore |
308 | <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) | 613 | ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore |
614 | ps-6143 2d..1 306us : <stack trace> | ||
615 | => trace_hardirqs_on_caller | ||
616 | => trace_hardirqs_on | ||
617 | => _raw_spin_unlock_irqrestore | ||
618 | => do_task_stat | ||
619 | => proc_tgid_stat | ||
620 | => proc_single_show | ||
621 | => seq_read | ||
622 | => vfs_read | ||
623 | => sys_read | ||
624 | => system_call_fastpath | ||
309 | 625 | ||
310 | 626 | ||
311 | This shows that the current tracer is "irqsoff" tracing the time | 627 | This shows that the current tracer is "irqsoff" tracing the time |
312 | for which interrupts were disabled. It gives the trace version | 628 | for which interrupts were disabled. It gives the trace version (which |
313 | and the version of the kernel upon which this was executed on | 629 | never changes) and the version of the kernel upon which this was executed on |
314 | (2.6.26-rc8). Then it displays the max latency in microsecs (97 | 630 | (3.10). Then it displays the max latency in microseconds (259 us). The number |
315 | us). The number of trace entries displayed and the total number | 631 | of trace entries displayed and the total number (both are four: #4/4). |
316 | recorded (both are three: #3/3). The type of preemption that was | 632 | VP, KP, SP, and HP are always zero and are reserved for later use. |
317 | used (PREEMPT). VP, KP, SP, and HP are always zero and are | 633 | #P is the number of online CPUs (#P:4). |
318 | reserved for later use. #P is the number of online CPUS (#P:2). | ||
319 | 634 | ||
320 | The task is the process that was running when the latency | 635 | The task is the process that was running when the latency |
321 | occurred. (swapper pid: 0). | 636 | occurred. (ps pid: 6143). |
322 | 637 | ||
323 | The start and stop (the functions in which the interrupts were | 638 | The start and stop (the functions in which the interrupts were |
324 | disabled and enabled respectively) that caused the latencies: | 639 | disabled and enabled respectively) that caused the latencies: |
325 | 640 | ||
326 | apic_timer_interrupt is where the interrupts were disabled. | 641 | __lock_task_sighand is where the interrupts were disabled. |
327 | do_softirq is where they were enabled again. | 642 | _raw_spin_unlock_irqrestore is where they were enabled again. |
328 | 643 | ||
329 | The next lines after the header are the trace itself. The header | 644 | The next lines after the header are the trace itself. The header |
330 | explains which is which. | 645 | explains which is which. |
@@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers. | |||
367 | 682 | ||
368 | The rest is the same as the 'trace' file. | 683 | The rest is the same as the 'trace' file. |
369 | 684 | ||
685 | Note, the latency tracers will usually end with a back trace | ||
686 | to easily find where the latency occurred. | ||
370 | 687 | ||
371 | trace_options | 688 | trace_options |
372 | ------------- | 689 | ------------- |
373 | 690 | ||
374 | The trace_options file is used to control what gets printed in | 691 | The trace_options file (or the options directory) is used to control |
375 | the trace output. To see what is available, simply cat the file: | 692 | what gets printed in the trace output, or manipulate the tracers. |
693 | To see what is available, simply cat the file: | ||
376 | 694 | ||
377 | cat trace_options | 695 | cat trace_options |
378 | print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ | 696 | print-parent |
379 | noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj | 697 | nosym-offset |
698 | nosym-addr | ||
699 | noverbose | ||
700 | noraw | ||
701 | nohex | ||
702 | nobin | ||
703 | noblock | ||
704 | nostacktrace | ||
705 | trace_printk | ||
706 | noftrace_preempt | ||
707 | nobranch | ||
708 | annotate | ||
709 | nouserstacktrace | ||
710 | nosym-userobj | ||
711 | noprintk-msg-only | ||
712 | context-info | ||
713 | latency-format | ||
714 | sleep-time | ||
715 | graph-time | ||
716 | record-cmd | ||
717 | overwrite | ||
718 | nodisable_on_free | ||
719 | irq-info | ||
720 | markers | ||
721 | function-trace | ||
380 | 722 | ||
381 | To disable one of the options, echo in the option prepended with | 723 | To disable one of the options, echo in the option prepended with |
382 | "no". | 724 | "no". |
@@ -428,13 +770,34 @@ Here are the available options: | |||
428 | 770 | ||
429 | bin - This will print out the formats in raw binary. | 771 | bin - This will print out the formats in raw binary. |
430 | 772 | ||
431 | block - TBD (needs update) | 773 | block - When set, reading trace_pipe will not block when polled. |
432 | 774 | ||
433 | stacktrace - This is one of the options that changes the trace | 775 | stacktrace - This is one of the options that changes the trace |
434 | itself. When a trace is recorded, so is the stack | 776 | itself. When a trace is recorded, so is the stack |
435 | of functions. This allows for back traces of | 777 | of functions. This allows for back traces of |
436 | trace sites. | 778 | trace sites. |
437 | 779 | ||
780 | trace_printk - Can disable trace_printk() from writing into the buffer. | ||
781 | |||
782 | branch - Enable branch tracing with the tracer. | ||
783 | |||
784 | annotate - It is sometimes confusing when the CPU buffers are full | ||
785 | and one CPU buffer had a lot of events recently, thus | ||
786 | a shorter time frame, were another CPU may have only had | ||
787 | a few events, which lets it have older events. When | ||
788 | the trace is reported, it shows the oldest events first, | ||
789 | and it may look like only one CPU ran (the one with the | ||
790 | oldest events). When the annotate option is set, it will | ||
791 | display when a new CPU buffer started: | ||
792 | |||
793 | <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on | ||
794 | <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on | ||
795 | <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore | ||
796 | ##### CPU 2 buffer started #### | ||
797 | <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle | ||
798 | <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog | ||
799 | <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock | ||
800 | |||
438 | userstacktrace - This option changes the trace. It records a | 801 | userstacktrace - This option changes the trace. It records a |
439 | stacktrace of the current userspace thread. | 802 | stacktrace of the current userspace thread. |
440 | 803 | ||
@@ -451,9 +814,13 @@ Here are the available options: | |||
451 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 | 814 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 |
452 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | 815 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] |
453 | 816 | ||
454 | sched-tree - trace all tasks that are on the runqueue, at | 817 | |
455 | every scheduling event. Will add overhead if | 818 | printk-msg-only - When set, trace_printk()s will only show the format |
456 | there's a lot of tasks running at once. | 819 | and not their parameters (if trace_bprintk() or |
820 | trace_bputs() was used to save the trace_printk()). | ||
821 | |||
822 | context-info - Show only the event data. Hides the comm, PID, | ||
823 | timestamp, CPU, and other useful data. | ||
457 | 824 | ||
458 | latency-format - This option changes the trace. When | 825 | latency-format - This option changes the trace. When |
459 | it is enabled, the trace displays | 826 | it is enabled, the trace displays |
@@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | |||
461 | latencies, as described in "Latency | 828 | latencies, as described in "Latency |
462 | trace format". | 829 | trace format". |
463 | 830 | ||
831 | sleep-time - When running function graph tracer, to include | ||
832 | the time a task schedules out in its function. | ||
833 | When enabled, it will account time the task has been | ||
834 | scheduled out as part of the function call. | ||
835 | |||
836 | graph-time - When running function graph tracer, to include the | ||
837 | time to call nested functions. When this is not set, | ||
838 | the time reported for the function will only include | ||
839 | the time the function itself executed for, not the time | ||
840 | for functions that it called. | ||
841 | |||
842 | record-cmd - When any event or tracer is enabled, a hook is enabled | ||
843 | in the sched_switch trace point to fill comm cache | ||
844 | with mapped pids and comms. But this may cause some | ||
845 | overhead, and if you only care about pids, and not the | ||
846 | name of the task, disabling this option can lower the | ||
847 | impact of tracing. | ||
848 | |||
464 | overwrite - This controls what happens when the trace buffer is | 849 | overwrite - This controls what happens when the trace buffer is |
465 | full. If "1" (default), the oldest events are | 850 | full. If "1" (default), the oldest events are |
466 | discarded and overwritten. If "0", then the newest | 851 | discarded and overwritten. If "0", then the newest |
467 | events are discarded. | 852 | events are discarded. |
853 | (see per_cpu/cpu0/stats for overrun and dropped) | ||
468 | 854 | ||
469 | ftrace_enabled | 855 | disable_on_free - When the free_buffer is closed, tracing will |
470 | -------------- | 856 | stop (tracing_on set to 0). |
471 | 857 | ||
472 | The following tracers (listed below) give different output | 858 | irq-info - Shows the interrupt, preempt count, need resched data. |
473 | depending on whether or not the sysctl ftrace_enabled is set. To | 859 | When disabled, the trace looks like: |
474 | set ftrace_enabled, one can either use the sysctl function or | ||
475 | set it via the proc file system interface. | ||
476 | 860 | ||
477 | sysctl kernel.ftrace_enabled=1 | 861 | # tracer: function |
862 | # | ||
863 | # entries-in-buffer/entries-written: 144405/9452052 #P:4 | ||
864 | # | ||
865 | # TASK-PID CPU# TIMESTAMP FUNCTION | ||
866 | # | | | | | | ||
867 | <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up | ||
868 | <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 | ||
869 | <idle>-0 [002] 23636.756055: enqueue_task <-activate_task | ||
478 | 870 | ||
479 | or | ||
480 | 871 | ||
481 | echo 1 > /proc/sys/kernel/ftrace_enabled | 872 | markers - When set, the trace_marker is writable (only by root). |
873 | When disabled, the trace_marker will error with EINVAL | ||
874 | on write. | ||
875 | |||
876 | |||
877 | function-trace - The latency tracers will enable function tracing | ||
878 | if this option is enabled (default it is). When | ||
879 | it is disabled, the latency tracers do not trace | ||
880 | functions. This keeps the overhead of the tracer down | ||
881 | when performing latency tests. | ||
482 | 882 | ||
483 | To disable ftrace_enabled simply replace the '1' with '0' in the | 883 | Note: Some tracers have their own options. They only appear |
484 | above commands. | 884 | when the tracer is active. |
485 | 885 | ||
486 | When ftrace_enabled is set the tracers will also record the | ||
487 | functions that are within the trace. The descriptions of the | ||
488 | tracers will also show an example with ftrace enabled. | ||
489 | 886 | ||
490 | 887 | ||
491 | irqsoff | 888 | irqsoff |
@@ -506,95 +903,133 @@ new trace is saved. | |||
506 | To reset the maximum, echo 0 into tracing_max_latency. Here is | 903 | To reset the maximum, echo 0 into tracing_max_latency. Here is |
507 | an example: | 904 | an example: |
508 | 905 | ||
906 | # echo 0 > options/function-trace | ||
509 | # echo irqsoff > current_tracer | 907 | # echo irqsoff > current_tracer |
510 | # echo latency-format > trace_options | ||
511 | # echo 0 > tracing_max_latency | ||
512 | # echo 1 > tracing_on | 908 | # echo 1 > tracing_on |
909 | # echo 0 > tracing_max_latency | ||
513 | # ls -ltr | 910 | # ls -ltr |
514 | [...] | 911 | [...] |
515 | # echo 0 > tracing_on | 912 | # echo 0 > tracing_on |
516 | # cat trace | 913 | # cat trace |
517 | # tracer: irqsoff | 914 | # tracer: irqsoff |
518 | # | 915 | # |
519 | irqsoff latency trace v1.1.5 on 2.6.26 | 916 | # irqsoff latency trace v1.1.5 on 3.8.0-test+ |
520 | -------------------------------------------------------------------- | 917 | # -------------------------------------------------------------------- |
521 | latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 918 | # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
522 | ----------------- | 919 | # ----------------- |
523 | | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) | 920 | # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) |
524 | ----------------- | 921 | # ----------------- |
525 | => started at: sys_setpgid | 922 | # => started at: run_timer_softirq |
526 | => ended at: sys_setpgid | 923 | # => ended at: run_timer_softirq |
527 | 924 | # | |
528 | # _------=> CPU# | 925 | # |
529 | # / _-----=> irqs-off | 926 | # _------=> CPU# |
530 | # | / _----=> need-resched | 927 | # / _-----=> irqs-off |
531 | # || / _---=> hardirq/softirq | 928 | # | / _----=> need-resched |
532 | # ||| / _--=> preempt-depth | 929 | # || / _---=> hardirq/softirq |
533 | # |||| / | 930 | # ||| / _--=> preempt-depth |
534 | # ||||| delay | 931 | # |||| / delay |
535 | # cmd pid ||||| time | caller | 932 | # cmd pid ||||| time | caller |
536 | # \ / ||||| \ | / | 933 | # \ / ||||| \ | / |
537 | bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) | 934 | <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq |
538 | bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) | 935 | <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq |
539 | bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) | 936 | <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq |
540 | 937 | <idle>-0 0dNs3 25us : <stack trace> | |
541 | 938 | => _raw_spin_unlock_irq | |
542 | Here we see that that we had a latency of 12 microsecs (which is | 939 | => run_timer_softirq |
543 | very good). The _write_lock_irq in sys_setpgid disabled | 940 | => __do_softirq |
544 | interrupts. The difference between the 12 and the displayed | 941 | => call_softirq |
545 | timestamp 14us occurred because the clock was incremented | 942 | => do_softirq |
943 | => irq_exit | ||
944 | => smp_apic_timer_interrupt | ||
945 | => apic_timer_interrupt | ||
946 | => rcu_idle_exit | ||
947 | => cpu_idle | ||
948 | => rest_init | ||
949 | => start_kernel | ||
950 | => x86_64_start_reservations | ||
951 | => x86_64_start_kernel | ||
952 | |||
953 | Here we see that that we had a latency of 16 microseconds (which is | ||
954 | very good). The _raw_spin_lock_irq in run_timer_softirq disabled | ||
955 | interrupts. The difference between the 16 and the displayed | ||
956 | timestamp 25us occurred because the clock was incremented | ||
546 | between the time of recording the max latency and the time of | 957 | between the time of recording the max latency and the time of |
547 | recording the function that had that latency. | 958 | recording the function that had that latency. |
548 | 959 | ||
549 | Note the above example had ftrace_enabled not set. If we set the | 960 | Note the above example had function-trace not set. If we set |
550 | ftrace_enabled, we get a much larger output: | 961 | function-trace, we get a much larger output: |
962 | |||
963 | with echo 1 > options/function-trace | ||
551 | 964 | ||
552 | # tracer: irqsoff | 965 | # tracer: irqsoff |
553 | # | 966 | # |
554 | irqsoff latency trace v1.1.5 on 2.6.26-rc8 | 967 | # irqsoff latency trace v1.1.5 on 3.8.0-test+ |
555 | -------------------------------------------------------------------- | 968 | # -------------------------------------------------------------------- |
556 | latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 969 | # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
557 | ----------------- | 970 | # ----------------- |
558 | | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) | 971 | # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) |
559 | ----------------- | 972 | # ----------------- |
560 | => started at: __alloc_pages_internal | 973 | # => started at: ata_scsi_queuecmd |
561 | => ended at: __alloc_pages_internal | 974 | # => ended at: ata_scsi_queuecmd |
562 | 975 | # | |
563 | # _------=> CPU# | 976 | # |
564 | # / _-----=> irqs-off | 977 | # _------=> CPU# |
565 | # | / _----=> need-resched | 978 | # / _-----=> irqs-off |
566 | # || / _---=> hardirq/softirq | 979 | # | / _----=> need-resched |
567 | # ||| / _--=> preempt-depth | 980 | # || / _---=> hardirq/softirq |
568 | # |||| / | 981 | # ||| / _--=> preempt-depth |
569 | # ||||| delay | 982 | # |||| / delay |
570 | # cmd pid ||||| time | caller | 983 | # cmd pid ||||| time | caller |
571 | # \ / ||||| \ | / | 984 | # \ / ||||| \ | / |
572 | ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) | 985 | bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd |
573 | ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) | 986 | bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave |
574 | ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) | 987 | bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd |
575 | ls-4339 0d..1 4us : add_preempt_count (_spin_lock) | 988 | bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev |
576 | ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) | 989 | bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev |
577 | ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) | 990 | bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd |
578 | ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) | 991 | bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd |
579 | ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) | 992 | bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd |
580 | ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) | 993 | bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat |
581 | ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) | ||
582 | ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) | ||
583 | ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) | ||
584 | [...] | 994 | [...] |
585 | ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) | 995 | bash-2042 3d..1 67us : delay_tsc <-__delay |
586 | ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) | 996 | bash-2042 3d..1 67us : add_preempt_count <-delay_tsc |
587 | ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) | 997 | bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc |
588 | ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) | 998 | bash-2042 3d..1 67us : add_preempt_count <-delay_tsc |
589 | ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) | 999 | bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc |
590 | ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) | 1000 | bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue |
591 | ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) | 1001 | bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd |
592 | ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) | 1002 | bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd |
593 | ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) | 1003 | bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd |
594 | 1004 | bash-2042 3d..1 120us : <stack trace> | |
595 | 1005 | => _raw_spin_unlock_irqrestore | |
596 | 1006 | => ata_scsi_queuecmd | |
597 | Here we traced a 50 microsecond latency. But we also see all the | 1007 | => scsi_dispatch_cmd |
1008 | => scsi_request_fn | ||
1009 | => __blk_run_queue_uncond | ||
1010 | => __blk_run_queue | ||
1011 | => blk_queue_bio | ||
1012 | => generic_make_request | ||
1013 | => submit_bio | ||
1014 | => submit_bh | ||
1015 | => __ext3_get_inode_loc | ||
1016 | => ext3_iget | ||
1017 | => ext3_lookup | ||
1018 | => lookup_real | ||
1019 | => __lookup_hash | ||
1020 | => walk_component | ||
1021 | => lookup_last | ||
1022 | => path_lookupat | ||
1023 | => filename_lookup | ||
1024 | => user_path_at_empty | ||
1025 | => user_path_at | ||
1026 | => vfs_fstatat | ||
1027 | => vfs_stat | ||
1028 | => sys_newstat | ||
1029 | => system_call_fastpath | ||
1030 | |||
1031 | |||
1032 | Here we traced a 71 microsecond latency. But we also see all the | ||
598 | functions that were called during that time. Note that by | 1033 | functions that were called during that time. Note that by |
599 | enabling function tracing, we incur an added overhead. This | 1034 | enabling function tracing, we incur an added overhead. This |
600 | overhead may extend the latency times. But nevertheless, this | 1035 | overhead may extend the latency times. But nevertheless, this |
@@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for | |||
614 | which preemption was disabled. The control of preemptoff tracer | 1049 | which preemption was disabled. The control of preemptoff tracer |
615 | is much like the irqsoff tracer. | 1050 | is much like the irqsoff tracer. |
616 | 1051 | ||
1052 | # echo 0 > options/function-trace | ||
617 | # echo preemptoff > current_tracer | 1053 | # echo preemptoff > current_tracer |
618 | # echo latency-format > trace_options | ||
619 | # echo 0 > tracing_max_latency | ||
620 | # echo 1 > tracing_on | 1054 | # echo 1 > tracing_on |
1055 | # echo 0 > tracing_max_latency | ||
621 | # ls -ltr | 1056 | # ls -ltr |
622 | [...] | 1057 | [...] |
623 | # echo 0 > tracing_on | 1058 | # echo 0 > tracing_on |
624 | # cat trace | 1059 | # cat trace |
625 | # tracer: preemptoff | 1060 | # tracer: preemptoff |
626 | # | 1061 | # |
627 | preemptoff latency trace v1.1.5 on 2.6.26-rc8 | 1062 | # preemptoff latency trace v1.1.5 on 3.8.0-test+ |
628 | -------------------------------------------------------------------- | 1063 | # -------------------------------------------------------------------- |
629 | latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1064 | # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
630 | ----------------- | 1065 | # ----------------- |
631 | | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) | 1066 | # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) |
632 | ----------------- | 1067 | # ----------------- |
633 | => started at: do_IRQ | 1068 | # => started at: do_IRQ |
634 | => ended at: __do_softirq | 1069 | # => ended at: do_IRQ |
635 | 1070 | # | |
636 | # _------=> CPU# | 1071 | # |
637 | # / _-----=> irqs-off | 1072 | # _------=> CPU# |
638 | # | / _----=> need-resched | 1073 | # / _-----=> irqs-off |
639 | # || / _---=> hardirq/softirq | 1074 | # | / _----=> need-resched |
640 | # ||| / _--=> preempt-depth | 1075 | # || / _---=> hardirq/softirq |
641 | # |||| / | 1076 | # ||| / _--=> preempt-depth |
642 | # ||||| delay | 1077 | # |||| / delay |
643 | # cmd pid ||||| time | caller | 1078 | # cmd pid ||||| time | caller |
644 | # \ / ||||| \ | / | 1079 | # \ / ||||| \ | / |
645 | sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) | 1080 | sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ |
646 | sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) | 1081 | sshd-1991 1d..1 46us : irq_exit <-do_IRQ |
647 | sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) | 1082 | sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ |
1083 | sshd-1991 1d..1 52us : <stack trace> | ||
1084 | => sub_preempt_count | ||
1085 | => irq_exit | ||
1086 | => do_IRQ | ||
1087 | => ret_from_intr | ||
648 | 1088 | ||
649 | 1089 | ||
650 | This has some more changes. Preemption was disabled when an | 1090 | This has some more changes. Preemption was disabled when an |
651 | interrupt came in (notice the 'h'), and was enabled while doing | 1091 | interrupt came in (notice the 'h'), and was enabled on exit. |
652 | a softirq. (notice the 's'). But we also see that interrupts | 1092 | But we also see that interrupts have been disabled when entering |
653 | have been disabled when entering the preempt off section and | 1093 | the preempt off section and leaving it (the 'd'). We do not know if |
654 | leaving it (the 'd'). We do not know if interrupts were enabled | 1094 | interrupts were enabled in the mean time or shortly after this |
655 | in the mean time. | 1095 | was over. |
656 | 1096 | ||
657 | # tracer: preemptoff | 1097 | # tracer: preemptoff |
658 | # | 1098 | # |
659 | preemptoff latency trace v1.1.5 on 2.6.26-rc8 | 1099 | # preemptoff latency trace v1.1.5 on 3.8.0-test+ |
660 | -------------------------------------------------------------------- | 1100 | # -------------------------------------------------------------------- |
661 | latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1101 | # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
662 | ----------------- | 1102 | # ----------------- |
663 | | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) | 1103 | # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) |
664 | ----------------- | 1104 | # ----------------- |
665 | => started at: remove_wait_queue | 1105 | # => started at: wake_up_new_task |
666 | => ended at: __do_softirq | 1106 | # => ended at: task_rq_unlock |
667 | 1107 | # | |
668 | # _------=> CPU# | 1108 | # |
669 | # / _-----=> irqs-off | 1109 | # _------=> CPU# |
670 | # | / _----=> need-resched | 1110 | # / _-----=> irqs-off |
671 | # || / _---=> hardirq/softirq | 1111 | # | / _----=> need-resched |
672 | # ||| / _--=> preempt-depth | 1112 | # || / _---=> hardirq/softirq |
673 | # |||| / | 1113 | # ||| / _--=> preempt-depth |
674 | # ||||| delay | 1114 | # |||| / delay |
675 | # cmd pid ||||| time | caller | 1115 | # cmd pid ||||| time | caller |
676 | # \ / ||||| \ | / | 1116 | # \ / ||||| \ | / |
677 | sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) | 1117 | bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task |
678 | sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) | 1118 | bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq |
679 | sshd-4261 0d..1 2us : do_IRQ (common_interrupt) | 1119 | bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair |
680 | sshd-4261 0d..1 2us : irq_enter (do_IRQ) | 1120 | bash-1994 1d..1 1us : source_load <-select_task_rq_fair |
681 | sshd-4261 0d..1 2us : idle_cpu (irq_enter) | 1121 | bash-1994 1d..1 1us : source_load <-select_task_rq_fair |
682 | sshd-4261 0d..1 3us : add_preempt_count (irq_enter) | ||
683 | sshd-4261 0d.h1 3us : idle_cpu (irq_enter) | ||
684 | sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) | ||
685 | [...] | 1122 | [...] |
686 | sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) | 1123 | bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt |
687 | sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) | 1124 | bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter |
688 | sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) | 1125 | bash-1994 1d..1 13us : add_preempt_count <-irq_enter |
689 | sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) | 1126 | bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt |
690 | sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) | 1127 | bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt |
691 | sshd-4261 0d.h1 14us : irq_exit (do_IRQ) | 1128 | bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt |
692 | sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) | 1129 | bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock |
693 | sshd-4261 0d..2 15us : do_softirq (irq_exit) | 1130 | bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt |
694 | sshd-4261 0d... 15us : __do_softirq (do_softirq) | ||
695 | sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) | ||
696 | sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) | ||
697 | sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) | ||
698 | sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) | ||
699 | sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) | ||
700 | [...] | 1131 | [...] |
701 | sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) | 1132 | bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event |
702 | sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) | 1133 | bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt |
703 | sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) | 1134 | bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit |
704 | sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) | 1135 | bash-1994 1d..2 36us : do_softirq <-irq_exit |
705 | sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) | 1136 | bash-1994 1d..2 36us : __do_softirq <-call_softirq |
706 | sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) | 1137 | bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq |
707 | sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) | 1138 | bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq |
708 | sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) | 1139 | bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq |
1140 | bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock | ||
1141 | bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq | ||
709 | [...] | 1142 | [...] |
710 | sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) | 1143 | bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks |
711 | sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) | 1144 | bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq |
1145 | bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable | ||
1146 | bash-1994 1dN.2 82us : idle_cpu <-irq_exit | ||
1147 | bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit | ||
1148 | bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit | ||
1149 | bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock | ||
1150 | bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock | ||
1151 | bash-1994 1.N.1 104us : <stack trace> | ||
1152 | => sub_preempt_count | ||
1153 | => _raw_spin_unlock_irqrestore | ||
1154 | => task_rq_unlock | ||
1155 | => wake_up_new_task | ||
1156 | => do_fork | ||
1157 | => sys_clone | ||
1158 | => stub_clone | ||
712 | 1159 | ||
713 | 1160 | ||
714 | The above is an example of the preemptoff trace with | 1161 | The above is an example of the preemptoff trace with |
715 | ftrace_enabled set. Here we see that interrupts were disabled | 1162 | function-trace set. Here we see that interrupts were not disabled |
716 | the entire time. The irq_enter code lets us know that we entered | 1163 | the entire time. The irq_enter code lets us know that we entered |
717 | an interrupt 'h'. Before that, the functions being traced still | 1164 | an interrupt 'h'. Before that, the functions being traced still |
718 | show that it is not in an interrupt, but we can see from the | 1165 | show that it is not in an interrupt, but we can see from the |
719 | functions themselves that this is not the case. | 1166 | functions themselves that this is not the case. |
720 | 1167 | ||
721 | Notice that __do_softirq when called does not have a | ||
722 | preempt_count. It may seem that we missed a preempt enabling. | ||
723 | What really happened is that the preempt count is held on the | ||
724 | thread's stack and we switched to the softirq stack (4K stacks | ||
725 | in effect). The code does not copy the preempt count, but | ||
726 | because interrupts are disabled, we do not need to worry about | ||
727 | it. Having a tracer like this is good for letting people know | ||
728 | what really happens inside the kernel. | ||
729 | |||
730 | |||
731 | preemptirqsoff | 1168 | preemptirqsoff |
732 | -------------- | 1169 | -------------- |
733 | 1170 | ||
@@ -762,38 +1199,57 @@ tracer. | |||
762 | Again, using this trace is much like the irqsoff and preemptoff | 1199 | Again, using this trace is much like the irqsoff and preemptoff |
763 | tracers. | 1200 | tracers. |
764 | 1201 | ||
1202 | # echo 0 > options/function-trace | ||
765 | # echo preemptirqsoff > current_tracer | 1203 | # echo preemptirqsoff > current_tracer |
766 | # echo latency-format > trace_options | ||
767 | # echo 0 > tracing_max_latency | ||
768 | # echo 1 > tracing_on | 1204 | # echo 1 > tracing_on |
1205 | # echo 0 > tracing_max_latency | ||
769 | # ls -ltr | 1206 | # ls -ltr |
770 | [...] | 1207 | [...] |
771 | # echo 0 > tracing_on | 1208 | # echo 0 > tracing_on |
772 | # cat trace | 1209 | # cat trace |
773 | # tracer: preemptirqsoff | 1210 | # tracer: preemptirqsoff |
774 | # | 1211 | # |
775 | preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | 1212 | # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ |
776 | -------------------------------------------------------------------- | 1213 | # -------------------------------------------------------------------- |
777 | latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1214 | # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
778 | ----------------- | 1215 | # ----------------- |
779 | | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) | 1216 | # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) |
780 | ----------------- | 1217 | # ----------------- |
781 | => started at: apic_timer_interrupt | 1218 | # => started at: ata_scsi_queuecmd |
782 | => ended at: __do_softirq | 1219 | # => ended at: ata_scsi_queuecmd |
783 | 1220 | # | |
784 | # _------=> CPU# | 1221 | # |
785 | # / _-----=> irqs-off | 1222 | # _------=> CPU# |
786 | # | / _----=> need-resched | 1223 | # / _-----=> irqs-off |
787 | # || / _---=> hardirq/softirq | 1224 | # | / _----=> need-resched |
788 | # ||| / _--=> preempt-depth | 1225 | # || / _---=> hardirq/softirq |
789 | # |||| / | 1226 | # ||| / _--=> preempt-depth |
790 | # ||||| delay | 1227 | # |||| / delay |
791 | # cmd pid ||||| time | caller | 1228 | # cmd pid ||||| time | caller |
792 | # \ / ||||| \ | / | 1229 | # \ / ||||| \ | / |
793 | ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) | 1230 | ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd |
794 | ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) | 1231 | ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd |
795 | ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) | 1232 | ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd |
796 | 1233 | ls-2230 3...1 111us : <stack trace> | |
1234 | => sub_preempt_count | ||
1235 | => _raw_spin_unlock_irqrestore | ||
1236 | => ata_scsi_queuecmd | ||
1237 | => scsi_dispatch_cmd | ||
1238 | => scsi_request_fn | ||
1239 | => __blk_run_queue_uncond | ||
1240 | => __blk_run_queue | ||
1241 | => blk_queue_bio | ||
1242 | => generic_make_request | ||
1243 | => submit_bio | ||
1244 | => submit_bh | ||
1245 | => ext3_bread | ||
1246 | => ext3_dir_bread | ||
1247 | => htree_dirblock_to_tree | ||
1248 | => ext3_htree_fill_tree | ||
1249 | => ext3_readdir | ||
1250 | => vfs_readdir | ||
1251 | => sys_getdents | ||
1252 | => system_call_fastpath | ||
797 | 1253 | ||
798 | 1254 | ||
799 | The trace_hardirqs_off_thunk is called from assembly on x86 when | 1255 | The trace_hardirqs_off_thunk is called from assembly on x86 when |
@@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled | |||
802 | within the preemption points. We do see that it started with | 1258 | within the preemption points. We do see that it started with |
803 | preemption enabled. | 1259 | preemption enabled. |
804 | 1260 | ||
805 | Here is a trace with ftrace_enabled set: | 1261 | Here is a trace with function-trace set: |
806 | |||
807 | 1262 | ||
808 | # tracer: preemptirqsoff | 1263 | # tracer: preemptirqsoff |
809 | # | 1264 | # |
810 | preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | 1265 | # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ |
811 | -------------------------------------------------------------------- | 1266 | # -------------------------------------------------------------------- |
812 | latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1267 | # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
813 | ----------------- | 1268 | # ----------------- |
814 | | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) | 1269 | # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) |
815 | ----------------- | 1270 | # ----------------- |
816 | => started at: write_chan | 1271 | # => started at: schedule |
817 | => ended at: __do_softirq | 1272 | # => ended at: mutex_unlock |
818 | 1273 | # | |
819 | # _------=> CPU# | 1274 | # |
820 | # / _-----=> irqs-off | 1275 | # _------=> CPU# |
821 | # | / _----=> need-resched | 1276 | # / _-----=> irqs-off |
822 | # || / _---=> hardirq/softirq | 1277 | # | / _----=> need-resched |
823 | # ||| / _--=> preempt-depth | 1278 | # || / _---=> hardirq/softirq |
824 | # |||| / | 1279 | # ||| / _--=> preempt-depth |
825 | # ||||| delay | 1280 | # |||| / delay |
826 | # cmd pid ||||| time | caller | 1281 | # cmd pid ||||| time | caller |
827 | # \ / ||||| \ | / | 1282 | # \ / ||||| \ | / |
828 | ls-4473 0.N.. 0us : preempt_schedule (write_chan) | 1283 | kworker/-59 3...1 0us : __schedule <-schedule |
829 | ls-4473 0dN.1 1us : _spin_lock (schedule) | 1284 | kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch |
830 | ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) | 1285 | kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq |
831 | ls-4473 0d..2 2us : put_prev_task_fair (schedule) | 1286 | kworker/-59 3d..2 1us : deactivate_task <-__schedule |
832 | [...] | 1287 | kworker/-59 3d..2 1us : dequeue_task <-deactivate_task |
833 | ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) | 1288 | kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task |
834 | ls-4473 0d..2 13us : __switch_to (schedule) | 1289 | kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task |
835 | sshd-4261 0d..2 14us : finish_task_switch (schedule) | 1290 | kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair |
836 | sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) | 1291 | kworker/-59 3d..2 2us : update_min_vruntime <-update_curr |
837 | sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) | 1292 | kworker/-59 3d..2 3us : cpuacct_charge <-update_curr |
838 | sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) | 1293 | kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge |
839 | sshd-4261 0d..2 16us : do_IRQ (common_interrupt) | 1294 | kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge |
840 | sshd-4261 0d..2 17us : irq_enter (do_IRQ) | 1295 | kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair |
841 | sshd-4261 0d..2 17us : idle_cpu (irq_enter) | 1296 | kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair |
842 | sshd-4261 0d..2 18us : add_preempt_count (irq_enter) | 1297 | kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair |
843 | sshd-4261 0d.h2 18us : idle_cpu (irq_enter) | 1298 | kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair |
844 | sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) | 1299 | kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair |
845 | sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) | 1300 | kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair |
846 | sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) | 1301 | kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule |
847 | sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) | 1302 | kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping |
848 | sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) | 1303 | kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule |
849 | [...] | 1304 | kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task |
850 | sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) | 1305 | kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair |
851 | sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) | 1306 | kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair |
852 | sshd-4261 0d.h2 29us : irq_exit (do_IRQ) | 1307 | kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity |
853 | sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) | 1308 | ls-2269 3d..2 7us : finish_task_switch <-__schedule |
854 | sshd-4261 0d..3 30us : do_softirq (irq_exit) | 1309 | ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch |
855 | sshd-4261 0d... 30us : __do_softirq (do_softirq) | 1310 | ls-2269 3d..2 8us : do_IRQ <-ret_from_intr |
856 | sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) | 1311 | ls-2269 3d..2 8us : irq_enter <-do_IRQ |
857 | sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) | 1312 | ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter |
858 | sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) | 1313 | ls-2269 3d..2 9us : add_preempt_count <-irq_enter |
1314 | ls-2269 3d.h2 9us : exit_idle <-do_IRQ | ||
859 | [...] | 1315 | [...] |
860 | sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) | 1316 | ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock |
861 | sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) | 1317 | ls-2269 3d.h2 20us : irq_exit <-do_IRQ |
862 | sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) | 1318 | ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit |
863 | sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) | 1319 | ls-2269 3d..3 21us : do_softirq <-irq_exit |
864 | sshd-4261 0d.s3 45us : idle_cpu (irq_enter) | 1320 | ls-2269 3d..3 21us : __do_softirq <-call_softirq |
865 | sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) | 1321 | ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq |
866 | sshd-4261 0d.H3 46us : idle_cpu (irq_enter) | 1322 | ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip |
867 | sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) | 1323 | ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip |
868 | sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) | 1324 | ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr |
1325 | ls-2269 3d.s5 31us : irq_enter <-do_IRQ | ||
1326 | ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter | ||
869 | [...] | 1327 | [...] |
870 | sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) | 1328 | ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter |
871 | sshd-4261 0d.H3 82us : ktime_get (tick_program_event) | 1329 | ls-2269 3d.s5 32us : add_preempt_count <-irq_enter |
872 | sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) | 1330 | ls-2269 3d.H5 32us : exit_idle <-do_IRQ |
873 | sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) | 1331 | ls-2269 3d.H5 32us : handle_irq <-do_IRQ |
874 | sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) | 1332 | ls-2269 3d.H5 32us : irq_to_desc <-handle_irq |
875 | sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) | 1333 | ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq |
876 | sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) | ||
877 | sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) | ||
878 | sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) | ||
879 | sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) | ||
880 | sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) | ||
881 | [...] | 1334 | [...] |
882 | sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) | 1335 | ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll |
883 | sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) | 1336 | ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action |
884 | sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) | 1337 | ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq |
885 | sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) | 1338 | ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable |
886 | sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) | 1339 | ls-2269 3d..3 159us : idle_cpu <-irq_exit |
887 | sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) | 1340 | ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit |
888 | sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) | 1341 | ls-2269 3d..3 160us : sub_preempt_count <-irq_exit |
889 | 1342 | ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock | |
890 | 1343 | ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock | |
891 | This is a very interesting trace. It started with the preemption | 1344 | ls-2269 3d... 186us : <stack trace> |
892 | of the ls task. We see that the task had the "need_resched" bit | 1345 | => __mutex_unlock_slowpath |
893 | set via the 'N' in the trace. Interrupts were disabled before | 1346 | => mutex_unlock |
894 | the spin_lock at the beginning of the trace. We see that a | 1347 | => process_output |
895 | schedule took place to run sshd. When the interrupts were | 1348 | => n_tty_write |
896 | enabled, we took an interrupt. On return from the interrupt | 1349 | => tty_write |
897 | handler, the softirq ran. We took another interrupt while | 1350 | => vfs_write |
898 | running the softirq as we see from the capital 'H'. | 1351 | => sys_write |
1352 | => system_call_fastpath | ||
1353 | |||
1354 | This is an interesting trace. It started with kworker running and | ||
1355 | scheduling out and ls taking over. But as soon as ls released the | ||
1356 | rq lock and enabled interrupts (but not preemption) an interrupt | ||
1357 | triggered. When the interrupt finished, it started running softirqs. | ||
1358 | But while the softirq was running, another interrupt triggered. | ||
1359 | When an interrupt is running inside a softirq, the annotation is 'H'. | ||
899 | 1360 | ||
900 | 1361 | ||
901 | wakeup | 1362 | wakeup |
902 | ------ | 1363 | ------ |
903 | 1364 | ||
1365 | One common case that people are interested in tracing is the | ||
1366 | time it takes for a task that is woken to actually wake up. | ||
1367 | Now for non Real-Time tasks, this can be arbitrary. But tracing | ||
1368 | it none the less can be interesting. | ||
1369 | |||
1370 | Without function tracing: | ||
1371 | |||
1372 | # echo 0 > options/function-trace | ||
1373 | # echo wakeup > current_tracer | ||
1374 | # echo 1 > tracing_on | ||
1375 | # echo 0 > tracing_max_latency | ||
1376 | # chrt -f 5 sleep 1 | ||
1377 | # echo 0 > tracing_on | ||
1378 | # cat trace | ||
1379 | # tracer: wakeup | ||
1380 | # | ||
1381 | # wakeup latency trace v1.1.5 on 3.8.0-test+ | ||
1382 | # -------------------------------------------------------------------- | ||
1383 | # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) | ||
1384 | # ----------------- | ||
1385 | # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) | ||
1386 | # ----------------- | ||
1387 | # | ||
1388 | # _------=> CPU# | ||
1389 | # / _-----=> irqs-off | ||
1390 | # | / _----=> need-resched | ||
1391 | # || / _---=> hardirq/softirq | ||
1392 | # ||| / _--=> preempt-depth | ||
1393 | # |||| / delay | ||
1394 | # cmd pid ||||| time | caller | ||
1395 | # \ / ||||| \ | / | ||
1396 | <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H | ||
1397 | <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up | ||
1398 | <idle>-0 3d..3 15us : __schedule <-schedule | ||
1399 | <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H | ||
1400 | |||
1401 | The tracer only traces the highest priority task in the system | ||
1402 | to avoid tracing the normal circumstances. Here we see that | ||
1403 | the kworker with a nice priority of -20 (not very nice), took | ||
1404 | just 15 microseconds from the time it woke up, to the time it | ||
1405 | ran. | ||
1406 | |||
1407 | Non Real-Time tasks are not that interesting. A more interesting | ||
1408 | trace is to concentrate only on Real-Time tasks. | ||
1409 | |||
1410 | wakeup_rt | ||
1411 | --------- | ||
1412 | |||
904 | In a Real-Time environment it is very important to know the | 1413 | In a Real-Time environment it is very important to know the |
905 | wakeup time it takes for the highest priority task that is woken | 1414 | wakeup time it takes for the highest priority task that is woken |
906 | up to the time that it executes. This is also known as "schedule | 1415 | up to the time that it executes. This is also known as "schedule |
@@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency. | |||
914 | That is the longest latency it takes for something to happen, | 1423 | That is the longest latency it takes for something to happen, |
915 | and not the average. We can have a very fast scheduler that may | 1424 | and not the average. We can have a very fast scheduler that may |
916 | only have a large latency once in a while, but that would not | 1425 | only have a large latency once in a while, but that would not |
917 | work well with Real-Time tasks. The wakeup tracer was designed | 1426 | work well with Real-Time tasks. The wakeup_rt tracer was designed |
918 | to record the worst case wakeups of RT tasks. Non-RT tasks are | 1427 | to record the worst case wakeups of RT tasks. Non-RT tasks are |
919 | not recorded because the tracer only records one worst case and | 1428 | not recorded because the tracer only records one worst case and |
920 | tracing non-RT tasks that are unpredictable will overwrite the | 1429 | tracing non-RT tasks that are unpredictable will overwrite the |
921 | worst case latency of RT tasks. | 1430 | worst case latency of RT tasks (just run the normal wakeup |
1431 | tracer for a while to see that effect). | ||
922 | 1432 | ||
923 | Since this tracer only deals with RT tasks, we will run this | 1433 | Since this tracer only deals with RT tasks, we will run this |
924 | slightly differently than we did with the previous tracers. | 1434 | slightly differently than we did with the previous tracers. |
925 | Instead of performing an 'ls', we will run 'sleep 1' under | 1435 | Instead of performing an 'ls', we will run 'sleep 1' under |
926 | 'chrt' which changes the priority of the task. | 1436 | 'chrt' which changes the priority of the task. |
927 | 1437 | ||
928 | # echo wakeup > current_tracer | 1438 | # echo 0 > options/function-trace |
929 | # echo latency-format > trace_options | 1439 | # echo wakeup_rt > current_tracer |
930 | # echo 0 > tracing_max_latency | ||
931 | # echo 1 > tracing_on | 1440 | # echo 1 > tracing_on |
1441 | # echo 0 > tracing_max_latency | ||
932 | # chrt -f 5 sleep 1 | 1442 | # chrt -f 5 sleep 1 |
933 | # echo 0 > tracing_on | 1443 | # echo 0 > tracing_on |
934 | # cat trace | 1444 | # cat trace |
935 | # tracer: wakeup | 1445 | # tracer: wakeup |
936 | # | 1446 | # |
937 | wakeup latency trace v1.1.5 on 2.6.26-rc8 | 1447 | # tracer: wakeup_rt |
938 | -------------------------------------------------------------------- | 1448 | # |
939 | latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1449 | # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ |
940 | ----------------- | 1450 | # -------------------------------------------------------------------- |
941 | | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) | 1451 | # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
942 | ----------------- | 1452 | # ----------------- |
943 | 1453 | # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) | |
944 | # _------=> CPU# | 1454 | # ----------------- |
945 | # / _-----=> irqs-off | 1455 | # |
946 | # | / _----=> need-resched | 1456 | # _------=> CPU# |
947 | # || / _---=> hardirq/softirq | 1457 | # / _-----=> irqs-off |
948 | # ||| / _--=> preempt-depth | 1458 | # | / _----=> need-resched |
949 | # |||| / | 1459 | # || / _---=> hardirq/softirq |
950 | # ||||| delay | 1460 | # ||| / _--=> preempt-depth |
951 | # cmd pid ||||| time | caller | 1461 | # |||| / delay |
952 | # \ / ||||| \ | / | 1462 | # cmd pid ||||| time | caller |
953 | <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) | 1463 | # \ / ||||| \ | / |
954 | <idle>-0 1d..4 4us : schedule (cpu_idle) | 1464 | <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep |
955 | 1465 | <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up | |
956 | 1466 | <idle>-0 3d..3 5us : __schedule <-schedule | |
957 | Running this on an idle system, we see that it only took 4 | 1467 | <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep |
958 | microseconds to perform the task switch. Note, since the trace | 1468 | |
959 | marker in the schedule is before the actual "switch", we stop | 1469 | |
960 | the tracing when the recorded task is about to schedule in. This | 1470 | Running this on an idle system, we see that it only took 5 microseconds |
961 | may change if we add a new marker at the end of the scheduler. | 1471 | to perform the task switch. Note, since the trace point in the schedule |
962 | 1472 | is before the actual "switch", we stop the tracing when the recorded task | |
963 | Notice that the recorded task is 'sleep' with the PID of 4901 | 1473 | is about to schedule in. This may change if we add a new marker at the |
1474 | end of the scheduler. | ||
1475 | |||
1476 | Notice that the recorded task is 'sleep' with the PID of 2389 | ||
964 | and it has an rt_prio of 5. This priority is user-space priority | 1477 | and it has an rt_prio of 5. This priority is user-space priority |
965 | and not the internal kernel priority. The policy is 1 for | 1478 | and not the internal kernel priority. The policy is 1 for |
966 | SCHED_FIFO and 2 for SCHED_RR. | 1479 | SCHED_FIFO and 2 for SCHED_RR. |
967 | 1480 | ||
968 | Doing the same with chrt -r 5 and ftrace_enabled set. | 1481 | Note, that the trace data shows the internal priority (99 - rtprio). |
969 | 1482 | ||
970 | # tracer: wakeup | 1483 | <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep |
1484 | |||
1485 | The 0:120:R means idle was running with a nice priority of 0 (120 - 20) | ||
1486 | and in the running state 'R'. The sleep task was scheduled in with | ||
1487 | 2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) | ||
1488 | and it too is in the running state. | ||
1489 | |||
1490 | Doing the same with chrt -r 5 and function-trace set. | ||
1491 | |||
1492 | echo 1 > options/function-trace | ||
1493 | |||
1494 | # tracer: wakeup_rt | ||
971 | # | 1495 | # |
972 | wakeup latency trace v1.1.5 on 2.6.26-rc8 | 1496 | # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ |
973 | -------------------------------------------------------------------- | 1497 | # -------------------------------------------------------------------- |
974 | latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) | 1498 | # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) |
975 | ----------------- | 1499 | # ----------------- |
976 | | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) | 1500 | # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) |
977 | ----------------- | 1501 | # ----------------- |
978 | 1502 | # | |
979 | # _------=> CPU# | 1503 | # _------=> CPU# |
980 | # / _-----=> irqs-off | 1504 | # / _-----=> irqs-off |
981 | # | / _----=> need-resched | 1505 | # | / _----=> need-resched |
982 | # || / _---=> hardirq/softirq | 1506 | # || / _---=> hardirq/softirq |
983 | # ||| / _--=> preempt-depth | 1507 | # ||| / _--=> preempt-depth |
984 | # |||| / | 1508 | # |||| / delay |
985 | # ||||| delay | 1509 | # cmd pid ||||| time | caller |
986 | # cmd pid ||||| time | caller | 1510 | # \ / ||||| \ | / |
987 | # \ / ||||| \ | / | 1511 | <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep |
988 | ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) | 1512 | <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up |
989 | ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) | 1513 | <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup |
990 | ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) | 1514 | <idle>-0 3d.h3 3us : resched_task <-check_preempt_curr |
991 | ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) | 1515 | <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup |
992 | ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) | 1516 | <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up |
993 | ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) | 1517 | <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock |
994 | ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) | 1518 | <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up |
995 | ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) | 1519 | <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up |
996 | [...] | 1520 | <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore |
997 | ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) | 1521 | <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer |
998 | ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) | 1522 | <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock |
999 | ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) | 1523 | <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt |
1000 | ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) | 1524 | <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock |
1001 | [...] | 1525 | <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt |
1002 | ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) | 1526 | <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event |
1003 | ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) | 1527 | <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event |
1004 | ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) | 1528 | <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event |
1005 | ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) | 1529 | <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt |
1006 | ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) | 1530 | <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit |
1007 | ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) | 1531 | <idle>-0 3dN.2 9us : idle_cpu <-irq_exit |
1008 | ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) | 1532 | <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit |
1009 | ksoftirq-7 1.N.2 33us : schedule (__cond_resched) | 1533 | <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit |
1010 | ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) | 1534 | <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit |
1011 | ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) | 1535 | <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle |
1012 | ksoftirq-7 1dN.3 35us : _spin_lock (schedule) | 1536 | <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit |
1013 | ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) | 1537 | <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle |
1014 | ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) | 1538 | <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit |
1015 | ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) | 1539 | <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit |
1016 | [...] | 1540 | <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit |
1017 | ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) | 1541 | <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit |
1018 | ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) | 1542 | <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz |
1019 | ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) | 1543 | <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock |
1020 | ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) | 1544 | <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz |
1021 | ksoftirq-7 1d..4 50us : schedule (__cond_resched) | 1545 | <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load |
1022 | 1546 | <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz | |
1023 | The interrupt went off while running ksoftirqd. This task runs | 1547 | <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock |
1024 | at SCHED_OTHER. Why did not we see the 'N' set early? This may | 1548 | <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit |
1025 | be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K | 1549 | <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit |
1026 | stacks configured, the interrupt and softirq run with their own | 1550 | <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit |
1027 | stack. Some information is held on the top of the task's stack | 1551 | <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel |
1028 | (need_resched and preempt_count are both stored there). The | 1552 | <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel |
1029 | setting of the NEED_RESCHED bit is done directly to the task's | 1553 | <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 |
1030 | stack, but the reading of the NEED_RESCHED is done by looking at | 1554 | <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave |
1031 | the current stack, which in this case is the stack for the hard | 1555 | <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 |
1032 | interrupt. This hides the fact that NEED_RESCHED has been set. | 1556 | <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer |
1033 | We do not see the 'N' until we switch back to the task's | 1557 | <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram |
1034 | assigned stack. | 1558 | <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event |
1559 | <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event | ||
1560 | <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event | ||
1561 | <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel | ||
1562 | <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore | ||
1563 | <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit | ||
1564 | <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward | ||
1565 | <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward | ||
1566 | <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 | ||
1567 | <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns | ||
1568 | <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns | ||
1569 | <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 | ||
1570 | <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave | ||
1571 | <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns | ||
1572 | <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns | ||
1573 | <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns | ||
1574 | <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event | ||
1575 | <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event | ||
1576 | <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event | ||
1577 | <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns | ||
1578 | <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore | ||
1579 | <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit | ||
1580 | <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks | ||
1581 | <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle | ||
1582 | <idle>-0 3.N.. 25us : schedule <-cpu_idle | ||
1583 | <idle>-0 3.N.. 25us : __schedule <-preempt_schedule | ||
1584 | <idle>-0 3.N.. 26us : add_preempt_count <-__schedule | ||
1585 | <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule | ||
1586 | <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch | ||
1587 | <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch | ||
1588 | <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule | ||
1589 | <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq | ||
1590 | <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule | ||
1591 | <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task | ||
1592 | <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task | ||
1593 | <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt | ||
1594 | <idle>-0 3d..3 29us : __schedule <-preempt_schedule | ||
1595 | <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep | ||
1596 | |||
1597 | This isn't that big of a trace, even with function tracing enabled, | ||
1598 | so I included the entire trace. | ||
1599 | |||
1600 | The interrupt went off while when the system was idle. Somewhere | ||
1601 | before task_woken_rt() was called, the NEED_RESCHED flag was set, | ||
1602 | this is indicated by the first occurrence of the 'N' flag. | ||
1603 | |||
1604 | Latency tracing and events | ||
1605 | -------------------------- | ||
1606 | As function tracing can induce a much larger latency, but without | ||
1607 | seeing what happens within the latency it is hard to know what | ||
1608 | caused it. There is a middle ground, and that is with enabling | ||
1609 | events. | ||
1610 | |||
1611 | # echo 0 > options/function-trace | ||
1612 | # echo wakeup_rt > current_tracer | ||
1613 | # echo 1 > events/enable | ||
1614 | # echo 1 > tracing_on | ||
1615 | # echo 0 > tracing_max_latency | ||
1616 | # chrt -f 5 sleep 1 | ||
1617 | # echo 0 > tracing_on | ||
1618 | # cat trace | ||
1619 | # tracer: wakeup_rt | ||
1620 | # | ||
1621 | # wakeup_rt latency trace v1.1.5 on 3.8.0-test+ | ||
1622 | # -------------------------------------------------------------------- | ||
1623 | # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) | ||
1624 | # ----------------- | ||
1625 | # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) | ||
1626 | # ----------------- | ||
1627 | # | ||
1628 | # _------=> CPU# | ||
1629 | # / _-----=> irqs-off | ||
1630 | # | / _----=> need-resched | ||
1631 | # || / _---=> hardirq/softirq | ||
1632 | # ||| / _--=> preempt-depth | ||
1633 | # |||| / delay | ||
1634 | # cmd pid ||||| time | caller | ||
1635 | # \ / ||||| \ | / | ||
1636 | <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep | ||
1637 | <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up | ||
1638 | <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 | ||
1639 | <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 | ||
1640 | <idle>-0 2.N.2 2us : power_end: cpu_id=2 | ||
1641 | <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 | ||
1642 | <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 | ||
1643 | <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 | ||
1644 | <idle>-0 2.N.2 5us : rcu_utilization: Start context switch | ||
1645 | <idle>-0 2.N.2 5us : rcu_utilization: End context switch | ||
1646 | <idle>-0 2d..3 6us : __schedule <-schedule | ||
1647 | <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep | ||
1648 | |||
1035 | 1649 | ||
1036 | function | 1650 | function |
1037 | -------- | 1651 | -------- |
@@ -1039,6 +1653,7 @@ function | |||
1039 | This tracer is the function tracer. Enabling the function tracer | 1653 | This tracer is the function tracer. Enabling the function tracer |
1040 | can be done from the debug file system. Make sure the | 1654 | can be done from the debug file system. Make sure the |
1041 | ftrace_enabled is set; otherwise this tracer is a nop. | 1655 | ftrace_enabled is set; otherwise this tracer is a nop. |
1656 | See the "ftrace_enabled" section below. | ||
1042 | 1657 | ||
1043 | # sysctl kernel.ftrace_enabled=1 | 1658 | # sysctl kernel.ftrace_enabled=1 |
1044 | # echo function > current_tracer | 1659 | # echo function > current_tracer |
@@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop. | |||
1048 | # cat trace | 1663 | # cat trace |
1049 | # tracer: function | 1664 | # tracer: function |
1050 | # | 1665 | # |
1051 | # TASK-PID CPU# TIMESTAMP FUNCTION | 1666 | # entries-in-buffer/entries-written: 24799/24799 #P:4 |
1052 | # | | | | | | 1667 | # |
1053 | bash-4003 [00] 123.638713: finish_task_switch <-schedule | 1668 | # _-----=> irqs-off |
1054 | bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch | 1669 | # / _----=> need-resched |
1055 | bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq | 1670 | # | / _---=> hardirq/softirq |
1056 | bash-4003 [00] 123.638715: hrtick_set <-schedule | 1671 | # || / _--=> preempt-depth |
1057 | bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set | 1672 | # ||| / delay |
1058 | bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave | 1673 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION |
1059 | bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set | 1674 | # | | | |||| | | |
1060 | bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore | 1675 | bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write |
1061 | bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set | 1676 | bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock |
1062 | bash-4003 [00] 123.638718: sub_preempt_count <-schedule | 1677 | bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify |
1063 | bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule | 1678 | bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify |
1064 | bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run | 1679 | bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify |
1065 | bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion | 1680 | bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock |
1066 | bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common | 1681 | bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock |
1067 | bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq | 1682 | bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify |
1068 | [...] | 1683 | [...] |
1069 | 1684 | ||
1070 | 1685 | ||
@@ -1214,79 +1829,19 @@ int main (int argc, char **argv) | |||
1214 | return 0; | 1829 | return 0; |
1215 | } | 1830 | } |
1216 | 1831 | ||
1832 | Or this simple script! | ||
1217 | 1833 | ||
1218 | hw-branch-tracer (x86 only) | 1834 | ------ |
1219 | --------------------------- | 1835 | #!/bin/bash |
1220 | 1836 | ||
1221 | This tracer uses the x86 last branch tracing hardware feature to | 1837 | debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` |
1222 | collect a branch trace on all cpus with relatively low overhead. | 1838 | echo nop > $debugfs/tracing/current_tracer |
1223 | 1839 | echo 0 > $debugfs/tracing/tracing_on | |
1224 | The tracer uses a fixed-size circular buffer per cpu and only | 1840 | echo $$ > $debugfs/tracing/set_ftrace_pid |
1225 | traces ring 0 branches. The trace file dumps that buffer in the | 1841 | echo function > $debugfs/tracing/current_tracer |
1226 | following format: | 1842 | echo 1 > $debugfs/tracing/tracing_on |
1227 | 1843 | exec "$@" | |
1228 | # tracer: hw-branch-tracer | 1844 | ------ |
1229 | # | ||
1230 | # CPU# TO <- FROM | ||
1231 | 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 | ||
1232 | 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a | ||
1233 | 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf | ||
1234 | 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf | ||
1235 | 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a | ||
1236 | 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf | ||
1237 | |||
1238 | |||
1239 | The tracer may be used to dump the trace for the oops'ing cpu on | ||
1240 | a kernel oops into the system log. To enable this, | ||
1241 | ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one | ||
1242 | can either use the sysctl function or set it via the proc system | ||
1243 | interface. | ||
1244 | |||
1245 | sysctl kernel.ftrace_dump_on_oops=n | ||
1246 | |||
1247 | or | ||
1248 | |||
1249 | echo n > /proc/sys/kernel/ftrace_dump_on_oops | ||
1250 | |||
1251 | If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will | ||
1252 | only dump the buffer of the CPU that triggered the oops. | ||
1253 | |||
1254 | Here's an example of such a dump after a null pointer | ||
1255 | dereference in a kernel module: | ||
1256 | |||
1257 | [57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 | ||
1258 | [57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1259 | [57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 | ||
1260 | [57848.106019] Oops: 0002 [#1] SMP | ||
1261 | [57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus | ||
1262 | [57848.106019] Dumping ftrace buffer: | ||
1263 | [57848.106019] --------------------------------- | ||
1264 | [...] | ||
1265 | [57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 | ||
1266 | [57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 | ||
1267 | [57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 | ||
1268 | [57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 | ||
1269 | [57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 | ||
1270 | [57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] | ||
1271 | [57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 | ||
1272 | [57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b | ||
1273 | [57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 | ||
1274 | [57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 | ||
1275 | [57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 | ||
1276 | [...] | ||
1277 | [57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 | ||
1278 | [57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 | ||
1279 | [57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 | ||
1280 | [57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 | ||
1281 | [...] | ||
1282 | [57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 | ||
1283 | [57848.106019] --------------------------------- | ||
1284 | [57848.106019] CPU 0 | ||
1285 | [57848.106019] Modules linked in: oops | ||
1286 | [57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 | ||
1287 | [57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] | ||
1288 | [57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 | ||
1289 | [...] | ||
1290 | 1845 | ||
1291 | 1846 | ||
1292 | function graph tracer | 1847 | function graph tracer |
@@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will | |||
1473 | include the -pg switch in the compiling of the kernel.) | 2028 | include the -pg switch in the compiling of the kernel.) |
1474 | 2029 | ||
1475 | At compile time every C file object is run through the | 2030 | At compile time every C file object is run through the |
1476 | recordmcount.pl script (located in the scripts directory). This | 2031 | recordmcount program (located in the scripts directory). This |
1477 | script will process the C object using objdump to find all the | 2032 | program will parse the ELF headers in the C object to find all |
1478 | locations in the .text section that call mcount. (Note, only the | 2033 | the locations in the .text section that call mcount. (Note, only |
1479 | .text section is processed, since processing other sections like | 2034 | white listed .text sections are processed, since processing other |
1480 | .init.text may cause races due to those sections being freed). | 2035 | sections like .init.text may cause races due to those sections |
2036 | being freed unexpectedly). | ||
1481 | 2037 | ||
1482 | A new section called "__mcount_loc" is created that holds | 2038 | A new section called "__mcount_loc" is created that holds |
1483 | references to all the mcount call sites in the .text section. | 2039 | references to all the mcount call sites in the .text section. |
1484 | This section is compiled back into the original object. The | 2040 | The recordmcount program re-links this section back into the |
1485 | final linker will add all these references into a single table. | 2041 | original object. The final linking stage of the kernel will add all these |
2042 | references into a single table. | ||
1486 | 2043 | ||
1487 | On boot up, before SMP is initialized, the dynamic ftrace code | 2044 | On boot up, before SMP is initialized, the dynamic ftrace code |
1488 | scans this table and updates all the locations into nops. It | 2045 | scans this table and updates all the locations into nops. It |
@@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function | |||
1493 | list. This is automatic in the module unload code, and the | 2050 | list. This is automatic in the module unload code, and the |
1494 | module author does not need to worry about it. | 2051 | module author does not need to worry about it. |
1495 | 2052 | ||
1496 | When tracing is enabled, kstop_machine is called to prevent | 2053 | When tracing is enabled, the process of modifying the function |
1497 | races with the CPUS executing code being modified (which can | 2054 | tracepoints is dependent on architecture. The old method is to use |
1498 | cause the CPU to do undesirable things), and the nops are | 2055 | kstop_machine to prevent races with the CPUs executing code being |
2056 | modified (which can cause the CPU to do undesirable things, especially | ||
2057 | if the modified code crosses cache (or page) boundaries), and the nops are | ||
1499 | patched back to calls. But this time, they do not call mcount | 2058 | patched back to calls. But this time, they do not call mcount |
1500 | (which is just a function stub). They now call into the ftrace | 2059 | (which is just a function stub). They now call into the ftrace |
1501 | infrastructure. | 2060 | infrastructure. |
1502 | 2061 | ||
2062 | The new method of modifying the function tracepoints is to place | ||
2063 | a breakpoint at the location to be modified, sync all CPUs, modify | ||
2064 | the rest of the instruction not covered by the breakpoint. Sync | ||
2065 | all CPUs again, and then remove the breakpoint with the finished | ||
2066 | version to the ftrace call site. | ||
2067 | |||
2068 | Some archs do not even need to monkey around with the synchronization, | ||
2069 | and can just slap the new code on top of the old without any | ||
2070 | problems with other CPUs executing it at the same time. | ||
2071 | |||
1503 | One special side-effect to the recording of the functions being | 2072 | One special side-effect to the recording of the functions being |
1504 | traced is that we can now selectively choose which functions we | 2073 | traced is that we can now selectively choose which functions we |
1505 | wish to trace and which ones we want the mcount calls to remain | 2074 | wish to trace and which ones we want the mcount calls to remain |
@@ -1530,20 +2099,28 @@ mutex_lock | |||
1530 | 2099 | ||
1531 | If I am only interested in sys_nanosleep and hrtimer_interrupt: | 2100 | If I am only interested in sys_nanosleep and hrtimer_interrupt: |
1532 | 2101 | ||
1533 | # echo sys_nanosleep hrtimer_interrupt \ | 2102 | # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter |
1534 | > set_ftrace_filter | ||
1535 | # echo function > current_tracer | 2103 | # echo function > current_tracer |
1536 | # echo 1 > tracing_on | 2104 | # echo 1 > tracing_on |
1537 | # usleep 1 | 2105 | # usleep 1 |
1538 | # echo 0 > tracing_on | 2106 | # echo 0 > tracing_on |
1539 | # cat trace | 2107 | # cat trace |
1540 | # tracer: ftrace | 2108 | # tracer: function |
2109 | # | ||
2110 | # entries-in-buffer/entries-written: 5/5 #P:4 | ||
1541 | # | 2111 | # |
1542 | # TASK-PID CPU# TIMESTAMP FUNCTION | 2112 | # _-----=> irqs-off |
1543 | # | | | | | | 2113 | # / _----=> need-resched |
1544 | usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt | 2114 | # | / _---=> hardirq/softirq |
1545 | usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call | 2115 | # || / _--=> preempt-depth |
1546 | <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt | 2116 | # ||| / delay |
2117 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
2118 | # | | | |||| | | | ||
2119 | usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath | ||
2120 | <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt | ||
2121 | usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt | ||
2122 | <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt | ||
2123 | <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt | ||
1547 | 2124 | ||
1548 | To see which functions are being traced, you can cat the file: | 2125 | To see which functions are being traced, you can cat the file: |
1549 | 2126 | ||
@@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards, | |||
1571 | 2148 | ||
1572 | Produces: | 2149 | Produces: |
1573 | 2150 | ||
1574 | # tracer: ftrace | 2151 | # tracer: function |
1575 | # | 2152 | # |
1576 | # TASK-PID CPU# TIMESTAMP FUNCTION | 2153 | # entries-in-buffer/entries-written: 897/897 #P:4 |
1577 | # | | | | | | 2154 | # |
1578 | bash-4003 [00] 1480.611794: hrtimer_init <-copy_process | 2155 | # _-----=> irqs-off |
1579 | bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set | 2156 | # / _----=> need-resched |
1580 | bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear | 2157 | # | / _---=> hardirq/softirq |
1581 | bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel | 2158 | # || / _--=> preempt-depth |
1582 | <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt | 2159 | # ||| / delay |
1583 | <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt | 2160 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION |
1584 | <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt | 2161 | # | | | |||| | | |
1585 | <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt | 2162 | <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit |
1586 | <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt | 2163 | <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel |
1587 | 2164 | <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer | |
2165 | <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit | ||
2166 | <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 | ||
2167 | <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt | ||
2168 | <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter | ||
2169 | <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem | ||
1588 | 2170 | ||
1589 | Notice that we lost the sys_nanosleep. | 2171 | Notice that we lost the sys_nanosleep. |
1590 | 2172 | ||
@@ -1651,19 +2233,29 @@ traced. | |||
1651 | 2233 | ||
1652 | Produces: | 2234 | Produces: |
1653 | 2235 | ||
1654 | # tracer: ftrace | 2236 | # tracer: function |
2237 | # | ||
2238 | # entries-in-buffer/entries-written: 39608/39608 #P:4 | ||
1655 | # | 2239 | # |
1656 | # TASK-PID CPU# TIMESTAMP FUNCTION | 2240 | # _-----=> irqs-off |
1657 | # | | | | | | 2241 | # / _----=> need-resched |
1658 | bash-4043 [01] 115.281644: finish_task_switch <-schedule | 2242 | # | / _---=> hardirq/softirq |
1659 | bash-4043 [01] 115.281645: hrtick_set <-schedule | 2243 | # || / _--=> preempt-depth |
1660 | bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set | 2244 | # ||| / delay |
1661 | bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run | 2245 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION |
1662 | bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion | 2246 | # | | | |||| | | |
1663 | bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run | 2247 | bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open |
1664 | bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop | 2248 | bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last |
1665 | bash-4043 [01] 115.281648: wake_up_process <-kthread_stop | 2249 | bash-1994 [000] .... 4342.324897: ima_file_check <-do_last |
1666 | bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process | 2250 | bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check |
2251 | bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement | ||
2252 | bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action | ||
2253 | bash-1994 [000] .... 4342.324899: do_truncate <-do_last | ||
2254 | bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate | ||
2255 | bash-1994 [000] .... 4342.324899: notify_change <-do_truncate | ||
2256 | bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change | ||
2257 | bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time | ||
2258 | bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time | ||
1667 | 2259 | ||
1668 | We can see that there's no more lock or preempt tracing. | 2260 | We can see that there's no more lock or preempt tracing. |
1669 | 2261 | ||
@@ -1729,6 +2321,28 @@ this special filter via: | |||
1729 | echo > set_graph_function | 2321 | echo > set_graph_function |
1730 | 2322 | ||
1731 | 2323 | ||
2324 | ftrace_enabled | ||
2325 | -------------- | ||
2326 | |||
2327 | Note, the proc sysctl ftrace_enable is a big on/off switch for the | ||
2328 | function tracer. By default it is enabled (when function tracing is | ||
2329 | enabled in the kernel). If it is disabled, all function tracing is | ||
2330 | disabled. This includes not only the function tracers for ftrace, but | ||
2331 | also for any other uses (perf, kprobes, stack tracing, profiling, etc). | ||
2332 | |||
2333 | Please disable this with care. | ||
2334 | |||
2335 | This can be disable (and enabled) with: | ||
2336 | |||
2337 | sysctl kernel.ftrace_enabled=0 | ||
2338 | sysctl kernel.ftrace_enabled=1 | ||
2339 | |||
2340 | or | ||
2341 | |||
2342 | echo 0 > /proc/sys/kernel/ftrace_enabled | ||
2343 | echo 1 > /proc/sys/kernel/ftrace_enabled | ||
2344 | |||
2345 | |||
1732 | Filter commands | 2346 | Filter commands |
1733 | --------------- | 2347 | --------------- |
1734 | 2348 | ||
@@ -1763,12 +2377,58 @@ The following commands are supported: | |||
1763 | 2377 | ||
1764 | echo '__schedule_bug:traceoff:5' > set_ftrace_filter | 2378 | echo '__schedule_bug:traceoff:5' > set_ftrace_filter |
1765 | 2379 | ||
2380 | To always disable tracing when __schedule_bug is hit: | ||
2381 | |||
2382 | echo '__schedule_bug:traceoff' > set_ftrace_filter | ||
2383 | |||
1766 | These commands are cumulative whether or not they are appended | 2384 | These commands are cumulative whether or not they are appended |
1767 | to set_ftrace_filter. To remove a command, prepend it by '!' | 2385 | to set_ftrace_filter. To remove a command, prepend it by '!' |
1768 | and drop the parameter: | 2386 | and drop the parameter: |
1769 | 2387 | ||
2388 | echo '!__schedule_bug:traceoff:0' > set_ftrace_filter | ||
2389 | |||
2390 | The above removes the traceoff command for __schedule_bug | ||
2391 | that have a counter. To remove commands without counters: | ||
2392 | |||
1770 | echo '!__schedule_bug:traceoff' > set_ftrace_filter | 2393 | echo '!__schedule_bug:traceoff' > set_ftrace_filter |
1771 | 2394 | ||
2395 | - snapshot | ||
2396 | Will cause a snapshot to be triggered when the function is hit. | ||
2397 | |||
2398 | echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter | ||
2399 | |||
2400 | To only snapshot once: | ||
2401 | |||
2402 | echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter | ||
2403 | |||
2404 | To remove the above commands: | ||
2405 | |||
2406 | echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter | ||
2407 | echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter | ||
2408 | |||
2409 | - enable_event/disable_event | ||
2410 | These commands can enable or disable a trace event. Note, because | ||
2411 | function tracing callbacks are very sensitive, when these commands | ||
2412 | are registered, the trace point is activated, but disabled in | ||
2413 | a "soft" mode. That is, the tracepoint will be called, but | ||
2414 | just will not be traced. The event tracepoint stays in this mode | ||
2415 | as long as there's a command that triggers it. | ||
2416 | |||
2417 | echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ | ||
2418 | set_ftrace_filter | ||
2419 | |||
2420 | The format is: | ||
2421 | |||
2422 | <function>:enable_event:<system>:<event>[:count] | ||
2423 | <function>:disable_event:<system>:<event>[:count] | ||
2424 | |||
2425 | To remove the events commands: | ||
2426 | |||
2427 | |||
2428 | echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ | ||
2429 | set_ftrace_filter | ||
2430 | echo '!schedule:disable_event:sched:sched_switch' > \ | ||
2431 | set_ftrace_filter | ||
1772 | 2432 | ||
1773 | trace_pipe | 2433 | trace_pipe |
1774 | ---------- | 2434 | ---------- |
@@ -1787,28 +2447,31 @@ different. The trace is live. | |||
1787 | # cat trace | 2447 | # cat trace |
1788 | # tracer: function | 2448 | # tracer: function |
1789 | # | 2449 | # |
1790 | # TASK-PID CPU# TIMESTAMP FUNCTION | 2450 | # entries-in-buffer/entries-written: 0/0 #P:4 |
1791 | # | | | | | | 2451 | # |
2452 | # _-----=> irqs-off | ||
2453 | # / _----=> need-resched | ||
2454 | # | / _---=> hardirq/softirq | ||
2455 | # || / _--=> preempt-depth | ||
2456 | # ||| / delay | ||
2457 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
2458 | # | | | |||| | | | ||
1792 | 2459 | ||
1793 | # | 2460 | # |
1794 | # cat /tmp/trace.out | 2461 | # cat /tmp/trace.out |
1795 | bash-4043 [00] 41.267106: finish_task_switch <-schedule | 2462 | bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write |
1796 | bash-4043 [00] 41.267106: hrtick_set <-schedule | 2463 | bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock |
1797 | bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set | 2464 | bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify |
1798 | bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run | 2465 | bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify |
1799 | bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion | 2466 | bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify |
1800 | bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run | 2467 | bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock |
1801 | bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop | 2468 | bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock |
1802 | bash-4043 [00] 41.267110: wake_up_process <-kthread_stop | 2469 | bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify |
1803 | bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process | 2470 | bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath |
1804 | bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up | ||
1805 | 2471 | ||
1806 | 2472 | ||
1807 | Note, reading the trace_pipe file will block until more input is | 2473 | Note, reading the trace_pipe file will block until more input is |
1808 | added. By changing the tracer, trace_pipe will issue an EOF. We | 2474 | added. |
1809 | needed to set the function tracer _before_ we "cat" the | ||
1810 | trace_pipe file. | ||
1811 | |||
1812 | 2475 | ||
1813 | trace entries | 2476 | trace entries |
1814 | ------------- | 2477 | ------------- |
@@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in | |||
1817 | diagnosing an issue in the kernel. The file buffer_size_kb is | 2480 | diagnosing an issue in the kernel. The file buffer_size_kb is |
1818 | used to modify the size of the internal trace buffers. The | 2481 | used to modify the size of the internal trace buffers. The |
1819 | number listed is the number of entries that can be recorded per | 2482 | number listed is the number of entries that can be recorded per |
1820 | CPU. To know the full size, multiply the number of possible CPUS | 2483 | CPU. To know the full size, multiply the number of possible CPUs |
1821 | with the number of entries. | 2484 | with the number of entries. |
1822 | 2485 | ||
1823 | # cat buffer_size_kb | 2486 | # cat buffer_size_kb |
1824 | 1408 (units kilobytes) | 2487 | 1408 (units kilobytes) |
1825 | 2488 | ||
1826 | Note, to modify this, you must have tracing completely disabled. | 2489 | Or simply read buffer_total_size_kb |
1827 | To do that, echo "nop" into the current_tracer. If the | 2490 | |
1828 | current_tracer is not set to "nop", an EINVAL error will be | 2491 | # cat buffer_total_size_kb |
1829 | returned. | 2492 | 5632 |
2493 | |||
2494 | To modify the buffer, simple echo in a number (in 1024 byte segments). | ||
1830 | 2495 | ||
1831 | # echo nop > current_tracer | ||
1832 | # echo 10000 > buffer_size_kb | 2496 | # echo 10000 > buffer_size_kb |
1833 | # cat buffer_size_kb | 2497 | # cat buffer_size_kb |
1834 | 10000 (units kilobytes) | 2498 | 10000 (units kilobytes) |
1835 | 2499 | ||
1836 | The number of pages which will be allocated is limited to a | 2500 | It will try to allocate as much as possible. If you allocate too |
1837 | percentage of available memory. Allocating too much will produce | 2501 | much, it can cause Out-Of-Memory to trigger. |
1838 | an error. | ||
1839 | 2502 | ||
1840 | # echo 1000000000000 > buffer_size_kb | 2503 | # echo 1000000000000 > buffer_size_kb |
1841 | -bash: echo: write error: Cannot allocate memory | 2504 | -bash: echo: write error: Cannot allocate memory |
1842 | # cat buffer_size_kb | 2505 | # cat buffer_size_kb |
1843 | 85 | 2506 | 85 |
1844 | 2507 | ||
2508 | The per_cpu buffers can be changed individually as well: | ||
2509 | |||
2510 | # echo 10000 > per_cpu/cpu0/buffer_size_kb | ||
2511 | # echo 100 > per_cpu/cpu1/buffer_size_kb | ||
2512 | |||
2513 | When the per_cpu buffers are not the same, the buffer_size_kb | ||
2514 | at the top level will just show an X | ||
2515 | |||
2516 | # cat buffer_size_kb | ||
2517 | X | ||
2518 | |||
2519 | This is where the buffer_total_size_kb is useful: | ||
2520 | |||
2521 | # cat buffer_total_size_kb | ||
2522 | 12916 | ||
2523 | |||
2524 | Writing to the top level buffer_size_kb will reset all the buffers | ||
2525 | to be the same again. | ||
2526 | |||
1845 | Snapshot | 2527 | Snapshot |
1846 | -------- | 2528 | -------- |
1847 | CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature | 2529 | CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature |
@@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy | |||
1925 | # cat snapshot | 2607 | # cat snapshot |
1926 | cat: snapshot: Device or resource busy | 2608 | cat: snapshot: Device or resource busy |
1927 | 2609 | ||
2610 | |||
2611 | Instances | ||
2612 | --------- | ||
2613 | In the debugfs tracing directory is a directory called "instances". | ||
2614 | This directory can have new directories created inside of it using | ||
2615 | mkdir, and removing directories with rmdir. The directory created | ||
2616 | with mkdir in this directory will already contain files and other | ||
2617 | directories after it is created. | ||
2618 | |||
2619 | # mkdir instances/foo | ||
2620 | # ls instances/foo | ||
2621 | buffer_size_kb buffer_total_size_kb events free_buffer per_cpu | ||
2622 | set_event snapshot trace trace_clock trace_marker trace_options | ||
2623 | trace_pipe tracing_on | ||
2624 | |||
2625 | As you can see, the new directory looks similar to the tracing directory | ||
2626 | itself. In fact, it is very similar, except that the buffer and | ||
2627 | events are agnostic from the main director, or from any other | ||
2628 | instances that are created. | ||
2629 | |||
2630 | The files in the new directory work just like the files with the | ||
2631 | same name in the tracing directory except the buffer that is used | ||
2632 | is a separate and new buffer. The files affect that buffer but do not | ||
2633 | affect the main buffer with the exception of trace_options. Currently, | ||
2634 | the trace_options affect all instances and the top level buffer | ||
2635 | the same, but this may change in future releases. That is, options | ||
2636 | may become specific to the instance they reside in. | ||
2637 | |||
2638 | Notice that none of the function tracer files are there, nor is | ||
2639 | current_tracer and available_tracers. This is because the buffers | ||
2640 | can currently only have events enabled for them. | ||
2641 | |||
2642 | # mkdir instances/foo | ||
2643 | # mkdir instances/bar | ||
2644 | # mkdir instances/zoot | ||
2645 | # echo 100000 > buffer_size_kb | ||
2646 | # echo 1000 > instances/foo/buffer_size_kb | ||
2647 | # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb | ||
2648 | # echo function > current_trace | ||
2649 | # echo 1 > instances/foo/events/sched/sched_wakeup/enable | ||
2650 | # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable | ||
2651 | # echo 1 > instances/foo/events/sched/sched_switch/enable | ||
2652 | # echo 1 > instances/bar/events/irq/enable | ||
2653 | # echo 1 > instances/zoot/events/syscalls/enable | ||
2654 | # cat trace_pipe | ||
2655 | CPU:2 [LOST 11745 EVENTS] | ||
2656 | bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist | ||
2657 | bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave | ||
2658 | bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist | ||
2659 | bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist | ||
2660 | bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock | ||
2661 | bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype | ||
2662 | bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist | ||
2663 | bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist | ||
2664 | bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics | ||
2665 | bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics | ||
2666 | bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process | ||
2667 | [...] | ||
2668 | |||
2669 | # cat instances/foo/trace_pipe | ||
2670 | bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 | ||
2671 | bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 | ||
2672 | <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 | ||
2673 | <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 | ||
2674 | rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 | ||
2675 | bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 | ||
2676 | bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 | ||
2677 | bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 | ||
2678 | kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 | ||
2679 | kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 | ||
2680 | [...] | ||
2681 | |||
2682 | # cat instances/bar/trace_pipe | ||
2683 | migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] | ||
2684 | <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] | ||
2685 | bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] | ||
2686 | bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] | ||
2687 | bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] | ||
2688 | bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] | ||
2689 | bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] | ||
2690 | bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] | ||
2691 | sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 | ||
2692 | sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled | ||
2693 | sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 | ||
2694 | sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled | ||
2695 | [...] | ||
2696 | |||
2697 | # cat instances/zoot/trace | ||
2698 | # tracer: nop | ||
2699 | # | ||
2700 | # entries-in-buffer/entries-written: 18996/18996 #P:4 | ||
2701 | # | ||
2702 | # _-----=> irqs-off | ||
2703 | # / _----=> need-resched | ||
2704 | # | / _---=> hardirq/softirq | ||
2705 | # || / _--=> preempt-depth | ||
2706 | # ||| / delay | ||
2707 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | ||
2708 | # | | | |||| | | | ||
2709 | bash-1998 [000] d... 140.733501: sys_write -> 0x2 | ||
2710 | bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) | ||
2711 | bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 | ||
2712 | bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) | ||
2713 | bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 | ||
2714 | bash-1998 [000] d... 140.733510: sys_close(fd: a) | ||
2715 | bash-1998 [000] d... 140.733510: sys_close -> 0x0 | ||
2716 | bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) | ||
2717 | bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 | ||
2718 | bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) | ||
2719 | bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 | ||
2720 | |||
2721 | You can see that the trace of the top most trace buffer shows only | ||
2722 | the function tracing. The foo instance displays wakeups and task | ||
2723 | switches. | ||
2724 | |||
2725 | To remove the instances, simply delete their directories: | ||
2726 | |||
2727 | # rmdir instances/foo | ||
2728 | # rmdir instances/bar | ||
2729 | # rmdir instances/zoot | ||
2730 | |||
2731 | Note, if a process has a trace file open in one of the instance | ||
2732 | directories, the rmdir will fail with EBUSY. | ||
2733 | |||
2734 | |||
2735 | Stack trace | ||
1928 | ----------- | 2736 | ----------- |
2737 | Since the kernel has a fixed sized stack, it is important not to | ||
2738 | waste it in functions. A kernel developer must be conscience of | ||
2739 | what they allocate on the stack. If they add too much, the system | ||
2740 | can be in danger of a stack overflow, and corruption will occur, | ||
2741 | usually leading to a system panic. | ||
2742 | |||
2743 | There are some tools that check this, usually with interrupts | ||
2744 | periodically checking usage. But if you can perform a check | ||
2745 | at every function call that will become very useful. As ftrace provides | ||
2746 | a function tracer, it makes it convenient to check the stack size | ||
2747 | at every function call. This is enabled via the stack tracer. | ||
2748 | |||
2749 | CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. | ||
2750 | To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. | ||
2751 | |||
2752 | # echo 1 > /proc/sys/kernel/stack_tracer_enabled | ||
2753 | |||
2754 | You can also enable it from the kernel command line to trace | ||
2755 | the stack size of the kernel during boot up, by adding "stacktrace" | ||
2756 | to the kernel command line parameter. | ||
2757 | |||
2758 | After running it for a few minutes, the output looks like: | ||
2759 | |||
2760 | # cat stack_max_size | ||
2761 | 2928 | ||
2762 | |||
2763 | # cat stack_trace | ||
2764 | Depth Size Location (18 entries) | ||
2765 | ----- ---- -------- | ||
2766 | 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac | ||
2767 | 1) 2704 160 find_busiest_group+0x31/0x1f1 | ||
2768 | 2) 2544 256 load_balance+0xd9/0x662 | ||
2769 | 3) 2288 80 idle_balance+0xbb/0x130 | ||
2770 | 4) 2208 128 __schedule+0x26e/0x5b9 | ||
2771 | 5) 2080 16 schedule+0x64/0x66 | ||
2772 | 6) 2064 128 schedule_timeout+0x34/0xe0 | ||
2773 | 7) 1936 112 wait_for_common+0x97/0xf1 | ||
2774 | 8) 1824 16 wait_for_completion+0x1d/0x1f | ||
2775 | 9) 1808 128 flush_work+0xfe/0x119 | ||
2776 | 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 | ||
2777 | 11) 1664 48 input_available_p+0x1d/0x5c | ||
2778 | 12) 1616 48 n_tty_poll+0x6d/0x134 | ||
2779 | 13) 1568 64 tty_poll+0x64/0x7f | ||
2780 | 14) 1504 880 do_select+0x31e/0x511 | ||
2781 | 15) 624 400 core_sys_select+0x177/0x216 | ||
2782 | 16) 224 96 sys_select+0x91/0xb9 | ||
2783 | 17) 128 128 system_call_fastpath+0x16/0x1b | ||
2784 | |||
2785 | Note, if -mfentry is being used by gcc, functions get traced before | ||
2786 | they set up the stack frame. This means that leaf level functions | ||
2787 | are not tested by the stack tracer when -mfentry is used. | ||
2788 | |||
2789 | Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. | ||
2790 | |||
2791 | --------- | ||
1929 | 2792 | ||
1930 | More details can be found in the source code, in the | 2793 | More details can be found in the source code, in the |
1931 | kernel/trace/*.c files. | 2794 | kernel/trace/*.c files. |