aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-04-29 16:55:38 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-04-29 16:55:38 -0400
commit9e8529afc4518f4e5d610001545ebc97e1333c79 (patch)
tree26e1aa2cbb50f3f511cfa7d8e39e6b7bd9221b68 /Documentation
parentec25e246b94a3233ab064994ef05a170bdba0e7c (diff)
parent4c69e6ea415a35eb7f0fc8ee9390c8f7436492a2 (diff)
Merge tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing updates from Steven Rostedt: "Along with the usual minor fixes and clean ups there are a few major changes with this pull request. 1) Multiple buffers for the ftrace facility This feature has been requested by many people over the last few years. I even heard that Google was about to implement it themselves. I finally had time and cleaned up the code such that you can now create multiple instances of the ftrace buffer and have different events go to different buffers. This way, a low frequency event will not be lost in the noise of a high frequency event. Note, currently only events can go to different buffers, the tracers (ie function, function_graph and the latency tracers) still can only be written to the main buffer. 2) The function tracer triggers have now been extended. The function tracer had two triggers. One to enable tracing when a function is hit, and one to disable tracing. Now you can record a stack trace on a single (or many) function(s), take a snapshot of the buffer (copy it to the snapshot buffer), and you can enable or disable an event to be traced when a function is hit. 3) A perf clock has been added. A "perf" clock can be chosen to be used when tracing. This will cause ftrace to use the same clock as perf uses, and hopefully this will make it easier to interleave the perf and ftrace data for analysis." * tag 'trace-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace: (82 commits) tracepoints: Prevent null probe from being added tracing: Compare to 1 instead of zero for is_signed_type() tracing: Remove obsolete macro guard _TRACE_PROFILE_INIT ftrace: Get rid of ftrace_profile_bits tracing: Check return value of tracing_init_dentry() tracing: Get rid of unneeded key calculation in ftrace_hash_move() tracing: Reset ftrace_graph_filter_enabled if count is zero tracing: Fix off-by-one on allocating stat->pages kernel: tracing: Use strlcpy instead of strncpy tracing: Update debugfs README file tracing: Fix ftrace_dump() tracing: Rename trace_event_mutex to trace_event_sem tracing: Fix comment about prefix in arch_syscall_match_sym_name() tracing: Convert trace_destroy_fields() to static tracing: Move find_event_field() into trace_events.c tracing: Use TRACE_MAX_PRINT instead of constant tracing: Use pr_warn_once instead of open coded implementation ring-buffer: Add ring buffer startup selftest tracing: Bring Documentation/trace/ftrace.txt up to date tracing: Add "perf" trace_clock ... Conflicts: kernel/trace/ftrace.c kernel/trace/trace.c
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/kernel-parameters.txt7
-rw-r--r--Documentation/trace/ftrace.txt2097
2 files changed, 1487 insertions, 617 deletions
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 8ccbf27aead4..5abc09a93bc2 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -320,6 +320,13 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
320 on: enable for both 32- and 64-bit processes 320 on: enable for both 32- and 64-bit processes
321 off: disable for both 32- and 64-bit processes 321 off: disable for both 32- and 64-bit processes
322 322
323 alloc_snapshot [FTRACE]
324 Allocate the ftrace snapshot buffer on boot up when the
325 main buffer is allocated. This is handy if debugging
326 and you need to use tracing_snapshot() on boot up, and
327 do not want to use tracing_snapshot_alloc() as it needs
328 to be done where GFP_KERNEL allocations are allowed.
329
323 amd_iommu= [HW,X86-64] 330 amd_iommu= [HW,X86-64]
324 Pass parameters to the AMD IOMMU driver in the system. 331 Pass parameters to the AMD IOMMU driver in the system.
325 Possible values are: 332 Possible values are:
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index a372304aef10..bfe8c29b1f1d 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -8,6 +8,7 @@ Copyright 2008 Red Hat Inc.
8Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, 8Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
9 John Kacur, and David Teigland. 9 John Kacur, and David Teigland.
10Written for: 2.6.28-rc2 10Written for: 2.6.28-rc2
11Updated for: 3.10
11 12
12Introduction 13Introduction
13------------ 14------------
@@ -17,13 +18,16 @@ designers of systems to find what is going on inside the kernel.
17It can be used for debugging or analyzing latencies and 18It can be used for debugging or analyzing latencies and
18performance issues that take place outside of user-space. 19performance issues that take place outside of user-space.
19 20
20Although ftrace is the function tracer, it also includes an 21Although ftrace is typically considered the function tracer, it
21infrastructure that allows for other types of tracing. Some of 22is really a frame work of several assorted tracing utilities.
22the tracers that are currently in ftrace include a tracer to 23There's latency tracing to examine what occurs between interrupts
23trace context switches, the time it takes for a high priority 24disabled and enabled, as well as for preemption and from a time
24task to run after it was woken up, the time interrupts are 25a task is woken to the task is actually scheduled in.
25disabled, and more (ftrace allows for tracer plugins, which 26
26means that the list of tracers can always grow). 27One of the most common uses of ftrace is the event tracing.
28Through out the kernel is hundreds of static event points that
29can be enabled via the debugfs file system to see what is
30going on in certain parts of the kernel.
27 31
28 32
29Implementation Details 33Implementation Details
@@ -61,7 +65,7 @@ the extended "/sys/kernel/debug/tracing" path name.
61 65
62That's it! (assuming that you have ftrace configured into your kernel) 66That's it! (assuming that you have ftrace configured into your kernel)
63 67
64After mounting the debugfs, you can see a directory called 68After mounting debugfs, you can see a directory called
65"tracing". This directory contains the control and output files 69"tracing". This directory contains the control and output files
66of ftrace. Here is a list of some of the key files: 70of ftrace. Here is a list of some of the key files:
67 71
@@ -84,7 +88,9 @@ of ftrace. Here is a list of some of the key files:
84 88
85 This sets or displays whether writing to the trace 89 This sets or displays whether writing to the trace
86 ring buffer is enabled. Echo 0 into this file to disable 90 ring buffer is enabled. Echo 0 into this file to disable
87 the tracer or 1 to enable it. 91 the tracer or 1 to enable it. Note, this only disables
92 writing to the ring buffer, the tracing overhead may
93 still be occurring.
88 94
89 trace: 95 trace:
90 96
@@ -109,7 +115,15 @@ of ftrace. Here is a list of some of the key files:
109 115
110 This file lets the user control the amount of data 116 This file lets the user control the amount of data
111 that is displayed in one of the above output 117 that is displayed in one of the above output
112 files. 118 files. Options also exist to modify how a tracer
119 or events work (stack traces, timestamps, etc).
120
121 options:
122
123 This is a directory that has a file for every available
124 trace option (also in trace_options). Options may also be set
125 or cleared by writing a "1" or "0" respectively into the
126 corresponding file with the option name.
113 127
114 tracing_max_latency: 128 tracing_max_latency:
115 129
@@ -121,10 +135,17 @@ of ftrace. Here is a list of some of the key files:
121 latency is greater than the value in this 135 latency is greater than the value in this
122 file. (in microseconds) 136 file. (in microseconds)
123 137
138 tracing_thresh:
139
140 Some latency tracers will record a trace whenever the
141 latency is greater than the number in this file.
142 Only active when the file contains a number greater than 0.
143 (in microseconds)
144
124 buffer_size_kb: 145 buffer_size_kb:
125 146
126 This sets or displays the number of kilobytes each CPU 147 This sets or displays the number of kilobytes each CPU
127 buffer can hold. The tracer buffers are the same size 148 buffer holds. By default, the trace buffers are the same size
128 for each CPU. The displayed number is the size of the 149 for each CPU. The displayed number is the size of the
129 CPU buffer and not total size of all buffers. The 150 CPU buffer and not total size of all buffers. The
130 trace buffers are allocated in pages (blocks of memory 151 trace buffers are allocated in pages (blocks of memory
@@ -133,16 +154,30 @@ of ftrace. Here is a list of some of the key files:
133 than requested, the rest of the page will be used, 154 than requested, the rest of the page will be used,
134 making the actual allocation bigger than requested. 155 making the actual allocation bigger than requested.
135 ( Note, the size may not be a multiple of the page size 156 ( Note, the size may not be a multiple of the page size
136 due to buffer management overhead. ) 157 due to buffer management meta-data. )
137 158
138 This can only be updated when the current_tracer 159 buffer_total_size_kb:
139 is set to "nop". 160
161 This displays the total combined size of all the trace buffers.
162
163 free_buffer:
164
165 If a process is performing the tracing, and the ring buffer
166 should be shrunk "freed" when the process is finished, even
167 if it were to be killed by a signal, this file can be used
168 for that purpose. On close of this file, the ring buffer will
169 be resized to its minimum size. Having a process that is tracing
170 also open this file, when the process exits its file descriptor
171 for this file will be closed, and in doing so, the ring buffer
172 will be "freed".
173
174 It may also stop tracing if disable_on_free option is set.
140 175
141 tracing_cpumask: 176 tracing_cpumask:
142 177
143 This is a mask that lets the user only trace 178 This is a mask that lets the user only trace
144 on specified CPUS. The format is a hex string 179 on specified CPUs. The format is a hex string
145 representing the CPUS. 180 representing the CPUs.
146 181
147 set_ftrace_filter: 182 set_ftrace_filter:
148 183
@@ -183,6 +218,261 @@ of ftrace. Here is a list of some of the key files:
183 "set_ftrace_notrace". (See the section "dynamic ftrace" 218 "set_ftrace_notrace". (See the section "dynamic ftrace"
184 below for more details.) 219 below for more details.)
185 220
221 enabled_functions:
222
223 This file is more for debugging ftrace, but can also be useful
224 in seeing if any function has a callback attached to it.
225 Not only does the trace infrastructure use ftrace function
226 trace utility, but other subsystems might too. This file
227 displays all functions that have a callback attached to them
228 as well as the number of callbacks that have been attached.
229 Note, a callback may also call multiple functions which will
230 not be listed in this count.
231
232 If the callback registered to be traced by a function with
233 the "save regs" attribute (thus even more overhead), a 'R'
234 will be displayed on the same line as the function that
235 is returning registers.
236
237 function_profile_enabled:
238
239 When set it will enable all functions with either the function
240 tracer, or if enabled, the function graph tracer. It will
241 keep a histogram of the number of functions that were called
242 and if run with the function graph tracer, it will also keep
243 track of the time spent in those functions. The histogram
244 content can be displayed in the files:
245
246 trace_stats/function<cpu> ( function0, function1, etc).
247
248 trace_stats:
249
250 A directory that holds different tracing stats.
251
252 kprobe_events:
253
254 Enable dynamic trace points. See kprobetrace.txt.
255
256 kprobe_profile:
257
258 Dynamic trace points stats. See kprobetrace.txt.
259
260 max_graph_depth:
261
262 Used with the function graph tracer. This is the max depth
263 it will trace into a function. Setting this to a value of
264 one will show only the first kernel function that is called
265 from user space.
266
267 printk_formats:
268
269 This is for tools that read the raw format files. If an event in
270 the ring buffer references a string (currently only trace_printk()
271 does this), only a pointer to the string is recorded into the buffer
272 and not the string itself. This prevents tools from knowing what
273 that string was. This file displays the string and address for
274 the string allowing tools to map the pointers to what the
275 strings were.
276
277 saved_cmdlines:
278
279 Only the pid of the task is recorded in a trace event unless
280 the event specifically saves the task comm as well. Ftrace
281 makes a cache of pid mappings to comms to try to display
282 comms for events. If a pid for a comm is not listed, then
283 "<...>" is displayed in the output.
284
285 snapshot:
286
287 This displays the "snapshot" buffer and also lets the user
288 take a snapshot of the current running trace.
289 See the "Snapshot" section below for more details.
290
291 stack_max_size:
292
293 When the stack tracer is activated, this will display the
294 maximum stack size it has encountered.
295 See the "Stack Trace" section below.
296
297 stack_trace:
298
299 This displays the stack back trace of the largest stack
300 that was encountered when the stack tracer is activated.
301 See the "Stack Trace" section below.
302
303 stack_trace_filter:
304
305 This is similar to "set_ftrace_filter" but it limits what
306 functions the stack tracer will check.
307
308 trace_clock:
309
310 Whenever an event is recorded into the ring buffer, a
311 "timestamp" is added. This stamp comes from a specified
312 clock. By default, ftrace uses the "local" clock. This
313 clock is very fast and strictly per cpu, but on some
314 systems it may not be monotonic with respect to other
315 CPUs. In other words, the local clocks may not be in sync
316 with local clocks on other CPUs.
317
318 Usual clocks for tracing:
319
320 # cat trace_clock
321 [local] global counter x86-tsc
322
323 local: Default clock, but may not be in sync across CPUs
324
325 global: This clock is in sync with all CPUs but may
326 be a bit slower than the local clock.
327
328 counter: This is not a clock at all, but literally an atomic
329 counter. It counts up one by one, but is in sync
330 with all CPUs. This is useful when you need to
331 know exactly the order events occurred with respect to
332 each other on different CPUs.
333
334 uptime: This uses the jiffies counter and the time stamp
335 is relative to the time since boot up.
336
337 perf: This makes ftrace use the same clock that perf uses.
338 Eventually perf will be able to read ftrace buffers
339 and this will help out in interleaving the data.
340
341 x86-tsc: Architectures may define their own clocks. For
342 example, x86 uses its own TSC cycle clock here.
343
344 To set a clock, simply echo the clock name into this file.
345
346 echo global > trace_clock
347
348 trace_marker:
349
350 This is a very useful file for synchronizing user space
351 with events happening in the kernel. Writing strings into
352 this file will be written into the ftrace buffer.
353
354 It is useful in applications to open this file at the start
355 of the application and just reference the file descriptor
356 for the file.
357
358 void trace_write(const char *fmt, ...)
359 {
360 va_list ap;
361 char buf[256];
362 int n;
363
364 if (trace_fd < 0)
365 return;
366
367 va_start(ap, fmt);
368 n = vsnprintf(buf, 256, fmt, ap);
369 va_end(ap);
370
371 write(trace_fd, buf, n);
372 }
373
374 start:
375
376 trace_fd = open("trace_marker", WR_ONLY);
377
378 uprobe_events:
379
380 Add dynamic tracepoints in programs.
381 See uprobetracer.txt
382
383 uprobe_profile:
384
385 Uprobe statistics. See uprobetrace.txt
386
387 instances:
388
389 This is a way to make multiple trace buffers where different
390 events can be recorded in different buffers.
391 See "Instances" section below.
392
393 events:
394
395 This is the trace event directory. It holds event tracepoints
396 (also known as static tracepoints) that have been compiled
397 into the kernel. It shows what event tracepoints exist
398 and how they are grouped by system. There are "enable"
399 files at various levels that can enable the tracepoints
400 when a "1" is written to them.
401
402 See events.txt for more information.
403
404 per_cpu:
405
406 This is a directory that contains the trace per_cpu information.
407
408 per_cpu/cpu0/buffer_size_kb:
409
410 The ftrace buffer is defined per_cpu. That is, there's a separate
411 buffer for each CPU to allow writes to be done atomically,
412 and free from cache bouncing. These buffers may have different
413 size buffers. This file is similar to the buffer_size_kb
414 file, but it only displays or sets the buffer size for the
415 specific CPU. (here cpu0).
416
417 per_cpu/cpu0/trace:
418
419 This is similar to the "trace" file, but it will only display
420 the data specific for the CPU. If written to, it only clears
421 the specific CPU buffer.
422
423 per_cpu/cpu0/trace_pipe
424
425 This is similar to the "trace_pipe" file, and is a consuming
426 read, but it will only display (and consume) the data specific
427 for the CPU.
428
429 per_cpu/cpu0/trace_pipe_raw
430
431 For tools that can parse the ftrace ring buffer binary format,
432 the trace_pipe_raw file can be used to extract the data
433 from the ring buffer directly. With the use of the splice()
434 system call, the buffer data can be quickly transferred to
435 a file or to the network where a server is collecting the
436 data.
437
438 Like trace_pipe, this is a consuming reader, where multiple
439 reads will always produce different data.
440
441 per_cpu/cpu0/snapshot:
442
443 This is similar to the main "snapshot" file, but will only
444 snapshot the current CPU (if supported). It only displays
445 the content of the snapshot for a given CPU, and if
446 written to, only clears this CPU buffer.
447
448 per_cpu/cpu0/snapshot_raw:
449
450 Similar to the trace_pipe_raw, but will read the binary format
451 from the snapshot buffer for the given CPU.
452
453 per_cpu/cpu0/stats:
454
455 This displays certain stats about the ring buffer:
456
457 entries: The number of events that are still in the buffer.
458
459 overrun: The number of lost events due to overwriting when
460 the buffer was full.
461
462 commit overrun: Should always be zero.
463 This gets set if so many events happened within a nested
464 event (ring buffer is re-entrant), that it fills the
465 buffer and starts dropping events.
466
467 bytes: Bytes actually read (not overwritten).
468
469 oldest event ts: The oldest timestamp in the buffer
470
471 now ts: The current timestamp
472
473 dropped events: Events lost due to overwrite option being off.
474
475 read events: The number of events read.
186 476
187The Tracers 477The Tracers
188----------- 478-----------
@@ -234,11 +524,6 @@ Here is the list of current tracers that may be configured.
234 RT tasks (as the current "wakeup" does). This is useful 524 RT tasks (as the current "wakeup" does). This is useful
235 for those interested in wake up timings of RT tasks. 525 for those interested in wake up timings of RT tasks.
236 526
237 "hw-branch-tracer"
238
239 Uses the BTS CPU feature on x86 CPUs to traces all
240 branches executed.
241
242 "nop" 527 "nop"
243 528
244 This is the "trace nothing" tracer. To remove all 529 This is the "trace nothing" tracer. To remove all
@@ -261,70 +546,100 @@ Here is an example of the output format of the file "trace"
261 -------- 546 --------
262# tracer: function 547# tracer: function
263# 548#
264# TASK-PID CPU# TIMESTAMP FUNCTION 549# entries-in-buffer/entries-written: 140080/250280 #P:4
265# | | | | | 550#
266 bash-4251 [01] 10152.583854: path_put <-path_walk 551# _-----=> irqs-off
267 bash-4251 [01] 10152.583855: dput <-path_put 552# / _----=> need-resched
268 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput 553# | / _---=> hardirq/softirq
554# || / _--=> preempt-depth
555# ||| / delay
556# TASK-PID CPU# |||| TIMESTAMP FUNCTION
557# | | | |||| | |
558 bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
559 bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
560 bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
561 sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
562 bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
563 bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
564 bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
565 bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
566 bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
567 sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
269 -------- 568 --------
270 569
271A header is printed with the tracer name that is represented by 570A header is printed with the tracer name that is represented by
272the trace. In this case the tracer is "function". Then a header 571the trace. In this case the tracer is "function". Then it shows the
273showing the format. Task name "bash", the task PID "4251", the 572number of events in the buffer as well as the total number of entries
274CPU that it was running on "01", the timestamp in <secs>.<usecs> 573that were written. The difference is the number of entries that were
275format, the function name that was traced "path_put" and the 574lost due to the buffer filling up (250280 - 140080 = 110200 events
276parent function that called this function "path_walk". The 575lost).
277timestamp is the time at which the function was entered. 576
577The header explains the content of the events. Task name "bash", the task
578PID "1977", the CPU that it was running on "000", the latency format
579(explained below), the timestamp in <secs>.<usecs> format, the
580function name that was traced "sys_close" and the parent function that
581called this function "system_call_fastpath". The timestamp is the time
582at which the function was entered.
278 583
279Latency trace format 584Latency trace format
280-------------------- 585--------------------
281 586
282When the latency-format option is enabled, the trace file gives 587When the latency-format option is enabled or when one of the latency
283somewhat more information to see why a latency happened. 588tracers is set, the trace file gives somewhat more information to see
284Here is a typical trace. 589why a latency happened. Here is a typical trace.
285 590
286# tracer: irqsoff 591# tracer: irqsoff
287# 592#
288irqsoff latency trace v1.1.5 on 2.6.26-rc8 593# irqsoff latency trace v1.1.5 on 3.8.0-test+
289-------------------------------------------------------------------- 594# --------------------------------------------------------------------
290 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 595# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
291 ----------------- 596# -----------------
292 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) 597# | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
293 ----------------- 598# -----------------
294 => started at: apic_timer_interrupt 599# => started at: __lock_task_sighand
295 => ended at: do_softirq 600# => ended at: _raw_spin_unlock_irqrestore
296 601#
297# _------=> CPU# 602#
298# / _-----=> irqs-off 603# _------=> CPU#
299# | / _----=> need-resched 604# / _-----=> irqs-off
300# || / _---=> hardirq/softirq 605# | / _----=> need-resched
301# ||| / _--=> preempt-depth 606# || / _---=> hardirq/softirq
302# |||| / 607# ||| / _--=> preempt-depth
303# ||||| delay 608# |||| / delay
304# cmd pid ||||| time | caller 609# cmd pid ||||| time | caller
305# \ / ||||| \ | / 610# \ / ||||| \ | /
306 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) 611 ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
307 <idle>-0 0d.s. 97us : __do_softirq (do_softirq) 612 ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
308 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) 613 ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
614 ps-6143 2d..1 306us : <stack trace>
615 => trace_hardirqs_on_caller
616 => trace_hardirqs_on
617 => _raw_spin_unlock_irqrestore
618 => do_task_stat
619 => proc_tgid_stat
620 => proc_single_show
621 => seq_read
622 => vfs_read
623 => sys_read
624 => system_call_fastpath
309 625
310 626
311This shows that the current tracer is "irqsoff" tracing the time 627This shows that the current tracer is "irqsoff" tracing the time
312for which interrupts were disabled. It gives the trace version 628for which interrupts were disabled. It gives the trace version (which
313and the version of the kernel upon which this was executed on 629never changes) and the version of the kernel upon which this was executed on
314(2.6.26-rc8). Then it displays the max latency in microsecs (97 630(3.10). Then it displays the max latency in microseconds (259 us). The number
315us). The number of trace entries displayed and the total number 631of trace entries displayed and the total number (both are four: #4/4).
316recorded (both are three: #3/3). The type of preemption that was 632VP, KP, SP, and HP are always zero and are reserved for later use.
317used (PREEMPT). VP, KP, SP, and HP are always zero and are 633#P is the number of online CPUs (#P:4).
318reserved for later use. #P is the number of online CPUS (#P:2).
319 634
320The task is the process that was running when the latency 635The task is the process that was running when the latency
321occurred. (swapper pid: 0). 636occurred. (ps pid: 6143).
322 637
323The start and stop (the functions in which the interrupts were 638The start and stop (the functions in which the interrupts were
324disabled and enabled respectively) that caused the latencies: 639disabled and enabled respectively) that caused the latencies:
325 640
326 apic_timer_interrupt is where the interrupts were disabled. 641 __lock_task_sighand is where the interrupts were disabled.
327 do_softirq is where they were enabled again. 642 _raw_spin_unlock_irqrestore is where they were enabled again.
328 643
329The next lines after the header are the trace itself. The header 644The next lines after the header are the trace itself. The header
330explains which is which. 645explains which is which.
@@ -367,16 +682,43 @@ The above is mostly meaningful for kernel developers.
367 682
368 The rest is the same as the 'trace' file. 683 The rest is the same as the 'trace' file.
369 684
685 Note, the latency tracers will usually end with a back trace
686 to easily find where the latency occurred.
370 687
371trace_options 688trace_options
372------------- 689-------------
373 690
374The trace_options file is used to control what gets printed in 691The trace_options file (or the options directory) is used to control
375the trace output. To see what is available, simply cat the file: 692what gets printed in the trace output, or manipulate the tracers.
693To see what is available, simply cat the file:
376 694
377 cat trace_options 695 cat trace_options
378 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ 696print-parent
379 noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj 697nosym-offset
698nosym-addr
699noverbose
700noraw
701nohex
702nobin
703noblock
704nostacktrace
705trace_printk
706noftrace_preempt
707nobranch
708annotate
709nouserstacktrace
710nosym-userobj
711noprintk-msg-only
712context-info
713latency-format
714sleep-time
715graph-time
716record-cmd
717overwrite
718nodisable_on_free
719irq-info
720markers
721function-trace
380 722
381To disable one of the options, echo in the option prepended with 723To disable one of the options, echo in the option prepended with
382"no". 724"no".
@@ -428,13 +770,34 @@ Here are the available options:
428 770
429 bin - This will print out the formats in raw binary. 771 bin - This will print out the formats in raw binary.
430 772
431 block - TBD (needs update) 773 block - When set, reading trace_pipe will not block when polled.
432 774
433 stacktrace - This is one of the options that changes the trace 775 stacktrace - This is one of the options that changes the trace
434 itself. When a trace is recorded, so is the stack 776 itself. When a trace is recorded, so is the stack
435 of functions. This allows for back traces of 777 of functions. This allows for back traces of
436 trace sites. 778 trace sites.
437 779
780 trace_printk - Can disable trace_printk() from writing into the buffer.
781
782 branch - Enable branch tracing with the tracer.
783
784 annotate - It is sometimes confusing when the CPU buffers are full
785 and one CPU buffer had a lot of events recently, thus
786 a shorter time frame, were another CPU may have only had
787 a few events, which lets it have older events. When
788 the trace is reported, it shows the oldest events first,
789 and it may look like only one CPU ran (the one with the
790 oldest events). When the annotate option is set, it will
791 display when a new CPU buffer started:
792
793 <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
794 <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
795 <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
796##### CPU 2 buffer started ####
797 <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
798 <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
799 <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
800
438 userstacktrace - This option changes the trace. It records a 801 userstacktrace - This option changes the trace. It records a
439 stacktrace of the current userspace thread. 802 stacktrace of the current userspace thread.
440 803
@@ -451,9 +814,13 @@ Here are the available options:
451 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 814 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
452x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 815x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
453 816
454 sched-tree - trace all tasks that are on the runqueue, at 817
455 every scheduling event. Will add overhead if 818 printk-msg-only - When set, trace_printk()s will only show the format
456 there's a lot of tasks running at once. 819 and not their parameters (if trace_bprintk() or
820 trace_bputs() was used to save the trace_printk()).
821
822 context-info - Show only the event data. Hides the comm, PID,
823 timestamp, CPU, and other useful data.
457 824
458 latency-format - This option changes the trace. When 825 latency-format - This option changes the trace. When
459 it is enabled, the trace displays 826 it is enabled, the trace displays
@@ -461,31 +828,61 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
461 latencies, as described in "Latency 828 latencies, as described in "Latency
462 trace format". 829 trace format".
463 830
831 sleep-time - When running function graph tracer, to include
832 the time a task schedules out in its function.
833 When enabled, it will account time the task has been
834 scheduled out as part of the function call.
835
836 graph-time - When running function graph tracer, to include the
837 time to call nested functions. When this is not set,
838 the time reported for the function will only include
839 the time the function itself executed for, not the time
840 for functions that it called.
841
842 record-cmd - When any event or tracer is enabled, a hook is enabled
843 in the sched_switch trace point to fill comm cache
844 with mapped pids and comms. But this may cause some
845 overhead, and if you only care about pids, and not the
846 name of the task, disabling this option can lower the
847 impact of tracing.
848
464 overwrite - This controls what happens when the trace buffer is 849 overwrite - This controls what happens when the trace buffer is
465 full. If "1" (default), the oldest events are 850 full. If "1" (default), the oldest events are
466 discarded and overwritten. If "0", then the newest 851 discarded and overwritten. If "0", then the newest
467 events are discarded. 852 events are discarded.
853 (see per_cpu/cpu0/stats for overrun and dropped)
468 854
469ftrace_enabled 855 disable_on_free - When the free_buffer is closed, tracing will
470-------------- 856 stop (tracing_on set to 0).
471 857
472The following tracers (listed below) give different output 858 irq-info - Shows the interrupt, preempt count, need resched data.
473depending on whether or not the sysctl ftrace_enabled is set. To 859 When disabled, the trace looks like:
474set ftrace_enabled, one can either use the sysctl function or
475set it via the proc file system interface.
476 860
477 sysctl kernel.ftrace_enabled=1 861# tracer: function
862#
863# entries-in-buffer/entries-written: 144405/9452052 #P:4
864#
865# TASK-PID CPU# TIMESTAMP FUNCTION
866# | | | | |
867 <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
868 <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
869 <idle>-0 [002] 23636.756055: enqueue_task <-activate_task
478 870
479 or
480 871
481 echo 1 > /proc/sys/kernel/ftrace_enabled 872 markers - When set, the trace_marker is writable (only by root).
873 When disabled, the trace_marker will error with EINVAL
874 on write.
875
876
877 function-trace - The latency tracers will enable function tracing
878 if this option is enabled (default it is). When
879 it is disabled, the latency tracers do not trace
880 functions. This keeps the overhead of the tracer down
881 when performing latency tests.
482 882
483To disable ftrace_enabled simply replace the '1' with '0' in the 883 Note: Some tracers have their own options. They only appear
484above commands. 884 when the tracer is active.
485 885
486When ftrace_enabled is set the tracers will also record the
487functions that are within the trace. The descriptions of the
488tracers will also show an example with ftrace enabled.
489 886
490 887
491irqsoff 888irqsoff
@@ -506,95 +903,133 @@ new trace is saved.
506To reset the maximum, echo 0 into tracing_max_latency. Here is 903To reset the maximum, echo 0 into tracing_max_latency. Here is
507an example: 904an example:
508 905
906 # echo 0 > options/function-trace
509 # echo irqsoff > current_tracer 907 # echo irqsoff > current_tracer
510 # echo latency-format > trace_options
511 # echo 0 > tracing_max_latency
512 # echo 1 > tracing_on 908 # echo 1 > tracing_on
909 # echo 0 > tracing_max_latency
513 # ls -ltr 910 # ls -ltr
514 [...] 911 [...]
515 # echo 0 > tracing_on 912 # echo 0 > tracing_on
516 # cat trace 913 # cat trace
517# tracer: irqsoff 914# tracer: irqsoff
518# 915#
519irqsoff latency trace v1.1.5 on 2.6.26 916# irqsoff latency trace v1.1.5 on 3.8.0-test+
520-------------------------------------------------------------------- 917# --------------------------------------------------------------------
521 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 918# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
522 ----------------- 919# -----------------
523 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) 920# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
524 ----------------- 921# -----------------
525 => started at: sys_setpgid 922# => started at: run_timer_softirq
526 => ended at: sys_setpgid 923# => ended at: run_timer_softirq
527 924#
528# _------=> CPU# 925#
529# / _-----=> irqs-off 926# _------=> CPU#
530# | / _----=> need-resched 927# / _-----=> irqs-off
531# || / _---=> hardirq/softirq 928# | / _----=> need-resched
532# ||| / _--=> preempt-depth 929# || / _---=> hardirq/softirq
533# |||| / 930# ||| / _--=> preempt-depth
534# ||||| delay 931# |||| / delay
535# cmd pid ||||| time | caller 932# cmd pid ||||| time | caller
536# \ / ||||| \ | / 933# \ / ||||| \ | /
537 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) 934 <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
538 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) 935 <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
539 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) 936 <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
540 937 <idle>-0 0dNs3 25us : <stack trace>
541 938 => _raw_spin_unlock_irq
542Here we see that that we had a latency of 12 microsecs (which is 939 => run_timer_softirq
543very good). The _write_lock_irq in sys_setpgid disabled 940 => __do_softirq
544interrupts. The difference between the 12 and the displayed 941 => call_softirq
545timestamp 14us occurred because the clock was incremented 942 => do_softirq
943 => irq_exit
944 => smp_apic_timer_interrupt
945 => apic_timer_interrupt
946 => rcu_idle_exit
947 => cpu_idle
948 => rest_init
949 => start_kernel
950 => x86_64_start_reservations
951 => x86_64_start_kernel
952
953Here we see that that we had a latency of 16 microseconds (which is
954very good). The _raw_spin_lock_irq in run_timer_softirq disabled
955interrupts. The difference between the 16 and the displayed
956timestamp 25us occurred because the clock was incremented
546between the time of recording the max latency and the time of 957between the time of recording the max latency and the time of
547recording the function that had that latency. 958recording the function that had that latency.
548 959
549Note the above example had ftrace_enabled not set. If we set the 960Note the above example had function-trace not set. If we set
550ftrace_enabled, we get a much larger output: 961function-trace, we get a much larger output:
962
963 with echo 1 > options/function-trace
551 964
552# tracer: irqsoff 965# tracer: irqsoff
553# 966#
554irqsoff latency trace v1.1.5 on 2.6.26-rc8 967# irqsoff latency trace v1.1.5 on 3.8.0-test+
555-------------------------------------------------------------------- 968# --------------------------------------------------------------------
556 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 969# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
557 ----------------- 970# -----------------
558 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) 971# | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
559 ----------------- 972# -----------------
560 => started at: __alloc_pages_internal 973# => started at: ata_scsi_queuecmd
561 => ended at: __alloc_pages_internal 974# => ended at: ata_scsi_queuecmd
562 975#
563# _------=> CPU# 976#
564# / _-----=> irqs-off 977# _------=> CPU#
565# | / _----=> need-resched 978# / _-----=> irqs-off
566# || / _---=> hardirq/softirq 979# | / _----=> need-resched
567# ||| / _--=> preempt-depth 980# || / _---=> hardirq/softirq
568# |||| / 981# ||| / _--=> preempt-depth
569# ||||| delay 982# |||| / delay
570# cmd pid ||||| time | caller 983# cmd pid ||||| time | caller
571# \ / ||||| \ | / 984# \ / ||||| \ | /
572 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) 985 bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
573 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) 986 bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
574 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) 987 bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
575 ls-4339 0d..1 4us : add_preempt_count (_spin_lock) 988 bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
576 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) 989 bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
577 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) 990 bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
578 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) 991 bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
579 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) 992 bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
580 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) 993 bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
581 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
582 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
583 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
584[...] 994[...]
585 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) 995 bash-2042 3d..1 67us : delay_tsc <-__delay
586 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) 996 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
587 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) 997 bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
588 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) 998 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
589 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) 999 bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
590 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) 1000 bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
591 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) 1001 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
592 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) 1002 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
593 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) 1003 bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
594 1004 bash-2042 3d..1 120us : <stack trace>
595 1005 => _raw_spin_unlock_irqrestore
596 1006 => ata_scsi_queuecmd
597Here we traced a 50 microsecond latency. But we also see all the 1007 => scsi_dispatch_cmd
1008 => scsi_request_fn
1009 => __blk_run_queue_uncond
1010 => __blk_run_queue
1011 => blk_queue_bio
1012 => generic_make_request
1013 => submit_bio
1014 => submit_bh
1015 => __ext3_get_inode_loc
1016 => ext3_iget
1017 => ext3_lookup
1018 => lookup_real
1019 => __lookup_hash
1020 => walk_component
1021 => lookup_last
1022 => path_lookupat
1023 => filename_lookup
1024 => user_path_at_empty
1025 => user_path_at
1026 => vfs_fstatat
1027 => vfs_stat
1028 => sys_newstat
1029 => system_call_fastpath
1030
1031
1032Here we traced a 71 microsecond latency. But we also see all the
598functions that were called during that time. Note that by 1033functions that were called during that time. Note that by
599enabling function tracing, we incur an added overhead. This 1034enabling function tracing, we incur an added overhead. This
600overhead may extend the latency times. But nevertheless, this 1035overhead may extend the latency times. But nevertheless, this
@@ -614,120 +1049,122 @@ Like the irqsoff tracer, it records the maximum latency for
614which preemption was disabled. The control of preemptoff tracer 1049which preemption was disabled. The control of preemptoff tracer
615is much like the irqsoff tracer. 1050is much like the irqsoff tracer.
616 1051
1052 # echo 0 > options/function-trace
617 # echo preemptoff > current_tracer 1053 # echo preemptoff > current_tracer
618 # echo latency-format > trace_options
619 # echo 0 > tracing_max_latency
620 # echo 1 > tracing_on 1054 # echo 1 > tracing_on
1055 # echo 0 > tracing_max_latency
621 # ls -ltr 1056 # ls -ltr
622 [...] 1057 [...]
623 # echo 0 > tracing_on 1058 # echo 0 > tracing_on
624 # cat trace 1059 # cat trace
625# tracer: preemptoff 1060# tracer: preemptoff
626# 1061#
627preemptoff latency trace v1.1.5 on 2.6.26-rc8 1062# preemptoff latency trace v1.1.5 on 3.8.0-test+
628-------------------------------------------------------------------- 1063# --------------------------------------------------------------------
629 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1064# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
630 ----------------- 1065# -----------------
631 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 1066# | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
632 ----------------- 1067# -----------------
633 => started at: do_IRQ 1068# => started at: do_IRQ
634 => ended at: __do_softirq 1069# => ended at: do_IRQ
635 1070#
636# _------=> CPU# 1071#
637# / _-----=> irqs-off 1072# _------=> CPU#
638# | / _----=> need-resched 1073# / _-----=> irqs-off
639# || / _---=> hardirq/softirq 1074# | / _----=> need-resched
640# ||| / _--=> preempt-depth 1075# || / _---=> hardirq/softirq
641# |||| / 1076# ||| / _--=> preempt-depth
642# ||||| delay 1077# |||| / delay
643# cmd pid ||||| time | caller 1078# cmd pid ||||| time | caller
644# \ / ||||| \ | / 1079# \ / ||||| \ | /
645 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) 1080 sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
646 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) 1081 sshd-1991 1d..1 46us : irq_exit <-do_IRQ
647 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) 1082 sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
1083 sshd-1991 1d..1 52us : <stack trace>
1084 => sub_preempt_count
1085 => irq_exit
1086 => do_IRQ
1087 => ret_from_intr
648 1088
649 1089
650This has some more changes. Preemption was disabled when an 1090This has some more changes. Preemption was disabled when an
651interrupt came in (notice the 'h'), and was enabled while doing 1091interrupt came in (notice the 'h'), and was enabled on exit.
652a softirq. (notice the 's'). But we also see that interrupts 1092But we also see that interrupts have been disabled when entering
653have been disabled when entering the preempt off section and 1093the preempt off section and leaving it (the 'd'). We do not know if
654leaving it (the 'd'). We do not know if interrupts were enabled 1094interrupts were enabled in the mean time or shortly after this
655in the mean time. 1095was over.
656 1096
657# tracer: preemptoff 1097# tracer: preemptoff
658# 1098#
659preemptoff latency trace v1.1.5 on 2.6.26-rc8 1099# preemptoff latency trace v1.1.5 on 3.8.0-test+
660-------------------------------------------------------------------- 1100# --------------------------------------------------------------------
661 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1101# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
662 ----------------- 1102# -----------------
663 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 1103# | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
664 ----------------- 1104# -----------------
665 => started at: remove_wait_queue 1105# => started at: wake_up_new_task
666 => ended at: __do_softirq 1106# => ended at: task_rq_unlock
667 1107#
668# _------=> CPU# 1108#
669# / _-----=> irqs-off 1109# _------=> CPU#
670# | / _----=> need-resched 1110# / _-----=> irqs-off
671# || / _---=> hardirq/softirq 1111# | / _----=> need-resched
672# ||| / _--=> preempt-depth 1112# || / _---=> hardirq/softirq
673# |||| / 1113# ||| / _--=> preempt-depth
674# ||||| delay 1114# |||| / delay
675# cmd pid ||||| time | caller 1115# cmd pid ||||| time | caller
676# \ / ||||| \ | / 1116# \ / ||||| \ | /
677 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) 1117 bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
678 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) 1118 bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
679 sshd-4261 0d..1 2us : do_IRQ (common_interrupt) 1119 bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
680 sshd-4261 0d..1 2us : irq_enter (do_IRQ) 1120 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
681 sshd-4261 0d..1 2us : idle_cpu (irq_enter) 1121 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
682 sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
683 sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
684 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
685[...] 1122[...]
686 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) 1123 bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
687 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) 1124 bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
688 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) 1125 bash-1994 1d..1 13us : add_preempt_count <-irq_enter
689 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) 1126 bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
690 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) 1127 bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
691 sshd-4261 0d.h1 14us : irq_exit (do_IRQ) 1128 bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
692 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) 1129 bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
693 sshd-4261 0d..2 15us : do_softirq (irq_exit) 1130 bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
694 sshd-4261 0d... 15us : __do_softirq (do_softirq)
695 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
696 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
697 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
698 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
699 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
700[...] 1131[...]
701 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) 1132 bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
702 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) 1133 bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
703 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) 1134 bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
704 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) 1135 bash-1994 1d..2 36us : do_softirq <-irq_exit
705 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) 1136 bash-1994 1d..2 36us : __do_softirq <-call_softirq
706 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) 1137 bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
707 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) 1138 bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
708 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) 1139 bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
1140 bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
1141 bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
709[...] 1142[...]
710 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) 1143 bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
711 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) 1144 bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
1145 bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
1146 bash-1994 1dN.2 82us : idle_cpu <-irq_exit
1147 bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
1148 bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
1149 bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1150 bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
1151 bash-1994 1.N.1 104us : <stack trace>
1152 => sub_preempt_count
1153 => _raw_spin_unlock_irqrestore
1154 => task_rq_unlock
1155 => wake_up_new_task
1156 => do_fork
1157 => sys_clone
1158 => stub_clone
712 1159
713 1160
714The above is an example of the preemptoff trace with 1161The above is an example of the preemptoff trace with
715ftrace_enabled set. Here we see that interrupts were disabled 1162function-trace set. Here we see that interrupts were not disabled
716the entire time. The irq_enter code lets us know that we entered 1163the entire time. The irq_enter code lets us know that we entered
717an interrupt 'h'. Before that, the functions being traced still 1164an interrupt 'h'. Before that, the functions being traced still
718show that it is not in an interrupt, but we can see from the 1165show that it is not in an interrupt, but we can see from the
719functions themselves that this is not the case. 1166functions themselves that this is not the case.
720 1167
721Notice that __do_softirq when called does not have a
722preempt_count. It may seem that we missed a preempt enabling.
723What really happened is that the preempt count is held on the
724thread's stack and we switched to the softirq stack (4K stacks
725in effect). The code does not copy the preempt count, but
726because interrupts are disabled, we do not need to worry about
727it. Having a tracer like this is good for letting people know
728what really happens inside the kernel.
729
730
731preemptirqsoff 1168preemptirqsoff
732-------------- 1169--------------
733 1170
@@ -762,38 +1199,57 @@ tracer.
762Again, using this trace is much like the irqsoff and preemptoff 1199Again, using this trace is much like the irqsoff and preemptoff
763tracers. 1200tracers.
764 1201
1202 # echo 0 > options/function-trace
765 # echo preemptirqsoff > current_tracer 1203 # echo preemptirqsoff > current_tracer
766 # echo latency-format > trace_options
767 # echo 0 > tracing_max_latency
768 # echo 1 > tracing_on 1204 # echo 1 > tracing_on
1205 # echo 0 > tracing_max_latency
769 # ls -ltr 1206 # ls -ltr
770 [...] 1207 [...]
771 # echo 0 > tracing_on 1208 # echo 0 > tracing_on
772 # cat trace 1209 # cat trace
773# tracer: preemptirqsoff 1210# tracer: preemptirqsoff
774# 1211#
775preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 1212# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
776-------------------------------------------------------------------- 1213# --------------------------------------------------------------------
777 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1214# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
778 ----------------- 1215# -----------------
779 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) 1216# | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
780 ----------------- 1217# -----------------
781 => started at: apic_timer_interrupt 1218# => started at: ata_scsi_queuecmd
782 => ended at: __do_softirq 1219# => ended at: ata_scsi_queuecmd
783 1220#
784# _------=> CPU# 1221#
785# / _-----=> irqs-off 1222# _------=> CPU#
786# | / _----=> need-resched 1223# / _-----=> irqs-off
787# || / _---=> hardirq/softirq 1224# | / _----=> need-resched
788# ||| / _--=> preempt-depth 1225# || / _---=> hardirq/softirq
789# |||| / 1226# ||| / _--=> preempt-depth
790# ||||| delay 1227# |||| / delay
791# cmd pid ||||| time | caller 1228# cmd pid ||||| time | caller
792# \ / ||||| \ | / 1229# \ / ||||| \ | /
793 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) 1230 ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
794 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) 1231 ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
795 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) 1232 ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
796 1233 ls-2230 3...1 111us : <stack trace>
1234 => sub_preempt_count
1235 => _raw_spin_unlock_irqrestore
1236 => ata_scsi_queuecmd
1237 => scsi_dispatch_cmd
1238 => scsi_request_fn
1239 => __blk_run_queue_uncond
1240 => __blk_run_queue
1241 => blk_queue_bio
1242 => generic_make_request
1243 => submit_bio
1244 => submit_bh
1245 => ext3_bread
1246 => ext3_dir_bread
1247 => htree_dirblock_to_tree
1248 => ext3_htree_fill_tree
1249 => ext3_readdir
1250 => vfs_readdir
1251 => sys_getdents
1252 => system_call_fastpath
797 1253
798 1254
799The trace_hardirqs_off_thunk is called from assembly on x86 when 1255The trace_hardirqs_off_thunk is called from assembly on x86 when
@@ -802,105 +1258,158 @@ function tracing, we do not know if interrupts were enabled
802within the preemption points. We do see that it started with 1258within the preemption points. We do see that it started with
803preemption enabled. 1259preemption enabled.
804 1260
805Here is a trace with ftrace_enabled set: 1261Here is a trace with function-trace set:
806
807 1262
808# tracer: preemptirqsoff 1263# tracer: preemptirqsoff
809# 1264#
810preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 1265# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
811-------------------------------------------------------------------- 1266# --------------------------------------------------------------------
812 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1267# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
813 ----------------- 1268# -----------------
814 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) 1269# | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
815 ----------------- 1270# -----------------
816 => started at: write_chan 1271# => started at: schedule
817 => ended at: __do_softirq 1272# => ended at: mutex_unlock
818 1273#
819# _------=> CPU# 1274#
820# / _-----=> irqs-off 1275# _------=> CPU#
821# | / _----=> need-resched 1276# / _-----=> irqs-off
822# || / _---=> hardirq/softirq 1277# | / _----=> need-resched
823# ||| / _--=> preempt-depth 1278# || / _---=> hardirq/softirq
824# |||| / 1279# ||| / _--=> preempt-depth
825# ||||| delay 1280# |||| / delay
826# cmd pid ||||| time | caller 1281# cmd pid ||||| time | caller
827# \ / ||||| \ | / 1282# \ / ||||| \ | /
828 ls-4473 0.N.. 0us : preempt_schedule (write_chan) 1283kworker/-59 3...1 0us : __schedule <-schedule
829 ls-4473 0dN.1 1us : _spin_lock (schedule) 1284kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
830 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) 1285kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
831 ls-4473 0d..2 2us : put_prev_task_fair (schedule) 1286kworker/-59 3d..2 1us : deactivate_task <-__schedule
832[...] 1287kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
833 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) 1288kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
834 ls-4473 0d..2 13us : __switch_to (schedule) 1289kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
835 sshd-4261 0d..2 14us : finish_task_switch (schedule) 1290kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
836 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) 1291kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
837 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) 1292kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
838 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) 1293kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
839 sshd-4261 0d..2 16us : do_IRQ (common_interrupt) 1294kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
840 sshd-4261 0d..2 17us : irq_enter (do_IRQ) 1295kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
841 sshd-4261 0d..2 17us : idle_cpu (irq_enter) 1296kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
842 sshd-4261 0d..2 18us : add_preempt_count (irq_enter) 1297kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
843 sshd-4261 0d.h2 18us : idle_cpu (irq_enter) 1298kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
844 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) 1299kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
845 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) 1300kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
846 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) 1301kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
847 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) 1302kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
848 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) 1303kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
849[...] 1304kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
850 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) 1305kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
851 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) 1306kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
852 sshd-4261 0d.h2 29us : irq_exit (do_IRQ) 1307kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
853 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) 1308 ls-2269 3d..2 7us : finish_task_switch <-__schedule
854 sshd-4261 0d..3 30us : do_softirq (irq_exit) 1309 ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
855 sshd-4261 0d... 30us : __do_softirq (do_softirq) 1310 ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
856 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) 1311 ls-2269 3d..2 8us : irq_enter <-do_IRQ
857 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) 1312 ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
858 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) 1313 ls-2269 3d..2 9us : add_preempt_count <-irq_enter
1314 ls-2269 3d.h2 9us : exit_idle <-do_IRQ
859[...] 1315[...]
860 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) 1316 ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
861 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) 1317 ls-2269 3d.h2 20us : irq_exit <-do_IRQ
862 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) 1318 ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
863 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) 1319 ls-2269 3d..3 21us : do_softirq <-irq_exit
864 sshd-4261 0d.s3 45us : idle_cpu (irq_enter) 1320 ls-2269 3d..3 21us : __do_softirq <-call_softirq
865 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) 1321 ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
866 sshd-4261 0d.H3 46us : idle_cpu (irq_enter) 1322 ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
867 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) 1323 ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
868 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) 1324 ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
1325 ls-2269 3d.s5 31us : irq_enter <-do_IRQ
1326 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
869[...] 1327[...]
870 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) 1328 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
871 sshd-4261 0d.H3 82us : ktime_get (tick_program_event) 1329 ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
872 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) 1330 ls-2269 3d.H5 32us : exit_idle <-do_IRQ
873 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) 1331 ls-2269 3d.H5 32us : handle_irq <-do_IRQ
874 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) 1332 ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
875 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) 1333 ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
876 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
877 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
878 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
879 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
880 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
881[...] 1334[...]
882 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) 1335 ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
883 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) 1336 ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
884 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) 1337 ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
885 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) 1338 ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
886 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) 1339 ls-2269 3d..3 159us : idle_cpu <-irq_exit
887 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) 1340 ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
888 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) 1341 ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
889 1342 ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
890 1343 ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
891This is a very interesting trace. It started with the preemption 1344 ls-2269 3d... 186us : <stack trace>
892of the ls task. We see that the task had the "need_resched" bit 1345 => __mutex_unlock_slowpath
893set via the 'N' in the trace. Interrupts were disabled before 1346 => mutex_unlock
894the spin_lock at the beginning of the trace. We see that a 1347 => process_output
895schedule took place to run sshd. When the interrupts were 1348 => n_tty_write
896enabled, we took an interrupt. On return from the interrupt 1349 => tty_write
897handler, the softirq ran. We took another interrupt while 1350 => vfs_write
898running the softirq as we see from the capital 'H'. 1351 => sys_write
1352 => system_call_fastpath
1353
1354This is an interesting trace. It started with kworker running and
1355scheduling out and ls taking over. But as soon as ls released the
1356rq lock and enabled interrupts (but not preemption) an interrupt
1357triggered. When the interrupt finished, it started running softirqs.
1358But while the softirq was running, another interrupt triggered.
1359When an interrupt is running inside a softirq, the annotation is 'H'.
899 1360
900 1361
901wakeup 1362wakeup
902------ 1363------
903 1364
1365One common case that people are interested in tracing is the
1366time it takes for a task that is woken to actually wake up.
1367Now for non Real-Time tasks, this can be arbitrary. But tracing
1368it none the less can be interesting.
1369
1370Without function tracing:
1371
1372 # echo 0 > options/function-trace
1373 # echo wakeup > current_tracer
1374 # echo 1 > tracing_on
1375 # echo 0 > tracing_max_latency
1376 # chrt -f 5 sleep 1
1377 # echo 0 > tracing_on
1378 # cat trace
1379# tracer: wakeup
1380#
1381# wakeup latency trace v1.1.5 on 3.8.0-test+
1382# --------------------------------------------------------------------
1383# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1384# -----------------
1385# | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1386# -----------------
1387#
1388# _------=> CPU#
1389# / _-----=> irqs-off
1390# | / _----=> need-resched
1391# || / _---=> hardirq/softirq
1392# ||| / _--=> preempt-depth
1393# |||| / delay
1394# cmd pid ||||| time | caller
1395# \ / ||||| \ | /
1396 <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
1397 <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1398 <idle>-0 3d..3 15us : __schedule <-schedule
1399 <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
1400
1401The tracer only traces the highest priority task in the system
1402to avoid tracing the normal circumstances. Here we see that
1403the kworker with a nice priority of -20 (not very nice), took
1404just 15 microseconds from the time it woke up, to the time it
1405ran.
1406
1407Non Real-Time tasks are not that interesting. A more interesting
1408trace is to concentrate only on Real-Time tasks.
1409
1410wakeup_rt
1411---------
1412
904In a Real-Time environment it is very important to know the 1413In a Real-Time environment it is very important to know the
905wakeup time it takes for the highest priority task that is woken 1414wakeup time it takes for the highest priority task that is woken
906up to the time that it executes. This is also known as "schedule 1415up to the time that it executes. This is also known as "schedule
@@ -914,124 +1423,229 @@ Real-Time environments are interested in the worst case latency.
914That is the longest latency it takes for something to happen, 1423That is the longest latency it takes for something to happen,
915and not the average. We can have a very fast scheduler that may 1424and not the average. We can have a very fast scheduler that may
916only have a large latency once in a while, but that would not 1425only have a large latency once in a while, but that would not
917work well with Real-Time tasks. The wakeup tracer was designed 1426work well with Real-Time tasks. The wakeup_rt tracer was designed
918to record the worst case wakeups of RT tasks. Non-RT tasks are 1427to record the worst case wakeups of RT tasks. Non-RT tasks are
919not recorded because the tracer only records one worst case and 1428not recorded because the tracer only records one worst case and
920tracing non-RT tasks that are unpredictable will overwrite the 1429tracing non-RT tasks that are unpredictable will overwrite the
921worst case latency of RT tasks. 1430worst case latency of RT tasks (just run the normal wakeup
1431tracer for a while to see that effect).
922 1432
923Since this tracer only deals with RT tasks, we will run this 1433Since this tracer only deals with RT tasks, we will run this
924slightly differently than we did with the previous tracers. 1434slightly differently than we did with the previous tracers.
925Instead of performing an 'ls', we will run 'sleep 1' under 1435Instead of performing an 'ls', we will run 'sleep 1' under
926'chrt' which changes the priority of the task. 1436'chrt' which changes the priority of the task.
927 1437
928 # echo wakeup > current_tracer 1438 # echo 0 > options/function-trace
929 # echo latency-format > trace_options 1439 # echo wakeup_rt > current_tracer
930 # echo 0 > tracing_max_latency
931 # echo 1 > tracing_on 1440 # echo 1 > tracing_on
1441 # echo 0 > tracing_max_latency
932 # chrt -f 5 sleep 1 1442 # chrt -f 5 sleep 1
933 # echo 0 > tracing_on 1443 # echo 0 > tracing_on
934 # cat trace 1444 # cat trace
935# tracer: wakeup 1445# tracer: wakeup
936# 1446#
937wakeup latency trace v1.1.5 on 2.6.26-rc8 1447# tracer: wakeup_rt
938-------------------------------------------------------------------- 1448#
939 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1449# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
940 ----------------- 1450# --------------------------------------------------------------------
941 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) 1451# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
942 ----------------- 1452# -----------------
943 1453# | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
944# _------=> CPU# 1454# -----------------
945# / _-----=> irqs-off 1455#
946# | / _----=> need-resched 1456# _------=> CPU#
947# || / _---=> hardirq/softirq 1457# / _-----=> irqs-off
948# ||| / _--=> preempt-depth 1458# | / _----=> need-resched
949# |||| / 1459# || / _---=> hardirq/softirq
950# ||||| delay 1460# ||| / _--=> preempt-depth
951# cmd pid ||||| time | caller 1461# |||| / delay
952# \ / ||||| \ | / 1462# cmd pid ||||| time | caller
953 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) 1463# \ / ||||| \ | /
954 <idle>-0 1d..4 4us : schedule (cpu_idle) 1464 <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
955 1465 <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
956 1466 <idle>-0 3d..3 5us : __schedule <-schedule
957Running this on an idle system, we see that it only took 4 1467 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
958microseconds to perform the task switch. Note, since the trace 1468
959marker in the schedule is before the actual "switch", we stop 1469
960the tracing when the recorded task is about to schedule in. This 1470Running this on an idle system, we see that it only took 5 microseconds
961may change if we add a new marker at the end of the scheduler. 1471to perform the task switch. Note, since the trace point in the schedule
962 1472is before the actual "switch", we stop the tracing when the recorded task
963Notice that the recorded task is 'sleep' with the PID of 4901 1473is about to schedule in. This may change if we add a new marker at the
1474end of the scheduler.
1475
1476Notice that the recorded task is 'sleep' with the PID of 2389
964and it has an rt_prio of 5. This priority is user-space priority 1477and it has an rt_prio of 5. This priority is user-space priority
965and not the internal kernel priority. The policy is 1 for 1478and not the internal kernel priority. The policy is 1 for
966SCHED_FIFO and 2 for SCHED_RR. 1479SCHED_FIFO and 2 for SCHED_RR.
967 1480
968Doing the same with chrt -r 5 and ftrace_enabled set. 1481Note, that the trace data shows the internal priority (99 - rtprio).
969 1482
970# tracer: wakeup 1483 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
1484
1485The 0:120:R means idle was running with a nice priority of 0 (120 - 20)
1486and in the running state 'R'. The sleep task was scheduled in with
14872389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1488and it too is in the running state.
1489
1490Doing the same with chrt -r 5 and function-trace set.
1491
1492 echo 1 > options/function-trace
1493
1494# tracer: wakeup_rt
971# 1495#
972wakeup latency trace v1.1.5 on 2.6.26-rc8 1496# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
973-------------------------------------------------------------------- 1497# --------------------------------------------------------------------
974 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 1498# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
975 ----------------- 1499# -----------------
976 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) 1500# | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
977 ----------------- 1501# -----------------
978 1502#
979# _------=> CPU# 1503# _------=> CPU#
980# / _-----=> irqs-off 1504# / _-----=> irqs-off
981# | / _----=> need-resched 1505# | / _----=> need-resched
982# || / _---=> hardirq/softirq 1506# || / _---=> hardirq/softirq
983# ||| / _--=> preempt-depth 1507# ||| / _--=> preempt-depth
984# |||| / 1508# |||| / delay
985# ||||| delay 1509# cmd pid ||||| time | caller
986# cmd pid ||||| time | caller 1510# \ / ||||| \ | /
987# \ / ||||| \ | / 1511 <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
988ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) 1512 <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
989ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) 1513 <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
990ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) 1514 <idle>-0 3d.h3 3us : resched_task <-check_preempt_curr
991ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) 1515 <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
992ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) 1516 <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
993ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) 1517 <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
994ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) 1518 <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
995ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) 1519 <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
996[...] 1520 <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
997ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) 1521 <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
998ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) 1522 <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
999ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) 1523 <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
1000ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) 1524 <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
1001[...] 1525 <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
1002ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) 1526 <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
1003ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) 1527 <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
1004ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) 1528 <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
1005ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) 1529 <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
1006ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) 1530 <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
1007ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) 1531 <idle>-0 3dN.2 9us : idle_cpu <-irq_exit
1008ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) 1532 <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
1009ksoftirq-7 1.N.2 33us : schedule (__cond_resched) 1533 <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1010ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) 1534 <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
1011ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) 1535 <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
1012ksoftirq-7 1dN.3 35us : _spin_lock (schedule) 1536 <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1013ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) 1537 <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
1014ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) 1538 <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1015ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) 1539 <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
1016[...] 1540 <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1017ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) 1541 <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit
1018ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) 1542 <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz
1019ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) 1543 <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
1020ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) 1544 <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz
1021ksoftirq-7 1d..4 50us : schedule (__cond_resched) 1545 <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load
1022 1546 <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz
1023The interrupt went off while running ksoftirqd. This task runs 1547 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
1024at SCHED_OTHER. Why did not we see the 'N' set early? This may 1548 <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit
1025be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K 1549 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
1026stacks configured, the interrupt and softirq run with their own 1550 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
1027stack. Some information is held on the top of the task's stack 1551 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
1028(need_resched and preempt_count are both stored there). The 1552 <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
1029setting of the NEED_RESCHED bit is done directly to the task's 1553 <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1030stack, but the reading of the NEED_RESCHED is done by looking at 1554 <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
1031the current stack, which in this case is the stack for the hard 1555 <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
1032interrupt. This hides the fact that NEED_RESCHED has been set. 1556 <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
1033We do not see the 'N' until we switch back to the task's 1557 <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
1034assigned stack. 1558 <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
1559 <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
1560 <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
1561 <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
1562 <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1563 <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
1564 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
1565 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
1566 <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
1567 <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
1568 <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
1569 <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
1570 <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
1571 <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
1572 <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
1573 <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
1574 <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
1575 <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
1576 <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
1577 <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
1578 <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1579 <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
1580 <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
1581 <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
1582 <idle>-0 3.N.. 25us : schedule <-cpu_idle
1583 <idle>-0 3.N.. 25us : __schedule <-preempt_schedule
1584 <idle>-0 3.N.. 26us : add_preempt_count <-__schedule
1585 <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
1586 <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
1587 <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
1588 <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
1589 <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
1590 <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
1591 <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
1592 <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
1593 <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
1594 <idle>-0 3d..3 29us : __schedule <-preempt_schedule
1595 <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
1596
1597This isn't that big of a trace, even with function tracing enabled,
1598so I included the entire trace.
1599
1600The interrupt went off while when the system was idle. Somewhere
1601before task_woken_rt() was called, the NEED_RESCHED flag was set,
1602this is indicated by the first occurrence of the 'N' flag.
1603
1604Latency tracing and events
1605--------------------------
1606As function tracing can induce a much larger latency, but without
1607seeing what happens within the latency it is hard to know what
1608caused it. There is a middle ground, and that is with enabling
1609events.
1610
1611 # echo 0 > options/function-trace
1612 # echo wakeup_rt > current_tracer
1613 # echo 1 > events/enable
1614 # echo 1 > tracing_on
1615 # echo 0 > tracing_max_latency
1616 # chrt -f 5 sleep 1
1617 # echo 0 > tracing_on
1618 # cat trace
1619# tracer: wakeup_rt
1620#
1621# wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1622# --------------------------------------------------------------------
1623# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1624# -----------------
1625# | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
1626# -----------------
1627#
1628# _------=> CPU#
1629# / _-----=> irqs-off
1630# | / _----=> need-resched
1631# || / _---=> hardirq/softirq
1632# ||| / _--=> preempt-depth
1633# |||| / delay
1634# cmd pid ||||| time | caller
1635# \ / ||||| \ | /
1636 <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
1637 <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1638 <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
1639 <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
1640 <idle>-0 2.N.2 2us : power_end: cpu_id=2
1641 <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
1642 <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
1643 <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
1644 <idle>-0 2.N.2 5us : rcu_utilization: Start context switch
1645 <idle>-0 2.N.2 5us : rcu_utilization: End context switch
1646 <idle>-0 2d..3 6us : __schedule <-schedule
1647 <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
1648
1035 1649
1036function 1650function
1037-------- 1651--------
@@ -1039,6 +1653,7 @@ function
1039This tracer is the function tracer. Enabling the function tracer 1653This tracer is the function tracer. Enabling the function tracer
1040can be done from the debug file system. Make sure the 1654can be done from the debug file system. Make sure the
1041ftrace_enabled is set; otherwise this tracer is a nop. 1655ftrace_enabled is set; otherwise this tracer is a nop.
1656See the "ftrace_enabled" section below.
1042 1657
1043 # sysctl kernel.ftrace_enabled=1 1658 # sysctl kernel.ftrace_enabled=1
1044 # echo function > current_tracer 1659 # echo function > current_tracer
@@ -1048,23 +1663,23 @@ ftrace_enabled is set; otherwise this tracer is a nop.
1048 # cat trace 1663 # cat trace
1049# tracer: function 1664# tracer: function
1050# 1665#
1051# TASK-PID CPU# TIMESTAMP FUNCTION 1666# entries-in-buffer/entries-written: 24799/24799 #P:4
1052# | | | | | 1667#
1053 bash-4003 [00] 123.638713: finish_task_switch <-schedule 1668# _-----=> irqs-off
1054 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch 1669# / _----=> need-resched
1055 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq 1670# | / _---=> hardirq/softirq
1056 bash-4003 [00] 123.638715: hrtick_set <-schedule 1671# || / _--=> preempt-depth
1057 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set 1672# ||| / delay
1058 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave 1673# TASK-PID CPU# |||| TIMESTAMP FUNCTION
1059 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set 1674# | | | |||| | |
1060 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore 1675 bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
1061 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set 1676 bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
1062 bash-4003 [00] 123.638718: sub_preempt_count <-schedule 1677 bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
1063 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule 1678 bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
1064 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run 1679 bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
1065 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion 1680 bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
1066 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common 1681 bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
1067 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq 1682 bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
1068[...] 1683[...]
1069 1684
1070 1685
@@ -1214,79 +1829,19 @@ int main (int argc, char **argv)
1214 return 0; 1829 return 0;
1215} 1830}
1216 1831
1832Or this simple script!
1217 1833
1218hw-branch-tracer (x86 only) 1834------
1219--------------------------- 1835#!/bin/bash
1220 1836
1221This tracer uses the x86 last branch tracing hardware feature to 1837debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts`
1222collect a branch trace on all cpus with relatively low overhead. 1838echo nop > $debugfs/tracing/current_tracer
1223 1839echo 0 > $debugfs/tracing/tracing_on
1224The tracer uses a fixed-size circular buffer per cpu and only 1840echo $$ > $debugfs/tracing/set_ftrace_pid
1225traces ring 0 branches. The trace file dumps that buffer in the 1841echo function > $debugfs/tracing/current_tracer
1226following format: 1842echo 1 > $debugfs/tracing/tracing_on
1227 1843exec "$@"
1228# tracer: hw-branch-tracer 1844------
1229#
1230# CPU# TO <- FROM
1231 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6
1232 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a
1233 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf
1234 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf
1235 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a
1236 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf
1237
1238
1239The tracer may be used to dump the trace for the oops'ing cpu on
1240a kernel oops into the system log. To enable this,
1241ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one
1242can either use the sysctl function or set it via the proc system
1243interface.
1244
1245 sysctl kernel.ftrace_dump_on_oops=n
1246
1247or
1248
1249 echo n > /proc/sys/kernel/ftrace_dump_on_oops
1250
1251If n = 1, ftrace will dump buffers of all CPUs, if n = 2 ftrace will
1252only dump the buffer of the CPU that triggered the oops.
1253
1254Here's an example of such a dump after a null pointer
1255dereference in a kernel module:
1256
1257[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
1258[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops]
1259[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0
1260[57848.106019] Oops: 0002 [#1] SMP
1261[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus
1262[57848.106019] Dumping ftrace buffer:
1263[57848.106019] ---------------------------------
1264[...]
1265[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24
1266[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165
1267[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165
1268[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165
1269[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165
1270[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops]
1271[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30
1272[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b
1273[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31
1274[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1
1275[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30
1276[...]
1277[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2
1278[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881
1279[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881
1280[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96
1281[...]
1282[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3
1283[57848.106019] ---------------------------------
1284[57848.106019] CPU 0
1285[57848.106019] Modules linked in: oops
1286[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23
1287[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops]
1288[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246
1289[...]
1290 1845
1291 1846
1292function graph tracer 1847function graph tracer
@@ -1473,16 +2028,18 @@ starts of pointing to a simple return. (Enabling FTRACE will
1473include the -pg switch in the compiling of the kernel.) 2028include the -pg switch in the compiling of the kernel.)
1474 2029
1475At compile time every C file object is run through the 2030At compile time every C file object is run through the
1476recordmcount.pl script (located in the scripts directory). This 2031recordmcount program (located in the scripts directory). This
1477script will process the C object using objdump to find all the 2032program will parse the ELF headers in the C object to find all
1478locations in the .text section that call mcount. (Note, only the 2033the locations in the .text section that call mcount. (Note, only
1479.text section is processed, since processing other sections like 2034white listed .text sections are processed, since processing other
1480.init.text may cause races due to those sections being freed). 2035sections like .init.text may cause races due to those sections
2036being freed unexpectedly).
1481 2037
1482A new section called "__mcount_loc" is created that holds 2038A new section called "__mcount_loc" is created that holds
1483references to all the mcount call sites in the .text section. 2039references to all the mcount call sites in the .text section.
1484This section is compiled back into the original object. The 2040The recordmcount program re-links this section back into the
1485final linker will add all these references into a single table. 2041original object. The final linking stage of the kernel will add all these
2042references into a single table.
1486 2043
1487On boot up, before SMP is initialized, the dynamic ftrace code 2044On boot up, before SMP is initialized, the dynamic ftrace code
1488scans this table and updates all the locations into nops. It 2045scans this table and updates all the locations into nops. It
@@ -1493,13 +2050,25 @@ unloaded, it also removes its functions from the ftrace function
1493list. This is automatic in the module unload code, and the 2050list. This is automatic in the module unload code, and the
1494module author does not need to worry about it. 2051module author does not need to worry about it.
1495 2052
1496When tracing is enabled, kstop_machine is called to prevent 2053When tracing is enabled, the process of modifying the function
1497races with the CPUS executing code being modified (which can 2054tracepoints is dependent on architecture. The old method is to use
1498cause the CPU to do undesirable things), and the nops are 2055kstop_machine to prevent races with the CPUs executing code being
2056modified (which can cause the CPU to do undesirable things, especially
2057if the modified code crosses cache (or page) boundaries), and the nops are
1499patched back to calls. But this time, they do not call mcount 2058patched back to calls. But this time, they do not call mcount
1500(which is just a function stub). They now call into the ftrace 2059(which is just a function stub). They now call into the ftrace
1501infrastructure. 2060infrastructure.
1502 2061
2062The new method of modifying the function tracepoints is to place
2063a breakpoint at the location to be modified, sync all CPUs, modify
2064the rest of the instruction not covered by the breakpoint. Sync
2065all CPUs again, and then remove the breakpoint with the finished
2066version to the ftrace call site.
2067
2068Some archs do not even need to monkey around with the synchronization,
2069and can just slap the new code on top of the old without any
2070problems with other CPUs executing it at the same time.
2071
1503One special side-effect to the recording of the functions being 2072One special side-effect to the recording of the functions being
1504traced is that we can now selectively choose which functions we 2073traced is that we can now selectively choose which functions we
1505wish to trace and which ones we want the mcount calls to remain 2074wish to trace and which ones we want the mcount calls to remain
@@ -1530,20 +2099,28 @@ mutex_lock
1530 2099
1531If I am only interested in sys_nanosleep and hrtimer_interrupt: 2100If I am only interested in sys_nanosleep and hrtimer_interrupt:
1532 2101
1533 # echo sys_nanosleep hrtimer_interrupt \ 2102 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
1534 > set_ftrace_filter
1535 # echo function > current_tracer 2103 # echo function > current_tracer
1536 # echo 1 > tracing_on 2104 # echo 1 > tracing_on
1537 # usleep 1 2105 # usleep 1
1538 # echo 0 > tracing_on 2106 # echo 0 > tracing_on
1539 # cat trace 2107 # cat trace
1540# tracer: ftrace 2108# tracer: function
2109#
2110# entries-in-buffer/entries-written: 5/5 #P:4
1541# 2111#
1542# TASK-PID CPU# TIMESTAMP FUNCTION 2112# _-----=> irqs-off
1543# | | | | | 2113# / _----=> need-resched
1544 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt 2114# | / _---=> hardirq/softirq
1545 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call 2115# || / _--=> preempt-depth
1546 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt 2116# ||| / delay
2117# TASK-PID CPU# |||| TIMESTAMP FUNCTION
2118# | | | |||| | |
2119 usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
2120 <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2121 usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2122 <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2123 <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
1547 2124
1548To see which functions are being traced, you can cat the file: 2125To see which functions are being traced, you can cat the file:
1549 2126
@@ -1571,20 +2148,25 @@ Note: It is better to use quotes to enclose the wild cards,
1571 2148
1572Produces: 2149Produces:
1573 2150
1574# tracer: ftrace 2151# tracer: function
1575# 2152#
1576# TASK-PID CPU# TIMESTAMP FUNCTION 2153# entries-in-buffer/entries-written: 897/897 #P:4
1577# | | | | | 2154#
1578 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process 2155# _-----=> irqs-off
1579 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set 2156# / _----=> need-resched
1580 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear 2157# | / _---=> hardirq/softirq
1581 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel 2158# || / _--=> preempt-depth
1582 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt 2159# ||| / delay
1583 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt 2160# TASK-PID CPU# |||| TIMESTAMP FUNCTION
1584 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt 2161# | | | |||| | |
1585 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt 2162 <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
1586 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt 2163 <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
1587 2164 <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2165 <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2166 <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2167 <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2168 <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2169 <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
1588 2170
1589Notice that we lost the sys_nanosleep. 2171Notice that we lost the sys_nanosleep.
1590 2172
@@ -1651,19 +2233,29 @@ traced.
1651 2233
1652Produces: 2234Produces:
1653 2235
1654# tracer: ftrace 2236# tracer: function
2237#
2238# entries-in-buffer/entries-written: 39608/39608 #P:4
1655# 2239#
1656# TASK-PID CPU# TIMESTAMP FUNCTION 2240# _-----=> irqs-off
1657# | | | | | 2241# / _----=> need-resched
1658 bash-4043 [01] 115.281644: finish_task_switch <-schedule 2242# | / _---=> hardirq/softirq
1659 bash-4043 [01] 115.281645: hrtick_set <-schedule 2243# || / _--=> preempt-depth
1660 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set 2244# ||| / delay
1661 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run 2245# TASK-PID CPU# |||| TIMESTAMP FUNCTION
1662 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion 2246# | | | |||| | |
1663 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run 2247 bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
1664 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop 2248 bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
1665 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop 2249 bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
1666 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process 2250 bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
2251 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
2252 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
2253 bash-1994 [000] .... 4342.324899: do_truncate <-do_last
2254 bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
2255 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
2256 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
2257 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
2258 bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
1667 2259
1668We can see that there's no more lock or preempt tracing. 2260We can see that there's no more lock or preempt tracing.
1669 2261
@@ -1729,6 +2321,28 @@ this special filter via:
1729 echo > set_graph_function 2321 echo > set_graph_function
1730 2322
1731 2323
2324ftrace_enabled
2325--------------
2326
2327Note, the proc sysctl ftrace_enable is a big on/off switch for the
2328function tracer. By default it is enabled (when function tracing is
2329enabled in the kernel). If it is disabled, all function tracing is
2330disabled. This includes not only the function tracers for ftrace, but
2331also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2332
2333Please disable this with care.
2334
2335This can be disable (and enabled) with:
2336
2337 sysctl kernel.ftrace_enabled=0
2338 sysctl kernel.ftrace_enabled=1
2339
2340 or
2341
2342 echo 0 > /proc/sys/kernel/ftrace_enabled
2343 echo 1 > /proc/sys/kernel/ftrace_enabled
2344
2345
1732Filter commands 2346Filter commands
1733--------------- 2347---------------
1734 2348
@@ -1763,12 +2377,58 @@ The following commands are supported:
1763 2377
1764 echo '__schedule_bug:traceoff:5' > set_ftrace_filter 2378 echo '__schedule_bug:traceoff:5' > set_ftrace_filter
1765 2379
2380 To always disable tracing when __schedule_bug is hit:
2381
2382 echo '__schedule_bug:traceoff' > set_ftrace_filter
2383
1766 These commands are cumulative whether or not they are appended 2384 These commands are cumulative whether or not they are appended
1767 to set_ftrace_filter. To remove a command, prepend it by '!' 2385 to set_ftrace_filter. To remove a command, prepend it by '!'
1768 and drop the parameter: 2386 and drop the parameter:
1769 2387
2388 echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
2389
2390 The above removes the traceoff command for __schedule_bug
2391 that have a counter. To remove commands without counters:
2392
1770 echo '!__schedule_bug:traceoff' > set_ftrace_filter 2393 echo '!__schedule_bug:traceoff' > set_ftrace_filter
1771 2394
2395- snapshot
2396 Will cause a snapshot to be triggered when the function is hit.
2397
2398 echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
2399
2400 To only snapshot once:
2401
2402 echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
2403
2404 To remove the above commands:
2405
2406 echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
2407 echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
2408
2409- enable_event/disable_event
2410 These commands can enable or disable a trace event. Note, because
2411 function tracing callbacks are very sensitive, when these commands
2412 are registered, the trace point is activated, but disabled in
2413 a "soft" mode. That is, the tracepoint will be called, but
2414 just will not be traced. The event tracepoint stays in this mode
2415 as long as there's a command that triggers it.
2416
2417 echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
2418 set_ftrace_filter
2419
2420 The format is:
2421
2422 <function>:enable_event:<system>:<event>[:count]
2423 <function>:disable_event:<system>:<event>[:count]
2424
2425 To remove the events commands:
2426
2427
2428 echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
2429 set_ftrace_filter
2430 echo '!schedule:disable_event:sched:sched_switch' > \
2431 set_ftrace_filter
1772 2432
1773trace_pipe 2433trace_pipe
1774---------- 2434----------
@@ -1787,28 +2447,31 @@ different. The trace is live.
1787 # cat trace 2447 # cat trace
1788# tracer: function 2448# tracer: function
1789# 2449#
1790# TASK-PID CPU# TIMESTAMP FUNCTION 2450# entries-in-buffer/entries-written: 0/0 #P:4
1791# | | | | | 2451#
2452# _-----=> irqs-off
2453# / _----=> need-resched
2454# | / _---=> hardirq/softirq
2455# || / _--=> preempt-depth
2456# ||| / delay
2457# TASK-PID CPU# |||| TIMESTAMP FUNCTION
2458# | | | |||| | |
1792 2459
1793 # 2460 #
1794 # cat /tmp/trace.out 2461 # cat /tmp/trace.out
1795 bash-4043 [00] 41.267106: finish_task_switch <-schedule 2462 bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
1796 bash-4043 [00] 41.267106: hrtick_set <-schedule 2463 bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
1797 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set 2464 bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
1798 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run 2465 bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
1799 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion 2466 bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
1800 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run 2467 bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
1801 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop 2468 bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
1802 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop 2469 bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
1803 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process 2470 bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
1804 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1805 2471
1806 2472
1807Note, reading the trace_pipe file will block until more input is 2473Note, reading the trace_pipe file will block until more input is
1808added. By changing the tracer, trace_pipe will issue an EOF. We 2474added.
1809needed to set the function tracer _before_ we "cat" the
1810trace_pipe file.
1811
1812 2475
1813trace entries 2476trace entries
1814------------- 2477-------------
@@ -1817,31 +2480,50 @@ Having too much or not enough data can be troublesome in
1817diagnosing an issue in the kernel. The file buffer_size_kb is 2480diagnosing an issue in the kernel. The file buffer_size_kb is
1818used to modify the size of the internal trace buffers. The 2481used to modify the size of the internal trace buffers. The
1819number listed is the number of entries that can be recorded per 2482number listed is the number of entries that can be recorded per
1820CPU. To know the full size, multiply the number of possible CPUS 2483CPU. To know the full size, multiply the number of possible CPUs
1821with the number of entries. 2484with the number of entries.
1822 2485
1823 # cat buffer_size_kb 2486 # cat buffer_size_kb
18241408 (units kilobytes) 24871408 (units kilobytes)
1825 2488
1826Note, to modify this, you must have tracing completely disabled. 2489Or simply read buffer_total_size_kb
1827To do that, echo "nop" into the current_tracer. If the 2490
1828current_tracer is not set to "nop", an EINVAL error will be 2491 # cat buffer_total_size_kb
1829returned. 24925632
2493
2494To modify the buffer, simple echo in a number (in 1024 byte segments).
1830 2495
1831 # echo nop > current_tracer
1832 # echo 10000 > buffer_size_kb 2496 # echo 10000 > buffer_size_kb
1833 # cat buffer_size_kb 2497 # cat buffer_size_kb
183410000 (units kilobytes) 249810000 (units kilobytes)
1835 2499
1836The number of pages which will be allocated is limited to a 2500It will try to allocate as much as possible. If you allocate too
1837percentage of available memory. Allocating too much will produce 2501much, it can cause Out-Of-Memory to trigger.
1838an error.
1839 2502
1840 # echo 1000000000000 > buffer_size_kb 2503 # echo 1000000000000 > buffer_size_kb
1841-bash: echo: write error: Cannot allocate memory 2504-bash: echo: write error: Cannot allocate memory
1842 # cat buffer_size_kb 2505 # cat buffer_size_kb
184385 250685
1844 2507
2508The per_cpu buffers can be changed individually as well:
2509
2510 # echo 10000 > per_cpu/cpu0/buffer_size_kb
2511 # echo 100 > per_cpu/cpu1/buffer_size_kb
2512
2513When the per_cpu buffers are not the same, the buffer_size_kb
2514at the top level will just show an X
2515
2516 # cat buffer_size_kb
2517X
2518
2519This is where the buffer_total_size_kb is useful:
2520
2521 # cat buffer_total_size_kb
252212916
2523
2524Writing to the top level buffer_size_kb will reset all the buffers
2525to be the same again.
2526
1845Snapshot 2527Snapshot
1846-------- 2528--------
1847CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature 2529CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
@@ -1925,7 +2607,188 @@ bash: echo: write error: Device or resource busy
1925 # cat snapshot 2607 # cat snapshot
1926cat: snapshot: Device or resource busy 2608cat: snapshot: Device or resource busy
1927 2609
2610
2611Instances
2612---------
2613In the debugfs tracing directory is a directory called "instances".
2614This directory can have new directories created inside of it using
2615mkdir, and removing directories with rmdir. The directory created
2616with mkdir in this directory will already contain files and other
2617directories after it is created.
2618
2619 # mkdir instances/foo
2620 # ls instances/foo
2621buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
2622set_event snapshot trace trace_clock trace_marker trace_options
2623trace_pipe tracing_on
2624
2625As you can see, the new directory looks similar to the tracing directory
2626itself. In fact, it is very similar, except that the buffer and
2627events are agnostic from the main director, or from any other
2628instances that are created.
2629
2630The files in the new directory work just like the files with the
2631same name in the tracing directory except the buffer that is used
2632is a separate and new buffer. The files affect that buffer but do not
2633affect the main buffer with the exception of trace_options. Currently,
2634the trace_options affect all instances and the top level buffer
2635the same, but this may change in future releases. That is, options
2636may become specific to the instance they reside in.
2637
2638Notice that none of the function tracer files are there, nor is
2639current_tracer and available_tracers. This is because the buffers
2640can currently only have events enabled for them.
2641
2642 # mkdir instances/foo
2643 # mkdir instances/bar
2644 # mkdir instances/zoot
2645 # echo 100000 > buffer_size_kb
2646 # echo 1000 > instances/foo/buffer_size_kb
2647 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
2648 # echo function > current_trace
2649 # echo 1 > instances/foo/events/sched/sched_wakeup/enable
2650 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
2651 # echo 1 > instances/foo/events/sched/sched_switch/enable
2652 # echo 1 > instances/bar/events/irq/enable
2653 # echo 1 > instances/zoot/events/syscalls/enable
2654 # cat trace_pipe
2655CPU:2 [LOST 11745 EVENTS]
2656 bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
2657 bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
2658 bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
2659 bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
2660 bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
2661 bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
2662 bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
2663 bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
2664 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
2665 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
2666 bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
2667[...]
2668
2669 # cat instances/foo/trace_pipe
2670 bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
2671 bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
2672 <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
2673 <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
2674 rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
2675 bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
2676 bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
2677 bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
2678 kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
2679 kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
2680[...]
2681
2682 # cat instances/bar/trace_pipe
2683 migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
2684 <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
2685 bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
2686 bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
2687 bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
2688 bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
2689 bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
2690 bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
2691 sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
2692 sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
2693 sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
2694 sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
2695[...]
2696
2697 # cat instances/zoot/trace
2698# tracer: nop
2699#
2700# entries-in-buffer/entries-written: 18996/18996 #P:4
2701#
2702# _-----=> irqs-off
2703# / _----=> need-resched
2704# | / _---=> hardirq/softirq
2705# || / _--=> preempt-depth
2706# ||| / delay
2707# TASK-PID CPU# |||| TIMESTAMP FUNCTION
2708# | | | |||| | |
2709 bash-1998 [000] d... 140.733501: sys_write -> 0x2
2710 bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
2711 bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
2712 bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
2713 bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
2714 bash-1998 [000] d... 140.733510: sys_close(fd: a)
2715 bash-1998 [000] d... 140.733510: sys_close -> 0x0
2716 bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
2717 bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
2718 bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
2719 bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
2720
2721You can see that the trace of the top most trace buffer shows only
2722the function tracing. The foo instance displays wakeups and task
2723switches.
2724
2725To remove the instances, simply delete their directories:
2726
2727 # rmdir instances/foo
2728 # rmdir instances/bar
2729 # rmdir instances/zoot
2730
2731Note, if a process has a trace file open in one of the instance
2732directories, the rmdir will fail with EBUSY.
2733
2734
2735Stack trace
1928----------- 2736-----------
2737Since the kernel has a fixed sized stack, it is important not to
2738waste it in functions. A kernel developer must be conscience of
2739what they allocate on the stack. If they add too much, the system
2740can be in danger of a stack overflow, and corruption will occur,
2741usually leading to a system panic.
2742
2743There are some tools that check this, usually with interrupts
2744periodically checking usage. But if you can perform a check
2745at every function call that will become very useful. As ftrace provides
2746a function tracer, it makes it convenient to check the stack size
2747at every function call. This is enabled via the stack tracer.
2748
2749CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
2750To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
2751
2752 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
2753
2754You can also enable it from the kernel command line to trace
2755the stack size of the kernel during boot up, by adding "stacktrace"
2756to the kernel command line parameter.
2757
2758After running it for a few minutes, the output looks like:
2759
2760 # cat stack_max_size
27612928
2762
2763 # cat stack_trace
2764 Depth Size Location (18 entries)
2765 ----- ---- --------
2766 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
2767 1) 2704 160 find_busiest_group+0x31/0x1f1
2768 2) 2544 256 load_balance+0xd9/0x662
2769 3) 2288 80 idle_balance+0xbb/0x130
2770 4) 2208 128 __schedule+0x26e/0x5b9
2771 5) 2080 16 schedule+0x64/0x66
2772 6) 2064 128 schedule_timeout+0x34/0xe0
2773 7) 1936 112 wait_for_common+0x97/0xf1
2774 8) 1824 16 wait_for_completion+0x1d/0x1f
2775 9) 1808 128 flush_work+0xfe/0x119
2776 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
2777 11) 1664 48 input_available_p+0x1d/0x5c
2778 12) 1616 48 n_tty_poll+0x6d/0x134
2779 13) 1568 64 tty_poll+0x64/0x7f
2780 14) 1504 880 do_select+0x31e/0x511
2781 15) 624 400 core_sys_select+0x177/0x216
2782 16) 224 96 sys_select+0x91/0xb9
2783 17) 128 128 system_call_fastpath+0x16/0x1b
2784
2785Note, if -mfentry is being used by gcc, functions get traced before
2786they set up the stack frame. This means that leaf level functions
2787are not tested by the stack tracer when -mfentry is used.
2788
2789Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
2790
2791---------
1929 2792
1930More details can be found in the source code, in the 2793More details can be found in the source code, in the
1931kernel/trace/*.c files. 2794kernel/trace/*.c files.