aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/ftrace.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/ftrace.txt')
-rw-r--r--Documentation/ftrace.txt305
1 files changed, 152 insertions, 153 deletions
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index 77d3faa1a611..f218f616ff6b 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -4,9 +4,10 @@
4Copyright 2008 Red Hat Inc. 4Copyright 2008 Red Hat Inc.
5 Author: Steven Rostedt <srostedt@redhat.com> 5 Author: Steven Rostedt <srostedt@redhat.com>
6 License: The GNU Free Documentation License, Version 1.2 6 License: The GNU Free Documentation License, Version 1.2
7Reviewers: Elias Oltmanns and Randy Dunlap 7Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
8 John Kacur, and David Teigland.
8 9
9Writen for: 2.6.26-rc8 linux-2.6-tip.git tip/tracing/ftrace branch 10Written for: 2.6.27-rc1
10 11
11Introduction 12Introduction
12------------ 13------------
@@ -18,10 +19,11 @@ issues that take place outside of user-space.
18 19
19Although ftrace is the function tracer, it also includes an 20Although ftrace is the function tracer, it also includes an
20infrastructure that allows for other types of tracing. Some of the 21infrastructure that allows for other types of tracing. Some of the
21tracers that are currently in ftrace is a tracer to trace 22tracers that are currently in ftrace include a tracer to trace
22context switches, the time it takes for a high priority task to 23context switches, the time it takes for a high priority task to
23run after it was woken up, the time interrupts are disabled, and 24run after it was woken up, the time interrupts are disabled, and
24more. 25more (ftrace allows for tracer plugins, which means that the list of
26tracers can always grow).
25 27
26 28
27The File System 29The File System
@@ -35,6 +37,8 @@ To mount the debugfs system:
35 # mkdir /debug 37 # mkdir /debug
36 # mount -t debugfs nodev /debug 38 # mount -t debugfs nodev /debug
37 39
40(Note: it is more common to mount at /sys/kernel/debug, but for simplicity
41 this document will use /debug)
38 42
39That's it! (assuming that you have ftrace configured into your kernel) 43That's it! (assuming that you have ftrace configured into your kernel)
40 44
@@ -50,20 +54,19 @@ of ftrace. Here is a list of some of the key files:
50 54
51 available_tracers : This holds the different types of tracers that 55 available_tracers : This holds the different types of tracers that
52 have been compiled into the kernel. The tracers 56 have been compiled into the kernel. The tracers
53 listed here can be configured by echoing in their 57 listed here can be configured by echoing their name
54 name into current_tracer. 58 into current_tracer.
55 59
56 tracing_enabled : This sets or displays whether the current_tracer 60 tracing_enabled : This sets or displays whether the current_tracer
57 is activated and tracing or not. Echo 0 into this 61 is activated and tracing or not. Echo 0 into this
58 file to disable the tracer or 1 (or non-zero) to 62 file to disable the tracer or 1 to enable it.
59 enable it.
60 63
61 trace : This file holds the output of the trace in a human readable 64 trace : This file holds the output of the trace in a human readable
62 format. 65 format (described below).
63 66
64 latency_trace : This file shows the same trace but the information 67 latency_trace : This file shows the same trace but the information
65 is organized more to display possible latencies 68 is organized more to display possible latencies
66 in the system. 69 in the system (described below).
67 70
68 trace_pipe : The output is the same as the "trace" file but this 71 trace_pipe : The output is the same as the "trace" file but this
69 file is meant to be streamed with live tracing. 72 file is meant to be streamed with live tracing.
@@ -75,7 +78,7 @@ of ftrace. Here is a list of some of the key files:
75 file, it is consumed, and will not be read 78 file, it is consumed, and will not be read
76 again with a sequential read. The "trace" and 79 again with a sequential read. The "trace" and
77 "latency_trace" files are static, and if the 80 "latency_trace" files are static, and if the
78 tracer isn't adding more data, they will display 81 tracer is not adding more data, they will display
79 the same information every time they are read. 82 the same information every time they are read.
80 83
81 iter_ctrl : This file lets the user control the amount of data 84 iter_ctrl : This file lets the user control the amount of data
@@ -92,10 +95,10 @@ of ftrace. Here is a list of some of the key files:
92 95
93 trace_entries : This sets or displays the number of trace 96 trace_entries : This sets or displays the number of trace
94 entries each CPU buffer can hold. The tracer buffers 97 entries each CPU buffer can hold. The tracer buffers
95 are the same size for each CPU, so care must be 98 are the same size for each CPU. The displayed number
96 taken when modifying the trace_entries. The trace 99 is the size of the CPU buffer and not total size. The
97 buffers are allocated in pages (blocks of memory that 100 trace buffers are allocated in pages (blocks of memory
98 the kernel uses for allocation, usually 4 KB in size). 101 that the kernel uses for allocation, usually 4 KB in size).
99 Since each entry is smaller than a page, if the last 102 Since each entry is smaller than a page, if the last
100 allocated page has room for more entries than were 103 allocated page has room for more entries than were
101 requested, the rest of the page is used to allocate 104 requested, the rest of the page is used to allocate
@@ -112,20 +115,19 @@ of ftrace. Here is a list of some of the key files:
112 on specified CPUS. The format is a hex string 115 on specified CPUS. The format is a hex string
113 representing the CPUS. 116 representing the CPUS.
114 117
115 set_ftrace_filter : When dynamic ftrace is configured in, the 118 set_ftrace_filter : When dynamic ftrace is configured in (see the
116 code is dynamically modified to disable calling 119 section below "dynamic ftrace"), the code is dynamically
117 of the function profiler (mcount). This lets 120 modified (code text rewrite) to disable calling of the
118 tracing be configured in with practically no overhead 121 function profiler (mcount). This lets tracing be configured
119 in performance. This also has a side effect of 122 in with practically no overhead in performance. This also
120 enabling or disabling specific functions to be 123 has a side effect of enabling or disabling specific functions
121 traced. Echoing in names of functions into this 124 to be traced. Echoing names of functions into this file
122 file will limit the trace to only these functions. 125 will limit the trace to only those functions.
123 126
124 set_ftrace_notrace: This has the opposite effect that 127 set_ftrace_notrace: This has an effect opposite to that of
125 set_ftrace_filter has. Any function that is added 128 set_ftrace_filter. Any function that is added here will not
126 here will not be traced. If a function exists 129 be traced. If a function exists in both set_ftrace_filter
127 in both set_ftrace_filter and set_ftrace_notrace, 130 and set_ftrace_notrace, the function will _not_ be traced.
128 the function will _not_ be traced.
129 131
130 available_filter_functions : When a function is encountered the first 132 available_filter_functions : When a function is encountered the first
131 time by the dynamic tracer, it is recorded and 133 time by the dynamic tracer, it is recorded and
@@ -133,32 +135,31 @@ of ftrace. Here is a list of some of the key files:
133 lists the functions that have been recorded 135 lists the functions that have been recorded
134 by the dynamic tracer and these functions can 136 by the dynamic tracer and these functions can
135 be used to set the ftrace filter by the above 137 be used to set the ftrace filter by the above
136 "set_ftrace_filter" file. 138 "set_ftrace_filter" file. (See the section "dynamic ftrace"
139 below for more details).
137 140
138 141
139The Tracers 142The Tracers
140----------- 143-----------
141 144
142Here are the list of current tracers that can be configured. 145Here is the list of current tracers that may be configured.
143 146
144 ftrace - function tracer that uses mcount to trace all functions. 147 ftrace - function tracer that uses mcount to trace all functions.
145 It is possible to filter out which functions that are
146 to be traced when dynamic ftrace is configured in.
147 148
148 sched_switch - traces the context switches between tasks. 149 sched_switch - traces the context switches between tasks.
149 150
150 irqsoff - traces the areas that disable interrupts and saves off 151 irqsoff - traces the areas that disable interrupts and saves
151 the trace with the longest max latency. 152 the trace with the longest max latency.
152 See tracing_max_latency. When a new max is recorded, 153 See tracing_max_latency. When a new max is recorded,
153 it replaces the old trace. It is best to view this 154 it replaces the old trace. It is best to view this
154 trace with the latency_trace file. 155 trace via the latency_trace file.
155 156
156 preemptoff - Similar to irqsoff but traces and records the time 157 preemptoff - Similar to irqsoff but traces and records the amount of
157 preemption is disabled. 158 time for which preemption is disabled.
158 159
159 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and 160 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
160 records the largest time irqs and/or preemption is 161 records the largest time for which irqs and/or preemption
161 disabled. 162 is disabled.
162 163
163 wakeup - Traces and records the max latency that it takes for 164 wakeup - Traces and records the max latency that it takes for
164 the highest priority task to get scheduled after 165 the highest priority task to get scheduled after
@@ -171,13 +172,13 @@ Here are the list of current tracers that can be configured.
171Examples of using the tracer 172Examples of using the tracer
172---------------------------- 173----------------------------
173 174
174Here are typical examples of using the tracers with only controlling 175Here are typical examples of using the tracers when controlling them only
175them with the debugfs interface (without using any user-land utilities). 176with the debugfs interface (without using any user-land utilities).
176 177
177Output format: 178Output format:
178-------------- 179--------------
179 180
180Here's an example of the output format of the file "trace" 181Here is an example of the output format of the file "trace"
181 182
182 -------- 183 --------
183# tracer: ftrace 184# tracer: ftrace
@@ -189,14 +190,15 @@ Here's an example of the output format of the file "trace"
189 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput 190 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
190 -------- 191 --------
191 192
192A header is printed with the trace that is represented. In this case 193A header is printed with the tracer name that is represented by the trace.
193the tracer is "ftrace". Then a header showing the format. Task name 194In this case the tracer is "ftrace". Then a header showing the format. Task
194"bash", the task PID "4251", the CPU that it was running on 195name "bash", the task PID "4251", the CPU that it was running on
195"01", the timestamp in <secs>.<usecs> format, the function name that was 196"01", the timestamp in <secs>.<usecs> format, the function name that was
196traced "path_put" and the parent function that called this function 197traced "path_put" and the parent function that called this function
197"path_walk". 198"path_walk". The timestamp is the time at which the function was
199entered.
198 200
199The sched_switch tracer also includes tracing of task wake ups and 201The sched_switch tracer also includes tracing of task wakeups and
200context switches. 202context switches.
201 203
202 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S 204 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
@@ -206,7 +208,7 @@ context switches.
206 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R 208 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
207 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R 209 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
208 210
209Wake ups are represented by a "+" and the context switches show 211Wake ups are represented by a "+" and the context switches are shown as
210"==>". The format is: 212"==>". The format is:
211 213
212 Context switches: 214 Context switches:
@@ -221,7 +223,7 @@ Wake ups are represented by a "+" and the context switches show
221 223
222 <pid>:<prio>:<state> + <pid>:<prio>:<state> 224 <pid>:<prio>:<state> + <pid>:<prio>:<state>
223 225
224The prio is the internal kernel priority, which is inverse to the 226The prio is the internal kernel priority, which is the inverse of the
225priority that is usually displayed by user-space tools. Zero represents 227priority that is usually displayed by user-space tools. Zero represents
226the highest priority (99). Prio 100 starts the "nice" priorities with 228the highest priority (99). Prio 100 starts the "nice" priorities with
227100 being equal to nice -20 and 139 being nice 19. The prio "140" is 229100 being equal to nice -20 and 139 being nice 19. The prio "140" is
@@ -232,7 +234,7 @@ Latency trace format
232-------------------- 234--------------------
233 235
234For traces that display latency times, the latency_trace file gives 236For traces that display latency times, the latency_trace file gives
235a bit more information to see why a latency happened. Here's a typical 237somewhat more information to see why a latency happened. Here is a typical
236trace. 238trace.
237 239
238# tracer: irqsoff 240# tracer: irqsoff
@@ -260,21 +262,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
260 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) 262 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
261 263
262 264
263vim:ft=help
264
265 265
266This shows that the current tracer is "irqsoff" tracing the time 266This shows that the current tracer is "irqsoff" tracing the time for which
267interrupts are disabled. It gives the trace version and the kernel 267interrupts were disabled. It gives the trace version and the version
268this was executed on (2.6.26-rc8). Then it displays the max latency 268of the kernel upon which this was executed on (2.6.26-rc8). Then it displays
269in microsecs (97 us). The number of trace entries displayed 269the max latency in microsecs (97 us). The number of trace entries displayed
270by the total number recorded (both are three: #3/3). The type of 270and the total number recorded (both are three: #3/3). The type of
271preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero 271preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
272and reserved for later use. #P is the number of online CPUS (#P:2). 272and are reserved for later use. #P is the number of online CPUS (#P:2).
273 273
274The task is the process that was running when the latency happened. 274The task is the process that was running when the latency occurred.
275(swapper pid: 0). 275(swapper pid: 0).
276 276
277The start and stop that caused the latencies: 277The start and stop (the functions in which the interrupts were disabled and
278enabled respectively) that caused the latencies:
278 279
279 apic_timer_interrupt is where the interrupts were disabled. 280 apic_timer_interrupt is where the interrupts were disabled.
280 do_softirq is where they were enabled again. 281 do_softirq is where they were enabled again.
@@ -286,14 +287,14 @@ explains which is which.
286 287
287 pid: The PID of that process. 288 pid: The PID of that process.
288 289
289 CPU#: The CPU that the process was running on. 290 CPU#: The CPU which the process was running on.
290 291
291 irqs-off: 'd' interrupts are disabled. '.' otherwise. 292 irqs-off: 'd' interrupts are disabled. '.' otherwise.
292 293
293 need-resched: 'N' task need_resched is set, '.' otherwise. 294 need-resched: 'N' task need_resched is set, '.' otherwise.
294 295
295 hardirq/softirq: 296 hardirq/softirq:
296 'H' - hard irq happened inside a softirq. 297 'H' - hard irq occurred inside a softirq.
297 'h' - hard irq is running 298 'h' - hard irq is running
298 's' - soft irq is running 299 's' - soft irq is running
299 '.' - normal context. 300 '.' - normal context.
@@ -303,7 +304,7 @@ explains which is which.
303The above is mostly meaningful for kernel developers. 304The above is mostly meaningful for kernel developers.
304 305
305 time: This differs from the trace file output. The trace file output 306 time: This differs from the trace file output. The trace file output
306 included an absolute timestamp. The timestamp used by the 307 includes an absolute timestamp. The timestamp used by the
307 latency_trace file is relative to the start of the trace. 308 latency_trace file is relative to the start of the trace.
308 309
309 delay: This is just to help catch your eye a bit better. And 310 delay: This is just to help catch your eye a bit better. And
@@ -385,7 +386,7 @@ Here are the available options:
385sched_switch 386sched_switch
386------------ 387------------
387 388
388This tracer simply records schedule switches. Here's an example 389This tracer simply records schedule switches. Here is an example
389of how to use it. 390of how to use it.
390 391
391 # echo sched_switch > /debug/tracing/current_tracer 392 # echo sched_switch > /debug/tracing/current_tracer
@@ -421,8 +422,8 @@ the name of the trace and points to the options. The "FUNCTION"
421is a misnomer since here it represents the wake ups and context 422is a misnomer since here it represents the wake ups and context
422switches. 423switches.
423 424
424The sched_switch only lists the wake ups (represented with '+') 425The sched_switch file only lists the wake ups (represented with '+')
425and context switches ('==>') with the previous task or current 426and context switches ('==>') with the previous task or current task
426first followed by the next task or task waking up. The format for both 427first followed by the next task or task waking up. The format for both
427of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO 428of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
428is the inverse of the actual priority with zero (0) being the highest 429is the inverse of the actual priority with zero (0) being the highest
@@ -437,7 +438,8 @@ The task states are:
437 438
438 R - running : wants to run, may not actually be running 439 R - running : wants to run, may not actually be running
439 S - sleep : process is waiting to be woken up (handles signals) 440 S - sleep : process is waiting to be woken up (handles signals)
440 D - deep sleep : process must be woken up (ignores signals) 441 D - disk sleep (uninterruptible sleep) : process must be woken up
442 (ignores signals)
441 T - stopped : process suspended 443 T - stopped : process suspended
442 t - traced : process is being traced (with something like gdb) 444 t - traced : process is being traced (with something like gdb)
443 Z - zombie : process waiting to be cleaned up 445 Z - zombie : process waiting to be cleaned up
@@ -447,8 +449,8 @@ The task states are:
447ftrace_enabled 449ftrace_enabled
448-------------- 450--------------
449 451
450The following tracers give different output depending on whether 452The following tracers (listed below) give different output depending
451or not the sysctl ftrace_enabled is set. To set ftrace_enabled, 453on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
452one can either use the sysctl function or set it via the proc 454one can either use the sysctl function or set it via the proc
453file system interface. 455file system interface.
454 456
@@ -475,13 +477,12 @@ interrupt from triggering or the mouse interrupt from letting the
475kernel know of a new mouse event. The result is a latency with the 477kernel know of a new mouse event. The result is a latency with the
476reaction time. 478reaction time.
477 479
478The irqsoff tracer tracks the time interrupts are disabled to the time 480The irqsoff tracer tracks the time for which interrupts are disabled.
479they are re-enabled. When a new maximum latency is hit, it saves off 481When a new maximum latency is hit, the tracer saves the trace leading up
480the trace so that it may be retrieved at a later time. Every time a 482to that latency point so that every time a new maximum is reached, the old
481new maximum in reached, the old saved trace is discarded and the new 483saved trace is discarded and the new trace is saved.
482trace is saved.
483 484
484To reset the maximum, echo 0 into tracing_max_latency. Here's an 485To reset the maximum, echo 0 into tracing_max_latency. Here is an
485example: 486example:
486 487
487 # echo irqsoff > /debug/tracing/current_tracer 488 # echo irqsoff > /debug/tracing/current_tracer
@@ -493,14 +494,14 @@ example:
493 # cat /debug/tracing/latency_trace 494 # cat /debug/tracing/latency_trace
494# tracer: irqsoff 495# tracer: irqsoff
495# 496#
496irqsoff latency trace v1.1.5 on 2.6.26-rc8 497irqsoff latency trace v1.1.5 on 2.6.26
497-------------------------------------------------------------------- 498--------------------------------------------------------------------
498 latency: 6 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 499 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
499 ----------------- 500 -----------------
500 | task: bash-4269 (uid:0 nice:0 policy:0 rt_prio:0) 501 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
501 ----------------- 502 -----------------
502 => started at: copy_page_range 503 => started at: sys_setpgid
503 => ended at: copy_page_range 504 => ended at: sys_setpgid
504 505
505# _------=> CPU# 506# _------=> CPU#
506# / _-----=> irqs-off 507# / _-----=> irqs-off
@@ -511,21 +512,19 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
511# ||||| delay 512# ||||| delay
512# cmd pid ||||| time | caller 513# cmd pid ||||| time | caller
513# \ / ||||| \ | / 514# \ / ||||| \ | /
514 bash-4269 1...1 0us+: _spin_lock (copy_page_range) 515 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid)
515 bash-4269 1...1 7us : _spin_unlock (copy_page_range) 516 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid)
516 bash-4269 1...2 7us : trace_preempt_on (copy_page_range) 517 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid)
517
518 518
519vim:ft=help
520 519
521Here we see that that we had a latency of 6 microsecs (which is 520Here we see that that we had a latency of 12 microsecs (which is
522very good). The spin_lock in copy_page_range disabled interrupts. 521very good). The _write_lock_irq in sys_setpgid disabled interrupts.
523The difference between the 6 and the displayed timestamp 7us is 522The difference between the 12 and the displayed timestamp 14us occurred
524because the clock must have incremented between the time of recording 523because the clock was incremented between the time of recording the max
525the max latency and recording the function that had that latency. 524latency and the time of recording the function that had that latency.
526 525
527Note the above had ftrace_enabled not set. If we set the ftrace_enabled, 526Note the above example had ftrace_enabled not set. If we set the
528we get a much larger output: 527ftrace_enabled, we get a much larger output:
529 528
530# tracer: irqsoff 529# tracer: irqsoff
531# 530#
@@ -571,12 +570,10 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
571 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) 570 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
572 571
573 572
574vim:ft=help
575
576 573
577Here we traced a 50 microsecond latency. But we also see all the 574Here we traced a 50 microsecond latency. But we also see all the
578functions that were called during that time. Note that by enabling 575functions that were called during that time. Note that by enabling
579function tracing, we endure an added overhead. This overhead may 576function tracing, we incur an added overhead. This overhead may
580extend the latency times. But nevertheless, this trace has provided 577extend the latency times. But nevertheless, this trace has provided
581some very helpful debugging information. 578some very helpful debugging information.
582 579
@@ -590,8 +587,9 @@ for preemption to be enabled again before it can preempt a lower
590priority task. 587priority task.
591 588
592The preemptoff tracer traces the places that disable preemption. 589The preemptoff tracer traces the places that disable preemption.
593Like the irqsoff, it records the maximum latency that preemption 590Like the irqsoff tracer, it records the maximum latency for which preemption
594was disabled. The control of preemptoff is much like the irqsoff. 591was disabled. The control of preemptoff tracer is much like the irqsoff
592tracer.
595 593
596 # echo preemptoff > /debug/tracing/current_tracer 594 # echo preemptoff > /debug/tracing/current_tracer
597 # echo 0 > /debug/tracing/tracing_max_latency 595 # echo 0 > /debug/tracing/tracing_max_latency
@@ -625,8 +623,6 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8
625 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) 623 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
626 624
627 625
628vim:ft=help
629
630This has some more changes. Preemption was disabled when an interrupt 626This has some more changes. Preemption was disabled when an interrupt
631came in (notice the 'h'), and was enabled while doing a softirq. 627came in (notice the 'h'), and was enabled while doing a softirq.
632(notice the 's'). But we also see that interrupts have been disabled 628(notice the 's'). But we also see that interrupts have been disabled
@@ -694,16 +690,16 @@ The above is an example of the preemptoff trace with ftrace_enabled
694set. Here we see that interrupts were disabled the entire time. 690set. Here we see that interrupts were disabled the entire time.
695The irq_enter code lets us know that we entered an interrupt 'h'. 691The irq_enter code lets us know that we entered an interrupt 'h'.
696Before that, the functions being traced still show that it is not 692Before that, the functions being traced still show that it is not
697in an interrupt, but we can see by the functions themselves that 693in an interrupt, but we can see from the functions themselves that
698this is not the case. 694this is not the case.
699 695
700Notice that the __do_softirq when called doesn't have a preempt_count. 696Notice that __do_softirq when called does not have a preempt_count.
701It may seem that we missed a preempt enabled. What really happened 697It may seem that we missed a preempt enabling. What really happened
702is that the preempt count is held on the threads stack and we 698is that the preempt count is held on the thread's stack and we
703switched to the softirq stack (4K stacks in effect). The code 699switched to the softirq stack (4K stacks in effect). The code
704does not copy the preempt count, but because interrupts are disabled, 700does not copy the preempt count, but because interrupts are disabled,
705we don't need to worry about it. Having a tracer like this is good 701we do not need to worry about it. Having a tracer like this is good
706to let people know what really happens inside the kernel. 702for letting people know what really happens inside the kernel.
707 703
708 704
709preemptirqsoff 705preemptirqsoff
@@ -713,7 +709,7 @@ Knowing the locations that have interrupts disabled or preemption
713disabled for the longest times is helpful. But sometimes we would 709disabled for the longest times is helpful. But sometimes we would
714like to know when either preemption and/or interrupts are disabled. 710like to know when either preemption and/or interrupts are disabled.
715 711
716The following code: 712Consider the following code:
717 713
718 local_irq_disable(); 714 local_irq_disable();
719 call_function_with_irqs_off(); 715 call_function_with_irqs_off();
@@ -769,12 +765,10 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
769 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) 765 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
770 766
771 767
772vim:ft=help
773
774 768
775The trace_hardirqs_off_thunk is called from assembly on x86 when 769The trace_hardirqs_off_thunk is called from assembly on x86 when
776interrupts are disabled in the assembly code. Without the function 770interrupts are disabled in the assembly code. Without the function
777tracing, we don't know if interrupts were enabled within the preemption 771tracing, we do not know if interrupts were enabled within the preemption
778points. We do see that it started with preemption enabled. 772points. We do see that it started with preemption enabled.
779 773
780Here is a trace with ftrace_enabled set: 774Here is a trace with ftrace_enabled set:
@@ -865,19 +859,19 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
865 859
866This is a very interesting trace. It started with the preemption of 860This is a very interesting trace. It started with the preemption of
867the ls task. We see that the task had the "need_resched" bit set 861the ls task. We see that the task had the "need_resched" bit set
868with the 'N' in the trace. Interrupts are disabled in the spin_lock 862via the 'N' in the trace. Interrupts were disabled before the spin_lock
869and the trace started. We see that a schedule took place to run 863at the beginning of the trace. We see that a schedule took place to run
870sshd. When the interrupts were enabled, we took an interrupt. 864sshd. When the interrupts were enabled, we took an interrupt.
871On return from the interrupt handler, the softirq ran. We took another 865On return from the interrupt handler, the softirq ran. We took another
872interrupt while running the softirq as we see with the capital 'H'. 866interrupt while running the softirq as we see from the capital 'H'.
873 867
874 868
875wakeup 869wakeup
876------ 870------
877 871
878In Real-Time environment it is very important to know the wakeup 872In a Real-Time environment it is very important to know the wakeup
879time it takes for the highest priority task that wakes up to the 873time it takes for the highest priority task that is woken up to the
880time it executes. This is also known as "schedule latency". 874time that it executes. This is also known as "schedule latency".
881I stress the point that this is about RT tasks. It is also important 875I stress the point that this is about RT tasks. It is also important
882to know the scheduling latency of non-RT tasks, but the average 876to know the scheduling latency of non-RT tasks, but the average
883schedule latency is better for non-RT tasks. Tools like 877schedule latency is better for non-RT tasks. Tools like
@@ -926,8 +920,6 @@ wakeup latency trace v1.1.5 on 2.6.26-rc8
926 <idle>-0 1d..4 4us : schedule (cpu_idle) 920 <idle>-0 1d..4 4us : schedule (cpu_idle)
927 921
928 922
929vim:ft=help
930
931 923
932Running this on an idle system, we see that it only took 4 microseconds 924Running this on an idle system, we see that it only took 4 microseconds
933to perform the task switch. Note, since the trace marker in the 925to perform the task switch. Note, since the trace marker in the
@@ -996,15 +988,15 @@ ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
996ksoftirq-7 1d..4 50us : schedule (__cond_resched) 988ksoftirq-7 1d..4 50us : schedule (__cond_resched)
997 989
998The interrupt went off while running ksoftirqd. This task runs at 990The interrupt went off while running ksoftirqd. This task runs at
999SCHED_OTHER. Why didn't we see the 'N' set early? This may be 991SCHED_OTHER. Why did not we see the 'N' set early? This may be
1000a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks 992a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks
1001configured, the interrupt and softirq runs with their own stack. 993configured, the interrupt and softirq run with their own stack.
1002Some information is held on the top of the task's stack (need_resched 994Some information is held on the top of the task's stack (need_resched
1003and preempt_count are both stored there). The setting of the NEED_RESCHED 995and preempt_count are both stored there). The setting of the NEED_RESCHED
1004bit is done directly to the task's stack, but the reading of the 996bit is done directly to the task's stack, but the reading of the
1005NEED_RESCHED is done by looking at the current stack, which in this case 997NEED_RESCHED is done by looking at the current stack, which in this case
1006is the stack for the hard interrupt. This hides the fact that NEED_RESCHED 998is the stack for the hard interrupt. This hides the fact that NEED_RESCHED
1007has been set. We don't see the 'N' until we switch back to the task's 999has been set. We do not see the 'N' until we switch back to the task's
1008assigned stack. 1000assigned stack.
1009 1001
1010ftrace 1002ftrace
@@ -1044,14 +1036,14 @@ this tracer is a nop.
1044[...] 1036[...]
1045 1037
1046 1038
1047Note: It is sometimes better to enable or disable tracing directly from 1039Note: ftrace uses ring buffers to store the above entries. The newest data
1048a program, because the buffer may be overflowed by the echo commands 1040may overwrite the oldest data. Sometimes using echo to stop the trace
1049before you get to the point you want to trace. It is also easier to 1041is not sufficient because the tracing could have overwritten the data
1050stop the tracing at the point that you hit the part that you are 1042that you wanted to record. For this reason, it is sometimes better to
1051interested in. Since the ftrace buffer is a ring buffer with the 1043disable tracing directly from a program. This allows you to stop the
1052oldest data being overwritten, usually it is sufficient to start the 1044tracing at the point that you hit the part that you are interested in.
1053tracer with an echo command but have you code stop it. Something 1045To disable the tracing directly from a C program, something like following
1054like the following is usually appropriate for this. 1046code snippet can be used:
1055 1047
1056int trace_fd; 1048int trace_fd;
1057[...] 1049[...]
@@ -1060,20 +1052,26 @@ int main(int argc, char *argv[]) {
1060 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY); 1052 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1061 [...] 1053 [...]
1062 if (condition_hit()) { 1054 if (condition_hit()) {
1063 write(trace_fd, "0", 1); 1055 write(trace_fd, "0", 1);
1064 } 1056 }
1065 [...] 1057 [...]
1066} 1058}
1067 1059
1060Note: Here we hard coded the path name. The debugfs mount is not
1061guaranteed to be at /debug (and is more commonly at /sys/kernel/debug).
1062For simple one time traces, the above is sufficent. For anything else,
1063a search through /proc/mounts may be needed to find where the debugfs
1064file-system is mounted.
1068 1065
1069dynamic ftrace 1066dynamic ftrace
1070-------------- 1067--------------
1071 1068
1072If CONFIG_DYNAMIC_FTRACE is set, then the system will run with 1069If CONFIG_DYNAMIC_FTRACE is set, the system will run with
1073virtually no overhead when function tracing is disabled. The way 1070virtually no overhead when function tracing is disabled. The way
1074this works is the mcount function call (placed at the start of 1071this works is the mcount function call (placed at the start of
1075every kernel function, produced by the -pg switch in gcc), starts 1072every kernel function, produced by the -pg switch in gcc), starts
1076of pointing to a simple return. 1073of pointing to a simple return. (Enabling FTRACE will include the
1074-pg switch in the compiling of the kernel.)
1077 1075
1078When dynamic ftrace is initialized, it calls kstop_machine to make 1076When dynamic ftrace is initialized, it calls kstop_machine to make
1079the machine act like a uniprocessor so that it can freely modify code 1077the machine act like a uniprocessor so that it can freely modify code
@@ -1086,15 +1084,15 @@ Later on the ftraced kernel thread is awoken and will again call
1086kstop_machine if new functions have been recorded. The ftraced thread 1084kstop_machine if new functions have been recorded. The ftraced thread
1087will change all calls to mcount to "nop". Just calling mcount 1085will change all calls to mcount to "nop". Just calling mcount
1088and having mcount return has shown a 10% overhead. By converting 1086and having mcount return has shown a 10% overhead. By converting
1089it to a nop, there is no recordable overhead to the system. 1087it to a nop, there is no measurable overhead to the system.
1090 1088
1091One special side-effect to the recording of the functions being 1089One special side-effect to the recording of the functions being
1092traced, is that we can now selectively choose which functions we 1090traced is that we can now selectively choose which functions we
1093want to trace and which ones we want the mcount calls to remain as 1091wish to trace and which ones we want the mcount calls to remain as
1094nops. 1092nops.
1095 1093
1096Two files are used, one for enabling and one for disabling the tracing 1094Two files are used, one for enabling and one for disabling the tracing
1097of recorded functions. They are: 1095of specified functions. They are:
1098 1096
1099 set_ftrace_filter 1097 set_ftrace_filter
1100 1098
@@ -1116,7 +1114,7 @@ pick_next_task_fair
1116mutex_lock 1114mutex_lock
1117[...] 1115[...]
1118 1116
1119If I'm only interested in sys_nanosleep and hrtimer_interrupt: 1117If I am only interested in sys_nanosleep and hrtimer_interrupt:
1120 1118
1121 # echo sys_nanosleep hrtimer_interrupt \ 1119 # echo sys_nanosleep hrtimer_interrupt \
1122 > /debug/tracing/set_ftrace_filter 1120 > /debug/tracing/set_ftrace_filter
@@ -1133,21 +1131,21 @@ If I'm only interested in sys_nanosleep and hrtimer_interrupt:
1133 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call 1131 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
1134 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt 1132 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1135 1133
1136To see what functions are being traced, you can cat the file: 1134To see which functions are being traced, you can cat the file:
1137 1135
1138 # cat /debug/tracing/set_ftrace_filter 1136 # cat /debug/tracing/set_ftrace_filter
1139hrtimer_interrupt 1137hrtimer_interrupt
1140sys_nanosleep 1138sys_nanosleep
1141 1139
1142 1140
1143Perhaps this isn't enough. The filters also allow simple wild cards. 1141Perhaps this is not enough. The filters also allow simple wild cards.
1144Only the following are currently available 1142Only the following are currently available
1145 1143
1146 <match>* - will match functions that begin with <match> 1144 <match>* - will match functions that begin with <match>
1147 *<match> - will match functions that end with <match> 1145 *<match> - will match functions that end with <match>
1148 *<match>* - will match functions that have <match> in it 1146 *<match>* - will match functions that have <match> in it
1149 1147
1150Thats all the wild cards that are allowed. 1148These are the only wild cards which are supported.
1151 1149
1152 <match>*<match> will not work. 1150 <match>*<match> will not work.
1153 1151
@@ -1258,15 +1256,15 @@ calls that need to be converted into nops. If there are not any, then
1258it simply goes back to sleep. But if there are some, it will call 1256it simply goes back to sleep. But if there are some, it will call
1259kstop_machine to convert the calls to nops. 1257kstop_machine to convert the calls to nops.
1260 1258
1261There may be a case that you do not want this added latency. 1259There may be a case in which you do not want this added latency.
1262Perhaps you are doing some audio recording and this activity might 1260Perhaps you are doing some audio recording and this activity might
1263cause skips in the playback. There is an interface to disable 1261cause skips in the playback. There is an interface to disable
1264and enable the ftraced kernel thread. 1262and enable the "ftraced" kernel thread.
1265 1263
1266 # echo 0 > /debug/tracing/ftraced_enabled 1264 # echo 0 > /debug/tracing/ftraced_enabled
1267 1265
1268This will disable the calling of the kstop_machine to update the 1266This will disable the calling of kstop_machine to update the
1269mcount calls to nops. Remember that there's a large overhead 1267mcount calls to nops. Remember that there is a large overhead
1270to calling mcount. Without this kernel thread, that overhead will 1268to calling mcount. Without this kernel thread, that overhead will
1271exist. 1269exist.
1272 1270
@@ -1282,8 +1280,8 @@ that uses ftrace function recording.
1282trace_pipe 1280trace_pipe
1283---------- 1281----------
1284 1282
1285The trace_pipe outputs the same as trace, but the effect on the 1283The trace_pipe outputs the same content as the trace file, but the effect
1286tracing is different. Every read from trace_pipe is consumed. 1284on the tracing is different. Every read from trace_pipe is consumed.
1287This means that subsequent reads will be different. The trace 1285This means that subsequent reads will be different. The trace
1288is live. 1286is live.
1289 1287
@@ -1313,7 +1311,7 @@ is live.
1313 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up 1311 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1314 1312
1315 1313
1316Note, reading the trace_pipe will block until more input is added. 1314Note, reading the trace_pipe file will block until more input is added.
1317By changing the tracer, trace_pipe will issue an EOF. We needed 1315By changing the tracer, trace_pipe will issue an EOF. We needed
1318to set the ftrace tracer _before_ cating the trace_pipe file. 1316to set the ftrace tracer _before_ cating the trace_pipe file.
1319 1317
@@ -1322,7 +1320,7 @@ trace entries
1322------------- 1320-------------
1323 1321
1324Having too much or not enough data can be troublesome in diagnosing 1322Having too much or not enough data can be troublesome in diagnosing
1325some issue in the kernel. The file trace_entries is used to modify 1323an issue in the kernel. The file trace_entries is used to modify
1326the size of the internal trace buffers. The number listed 1324the size of the internal trace buffers. The number listed
1327is the number of entries that can be recorded per CPU. To know 1325is the number of entries that can be recorded per CPU. To know
1328the full size, multiply the number of possible CPUS with the 1326the full size, multiply the number of possible CPUS with the
@@ -1332,7 +1330,8 @@ number of entries.
133265620 133065620
1333 1331
1334Note, to modify this, you must have tracing completely disabled. To do that, 1332Note, to modify this, you must have tracing completely disabled. To do that,
1335echo "none" into the current_tracer. 1333echo "none" into the current_tracer. If the current_tracer is not set
1334to "none", an EINVAL error will be returned.
1336 1335
1337 # echo none > /debug/tracing/current_tracer 1336 # echo none > /debug/tracing/current_tracer
1338 # echo 100000 > /debug/tracing/trace_entries 1337 # echo 100000 > /debug/tracing/trace_entries
@@ -1341,18 +1340,18 @@ echo "none" into the current_tracer.
1341 1340
1342 1341
1343Notice that we echoed in 100,000 but the size is 100,045. The entries 1342Notice that we echoed in 100,000 but the size is 100,045. The entries
1344are held by individual pages. It allocates the number of pages it takes 1343are held in individual pages. It allocates the number of pages it takes
1345to fulfill the request. If more entries may fit on the last page 1344to fulfill the request. If more entries may fit on the last page
1346it will add them. 1345then they will be added.
1347 1346
1348 # echo 1 > /debug/tracing/trace_entries 1347 # echo 1 > /debug/tracing/trace_entries
1349 # cat /debug/tracing/trace_entries 1348 # cat /debug/tracing/trace_entries
135085 134985
1351 1350
1352This shows us that 85 entries can fit on a single page. 1351This shows us that 85 entries can fit in a single page.
1353 1352
1354The number of pages that will be allocated is a percentage of available 1353The number of pages which will be allocated is limited to a percentage
1355memory. Allocating too much will produce an error. 1354of available memory. Allocating too much will produce an error.
1356 1355
1357 # echo 1000000000000 > /debug/tracing/trace_entries 1356 # echo 1000000000000 > /debug/tracing/trace_entries
1358-bash: echo: write error: Cannot allocate memory 1357-bash: echo: write error: Cannot allocate memory