aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/ftrace.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/ftrace.txt')
-rw-r--r--Documentation/ftrace.txt403
1 files changed, 205 insertions, 198 deletions
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index 13e4bf054c38..f218f616ff6b 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -2,8 +2,12 @@
2 ======================== 2 ========================
3 3
4Copyright 2008 Red Hat Inc. 4Copyright 2008 Red Hat Inc.
5Author: Steven Rostedt <srostedt@redhat.com> 5 Author: Steven Rostedt <srostedt@redhat.com>
6 License: The GNU Free Documentation License, Version 1.2
7Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
8 John Kacur, and David Teigland.
6 9
10Written for: 2.6.27-rc1
7 11
8Introduction 12Introduction
9------------ 13------------
@@ -15,10 +19,11 @@ issues that take place outside of user-space.
15 19
16Although ftrace is the function tracer, it also includes an 20Although ftrace is the function tracer, it also includes an
17infrastructure that allows for other types of tracing. Some of the 21infrastructure that allows for other types of tracing. Some of the
18tracers that are currently in ftrace is a tracer to trace 22tracers that are currently in ftrace include a tracer to trace
19context switches, the time it takes for a high priority task to 23context switches, the time it takes for a high priority task to
20run after it was woken up, the time interrupts are disabled, and 24run after it was woken up, the time interrupts are disabled, and
21more. 25more (ftrace allows for tracer plugins, which means that the list of
26tracers can always grow).
22 27
23 28
24The File System 29The File System
@@ -32,6 +37,8 @@ To mount the debugfs system:
32 # mkdir /debug 37 # mkdir /debug
33 # mount -t debugfs nodev /debug 38 # mount -t debugfs nodev /debug
34 39
40(Note: it is more common to mount at /sys/kernel/debug, but for simplicity
41 this document will use /debug)
35 42
36That's it! (assuming that you have ftrace configured into your kernel) 43That's it! (assuming that you have ftrace configured into your kernel)
37 44
@@ -46,21 +53,20 @@ of ftrace. Here is a list of some of the key files:
46 that is configured. 53 that is configured.
47 54
48 available_tracers : This holds the different types of tracers that 55 available_tracers : This holds the different types of tracers that
49 has been compiled into the kernel. The tracers 56 have been compiled into the kernel. The tracers
50 listed here can be configured by echoing in their 57 listed here can be configured by echoing their name
51 name into current_tracer. 58 into current_tracer.
52 59
53 tracing_enabled : This sets or displays whether the current_tracer 60 tracing_enabled : This sets or displays whether the current_tracer
54 is activated and tracing or not. Echo 0 into this 61 is activated and tracing or not. Echo 0 into this
55 file to disable the tracer or 1 (or non-zero) to 62 file to disable the tracer or 1 to enable it.
56 enable it.
57 63
58 trace : This file holds the output of the trace in a human readable 64 trace : This file holds the output of the trace in a human readable
59 format. 65 format (described below).
60 66
61 latency_trace : This file shows the same trace but the information 67 latency_trace : This file shows the same trace but the information
62 is organized more to display possible latencies 68 is organized more to display possible latencies
63 in the system. 69 in the system (described below).
64 70
65 trace_pipe : The output is the same as the "trace" file but this 71 trace_pipe : The output is the same as the "trace" file but this
66 file is meant to be streamed with live tracing. 72 file is meant to be streamed with live tracing.
@@ -72,7 +78,7 @@ of ftrace. Here is a list of some of the key files:
72 file, it is consumed, and will not be read 78 file, it is consumed, and will not be read
73 again with a sequential read. The "trace" and 79 again with a sequential read. The "trace" and
74 "latency_trace" files are static, and if the 80 "latency_trace" files are static, and if the
75 tracer isn't adding more data, they will display 81 tracer is not adding more data, they will display
76 the same information every time they are read. 82 the same information every time they are read.
77 83
78 iter_ctrl : This file lets the user control the amount of data 84 iter_ctrl : This file lets the user control the amount of data
@@ -89,12 +95,14 @@ of ftrace. Here is a list of some of the key files:
89 95
90 trace_entries : This sets or displays the number of trace 96 trace_entries : This sets or displays the number of trace
91 entries each CPU buffer can hold. The tracer buffers 97 entries each CPU buffer can hold. The tracer buffers
92 are the same size for each CPU, so care must be 98 are the same size for each CPU. The displayed number
93 taken when modifying the trace_entries. The number 99 is the size of the CPU buffer and not total size. The
94 of actually entries will be the number given 100 trace buffers are allocated in pages (blocks of memory
95 times the number of possible CPUS. The buffers 101 that the kernel uses for allocation, usually 4 KB in size).
96 are saved as individual pages, and the actual entries 102 Since each entry is smaller than a page, if the last
97 will always be rounded up to entries per page. 103 allocated page has room for more entries than were
104 requested, the rest of the page is used to allocate
105 entries.
98 106
99 This can only be updated when the current_tracer 107 This can only be updated when the current_tracer
100 is set to "none". 108 is set to "none".
@@ -107,20 +115,19 @@ of ftrace. Here is a list of some of the key files:
107 on specified CPUS. The format is a hex string 115 on specified CPUS. The format is a hex string
108 representing the CPUS. 116 representing the CPUS.
109 117
110 set_ftrace_filter : When dynamic ftrace is configured in, the 118 set_ftrace_filter : When dynamic ftrace is configured in (see the
111 code is dynamically modified to disable calling 119 section below "dynamic ftrace"), the code is dynamically
112 of the function profiler (mcount). This lets 120 modified (code text rewrite) to disable calling of the
113 tracing be configured in with practically no overhead 121 function profiler (mcount). This lets tracing be configured
114 in performance. This also has a side effect of 122 in with practically no overhead in performance. This also
115 enabling or disabling specific functions to be 123 has a side effect of enabling or disabling specific functions
116 traced. Echoing in names of functions into this 124 to be traced. Echoing names of functions into this file
117 file will limit the trace to only those files. 125 will limit the trace to only those functions.
118 126
119 set_ftrace_notrace: This has the opposite effect that 127 set_ftrace_notrace: This has an effect opposite to that of
120 set_ftrace_filter has. Any function that is added 128 set_ftrace_filter. Any function that is added here will not
121 here will not be traced. If a function exists 129 be traced. If a function exists in both set_ftrace_filter
122 in both set_ftrace_filter and set_ftrace_notrace 130 and set_ftrace_notrace, the function will _not_ be traced.
123 the function will _not_ bet traced.
124 131
125 available_filter_functions : When a function is encountered the first 132 available_filter_functions : When a function is encountered the first
126 time by the dynamic tracer, it is recorded and 133 time by the dynamic tracer, it is recorded and
@@ -128,32 +135,31 @@ of ftrace. Here is a list of some of the key files:
128 lists the functions that have been recorded 135 lists the functions that have been recorded
129 by the dynamic tracer and these functions can 136 by the dynamic tracer and these functions can
130 be used to set the ftrace filter by the above 137 be used to set the ftrace filter by the above
131 "set_ftrace_filter" file. 138 "set_ftrace_filter" file. (See the section "dynamic ftrace"
139 below for more details).
132 140
133 141
134The Tracers 142The Tracers
135----------- 143-----------
136 144
137Here are the list of current tracers that can be configured. 145Here is the list of current tracers that may be configured.
138 146
139 ftrace - function tracer that uses mcount to trace all functions. 147 ftrace - function tracer that uses mcount to trace all functions.
140 It is possible to filter out which functions that are
141 traced when dynamic ftrace is configured in.
142 148
143 sched_switch - traces the context switches between tasks. 149 sched_switch - traces the context switches between tasks.
144 150
145 irqsoff - traces the areas that disable interrupts and saves off 151 irqsoff - traces the areas that disable interrupts and saves
146 the trace with the longest max latency. 152 the trace with the longest max latency.
147 See tracing_max_latency. When a new max is recorded, 153 See tracing_max_latency. When a new max is recorded,
148 it replaces the old trace. It is best to view this 154 it replaces the old trace. It is best to view this
149 trace with the latency_trace file. 155 trace via the latency_trace file.
150 156
151 preemptoff - Similar to irqsoff but traces and records the time 157 preemptoff - Similar to irqsoff but traces and records the amount of
152 preemption is disabled. 158 time for which preemption is disabled.
153 159
154 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and 160 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
155 records the largest time irqs and/or preemption is 161 records the largest time for which irqs and/or preemption
156 disabled. 162 is disabled.
157 163
158 wakeup - Traces and records the max latency that it takes for 164 wakeup - Traces and records the max latency that it takes for
159 the highest priority task to get scheduled after 165 the highest priority task to get scheduled after
@@ -166,13 +172,13 @@ Here are the list of current tracers that can be configured.
166Examples of using the tracer 172Examples of using the tracer
167---------------------------- 173----------------------------
168 174
169Here are typical examples of using the tracers with only controlling 175Here are typical examples of using the tracers when controlling them only
170them with the debugfs interface (without using any user-land utilities). 176with the debugfs interface (without using any user-land utilities).
171 177
172Output format: 178Output format:
173-------------- 179--------------
174 180
175Here's an example of the output format of the file "trace" 181Here is an example of the output format of the file "trace"
176 182
177 -------- 183 --------
178# tracer: ftrace 184# tracer: ftrace
@@ -184,14 +190,15 @@ Here's an example of the output format of the file "trace"
184 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput 190 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
185 -------- 191 --------
186 192
187A header is printed with the trace that is represented. In this case 193A header is printed with the tracer name that is represented by the trace.
188the tracer is "ftrace". Then a header showing the format. Task name 194In this case the tracer is "ftrace". Then a header showing the format. Task
189"bash", the task PID "4251", the CPU that it was running on 195name "bash", the task PID "4251", the CPU that it was running on
190"01", the timestamp in <secs>.<usecs> format, the function name that was 196"01", the timestamp in <secs>.<usecs> format, the function name that was
191traced "path_put" and the parent function that called this function 197traced "path_put" and the parent function that called this function
192"path_walk". 198"path_walk". The timestamp is the time at which the function was
199entered.
193 200
194The sched_switch tracer also includes tracing of task wake ups and 201The sched_switch tracer also includes tracing of task wakeups and
195context switches. 202context switches.
196 203
197 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S 204 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
@@ -201,7 +208,7 @@ context switches.
201 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R 208 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
202 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R 209 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
203 210
204Wake ups are represented by a "+" and the context switches show 211Wake ups are represented by a "+" and the context switches are shown as
205"==>". The format is: 212"==>". The format is:
206 213
207 Context switches: 214 Context switches:
@@ -216,7 +223,7 @@ Wake ups are represented by a "+" and the context switches show
216 223
217 <pid>:<prio>:<state> + <pid>:<prio>:<state> 224 <pid>:<prio>:<state> + <pid>:<prio>:<state>
218 225
219The prio is the internal kernel priority, which is inverse to the 226The prio is the internal kernel priority, which is the inverse of the
220priority that is usually displayed by user-space tools. Zero represents 227priority that is usually displayed by user-space tools. Zero represents
221the highest priority (99). Prio 100 starts the "nice" priorities with 228the highest priority (99). Prio 100 starts the "nice" priorities with
222100 being equal to nice -20 and 139 being nice 19. The prio "140" is 229100 being equal to nice -20 and 139 being nice 19. The prio "140" is
@@ -227,7 +234,7 @@ Latency trace format
227-------------------- 234--------------------
228 235
229For traces that display latency times, the latency_trace file gives 236For traces that display latency times, the latency_trace file gives
230a bit more information to see why a latency happened. Here's a typical 237somewhat more information to see why a latency happened. Here is a typical
231trace. 238trace.
232 239
233# tracer: irqsoff 240# tracer: irqsoff
@@ -255,21 +262,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
255 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) 262 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
256 263
257 264
258vim:ft=help
259
260 265
261This shows that the current tracer is "irqsoff" tracing the time 266This shows that the current tracer is "irqsoff" tracing the time for which
262interrupts are disabled. It gives the trace version and the kernel 267interrupts were disabled. It gives the trace version and the version
263this was executed on (2.6.26-rc8). Then it displays the max latency 268of the kernel upon which this was executed on (2.6.26-rc8). Then it displays
264in microsecs (97 us). The number of trace entries displayed 269the max latency in microsecs (97 us). The number of trace entries displayed
265by the total number recorded (both are three: #3/3). The type of 270and the total number recorded (both are three: #3/3). The type of
266preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero 271preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
267and reserved for later use. #P is the number of online CPUS (#P:2). 272and are reserved for later use. #P is the number of online CPUS (#P:2).
268 273
269The task is the process that was running when the latency happened. 274The task is the process that was running when the latency occurred.
270(swapper pid: 0). 275(swapper pid: 0).
271 276
272The start and stop that caused the latencies: 277The start and stop (the functions in which the interrupts were disabled and
278enabled respectively) that caused the latencies:
273 279
274 apic_timer_interrupt is where the interrupts were disabled. 280 apic_timer_interrupt is where the interrupts were disabled.
275 do_softirq is where they were enabled again. 281 do_softirq is where they were enabled again.
@@ -281,14 +287,14 @@ explains which is which.
281 287
282 pid: The PID of that process. 288 pid: The PID of that process.
283 289
284 CPU#: The CPU that the process was running on. 290 CPU#: The CPU which the process was running on.
285 291
286 irqs-off: 'd' interrupts are disabled. '.' otherwise. 292 irqs-off: 'd' interrupts are disabled. '.' otherwise.
287 293
288 need-resched: 'N' task need_resched is set, '.' otherwise. 294 need-resched: 'N' task need_resched is set, '.' otherwise.
289 295
290 hardirq/softirq: 296 hardirq/softirq:
291 'H' - hard irq happened inside a softirq. 297 'H' - hard irq occurred inside a softirq.
292 'h' - hard irq is running 298 'h' - hard irq is running
293 's' - soft irq is running 299 's' - soft irq is running
294 '.' - normal context. 300 '.' - normal context.
@@ -297,13 +303,13 @@ explains which is which.
297 303
298The above is mostly meaningful for kernel developers. 304The above is mostly meaningful for kernel developers.
299 305
300 time: This differs from the trace output where as the trace output 306 time: This differs from the trace file output. The trace file output
301 contained a absolute timestamp. This timestamp is relative 307 includes an absolute timestamp. The timestamp used by the
302 to the start of the first entry in the the trace. 308 latency_trace file is relative to the start of the trace.
303 309
304 delay: This is just to help catch your eye a bit better. And 310 delay: This is just to help catch your eye a bit better. And
305 needs to be fixed to be only relative to the same CPU. 311 needs to be fixed to be only relative to the same CPU.
306 The marks is determined by the difference between this 312 The marks are determined by the difference between this
307 current trace and the next trace. 313 current trace and the next trace.
308 '!' - greater than preempt_mark_thresh (default 100) 314 '!' - greater than preempt_mark_thresh (default 100)
309 '+' - greater than 1 microsecond 315 '+' - greater than 1 microsecond
@@ -322,13 +328,13 @@ output. To see what is available, simply cat the file:
322 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ 328 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
323 noblock nostacktrace nosched-tree 329 noblock nostacktrace nosched-tree
324 330
325To disable one of the options, echo in the option appended with "no". 331To disable one of the options, echo in the option prepended with "no".
326 332
327 echo noprint-parent > /debug/tracing/iter_ctrl 333 echo noprint-parent > /debug/tracing/iter_ctrl
328 334
329To enable an option, leave off the "no". 335To enable an option, leave off the "no".
330 336
331 echo sym-offest > /debug/tracing/iter_ctrl 337 echo sym-offset > /debug/tracing/iter_ctrl
332 338
333Here are the available options: 339Here are the available options:
334 340
@@ -344,7 +350,7 @@ Here are the available options:
344 350
345 sym-offset - Display not only the function name, but also the offset 351 sym-offset - Display not only the function name, but also the offset
346 in the function. For example, instead of seeing just 352 in the function. For example, instead of seeing just
347 "ktime_get" you will see "ktime_get+0xb/0x20" 353 "ktime_get", you will see "ktime_get+0xb/0x20".
348 354
349 sym-offset: 355 sym-offset:
350 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0 356 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
@@ -364,7 +370,7 @@ Here are the available options:
364 user applications that can translate the raw numbers better than 370 user applications that can translate the raw numbers better than
365 having it done in the kernel. 371 having it done in the kernel.
366 372
367 hex - similar to raw, but the numbers will be in a hexadecimal format. 373 hex - Similar to raw, but the numbers will be in a hexadecimal format.
368 374
369 bin - This will print out the formats in raw binary. 375 bin - This will print out the formats in raw binary.
370 376
@@ -380,8 +386,8 @@ Here are the available options:
380sched_switch 386sched_switch
381------------ 387------------
382 388
383This tracer simply records schedule switches. Here's an example 389This tracer simply records schedule switches. Here is an example
384on how to implement it. 390of how to use it.
385 391
386 # echo sched_switch > /debug/tracing/current_tracer 392 # echo sched_switch > /debug/tracing/current_tracer
387 # echo 1 > /debug/tracing/tracing_enabled 393 # echo 1 > /debug/tracing/tracing_enabled
@@ -416,8 +422,8 @@ the name of the trace and points to the options. The "FUNCTION"
416is a misnomer since here it represents the wake ups and context 422is a misnomer since here it represents the wake ups and context
417switches. 423switches.
418 424
419The sched_switch only lists the wake ups (represented with '+') 425The sched_switch file only lists the wake ups (represented with '+')
420and context switches ('==>') with the previous task or current 426and context switches ('==>') with the previous task or current task
421first followed by the next task or task waking up. The format for both 427first followed by the next task or task waking up. The format for both
422of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO 428of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
423is the inverse of the actual priority with zero (0) being the highest 429is the inverse of the actual priority with zero (0) being the highest
@@ -432,7 +438,8 @@ The task states are:
432 438
433 R - running : wants to run, may not actually be running 439 R - running : wants to run, may not actually be running
434 S - sleep : process is waiting to be woken up (handles signals) 440 S - sleep : process is waiting to be woken up (handles signals)
435 D - deep sleep : process must be woken up (ignores signals) 441 D - disk sleep (uninterruptible sleep) : process must be woken up
442 (ignores signals)
436 T - stopped : process suspended 443 T - stopped : process suspended
437 t - traced : process is being traced (with something like gdb) 444 t - traced : process is being traced (with something like gdb)
438 Z - zombie : process waiting to be cleaned up 445 Z - zombie : process waiting to be cleaned up
@@ -442,8 +449,8 @@ The task states are:
442ftrace_enabled 449ftrace_enabled
443-------------- 450--------------
444 451
445The following tracers give different output depending on whether 452The following tracers (listed below) give different output depending
446or not the sysctl ftrace_enabled is set. To set ftrace_enabled, 453on whether or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
447one can either use the sysctl function or set it via the proc 454one can either use the sysctl function or set it via the proc
448file system interface. 455file system interface.
449 456
@@ -470,13 +477,12 @@ interrupt from triggering or the mouse interrupt from letting the
470kernel know of a new mouse event. The result is a latency with the 477kernel know of a new mouse event. The result is a latency with the
471reaction time. 478reaction time.
472 479
473The irqsoff tracer tracks the time interrupts are disabled and when 480The irqsoff tracer tracks the time for which interrupts are disabled.
474they are re-enabled. When a new maximum latency is hit, it saves off 481When a new maximum latency is hit, the tracer saves the trace leading up
475the trace so that it may be retrieved at a later time. Every time a 482to that latency point so that every time a new maximum is reached, the old
476new maximum in reached, the old saved trace is discarded and the new 483saved trace is discarded and the new trace is saved.
477trace is saved.
478 484
479To reset the maximum, echo 0 into tracing_max_latency. Here's an 485To reset the maximum, echo 0 into tracing_max_latency. Here is an
480example: 486example:
481 487
482 # echo irqsoff > /debug/tracing/current_tracer 488 # echo irqsoff > /debug/tracing/current_tracer
@@ -488,14 +494,14 @@ example:
488 # cat /debug/tracing/latency_trace 494 # cat /debug/tracing/latency_trace
489# tracer: irqsoff 495# tracer: irqsoff
490# 496#
491irqsoff latency trace v1.1.5 on 2.6.26-rc8 497irqsoff latency trace v1.1.5 on 2.6.26
492-------------------------------------------------------------------- 498--------------------------------------------------------------------
493 latency: 6 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) 499 latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
494 ----------------- 500 -----------------
495 | task: bash-4269 (uid:0 nice:0 policy:0 rt_prio:0) 501 | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0)
496 ----------------- 502 -----------------
497 => started at: copy_page_range 503 => started at: sys_setpgid
498 => ended at: copy_page_range 504 => ended at: sys_setpgid
499 505
500# _------=> CPU# 506# _------=> CPU#
501# / _-----=> irqs-off 507# / _-----=> irqs-off
@@ -506,21 +512,19 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
506# ||||| delay 512# ||||| delay
507# cmd pid ||||| time | caller 513# cmd pid ||||| time | caller
508# \ / ||||| \ | / 514# \ / ||||| \ | /
509 bash-4269 1...1 0us+: _spin_lock (copy_page_range) 515 bash-3730 1d... 0us : _write_lock_irq (sys_setpgid)
510 bash-4269 1...1 7us : _spin_unlock (copy_page_range) 516 bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid)
511 bash-4269 1...2 7us : trace_preempt_on (copy_page_range) 517 bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid)
512 518
513 519
514vim:ft=help 520Here we see that that we had a latency of 12 microsecs (which is
521very good). The _write_lock_irq in sys_setpgid disabled interrupts.
522The difference between the 12 and the displayed timestamp 14us occurred
523because the clock was incremented between the time of recording the max
524latency and the time of recording the function that had that latency.
515 525
516Here we see that that we had a latency of 6 microsecs (which is 526Note the above example had ftrace_enabled not set. If we set the
517very good). The spin_lock in copy_page_range disabled interrupts. 527ftrace_enabled, we get a much larger output:
518The difference between the 6 and the displayed timestamp 7us is
519because the clock must have incremented between the time of recording
520the max latency and recording the function that had that latency.
521
522Note the above had ftrace_enabled not set. If we set the ftrace_enabled
523we get a much larger output:
524 528
525# tracer: irqsoff 529# tracer: irqsoff
526# 530#
@@ -566,27 +570,26 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
566 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) 570 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
567 571
568 572
569vim:ft=help
570
571 573
572Here we traced a 50 microsecond latency. But we also see all the 574Here we traced a 50 microsecond latency. But we also see all the
573functions that were called during that time. Note that enabling 575functions that were called during that time. Note that by enabling
574function tracing we endure an added overhead. This overhead may 576function tracing, we incur an added overhead. This overhead may
575extend the latency times. But never the less, this trace has provided 577extend the latency times. But nevertheless, this trace has provided
576some very helpful debugging. 578some very helpful debugging information.
577 579
578 580
579preemptoff 581preemptoff
580---------- 582----------
581 583
582When preemption is disabled we may be able to receive interrupts but 584When preemption is disabled, we may be able to receive interrupts but
583the task can not be preempted and a higher priority task must wait 585the task cannot be preempted and a higher priority task must wait
584for preemption to be enabled again before it can preempt a lower 586for preemption to be enabled again before it can preempt a lower
585priority task. 587priority task.
586 588
587The preemptoff tracer traces the places that disables preemption. 589The preemptoff tracer traces the places that disable preemption.
588Like the irqsoff, it records the maximum latency that preemption 590Like the irqsoff tracer, it records the maximum latency for which preemption
589was disabled. The control of preemptoff is much like the irqsoff. 591was disabled. The control of preemptoff tracer is much like the irqsoff
592tracer.
590 593
591 # echo preemptoff > /debug/tracing/current_tracer 594 # echo preemptoff > /debug/tracing/current_tracer
592 # echo 0 > /debug/tracing/tracing_max_latency 595 # echo 0 > /debug/tracing/tracing_max_latency
@@ -620,8 +623,6 @@ preemptoff latency trace v1.1.5 on 2.6.26-rc8
620 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) 623 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
621 624
622 625
623vim:ft=help
624
625This has some more changes. Preemption was disabled when an interrupt 626This has some more changes. Preemption was disabled when an interrupt
626came in (notice the 'h'), and was enabled while doing a softirq. 627came in (notice the 'h'), and was enabled while doing a softirq.
627(notice the 's'). But we also see that interrupts have been disabled 628(notice the 's'). But we also see that interrupts have been disabled
@@ -689,16 +690,16 @@ The above is an example of the preemptoff trace with ftrace_enabled
689set. Here we see that interrupts were disabled the entire time. 690set. Here we see that interrupts were disabled the entire time.
690The irq_enter code lets us know that we entered an interrupt 'h'. 691The irq_enter code lets us know that we entered an interrupt 'h'.
691Before that, the functions being traced still show that it is not 692Before that, the functions being traced still show that it is not
692in an interrupt, but we can see by the functions themselves that 693in an interrupt, but we can see from the functions themselves that
693this is not the case. 694this is not the case.
694 695
695Notice that the __do_softirq when called doesn't have a preempt_count. 696Notice that __do_softirq when called does not have a preempt_count.
696It may seem that we missed a preempt enabled. What really happened 697It may seem that we missed a preempt enabling. What really happened
697is that the preempt count is held on the threads stack and we 698is that the preempt count is held on the thread's stack and we
698switched to the softirq stack (4K stacks in effect). The code 699switched to the softirq stack (4K stacks in effect). The code
699does not copy the preempt count, but because interrupts are disabled 700does not copy the preempt count, but because interrupts are disabled,
700we don't need to worry about it. Having a tracer like this is good 701we do not need to worry about it. Having a tracer like this is good
701to let people know what really happens inside the kernel. 702for letting people know what really happens inside the kernel.
702 703
703 704
704preemptirqsoff 705preemptirqsoff
@@ -708,7 +709,7 @@ Knowing the locations that have interrupts disabled or preemption
708disabled for the longest times is helpful. But sometimes we would 709disabled for the longest times is helpful. But sometimes we would
709like to know when either preemption and/or interrupts are disabled. 710like to know when either preemption and/or interrupts are disabled.
710 711
711The following code: 712Consider the following code:
712 713
713 local_irq_disable(); 714 local_irq_disable();
714 call_function_with_irqs_off(); 715 call_function_with_irqs_off();
@@ -732,7 +733,7 @@ To record this time, use the preemptirqsoff tracer.
732 733
733Again, using this trace is much like the irqsoff and preemptoff tracers. 734Again, using this trace is much like the irqsoff and preemptoff tracers.
734 735
735 # echo preemptoff > /debug/tracing/current_tracer 736 # echo preemptirqsoff > /debug/tracing/current_tracer
736 # echo 0 > /debug/tracing/tracing_max_latency 737 # echo 0 > /debug/tracing/tracing_max_latency
737 # echo 1 > /debug/tracing/tracing_enabled 738 # echo 1 > /debug/tracing/tracing_enabled
738 # ls -ltr 739 # ls -ltr
@@ -764,12 +765,10 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
764 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) 765 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
765 766
766 767
767vim:ft=help
768
769 768
770The trace_hardirqs_off_thunk is called from assembly on x86 when 769The trace_hardirqs_off_thunk is called from assembly on x86 when
771interrupts are disabled in the assembly code. Without the function 770interrupts are disabled in the assembly code. Without the function
772tracing, we don't know if interrupts were enabled within the preemption 771tracing, we do not know if interrupts were enabled within the preemption
773points. We do see that it started with preemption enabled. 772points. We do see that it started with preemption enabled.
774 773
775Here is a trace with ftrace_enabled set: 774Here is a trace with ftrace_enabled set:
@@ -860,25 +859,25 @@ preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
860 859
861This is a very interesting trace. It started with the preemption of 860This is a very interesting trace. It started with the preemption of
862the ls task. We see that the task had the "need_resched" bit set 861the ls task. We see that the task had the "need_resched" bit set
863with the 'N' in the trace. Interrupts are disabled in the spin_lock 862via the 'N' in the trace. Interrupts were disabled before the spin_lock
864and the trace started. We see that a schedule took place to run 863at the beginning of the trace. We see that a schedule took place to run
865sshd. When the interrupts were enabled we took an interrupt. 864sshd. When the interrupts were enabled, we took an interrupt.
866On return of the interrupt the softirq ran. We took another interrupt 865On return from the interrupt handler, the softirq ran. We took another
867while running the softirq as we see with the capital 'H'. 866interrupt while running the softirq as we see from the capital 'H'.
868 867
869 868
870wakeup 869wakeup
871------ 870------
872 871
873In Real-Time environment it is very important to know the wakeup 872In a Real-Time environment it is very important to know the wakeup
874time it takes for the highest priority task that wakes up to the 873time it takes for the highest priority task that is woken up to the
875time it executes. This is also known as "schedule latency". 874time that it executes. This is also known as "schedule latency".
876I stress the point that this is about RT tasks. It is also important 875I stress the point that this is about RT tasks. It is also important
877to know the scheduling latency of non-RT tasks, but the average 876to know the scheduling latency of non-RT tasks, but the average
878schedule latency is better for non-RT tasks. Tools like 877schedule latency is better for non-RT tasks. Tools like
879LatencyTop is more appropriate for such measurements. 878LatencyTop are more appropriate for such measurements.
880 879
881Real-Time environments is interested in the worst case latency. 880Real-Time environments are interested in the worst case latency.
882That is the longest latency it takes for something to happen, and 881That is the longest latency it takes for something to happen, and
883not the average. We can have a very fast scheduler that may only 882not the average. We can have a very fast scheduler that may only
884have a large latency once in a while, but that would not work well 883have a large latency once in a while, but that would not work well
@@ -889,8 +888,8 @@ tasks that are unpredictable will overwrite the worst case latency
889of RT tasks. 888of RT tasks.
890 889
891Since this tracer only deals with RT tasks, we will run this slightly 890Since this tracer only deals with RT tasks, we will run this slightly
892different than we did with the previous tracers. Instead of performing 891differently than we did with the previous tracers. Instead of performing
893an 'ls' we will run 'sleep 1' under 'chrt' which changes the 892an 'ls', we will run 'sleep 1' under 'chrt' which changes the
894priority of the task. 893priority of the task.
895 894
896 # echo wakeup > /debug/tracing/current_tracer 895 # echo wakeup > /debug/tracing/current_tracer
@@ -921,12 +920,10 @@ wakeup latency trace v1.1.5 on 2.6.26-rc8
921 <idle>-0 1d..4 4us : schedule (cpu_idle) 920 <idle>-0 1d..4 4us : schedule (cpu_idle)
922 921
923 922
924vim:ft=help
925
926 923
927Running this on an idle system we see that it only took 4 microseconds 924Running this on an idle system, we see that it only took 4 microseconds
928to perform the task switch. Note, since the trace marker in the 925to perform the task switch. Note, since the trace marker in the
929schedule is before the actual "switch" we stop the tracing when 926schedule is before the actual "switch", we stop the tracing when
930the recorded task is about to schedule in. This may change if 927the recorded task is about to schedule in. This may change if
931we add a new marker at the end of the scheduler. 928we add a new marker at the end of the scheduler.
932 929
@@ -991,13 +988,16 @@ ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
991ksoftirq-7 1d..4 50us : schedule (__cond_resched) 988ksoftirq-7 1d..4 50us : schedule (__cond_resched)
992 989
993The interrupt went off while running ksoftirqd. This task runs at 990The interrupt went off while running ksoftirqd. This task runs at
994SCHED_OTHER. Why didn't we see the 'N' set early? This may be 991SCHED_OTHER. Why did not we see the 'N' set early? This may be
995a harmless bug with x86_32 and 4K stacks. The need_reched() function 992a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K stacks
996that tests if we need to reschedule looks on the actual stack. 993configured, the interrupt and softirq run with their own stack.
997Where as the setting of the NEED_RESCHED bit happens on the 994Some information is held on the top of the task's stack (need_resched
998task's stack. But because we are in a hard interrupt, the test 995and preempt_count are both stored there). The setting of the NEED_RESCHED
999is with the interrupts stack which has that to be false. We don't 996bit is done directly to the task's stack, but the reading of the
1000see the 'N' until we switch back to the task's stack. 997NEED_RESCHED is done by looking at the current stack, which in this case
998is the stack for the hard interrupt. This hides the fact that NEED_RESCHED
999has been set. We do not see the 'N' until we switch back to the task's
1000assigned stack.
1001 1001
1002ftrace 1002ftrace
1003------ 1003------
@@ -1036,14 +1036,14 @@ this tracer is a nop.
1036[...] 1036[...]
1037 1037
1038 1038
1039Note: It is sometimes better to enable or disable tracing directly from 1039Note: ftrace uses ring buffers to store the above entries. The newest data
1040a program, because the buffer may be overflowed by the echo commands 1040may overwrite the oldest data. Sometimes using echo to stop the trace
1041before you get to the point you want to trace. It is also easier to 1041is not sufficient because the tracing could have overwritten the data
1042stop the tracing at the point that you hit the part that you are 1042that you wanted to record. For this reason, it is sometimes better to
1043interested in. Since the ftrace buffer is a ring buffer with the 1043disable tracing directly from a program. This allows you to stop the
1044oldest data being overwritten, usually it is sufficient to start the 1044tracing at the point that you hit the part that you are interested in.
1045tracer with an echo command but have you code stop it. Something 1045To disable the tracing directly from a C program, something like following
1046like the following is usually appropriate for this. 1046code snippet can be used:
1047 1047
1048int trace_fd; 1048int trace_fd;
1049[...] 1049[...]
@@ -1052,25 +1052,31 @@ int main(int argc, char *argv[]) {
1052 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY); 1052 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1053 [...] 1053 [...]
1054 if (condition_hit()) { 1054 if (condition_hit()) {
1055 write(trace_fd, "0", 1); 1055 write(trace_fd, "0", 1);
1056 } 1056 }
1057 [...] 1057 [...]
1058} 1058}
1059 1059
1060Note: Here we hard coded the path name. The debugfs mount is not
1061guaranteed to be at /debug (and is more commonly at /sys/kernel/debug).
1062For simple one time traces, the above is sufficent. For anything else,
1063a search through /proc/mounts may be needed to find where the debugfs
1064file-system is mounted.
1060 1065
1061dynamic ftrace 1066dynamic ftrace
1062-------------- 1067--------------
1063 1068
1064If CONFIG_DYNAMIC_FTRACE is set, then the system will run with 1069If CONFIG_DYNAMIC_FTRACE is set, the system will run with
1065virtually no overhead when function tracing is disabled. The way 1070virtually no overhead when function tracing is disabled. The way
1066this works is the mcount function call (placed at the start of 1071this works is the mcount function call (placed at the start of
1067every kernel function, produced by the -pg switch in gcc), starts 1072every kernel function, produced by the -pg switch in gcc), starts
1068of pointing to a simple return. 1073of pointing to a simple return. (Enabling FTRACE will include the
1074-pg switch in the compiling of the kernel.)
1069 1075
1070When dynamic ftrace is initialized, it calls kstop_machine to make it 1076When dynamic ftrace is initialized, it calls kstop_machine to make
1071act like a uniprocessor so that it can freely modify code without 1077the machine act like a uniprocessor so that it can freely modify code
1072worrying about other processors executing that same code. At 1078without worrying about other processors executing that same code. At
1073initialization, the mcount calls are change to call a "record_ip" 1079initialization, the mcount calls are changed to call a "record_ip"
1074function. After this, the first time a kernel function is called, 1080function. After this, the first time a kernel function is called,
1075it has the calling address saved in a hash table. 1081it has the calling address saved in a hash table.
1076 1082
@@ -1078,15 +1084,15 @@ Later on the ftraced kernel thread is awoken and will again call
1078kstop_machine if new functions have been recorded. The ftraced thread 1084kstop_machine if new functions have been recorded. The ftraced thread
1079will change all calls to mcount to "nop". Just calling mcount 1085will change all calls to mcount to "nop". Just calling mcount
1080and having mcount return has shown a 10% overhead. By converting 1086and having mcount return has shown a 10% overhead. By converting
1081it to a nop, there is no recordable overhead to the system. 1087it to a nop, there is no measurable overhead to the system.
1082 1088
1083One special side-effect to the recording of the functions being 1089One special side-effect to the recording of the functions being
1084traced, is that we can now selectively choose which functions we 1090traced is that we can now selectively choose which functions we
1085want to trace and which ones we want the mcount calls to remain as 1091wish to trace and which ones we want the mcount calls to remain as
1086nops. 1092nops.
1087 1093
1088Two files that contain to the enabling and disabling of recorded 1094Two files are used, one for enabling and one for disabling the tracing
1089functions are: 1095of specified functions. They are:
1090 1096
1091 set_ftrace_filter 1097 set_ftrace_filter
1092 1098
@@ -1094,7 +1100,7 @@ and
1094 1100
1095 set_ftrace_notrace 1101 set_ftrace_notrace
1096 1102
1097A list of available functions that you can add to this files is listed 1103A list of available functions that you can add to these files is listed
1098in: 1104in:
1099 1105
1100 available_filter_functions 1106 available_filter_functions
@@ -1108,7 +1114,7 @@ pick_next_task_fair
1108mutex_lock 1114mutex_lock
1109[...] 1115[...]
1110 1116
1111If I'm only interested in sys_nanosleep and hrtimer_interrupt: 1117If I am only interested in sys_nanosleep and hrtimer_interrupt:
1112 1118
1113 # echo sys_nanosleep hrtimer_interrupt \ 1119 # echo sys_nanosleep hrtimer_interrupt \
1114 > /debug/tracing/set_ftrace_filter 1120 > /debug/tracing/set_ftrace_filter
@@ -1125,21 +1131,21 @@ If I'm only interested in sys_nanosleep and hrtimer_interrupt:
1125 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call 1131 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
1126 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt 1132 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1127 1133
1128To see what functions are being traced, you can cat the file: 1134To see which functions are being traced, you can cat the file:
1129 1135
1130 # cat /debug/tracing/set_ftrace_filter 1136 # cat /debug/tracing/set_ftrace_filter
1131hrtimer_interrupt 1137hrtimer_interrupt
1132sys_nanosleep 1138sys_nanosleep
1133 1139
1134 1140
1135Perhaps this isn't enough. The filters also allow simple wild cards. 1141Perhaps this is not enough. The filters also allow simple wild cards.
1136Only the following is currently available 1142Only the following are currently available
1137 1143
1138 <match>* - will match functions that begins with <match> 1144 <match>* - will match functions that begin with <match>
1139 *<match> - will match functions that end with <match> 1145 *<match> - will match functions that end with <match>
1140 *<match>* - will match functions that have <match> in it 1146 *<match>* - will match functions that have <match> in it
1141 1147
1142Thats all the wild cards that are allowed. 1148These are the only wild cards which are supported.
1143 1149
1144 <match>*<match> will not work. 1150 <match>*<match> will not work.
1145 1151
@@ -1187,7 +1193,7 @@ This is because the '>' and '>>' act just like they do in bash.
1187To rewrite the filters, use '>' 1193To rewrite the filters, use '>'
1188To append to the filters, use '>>' 1194To append to the filters, use '>>'
1189 1195
1190To clear out a filter so that all functions will be recorded again. 1196To clear out a filter so that all functions will be recorded again:
1191 1197
1192 # echo > /debug/tracing/set_ftrace_filter 1198 # echo > /debug/tracing/set_ftrace_filter
1193 # cat /debug/tracing/set_ftrace_filter 1199 # cat /debug/tracing/set_ftrace_filter
@@ -1246,24 +1252,24 @@ ftraced
1246 1252
1247As mentioned above, when dynamic ftrace is configured in, a kernel 1253As mentioned above, when dynamic ftrace is configured in, a kernel
1248thread wakes up once a second and checks to see if there are mcount 1254thread wakes up once a second and checks to see if there are mcount
1249calls that need to be converted into nops. If there is not, then 1255calls that need to be converted into nops. If there are not any, then
1250it simply goes back to sleep. But if there is, it will call 1256it simply goes back to sleep. But if there are some, it will call
1251kstop_machine to convert the calls to nops. 1257kstop_machine to convert the calls to nops.
1252 1258
1253There may be a case that you do not want this added latency. 1259There may be a case in which you do not want this added latency.
1254Perhaps you are doing some audio recording and this activity might 1260Perhaps you are doing some audio recording and this activity might
1255cause skips in the playback. There is an interface to disable 1261cause skips in the playback. There is an interface to disable
1256and enable the ftraced kernel thread. 1262and enable the "ftraced" kernel thread.
1257 1263
1258 # echo 0 > /debug/tracing/ftraced_enabled 1264 # echo 0 > /debug/tracing/ftraced_enabled
1259 1265
1260This will disable the calling of the kstop_machine to update the 1266This will disable the calling of kstop_machine to update the
1261mcount calls to nops. Remember that there's a large overhead 1267mcount calls to nops. Remember that there is a large overhead
1262to calling mcount. Without this kernel thread, that overhead will 1268to calling mcount. Without this kernel thread, that overhead will
1263exist. 1269exist.
1264 1270
1265Any write to the ftraced_enabled file will cause the kstop_machine 1271If there are recorded calls to mcount, any write to the ftraced_enabled
1266to run if there are recorded calls to mcount. This means that a 1272file will cause the kstop_machine to run. This means that a
1267user can manually perform the updates when they want to by simply 1273user can manually perform the updates when they want to by simply
1268echoing a '0' into the ftraced_enabled file. 1274echoing a '0' into the ftraced_enabled file.
1269 1275
@@ -1274,8 +1280,8 @@ that uses ftrace function recording.
1274trace_pipe 1280trace_pipe
1275---------- 1281----------
1276 1282
1277The trace_pipe outputs the same as trace, but the effect on the 1283The trace_pipe outputs the same content as the trace file, but the effect
1278tracing is different. Every read from trace_pipe is consumed. 1284on the tracing is different. Every read from trace_pipe is consumed.
1279This means that subsequent reads will be different. The trace 1285This means that subsequent reads will be different. The trace
1280is live. 1286is live.
1281 1287
@@ -1305,7 +1311,7 @@ is live.
1305 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up 1311 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1306 1312
1307 1313
1308Note, reading the trace_pipe will block until more input is added. 1314Note, reading the trace_pipe file will block until more input is added.
1309By changing the tracer, trace_pipe will issue an EOF. We needed 1315By changing the tracer, trace_pipe will issue an EOF. We needed
1310to set the ftrace tracer _before_ cating the trace_pipe file. 1316to set the ftrace tracer _before_ cating the trace_pipe file.
1311 1317
@@ -1314,8 +1320,8 @@ trace entries
1314------------- 1320-------------
1315 1321
1316Having too much or not enough data can be troublesome in diagnosing 1322Having too much or not enough data can be troublesome in diagnosing
1317some issue in the kernel. The file trace_entries is used to modify 1323an issue in the kernel. The file trace_entries is used to modify
1318the size of the internal trace buffers. The numbers listed 1324the size of the internal trace buffers. The number listed
1319is the number of entries that can be recorded per CPU. To know 1325is the number of entries that can be recorded per CPU. To know
1320the full size, multiply the number of possible CPUS with the 1326the full size, multiply the number of possible CPUS with the
1321number of entries. 1327number of entries.
@@ -1323,8 +1329,9 @@ number of entries.
1323 # cat /debug/tracing/trace_entries 1329 # cat /debug/tracing/trace_entries
132465620 133065620
1325 1331
1326Note, to modify this you must have tracing fulling disabled. To do that, 1332Note, to modify this, you must have tracing completely disabled. To do that,
1327echo "none" into the current_tracer. 1333echo "none" into the current_tracer. If the current_tracer is not set
1334to "none", an EINVAL error will be returned.
1328 1335
1329 # echo none > /debug/tracing/current_tracer 1336 # echo none > /debug/tracing/current_tracer
1330 # echo 100000 > /debug/tracing/trace_entries 1337 # echo 100000 > /debug/tracing/trace_entries
@@ -1333,18 +1340,18 @@ echo "none" into the current_tracer.
1333 1340
1334 1341
1335Notice that we echoed in 100,000 but the size is 100,045. The entries 1342Notice that we echoed in 100,000 but the size is 100,045. The entries
1336are held by individual pages. It allocates the number of pages it takes 1343are held in individual pages. It allocates the number of pages it takes
1337to fulfill the request. If more entries may fit on the last page 1344to fulfill the request. If more entries may fit on the last page
1338it will add them. 1345then they will be added.
1339 1346
1340 # echo 1 > /debug/tracing/trace_entries 1347 # echo 1 > /debug/tracing/trace_entries
1341 # cat /debug/tracing/trace_entries 1348 # cat /debug/tracing/trace_entries
134285 134985
1343 1350
1344This shows us that 85 entries can fit on a single page. 1351This shows us that 85 entries can fit in a single page.
1345 1352
1346The number of pages that will be allocated is a percentage of available 1353The number of pages which will be allocated is limited to a percentage
1347memory. Allocating too much will produces an error. 1354of available memory. Allocating too much will produce an error.
1348 1355
1349 # echo 1000000000000 > /debug/tracing/trace_entries 1356 # echo 1000000000000 > /debug/tracing/trace_entries
1350-bash: echo: write error: Cannot allocate memory 1357-bash: echo: write error: Cannot allocate memory