aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ftrace.txt1353
-rw-r--r--Documentation/networking/ip-sysctl.txt256
2 files changed, 1567 insertions, 42 deletions
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
new file mode 100644
index 000000000000..13e4bf054c38
--- /dev/null
+++ b/Documentation/ftrace.txt
@@ -0,0 +1,1353 @@
1 ftrace - Function Tracer
2 ========================
3
4Copyright 2008 Red Hat Inc.
5Author: Steven Rostedt <srostedt@redhat.com>
6
7
8Introduction
9------------
10
11Ftrace is an internal tracer designed to help out developers and
12designers of systems to find what is going on inside the kernel.
13It can be used for debugging or analyzing latencies and performance
14issues that take place outside of user-space.
15
16Although ftrace is the function tracer, it also includes an
17infrastructure that allows for other types of tracing. Some of the
18tracers that are currently in ftrace is a tracer to trace
19context switches, the time it takes for a high priority task to
20run after it was woken up, the time interrupts are disabled, and
21more.
22
23
24The File System
25---------------
26
27Ftrace uses the debugfs file system to hold the control files as well
28as the files to display output.
29
30To mount the debugfs system:
31
32 # mkdir /debug
33 # mount -t debugfs nodev /debug
34
35
36That's it! (assuming that you have ftrace configured into your kernel)
37
38After mounting the debugfs, you can see a directory called
39"tracing". This directory contains the control and output files
40of ftrace. Here is a list of some of the key files:
41
42
43 Note: all time values are in microseconds.
44
45 current_tracer : This is used to set or display the current tracer
46 that is configured.
47
48 available_tracers : This holds the different types of tracers that
49 has been compiled into the kernel. The tracers
50 listed here can be configured by echoing in their
51 name into current_tracer.
52
53 tracing_enabled : This sets or displays whether the current_tracer
54 is activated and tracing or not. Echo 0 into this
55 file to disable the tracer or 1 (or non-zero) to
56 enable it.
57
58 trace : This file holds the output of the trace in a human readable
59 format.
60
61 latency_trace : This file shows the same trace but the information
62 is organized more to display possible latencies
63 in the system.
64
65 trace_pipe : The output is the same as the "trace" file but this
66 file is meant to be streamed with live tracing.
67 Reads from this file will block until new data
68 is retrieved. Unlike the "trace" and "latency_trace"
69 files, this file is a consumer. This means reading
70 from this file causes sequential reads to display
71 more current data. Once data is read from this
72 file, it is consumed, and will not be read
73 again with a sequential read. The "trace" and
74 "latency_trace" files are static, and if the
75 tracer isn't adding more data, they will display
76 the same information every time they are read.
77
78 iter_ctrl : This file lets the user control the amount of data
79 that is displayed in one of the above output
80 files.
81
82 trace_max_latency : Some of the tracers record the max latency.
83 For example, the time interrupts are disabled.
84 This time is saved in this file. The max trace
85 will also be stored, and displayed by either
86 "trace" or "latency_trace". A new max trace will
87 only be recorded if the latency is greater than
88 the value in this file. (in microseconds)
89
90 trace_entries : This sets or displays the number of trace
91 entries each CPU buffer can hold. The tracer buffers
92 are the same size for each CPU, so care must be
93 taken when modifying the trace_entries. The number
94 of actually entries will be the number given
95 times the number of possible CPUS. The buffers
96 are saved as individual pages, and the actual entries
97 will always be rounded up to entries per page.
98
99 This can only be updated when the current_tracer
100 is set to "none".
101
102 NOTE: It is planned on changing the allocated buffers
103 from being the number of possible CPUS to
104 the number of online CPUS.
105
106 tracing_cpumask : This is a mask that lets the user only trace
107 on specified CPUS. The format is a hex string
108 representing the CPUS.
109
110 set_ftrace_filter : When dynamic ftrace is configured in, the
111 code is dynamically modified to disable calling
112 of the function profiler (mcount). This lets
113 tracing be configured in with practically no overhead
114 in performance. This also has a side effect of
115 enabling or disabling specific functions to be
116 traced. Echoing in names of functions into this
117 file will limit the trace to only those files.
118
119 set_ftrace_notrace: This has the opposite effect that
120 set_ftrace_filter has. Any function that is added
121 here will not be traced. If a function exists
122 in both set_ftrace_filter and set_ftrace_notrace
123 the function will _not_ bet traced.
124
125 available_filter_functions : When a function is encountered the first
126 time by the dynamic tracer, it is recorded and
127 later the call is converted into a nop. This file
128 lists the functions that have been recorded
129 by the dynamic tracer and these functions can
130 be used to set the ftrace filter by the above
131 "set_ftrace_filter" file.
132
133
134The Tracers
135-----------
136
137Here are the list of current tracers that can be configured.
138
139 ftrace - function tracer that uses mcount to trace all functions.
140 It is possible to filter out which functions that are
141 traced when dynamic ftrace is configured in.
142
143 sched_switch - traces the context switches between tasks.
144
145 irqsoff - traces the areas that disable interrupts and saves off
146 the trace with the longest max latency.
147 See tracing_max_latency. When a new max is recorded,
148 it replaces the old trace. It is best to view this
149 trace with the latency_trace file.
150
151 preemptoff - Similar to irqsoff but traces and records the time
152 preemption is disabled.
153
154 preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
155 records the largest time irqs and/or preemption is
156 disabled.
157
158 wakeup - Traces and records the max latency that it takes for
159 the highest priority task to get scheduled after
160 it has been woken up.
161
162 none - This is not a tracer. To remove all tracers from tracing
163 simply echo "none" into current_tracer.
164
165
166Examples of using the tracer
167----------------------------
168
169Here are typical examples of using the tracers with only controlling
170them with the debugfs interface (without using any user-land utilities).
171
172Output format:
173--------------
174
175Here's an example of the output format of the file "trace"
176
177 --------
178# tracer: ftrace
179#
180# TASK-PID CPU# TIMESTAMP FUNCTION
181# | | | | |
182 bash-4251 [01] 10152.583854: path_put <-path_walk
183 bash-4251 [01] 10152.583855: dput <-path_put
184 bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
185 --------
186
187A header is printed with the trace that is represented. In this case
188the tracer is "ftrace". Then a header showing the format. Task name
189"bash", the task PID "4251", the CPU that it was running on
190"01", the timestamp in <secs>.<usecs> format, the function name that was
191traced "path_put" and the parent function that called this function
192"path_walk".
193
194The sched_switch tracer also includes tracing of task wake ups and
195context switches.
196
197 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
198 ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S
199 ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R
200 events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R
201 kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
202 ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
203
204Wake ups are represented by a "+" and the context switches show
205"==>". The format is:
206
207 Context switches:
208
209 Previous task Next Task
210
211 <pid>:<prio>:<state> ==> <pid>:<prio>:<state>
212
213 Wake ups:
214
215 Current task Task waking up
216
217 <pid>:<prio>:<state> + <pid>:<prio>:<state>
218
219The prio is the internal kernel priority, which is inverse to the
220priority that is usually displayed by user-space tools. Zero represents
221the highest priority (99). Prio 100 starts the "nice" priorities with
222100 being equal to nice -20 and 139 being nice 19. The prio "140" is
223reserved for the idle task which is the lowest priority thread (pid 0).
224
225
226Latency trace format
227--------------------
228
229For traces that display latency times, the latency_trace file gives
230a bit more information to see why a latency happened. Here's a typical
231trace.
232
233# tracer: irqsoff
234#
235irqsoff latency trace v1.1.5 on 2.6.26-rc8
236--------------------------------------------------------------------
237 latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
238 -----------------
239 | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0)
240 -----------------
241 => started at: apic_timer_interrupt
242 => ended at: do_softirq
243
244# _------=> CPU#
245# / _-----=> irqs-off
246# | / _----=> need-resched
247# || / _---=> hardirq/softirq
248# ||| / _--=> preempt-depth
249# |||| /
250# ||||| delay
251# cmd pid ||||| time | caller
252# \ / ||||| \ | /
253 <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt)
254 <idle>-0 0d.s. 97us : __do_softirq (do_softirq)
255 <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
256
257
258vim:ft=help
259
260
261This shows that the current tracer is "irqsoff" tracing the time
262interrupts are disabled. It gives the trace version and the kernel
263this was executed on (2.6.26-rc8). Then it displays the max latency
264in microsecs (97 us). The number of trace entries displayed
265by the total number recorded (both are three: #3/3). The type of
266preemption that was used (PREEMPT). VP, KP, SP, and HP are always zero
267and reserved for later use. #P is the number of online CPUS (#P:2).
268
269The task is the process that was running when the latency happened.
270(swapper pid: 0).
271
272The start and stop that caused the latencies:
273
274 apic_timer_interrupt is where the interrupts were disabled.
275 do_softirq is where they were enabled again.
276
277The next lines after the header are the trace itself. The header
278explains which is which.
279
280 cmd: The name of the process in the trace.
281
282 pid: The PID of that process.
283
284 CPU#: The CPU that the process was running on.
285
286 irqs-off: 'd' interrupts are disabled. '.' otherwise.
287
288 need-resched: 'N' task need_resched is set, '.' otherwise.
289
290 hardirq/softirq:
291 'H' - hard irq happened inside a softirq.
292 'h' - hard irq is running
293 's' - soft irq is running
294 '.' - normal context.
295
296 preempt-depth: The level of preempt_disabled
297
298The above is mostly meaningful for kernel developers.
299
300 time: This differs from the trace output where as the trace output
301 contained a absolute timestamp. This timestamp is relative
302 to the start of the first entry in the the trace.
303
304 delay: This is just to help catch your eye a bit better. And
305 needs to be fixed to be only relative to the same CPU.
306 The marks is determined by the difference between this
307 current trace and the next trace.
308 '!' - greater than preempt_mark_thresh (default 100)
309 '+' - greater than 1 microsecond
310 ' ' - less than or equal to 1 microsecond.
311
312 The rest is the same as the 'trace' file.
313
314
315iter_ctrl
316---------
317
318The iter_ctrl file is used to control what gets printed in the trace
319output. To see what is available, simply cat the file:
320
321 cat /debug/tracing/iter_ctrl
322 print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \
323 noblock nostacktrace nosched-tree
324
325To disable one of the options, echo in the option appended with "no".
326
327 echo noprint-parent > /debug/tracing/iter_ctrl
328
329To enable an option, leave off the "no".
330
331 echo sym-offest > /debug/tracing/iter_ctrl
332
333Here are the available options:
334
335 print-parent - On function traces, display the calling function
336 as well as the function being traced.
337
338 print-parent:
339 bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul
340
341 noprint-parent:
342 bash-4000 [01] 1477.606694: simple_strtoul
343
344
345 sym-offset - Display not only the function name, but also the offset
346 in the function. For example, instead of seeing just
347 "ktime_get" you will see "ktime_get+0xb/0x20"
348
349 sym-offset:
350 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
351
352 sym-addr - this will also display the function address as well as
353 the function name.
354
355 sym-addr:
356 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
357
358 verbose - This deals with the latency_trace file.
359
360 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
361 (+0.000ms): simple_strtoul (strict_strtoul)
362
363 raw - This will display raw numbers. This option is best for use with
364 user applications that can translate the raw numbers better than
365 having it done in the kernel.
366
367 hex - similar to raw, but the numbers will be in a hexadecimal format.
368
369 bin - This will print out the formats in raw binary.
370
371 block - TBD (needs update)
372
373 stacktrace - This is one of the options that changes the trace itself.
374 When a trace is recorded, so is the stack of functions.
375 This allows for back traces of trace sites.
376
377 sched-tree - TBD (any users??)
378
379
380sched_switch
381------------
382
383This tracer simply records schedule switches. Here's an example
384on how to implement it.
385
386 # echo sched_switch > /debug/tracing/current_tracer
387 # echo 1 > /debug/tracing/tracing_enabled
388 # sleep 1
389 # echo 0 > /debug/tracing/tracing_enabled
390 # cat /debug/tracing/trace
391
392# tracer: sched_switch
393#
394# TASK-PID CPU# TIMESTAMP FUNCTION
395# | | | | |
396 bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R
397 bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R
398 sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R
399 bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S
400 bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R
401 sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R
402 bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D
403 bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R
404 <idle>-0 [00] 240.132589: 0:140:R + 4:115:S
405 <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R
406 ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R
407 <idle>-0 [00] 240.132598: 0:140:R + 4:115:S
408 <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R
409 ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R
410 sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R
411 [...]
412
413
414As we have discussed previously about this format, the header shows
415the name of the trace and points to the options. The "FUNCTION"
416is a misnomer since here it represents the wake ups and context
417switches.
418
419The sched_switch only lists the wake ups (represented with '+')
420and context switches ('==>') with the previous task or current
421first followed by the next task or task waking up. The format for both
422of these is PID:KERNEL-PRIO:TASK-STATE. Remember that the KERNEL-PRIO
423is the inverse of the actual priority with zero (0) being the highest
424priority and the nice values starting at 100 (nice -20). Below is
425a quick chart to map the kernel priority to user land priorities.
426
427 Kernel priority: 0 to 99 ==> user RT priority 99 to 0
428 Kernel priority: 100 to 139 ==> user nice -20 to 19
429 Kernel priority: 140 ==> idle task priority
430
431The task states are:
432
433 R - running : wants to run, may not actually be running
434 S - sleep : process is waiting to be woken up (handles signals)
435 D - deep sleep : process must be woken up (ignores signals)
436 T - stopped : process suspended
437 t - traced : process is being traced (with something like gdb)
438 Z - zombie : process waiting to be cleaned up
439 X - unknown
440
441
442ftrace_enabled
443--------------
444
445The following tracers give different output depending on whether
446or not the sysctl ftrace_enabled is set. To set ftrace_enabled,
447one can either use the sysctl function or set it via the proc
448file system interface.
449
450 sysctl kernel.ftrace_enabled=1
451
452 or
453
454 echo 1 > /proc/sys/kernel/ftrace_enabled
455
456To disable ftrace_enabled simply replace the '1' with '0' in
457the above commands.
458
459When ftrace_enabled is set the tracers will also record the functions
460that are within the trace. The descriptions of the tracers
461will also show an example with ftrace enabled.
462
463
464irqsoff
465-------
466
467When interrupts are disabled, the CPU can not react to any other
468external event (besides NMIs and SMIs). This prevents the timer
469interrupt from triggering or the mouse interrupt from letting the
470kernel know of a new mouse event. The result is a latency with the
471reaction time.
472
473The irqsoff tracer tracks the time interrupts are disabled and when
474they are re-enabled. When a new maximum latency is hit, it saves off
475the trace so that it may be retrieved at a later time. Every time a
476new maximum in reached, the old saved trace is discarded and the new
477trace is saved.
478
479To reset the maximum, echo 0 into tracing_max_latency. Here's an
480example:
481
482 # echo irqsoff > /debug/tracing/current_tracer
483 # echo 0 > /debug/tracing/tracing_max_latency
484 # echo 1 > /debug/tracing/tracing_enabled
485 # ls -ltr
486 [...]
487 # echo 0 > /debug/tracing/tracing_enabled
488 # cat /debug/tracing/latency_trace
489# tracer: irqsoff
490#
491irqsoff latency trace v1.1.5 on 2.6.26-rc8
492--------------------------------------------------------------------
493 latency: 6 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
494 -----------------
495 | task: bash-4269 (uid:0 nice:0 policy:0 rt_prio:0)
496 -----------------
497 => started at: copy_page_range
498 => ended at: copy_page_range
499
500# _------=> CPU#
501# / _-----=> irqs-off
502# | / _----=> need-resched
503# || / _---=> hardirq/softirq
504# ||| / _--=> preempt-depth
505# |||| /
506# ||||| delay
507# cmd pid ||||| time | caller
508# \ / ||||| \ | /
509 bash-4269 1...1 0us+: _spin_lock (copy_page_range)
510 bash-4269 1...1 7us : _spin_unlock (copy_page_range)
511 bash-4269 1...2 7us : trace_preempt_on (copy_page_range)
512
513
514vim:ft=help
515
516Here we see that that we had a latency of 6 microsecs (which is
517very good). The spin_lock in copy_page_range disabled interrupts.
518The difference between the 6 and the displayed timestamp 7us is
519because the clock must have incremented between the time of recording
520the max latency and recording the function that had that latency.
521
522Note the above had ftrace_enabled not set. If we set the ftrace_enabled
523we get a much larger output:
524
525# tracer: irqsoff
526#
527irqsoff latency trace v1.1.5 on 2.6.26-rc8
528--------------------------------------------------------------------
529 latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
530 -----------------
531 | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0)
532 -----------------
533 => started at: __alloc_pages_internal
534 => ended at: __alloc_pages_internal
535
536# _------=> CPU#
537# / _-----=> irqs-off
538# | / _----=> need-resched
539# || / _---=> hardirq/softirq
540# ||| / _--=> preempt-depth
541# |||| /
542# ||||| delay
543# cmd pid ||||| time | caller
544# \ / ||||| \ | /
545 ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal)
546 ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist)
547 ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk)
548 ls-4339 0d..1 4us : add_preempt_count (_spin_lock)
549 ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk)
550 ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue)
551 ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest)
552 ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk)
553 ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue)
554 ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest)
555 ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk)
556 ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue)
557[...]
558 ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue)
559 ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest)
560 ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk)
561 ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue)
562 ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest)
563 ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk)
564 ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock)
565 ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal)
566 ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal)
567
568
569vim:ft=help
570
571
572Here we traced a 50 microsecond latency. But we also see all the
573functions that were called during that time. Note that enabling
574function tracing we endure an added overhead. This overhead may
575extend the latency times. But never the less, this trace has provided
576some very helpful debugging.
577
578
579preemptoff
580----------
581
582When preemption is disabled we may be able to receive interrupts but
583the task can not be preempted and a higher priority task must wait
584for preemption to be enabled again before it can preempt a lower
585priority task.
586
587The preemptoff tracer traces the places that disables preemption.
588Like the irqsoff, it records the maximum latency that preemption
589was disabled. The control of preemptoff is much like the irqsoff.
590
591 # echo preemptoff > /debug/tracing/current_tracer
592 # echo 0 > /debug/tracing/tracing_max_latency
593 # echo 1 > /debug/tracing/tracing_enabled
594 # ls -ltr
595 [...]
596 # echo 0 > /debug/tracing/tracing_enabled
597 # cat /debug/tracing/latency_trace
598# tracer: preemptoff
599#
600preemptoff latency trace v1.1.5 on 2.6.26-rc8
601--------------------------------------------------------------------
602 latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
603 -----------------
604 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
605 -----------------
606 => started at: do_IRQ
607 => ended at: __do_softirq
608
609# _------=> CPU#
610# / _-----=> irqs-off
611# | / _----=> need-resched
612# || / _---=> hardirq/softirq
613# ||| / _--=> preempt-depth
614# |||| /
615# ||||| delay
616# cmd pid ||||| time | caller
617# \ / ||||| \ | /
618 sshd-4261 0d.h. 0us+: irq_enter (do_IRQ)
619 sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq)
620 sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq)
621
622
623vim:ft=help
624
625This has some more changes. Preemption was disabled when an interrupt
626came in (notice the 'h'), and was enabled while doing a softirq.
627(notice the 's'). But we also see that interrupts have been disabled
628when entering the preempt off section and leaving it (the 'd').
629We do not know if interrupts were enabled in the mean time.
630
631# tracer: preemptoff
632#
633preemptoff latency trace v1.1.5 on 2.6.26-rc8
634--------------------------------------------------------------------
635 latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
636 -----------------
637 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
638 -----------------
639 => started at: remove_wait_queue
640 => ended at: __do_softirq
641
642# _------=> CPU#
643# / _-----=> irqs-off
644# | / _----=> need-resched
645# || / _---=> hardirq/softirq
646# ||| / _--=> preempt-depth
647# |||| /
648# ||||| delay
649# cmd pid ||||| time | caller
650# \ / ||||| \ | /
651 sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue)
652 sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue)
653 sshd-4261 0d..1 2us : do_IRQ (common_interrupt)
654 sshd-4261 0d..1 2us : irq_enter (do_IRQ)
655 sshd-4261 0d..1 2us : idle_cpu (irq_enter)
656 sshd-4261 0d..1 3us : add_preempt_count (irq_enter)
657 sshd-4261 0d.h1 3us : idle_cpu (irq_enter)
658 sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ)
659[...]
660 sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock)
661 sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq)
662 sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq)
663 sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq)
664 sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock)
665 sshd-4261 0d.h1 14us : irq_exit (do_IRQ)
666 sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit)
667 sshd-4261 0d..2 15us : do_softirq (irq_exit)
668 sshd-4261 0d... 15us : __do_softirq (do_softirq)
669 sshd-4261 0d... 16us : __local_bh_disable (__do_softirq)
670 sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable)
671 sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable)
672 sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable)
673 sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable)
674[...]
675 sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable)
676 sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable)
677 sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable)
678 sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable)
679 sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip)
680 sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip)
681 sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable)
682 sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable)
683[...]
684 sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq)
685 sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq)
686
687
688The above is an example of the preemptoff trace with ftrace_enabled
689set. Here we see that interrupts were disabled the entire time.
690The irq_enter code lets us know that we entered an interrupt 'h'.
691Before that, the functions being traced still show that it is not
692in an interrupt, but we can see by the functions themselves that
693this is not the case.
694
695Notice that the __do_softirq when called doesn't have a preempt_count.
696It may seem that we missed a preempt enabled. What really happened
697is that the preempt count is held on the threads stack and we
698switched to the softirq stack (4K stacks in effect). The code
699does not copy the preempt count, but because interrupts are disabled
700we don't need to worry about it. Having a tracer like this is good
701to let people know what really happens inside the kernel.
702
703
704preemptirqsoff
705--------------
706
707Knowing the locations that have interrupts disabled or preemption
708disabled for the longest times is helpful. But sometimes we would
709like to know when either preemption and/or interrupts are disabled.
710
711The following code:
712
713 local_irq_disable();
714 call_function_with_irqs_off();
715 preempt_disable();
716 call_function_with_irqs_and_preemption_off();
717 local_irq_enable();
718 call_function_with_preemption_off();
719 preempt_enable();
720
721The irqsoff tracer will record the total length of
722call_function_with_irqs_off() and
723call_function_with_irqs_and_preemption_off().
724
725The preemptoff tracer will record the total length of
726call_function_with_irqs_and_preemption_off() and
727call_function_with_preemption_off().
728
729But neither will trace the time that interrupts and/or preemption
730is disabled. This total time is the time that we can not schedule.
731To record this time, use the preemptirqsoff tracer.
732
733Again, using this trace is much like the irqsoff and preemptoff tracers.
734
735 # echo preemptoff > /debug/tracing/current_tracer
736 # echo 0 > /debug/tracing/tracing_max_latency
737 # echo 1 > /debug/tracing/tracing_enabled
738 # ls -ltr
739 [...]
740 # echo 0 > /debug/tracing/tracing_enabled
741 # cat /debug/tracing/latency_trace
742# tracer: preemptirqsoff
743#
744preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
745--------------------------------------------------------------------
746 latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
747 -----------------
748 | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0)
749 -----------------
750 => started at: apic_timer_interrupt
751 => ended at: __do_softirq
752
753# _------=> CPU#
754# / _-----=> irqs-off
755# | / _----=> need-resched
756# || / _---=> hardirq/softirq
757# ||| / _--=> preempt-depth
758# |||| /
759# ||||| delay
760# cmd pid ||||| time | caller
761# \ / ||||| \ | /
762 ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt)
763 ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq)
764 ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq)
765
766
767vim:ft=help
768
769
770The trace_hardirqs_off_thunk is called from assembly on x86 when
771interrupts are disabled in the assembly code. Without the function
772tracing, we don't know if interrupts were enabled within the preemption
773points. We do see that it started with preemption enabled.
774
775Here is a trace with ftrace_enabled set:
776
777
778# tracer: preemptirqsoff
779#
780preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
781--------------------------------------------------------------------
782 latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
783 -----------------
784 | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0)
785 -----------------
786 => started at: write_chan
787 => ended at: __do_softirq
788
789# _------=> CPU#
790# / _-----=> irqs-off
791# | / _----=> need-resched
792# || / _---=> hardirq/softirq
793# ||| / _--=> preempt-depth
794# |||| /
795# ||||| delay
796# cmd pid ||||| time | caller
797# \ / ||||| \ | /
798 ls-4473 0.N.. 0us : preempt_schedule (write_chan)
799 ls-4473 0dN.1 1us : _spin_lock (schedule)
800 ls-4473 0dN.1 2us : add_preempt_count (_spin_lock)
801 ls-4473 0d..2 2us : put_prev_task_fair (schedule)
802[...]
803 ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts)
804 ls-4473 0d..2 13us : __switch_to (schedule)
805 sshd-4261 0d..2 14us : finish_task_switch (schedule)
806 sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch)
807 sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave)
808 sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set)
809 sshd-4261 0d..2 16us : do_IRQ (common_interrupt)
810 sshd-4261 0d..2 17us : irq_enter (do_IRQ)
811 sshd-4261 0d..2 17us : idle_cpu (irq_enter)
812 sshd-4261 0d..2 18us : add_preempt_count (irq_enter)
813 sshd-4261 0d.h2 18us : idle_cpu (irq_enter)
814 sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ)
815 sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq)
816 sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock)
817 sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq)
818 sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock)
819[...]
820 sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq)
821 sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock)
822 sshd-4261 0d.h2 29us : irq_exit (do_IRQ)
823 sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit)
824 sshd-4261 0d..3 30us : do_softirq (irq_exit)
825 sshd-4261 0d... 30us : __do_softirq (do_softirq)
826 sshd-4261 0d... 31us : __local_bh_disable (__do_softirq)
827 sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable)
828 sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable)
829[...]
830 sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip)
831 sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip)
832 sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt)
833 sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt)
834 sshd-4261 0d.s3 45us : idle_cpu (irq_enter)
835 sshd-4261 0d.s3 46us : add_preempt_count (irq_enter)
836 sshd-4261 0d.H3 46us : idle_cpu (irq_enter)
837 sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt)
838 sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt)
839[...]
840 sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt)
841 sshd-4261 0d.H3 82us : ktime_get (tick_program_event)
842 sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get)
843 sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts)
844 sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts)
845 sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event)
846 sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event)
847 sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt)
848 sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit)
849 sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit)
850 sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable)
851[...]
852 sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action)
853 sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq)
854 sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq)
855 sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq)
856 sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable)
857 sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq)
858 sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq)
859
860
861This is a very interesting trace. It started with the preemption of
862the ls task. We see that the task had the "need_resched" bit set
863with the 'N' in the trace. Interrupts are disabled in the spin_lock
864and the trace started. We see that a schedule took place to run
865sshd. When the interrupts were enabled we took an interrupt.
866On return of the interrupt the softirq ran. We took another interrupt
867while running the softirq as we see with the capital 'H'.
868
869
870wakeup
871------
872
873In Real-Time environment it is very important to know the wakeup
874time it takes for the highest priority task that wakes up to the
875time it executes. This is also known as "schedule latency".
876I stress the point that this is about RT tasks. It is also important
877to know the scheduling latency of non-RT tasks, but the average
878schedule latency is better for non-RT tasks. Tools like
879LatencyTop is more appropriate for such measurements.
880
881Real-Time environments is interested in the worst case latency.
882That is the longest latency it takes for something to happen, and
883not the average. We can have a very fast scheduler that may only
884have a large latency once in a while, but that would not work well
885with Real-Time tasks. The wakeup tracer was designed to record
886the worst case wakeups of RT tasks. Non-RT tasks are not recorded
887because the tracer only records one worst case and tracing non-RT
888tasks that are unpredictable will overwrite the worst case latency
889of RT tasks.
890
891Since this tracer only deals with RT tasks, we will run this slightly
892different than we did with the previous tracers. Instead of performing
893an 'ls' we will run 'sleep 1' under 'chrt' which changes the
894priority of the task.
895
896 # echo wakeup > /debug/tracing/current_tracer
897 # echo 0 > /debug/tracing/tracing_max_latency
898 # echo 1 > /debug/tracing/tracing_enabled
899 # chrt -f 5 sleep 1
900 # echo 0 > /debug/tracing/tracing_enabled
901 # cat /debug/tracing/latency_trace
902# tracer: wakeup
903#
904wakeup latency trace v1.1.5 on 2.6.26-rc8
905--------------------------------------------------------------------
906 latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
907 -----------------
908 | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5)
909 -----------------
910
911# _------=> CPU#
912# / _-----=> irqs-off
913# | / _----=> need-resched
914# || / _---=> hardirq/softirq
915# ||| / _--=> preempt-depth
916# |||| /
917# ||||| delay
918# cmd pid ||||| time | caller
919# \ / ||||| \ | /
920 <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process)
921 <idle>-0 1d..4 4us : schedule (cpu_idle)
922
923
924vim:ft=help
925
926
927Running this on an idle system we see that it only took 4 microseconds
928to perform the task switch. Note, since the trace marker in the
929schedule is before the actual "switch" we stop the tracing when
930the recorded task is about to schedule in. This may change if
931we add a new marker at the end of the scheduler.
932
933Notice that the recorded task is 'sleep' with the PID of 4901 and it
934has an rt_prio of 5. This priority is user-space priority and not
935the internal kernel priority. The policy is 1 for SCHED_FIFO and 2
936for SCHED_RR.
937
938Doing the same with chrt -r 5 and ftrace_enabled set.
939
940# tracer: wakeup
941#
942wakeup latency trace v1.1.5 on 2.6.26-rc8
943--------------------------------------------------------------------
944 latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2)
945 -----------------
946 | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5)
947 -----------------
948
949# _------=> CPU#
950# / _-----=> irqs-off
951# | / _----=> need-resched
952# || / _---=> hardirq/softirq
953# ||| / _--=> preempt-depth
954# |||| /
955# ||||| delay
956# cmd pid ||||| time | caller
957# \ / ||||| \ | /
958ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process)
959ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb)
960ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up)
961ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup)
962ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr)
963ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup)
964ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up)
965ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up)
966[...]
967ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt)
968ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit)
969ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit)
970ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq)
971[...]
972ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks)
973ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq)
974ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable)
975ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd)
976ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd)
977ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched)
978ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched)
979ksoftirq-7 1.N.2 33us : schedule (__cond_resched)
980ksoftirq-7 1.N.2 33us : add_preempt_count (schedule)
981ksoftirq-7 1.N.3 34us : hrtick_clear (schedule)
982ksoftirq-7 1dN.3 35us : _spin_lock (schedule)
983ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock)
984ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule)
985ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair)
986[...]
987ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline)
988ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock)
989ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline)
990ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock)
991ksoftirq-7 1d..4 50us : schedule (__cond_resched)
992
993The interrupt went off while running ksoftirqd. This task runs at
994SCHED_OTHER. Why didn't we see the 'N' set early? This may be
995a harmless bug with x86_32 and 4K stacks. The need_reched() function
996that tests if we need to reschedule looks on the actual stack.
997Where as the setting of the NEED_RESCHED bit happens on the
998task's stack. But because we are in a hard interrupt, the test
999is with the interrupts stack which has that to be false. We don't
1000see the 'N' until we switch back to the task's stack.
1001
1002ftrace
1003------
1004
1005ftrace is not only the name of the tracing infrastructure, but it
1006is also a name of one of the tracers. The tracer is the function
1007tracer. Enabling the function tracer can be done from the
1008debug file system. Make sure the ftrace_enabled is set otherwise
1009this tracer is a nop.
1010
1011 # sysctl kernel.ftrace_enabled=1
1012 # echo ftrace > /debug/tracing/current_tracer
1013 # echo 1 > /debug/tracing/tracing_enabled
1014 # usleep 1
1015 # echo 0 > /debug/tracing/tracing_enabled
1016 # cat /debug/tracing/trace
1017# tracer: ftrace
1018#
1019# TASK-PID CPU# TIMESTAMP FUNCTION
1020# | | | | |
1021 bash-4003 [00] 123.638713: finish_task_switch <-schedule
1022 bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch
1023 bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq
1024 bash-4003 [00] 123.638715: hrtick_set <-schedule
1025 bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set
1026 bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave
1027 bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set
1028 bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore
1029 bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set
1030 bash-4003 [00] 123.638718: sub_preempt_count <-schedule
1031 bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule
1032 bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run
1033 bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion
1034 bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common
1035 bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq
1036[...]
1037
1038
1039Note: It is sometimes better to enable or disable tracing directly from
1040a program, because the buffer may be overflowed by the echo commands
1041before you get to the point you want to trace. It is also easier to
1042stop the tracing at the point that you hit the part that you are
1043interested in. Since the ftrace buffer is a ring buffer with the
1044oldest data being overwritten, usually it is sufficient to start the
1045tracer with an echo command but have you code stop it. Something
1046like the following is usually appropriate for this.
1047
1048int trace_fd;
1049[...]
1050int main(int argc, char *argv[]) {
1051 [...]
1052 trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY);
1053 [...]
1054 if (condition_hit()) {
1055 write(trace_fd, "0", 1);
1056 }
1057 [...]
1058}
1059
1060
1061dynamic ftrace
1062--------------
1063
1064If CONFIG_DYNAMIC_FTRACE is set, then the system will run with
1065virtually no overhead when function tracing is disabled. The way
1066this works is the mcount function call (placed at the start of
1067every kernel function, produced by the -pg switch in gcc), starts
1068of pointing to a simple return.
1069
1070When dynamic ftrace is initialized, it calls kstop_machine to make it
1071act like a uniprocessor so that it can freely modify code without
1072worrying about other processors executing that same code. At
1073initialization, the mcount calls are change to call a "record_ip"
1074function. After this, the first time a kernel function is called,
1075it has the calling address saved in a hash table.
1076
1077Later on the ftraced kernel thread is awoken and will again call
1078kstop_machine if new functions have been recorded. The ftraced thread
1079will change all calls to mcount to "nop". Just calling mcount
1080and having mcount return has shown a 10% overhead. By converting
1081it to a nop, there is no recordable overhead to the system.
1082
1083One special side-effect to the recording of the functions being
1084traced, is that we can now selectively choose which functions we
1085want to trace and which ones we want the mcount calls to remain as
1086nops.
1087
1088Two files that contain to the enabling and disabling of recorded
1089functions are:
1090
1091 set_ftrace_filter
1092
1093and
1094
1095 set_ftrace_notrace
1096
1097A list of available functions that you can add to this files is listed
1098in:
1099
1100 available_filter_functions
1101
1102 # cat /debug/tracing/available_filter_functions
1103put_prev_task_idle
1104kmem_cache_create
1105pick_next_task_rt
1106get_online_cpus
1107pick_next_task_fair
1108mutex_lock
1109[...]
1110
1111If I'm only interested in sys_nanosleep and hrtimer_interrupt:
1112
1113 # echo sys_nanosleep hrtimer_interrupt \
1114 > /debug/tracing/set_ftrace_filter
1115 # echo ftrace > /debug/tracing/current_tracer
1116 # echo 1 > /debug/tracing/tracing_enabled
1117 # usleep 1
1118 # echo 0 > /debug/tracing/tracing_enabled
1119 # cat /debug/tracing/trace
1120# tracer: ftrace
1121#
1122# TASK-PID CPU# TIMESTAMP FUNCTION
1123# | | | | |
1124 usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt
1125 usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call
1126 <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt
1127
1128To see what functions are being traced, you can cat the file:
1129
1130 # cat /debug/tracing/set_ftrace_filter
1131hrtimer_interrupt
1132sys_nanosleep
1133
1134
1135Perhaps this isn't enough. The filters also allow simple wild cards.
1136Only the following is currently available
1137
1138 <match>* - will match functions that begins with <match>
1139 *<match> - will match functions that end with <match>
1140 *<match>* - will match functions that have <match> in it
1141
1142Thats all the wild cards that are allowed.
1143
1144 <match>*<match> will not work.
1145
1146 # echo hrtimer_* > /debug/tracing/set_ftrace_filter
1147
1148Produces:
1149
1150# tracer: ftrace
1151#
1152# TASK-PID CPU# TIMESTAMP FUNCTION
1153# | | | | |
1154 bash-4003 [00] 1480.611794: hrtimer_init <-copy_process
1155 bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set
1156 bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear
1157 bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel
1158 <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt
1159 <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt
1160 <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt
1161 <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt
1162 <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt
1163
1164
1165Notice that we lost the sys_nanosleep.
1166
1167 # cat /debug/tracing/set_ftrace_filter
1168hrtimer_run_queues
1169hrtimer_run_pending
1170hrtimer_init
1171hrtimer_cancel
1172hrtimer_try_to_cancel
1173hrtimer_forward
1174hrtimer_start
1175hrtimer_reprogram
1176hrtimer_force_reprogram
1177hrtimer_get_next_event
1178hrtimer_interrupt
1179hrtimer_nanosleep
1180hrtimer_wakeup
1181hrtimer_get_remaining
1182hrtimer_get_res
1183hrtimer_init_sleeper
1184
1185
1186This is because the '>' and '>>' act just like they do in bash.
1187To rewrite the filters, use '>'
1188To append to the filters, use '>>'
1189
1190To clear out a filter so that all functions will be recorded again.
1191
1192 # echo > /debug/tracing/set_ftrace_filter
1193 # cat /debug/tracing/set_ftrace_filter
1194 #
1195
1196Again, now we want to append.
1197
1198 # echo sys_nanosleep > /debug/tracing/set_ftrace_filter
1199 # cat /debug/tracing/set_ftrace_filter
1200sys_nanosleep
1201 # echo hrtimer_* >> /debug/tracing/set_ftrace_filter
1202 # cat /debug/tracing/set_ftrace_filter
1203hrtimer_run_queues
1204hrtimer_run_pending
1205hrtimer_init
1206hrtimer_cancel
1207hrtimer_try_to_cancel
1208hrtimer_forward
1209hrtimer_start
1210hrtimer_reprogram
1211hrtimer_force_reprogram
1212hrtimer_get_next_event
1213hrtimer_interrupt
1214sys_nanosleep
1215hrtimer_nanosleep
1216hrtimer_wakeup
1217hrtimer_get_remaining
1218hrtimer_get_res
1219hrtimer_init_sleeper
1220
1221
1222The set_ftrace_notrace prevents those functions from being traced.
1223
1224 # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace
1225
1226Produces:
1227
1228# tracer: ftrace
1229#
1230# TASK-PID CPU# TIMESTAMP FUNCTION
1231# | | | | |
1232 bash-4043 [01] 115.281644: finish_task_switch <-schedule
1233 bash-4043 [01] 115.281645: hrtick_set <-schedule
1234 bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set
1235 bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run
1236 bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion
1237 bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run
1238 bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop
1239 bash-4043 [01] 115.281648: wake_up_process <-kthread_stop
1240 bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process
1241
1242We can see that there's no more lock or preempt tracing.
1243
1244ftraced
1245-------
1246
1247As mentioned above, when dynamic ftrace is configured in, a kernel
1248thread wakes up once a second and checks to see if there are mcount
1249calls that need to be converted into nops. If there is not, then
1250it simply goes back to sleep. But if there is, it will call
1251kstop_machine to convert the calls to nops.
1252
1253There may be a case that you do not want this added latency.
1254Perhaps you are doing some audio recording and this activity might
1255cause skips in the playback. There is an interface to disable
1256and enable the ftraced kernel thread.
1257
1258 # echo 0 > /debug/tracing/ftraced_enabled
1259
1260This will disable the calling of the kstop_machine to update the
1261mcount calls to nops. Remember that there's a large overhead
1262to calling mcount. Without this kernel thread, that overhead will
1263exist.
1264
1265Any write to the ftraced_enabled file will cause the kstop_machine
1266to run if there are recorded calls to mcount. This means that a
1267user can manually perform the updates when they want to by simply
1268echoing a '0' into the ftraced_enabled file.
1269
1270The updates are also done at the beginning of enabling a tracer
1271that uses ftrace function recording.
1272
1273
1274trace_pipe
1275----------
1276
1277The trace_pipe outputs the same as trace, but the effect on the
1278tracing is different. Every read from trace_pipe is consumed.
1279This means that subsequent reads will be different. The trace
1280is live.
1281
1282 # echo ftrace > /debug/tracing/current_tracer
1283 # cat /debug/tracing/trace_pipe > /tmp/trace.out &
1284[1] 4153
1285 # echo 1 > /debug/tracing/tracing_enabled
1286 # usleep 1
1287 # echo 0 > /debug/tracing/tracing_enabled
1288 # cat /debug/tracing/trace
1289# tracer: ftrace
1290#
1291# TASK-PID CPU# TIMESTAMP FUNCTION
1292# | | | | |
1293
1294 #
1295 # cat /tmp/trace.out
1296 bash-4043 [00] 41.267106: finish_task_switch <-schedule
1297 bash-4043 [00] 41.267106: hrtick_set <-schedule
1298 bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set
1299 bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run
1300 bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion
1301 bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run
1302 bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop
1303 bash-4043 [00] 41.267110: wake_up_process <-kthread_stop
1304 bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process
1305 bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up
1306
1307
1308Note, reading the trace_pipe will block until more input is added.
1309By changing the tracer, trace_pipe will issue an EOF. We needed
1310to set the ftrace tracer _before_ cating the trace_pipe file.
1311
1312
1313trace entries
1314-------------
1315
1316Having too much or not enough data can be troublesome in diagnosing
1317some issue in the kernel. The file trace_entries is used to modify
1318the size of the internal trace buffers. The numbers listed
1319is the number of entries that can be recorded per CPU. To know
1320the full size, multiply the number of possible CPUS with the
1321number of entries.
1322
1323 # cat /debug/tracing/trace_entries
132465620
1325
1326Note, to modify this you must have tracing fulling disabled. To do that,
1327echo "none" into the current_tracer.
1328
1329 # echo none > /debug/tracing/current_tracer
1330 # echo 100000 > /debug/tracing/trace_entries
1331 # cat /debug/tracing/trace_entries
1332100045
1333
1334
1335Notice that we echoed in 100,000 but the size is 100,045. The entries
1336are held by individual pages. It allocates the number of pages it takes
1337to fulfill the request. If more entries may fit on the last page
1338it will add them.
1339
1340 # echo 1 > /debug/tracing/trace_entries
1341 # cat /debug/tracing/trace_entries
134285
1343
1344This shows us that 85 entries can fit on a single page.
1345
1346The number of pages that will be allocated is a percentage of available
1347memory. Allocating too much will produces an error.
1348
1349 # echo 1000000000000 > /debug/tracing/trace_entries
1350-bash: echo: write error: Cannot allocate memory
1351 # cat /debug/tracing/trace_entries
135285
1353
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 17f1f91af35c..946b66e1b652 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -148,9 +148,9 @@ tcp_available_congestion_control - STRING
148 but not loaded. 148 but not loaded.
149 149
150tcp_base_mss - INTEGER 150tcp_base_mss - INTEGER
151 The initial value of search_low to be used by Packetization Layer 151 The initial value of search_low to be used by the packetization layer
152 Path MTU Discovery (MTU probing). If MTU probing is enabled, 152 Path MTU discovery (MTU probing). If MTU probing is enabled,
153 this is the inital MSS used by the connection. 153 this is the initial MSS used by the connection.
154 154
155tcp_congestion_control - STRING 155tcp_congestion_control - STRING
156 Set the congestion control algorithm to be used for new 156 Set the congestion control algorithm to be used for new
@@ -185,10 +185,9 @@ tcp_frto - INTEGER
185 timeouts. It is particularly beneficial in wireless environments 185 timeouts. It is particularly beneficial in wireless environments
186 where packet loss is typically due to random radio interference 186 where packet loss is typically due to random radio interference
187 rather than intermediate router congestion. F-RTO is sender-side 187 rather than intermediate router congestion. F-RTO is sender-side
188 only modification. Therefore it does not require any support from 188 only modification. Therefore it does not require any support from
189 the peer, but in a typical case, however, where wireless link is 189 the peer.
190 the local access link and most of the data flows downlink, the 190
191 faraway servers should have F-RTO enabled to take advantage of it.
192 If set to 1, basic version is enabled. 2 enables SACK enhanced 191 If set to 1, basic version is enabled. 2 enables SACK enhanced
193 F-RTO if flow uses SACK. The basic version can be used also when 192 F-RTO if flow uses SACK. The basic version can be used also when
194 SACK is in use though scenario(s) with it exists where F-RTO 193 SACK is in use though scenario(s) with it exists where F-RTO
@@ -276,7 +275,7 @@ tcp_mem - vector of 3 INTEGERs: min, pressure, max
276 memory. 275 memory.
277 276
278tcp_moderate_rcvbuf - BOOLEAN 277tcp_moderate_rcvbuf - BOOLEAN
279 If set, TCP performs receive buffer autotuning, attempting to 278 If set, TCP performs receive buffer auto-tuning, attempting to
280 automatically size the buffer (no greater than tcp_rmem[2]) to 279 automatically size the buffer (no greater than tcp_rmem[2]) to
281 match the size required by the path for full throughput. Enabled by 280 match the size required by the path for full throughput. Enabled by
282 default. 281 default.
@@ -336,7 +335,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
336 pressure. 335 pressure.
337 Default: 8K 336 Default: 8K
338 337
339 default: default size of receive buffer used by TCP sockets. 338 default: initial size of receive buffer used by TCP sockets.
340 This value overrides net.core.rmem_default used by other protocols. 339 This value overrides net.core.rmem_default used by other protocols.
341 Default: 87380 bytes. This value results in window of 65535 with 340 Default: 87380 bytes. This value results in window of 65535 with
342 default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit 341 default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit
@@ -344,8 +343,10 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max
344 343
345 max: maximal size of receive buffer allowed for automatically 344 max: maximal size of receive buffer allowed for automatically
346 selected receiver buffers for TCP socket. This value does not override 345 selected receiver buffers for TCP socket. This value does not override
347 net.core.rmem_max, "static" selection via SO_RCVBUF does not use this. 346 net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables
348 Default: 87380*2 bytes. 347 automatic tuning of that socket's receive buffer size, in which
348 case this value is ignored.
349 Default: between 87380B and 4MB, depending on RAM size.
349 350
350tcp_sack - BOOLEAN 351tcp_sack - BOOLEAN
351 Enable select acknowledgments (SACKS). 352 Enable select acknowledgments (SACKS).
@@ -358,7 +359,7 @@ tcp_slow_start_after_idle - BOOLEAN
358 Default: 1 359 Default: 1
359 360
360tcp_stdurg - BOOLEAN 361tcp_stdurg - BOOLEAN
361 Use the Host requirements interpretation of the TCP urg pointer field. 362 Use the Host requirements interpretation of the TCP urgent pointer field.
362 Most hosts use the older BSD interpretation, so if you turn this on 363 Most hosts use the older BSD interpretation, so if you turn this on
363 Linux might not communicate correctly with them. 364 Linux might not communicate correctly with them.
364 Default: FALSE 365 Default: FALSE
@@ -371,12 +372,12 @@ tcp_synack_retries - INTEGER
371tcp_syncookies - BOOLEAN 372tcp_syncookies - BOOLEAN
372 Only valid when the kernel was compiled with CONFIG_SYNCOOKIES 373 Only valid when the kernel was compiled with CONFIG_SYNCOOKIES
373 Send out syncookies when the syn backlog queue of a socket 374 Send out syncookies when the syn backlog queue of a socket
374 overflows. This is to prevent against the common 'syn flood attack' 375 overflows. This is to prevent against the common 'SYN flood attack'
375 Default: FALSE 376 Default: FALSE
376 377
377 Note, that syncookies is fallback facility. 378 Note, that syncookies is fallback facility.
378 It MUST NOT be used to help highly loaded servers to stand 379 It MUST NOT be used to help highly loaded servers to stand
379 against legal connection rate. If you see synflood warnings 380 against legal connection rate. If you see SYN flood warnings
380 in your logs, but investigation shows that they occur 381 in your logs, but investigation shows that they occur
381 because of overload with legal connections, you should tune 382 because of overload with legal connections, you should tune
382 another parameters until this warning disappear. 383 another parameters until this warning disappear.
@@ -386,7 +387,7 @@ tcp_syncookies - BOOLEAN
386 to use TCP extensions, can result in serious degradation 387 to use TCP extensions, can result in serious degradation
387 of some services (f.e. SMTP relaying), visible not by you, 388 of some services (f.e. SMTP relaying), visible not by you,
388 but your clients and relays, contacting you. While you see 389 but your clients and relays, contacting you. While you see
389 synflood warnings in logs not being really flooded, your server 390 SYN flood warnings in logs not being really flooded, your server
390 is seriously misconfigured. 391 is seriously misconfigured.
391 392
392tcp_syn_retries - INTEGER 393tcp_syn_retries - INTEGER
@@ -419,19 +420,21 @@ tcp_window_scaling - BOOLEAN
419 Enable window scaling as defined in RFC1323. 420 Enable window scaling as defined in RFC1323.
420 421
421tcp_wmem - vector of 3 INTEGERs: min, default, max 422tcp_wmem - vector of 3 INTEGERs: min, default, max
422 min: Amount of memory reserved for send buffers for TCP socket. 423 min: Amount of memory reserved for send buffers for TCP sockets.
423 Each TCP socket has rights to use it due to fact of its birth. 424 Each TCP socket has rights to use it due to fact of its birth.
424 Default: 4K 425 Default: 4K
425 426
426 default: Amount of memory allowed for send buffers for TCP socket 427 default: initial size of send buffer used by TCP sockets. This
427 by default. This value overrides net.core.wmem_default used 428 value overrides net.core.wmem_default used by other protocols.
428 by other protocols, it is usually lower than net.core.wmem_default. 429 It is usually lower than net.core.wmem_default.
429 Default: 16K 430 Default: 16K
430 431
431 max: Maximal amount of memory allowed for automatically selected 432 max: Maximal amount of memory allowed for automatically tuned
432 send buffers for TCP socket. This value does not override 433 send buffers for TCP sockets. This value does not override
433 net.core.wmem_max, "static" selection via SO_SNDBUF does not use this. 434 net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables
434 Default: 128K 435 automatic tuning of that socket's send buffer size, in which case
436 this value is ignored.
437 Default: between 64K and 4MB, depending on RAM size.
435 438
436tcp_workaround_signed_windows - BOOLEAN 439tcp_workaround_signed_windows - BOOLEAN
437 If set, assume no receipt of a window scaling option means the 440 If set, assume no receipt of a window scaling option means the
@@ -1060,24 +1063,193 @@ bridge-nf-filter-pppoe-tagged - BOOLEAN
1060 Default: 1 1063 Default: 1
1061 1064
1062 1065
1063UNDOCUMENTED: 1066proc/sys/net/sctp/* Variables:
1067
1068addip_enable - BOOLEAN
1069 Enable or disable extension of Dynamic Address Reconfiguration
1070 (ADD-IP) functionality specified in RFC5061. This extension provides
1071 the ability to dynamically add and remove new addresses for the SCTP
1072 associations.
1073
1074 1: Enable extension.
1075
1076 0: Disable extension.
1077
1078 Default: 0
1079
1080addip_noauth_enable - BOOLEAN
1081 Dynamic Address Reconfiguration (ADD-IP) requires the use of
1082 authentication to protect the operations of adding or removing new
1083 addresses. This requirement is mandated so that unauthorized hosts
1084 would not be able to hijack associations. However, older
1085 implementations may not have implemented this requirement while
1086 allowing the ADD-IP extension. For reasons of interoperability,
1087 we provide this variable to control the enforcement of the
1088 authentication requirement.
1089
1090 1: Allow ADD-IP extension to be used without authentication. This
1091 should only be set in a closed environment for interoperability
1092 with older implementations.
1093
1094 0: Enforce the authentication requirement
1095
1096 Default: 0
1097
1098auth_enable - BOOLEAN
1099 Enable or disable Authenticated Chunks extension. This extension
1100 provides the ability to send and receive authenticated chunks and is
1101 required for secure operation of Dynamic Address Reconfiguration
1102 (ADD-IP) extension.
1103
1104 1: Enable this extension.
1105 0: Disable this extension.
1106
1107 Default: 0
1108
1109prsctp_enable - BOOLEAN
1110 Enable or disable the Partial Reliability extension (RFC3758) which
1111 is used to notify peers that a given DATA should no longer be expected.
1112
1113 1: Enable extension
1114 0: Disable
1115
1116 Default: 1
1117
1118max_burst - INTEGER
1119 The limit of the number of new packets that can be initially sent. It
1120 controls how bursty the generated traffic can be.
1121
1122 Default: 4
1123
1124association_max_retrans - INTEGER
1125 Set the maximum number for retransmissions that an association can
1126 attempt deciding that the remote end is unreachable. If this value
1127 is exceeded, the association is terminated.
1128
1129 Default: 10
1130
1131max_init_retransmits - INTEGER
1132 The maximum number of retransmissions of INIT and COOKIE-ECHO chunks
1133 that an association will attempt before declaring the destination
1134 unreachable and terminating.
1135
1136 Default: 8
1137
1138path_max_retrans - INTEGER
1139 The maximum number of retransmissions that will be attempted on a given
1140 path. Once this threshold is exceeded, the path is considered
1141 unreachable, and new traffic will use a different path when the
1142 association is multihomed.
1143
1144 Default: 5
1145
1146rto_initial - INTEGER
1147 The initial round trip timeout value in milliseconds that will be used
1148 in calculating round trip times. This is the initial time interval
1149 for retransmissions.
1150
1151 Default: 3000
1064 1152
1065dev_weight FIXME 1153rto_max - INTEGER
1066discovery_slots FIXME 1154 The maximum value (in milliseconds) of the round trip timeout. This
1067discovery_timeout FIXME 1155 is the largest time interval that can elapse between retransmissions.
1068fast_poll_increase FIXME 1156
1069ip6_queue_maxlen FIXME 1157 Default: 60000
1070lap_keepalive_time FIXME 1158
1071lo_cong FIXME 1159rto_min - INTEGER
1072max_baud_rate FIXME 1160 The minimum value (in milliseconds) of the round trip timeout. This
1073max_dgram_qlen FIXME 1161 is the smallest time interval the can elapse between retransmissions.
1074max_noreply_time FIXME 1162
1075max_tx_data_size FIXME 1163 Default: 1000
1076max_tx_window FIXME 1164
1077min_tx_turn_time FIXME 1165hb_interval - INTEGER
1078mod_cong FIXME 1166 The interval (in milliseconds) between HEARTBEAT chunks. These chunks
1079no_cong FIXME 1167 are sent at the specified interval on idle paths to probe the state of
1080no_cong_thresh FIXME 1168 a given path between 2 associations.
1081slot_timeout FIXME 1169
1082warn_noreply_time FIXME 1170 Default: 30000
1171
1172sack_timeout - INTEGER
1173 The amount of time (in milliseconds) that the implementation will wait
1174 to send a SACK.
1175
1176 Default: 200
1177
1178valid_cookie_life - INTEGER
1179 The default lifetime of the SCTP cookie (in milliseconds). The cookie
1180 is used during association establishment.
1181
1182 Default: 60000
1183
1184cookie_preserve_enable - BOOLEAN
1185 Enable or disable the ability to extend the lifetime of the SCTP cookie
1186 that is used during the establishment phase of SCTP association
1187
1188 1: Enable cookie lifetime extension.
1189 0: Disable
1190
1191 Default: 1
1192
1193rcvbuf_policy - INTEGER
1194 Determines if the receive buffer is attributed to the socket or to
1195 association. SCTP supports the capability to create multiple
1196 associations on a single socket. When using this capability, it is
1197 possible that a single stalled association that's buffering a lot
1198 of data may block other associations from delivering their data by
1199 consuming all of the receive buffer space. To work around this,
1200 the rcvbuf_policy could be set to attribute the receiver buffer space
1201 to each association instead of the socket. This prevents the described
1202 blocking.
1203
1204 1: rcvbuf space is per association
1205 0: recbuf space is per socket
1206
1207 Default: 0
1208
1209sndbuf_policy - INTEGER
1210 Similar to rcvbuf_policy above, this applies to send buffer space.
1211
1212 1: Send buffer is tracked per association
1213 0: Send buffer is tracked per socket.
1214
1215 Default: 0
1216
1217sctp_mem - vector of 3 INTEGERs: min, pressure, max
1218 Number of pages allowed for queueing by all SCTP sockets.
1219
1220 min: Below this number of pages SCTP is not bothered about its
1221 memory appetite. When amount of memory allocated by SCTP exceeds
1222 this number, SCTP starts to moderate memory usage.
1223
1224 pressure: This value was introduced to follow format of tcp_mem.
1225
1226 max: Number of pages allowed for queueing by all SCTP sockets.
1227
1228 Default is calculated at boot time from amount of available memory.
1229
1230sctp_rmem - vector of 3 INTEGERs: min, default, max
1231 See tcp_rmem for a description.
1232
1233sctp_wmem - vector of 3 INTEGERs: min, default, max
1234 See tcp_wmem for a description.
1235
1236UNDOCUMENTED:
1083 1237
1238/proc/sys/net/core/*
1239 dev_weight FIXME
1240
1241/proc/sys/net/unix/*
1242 max_dgram_qlen FIXME
1243
1244/proc/sys/net/irda/*
1245 fast_poll_increase FIXME
1246 warn_noreply_time FIXME
1247 discovery_slots FIXME
1248 slot_timeout FIXME
1249 max_baud_rate FIXME
1250 discovery_timeout FIXME
1251 lap_keepalive_time FIXME
1252 max_noreply_time FIXME
1253 max_tx_data_size FIXME
1254 max_tx_window FIXME
1255 min_tx_turn_time FIXME