aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/events-kmem.txt107
-rw-r--r--Documentation/trace/events.txt219
-rw-r--r--Documentation/trace/ftrace-design.txt233
-rw-r--r--Documentation/trace/ftrace.txt76
-rw-r--r--Documentation/trace/function-graph-fold.vim42
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl418
-rw-r--r--Documentation/trace/power.txt17
-rw-r--r--Documentation/trace/ring-buffer-design.txt955
-rw-r--r--Documentation/trace/tracepoint-analysis.txt327
9 files changed, 2330 insertions, 64 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 000000000000..6ef2a8652e17
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
1 Subsystem Trace Points: kmem
2
3The tracing system kmem captures events related to object and page allocation
4within the kernel. Broadly speaking there are four major subheadings.
5
6 o Slab allocation of small objects of unknown type (kmalloc)
7 o Slab allocation of small objects of known type
8 o Page allocation
9 o Per-CPU Allocator Activity
10 o External Fragmentation
11
12This document will describe what each of the tracepoints are and why they
13might be useful.
14
151. Slab allocation of small objects of unknown type
16===================================================
17kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
18kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
19kfree call_site=%lx ptr=%p
20
21Heavy activity for these events may indicate that a specific cache is
22justified, particularly if kmalloc slab pages are getting significantly
23internal fragmented as a result of the allocation pattern. By correlating
24kmalloc with kfree, it may be possible to identify memory leaks and where
25the allocation sites were.
26
27
282. Slab allocation of small objects of known type
29=================================================
30kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
31kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
32kmem_cache_free call_site=%lx ptr=%p
33
34These events are similar in usage to the kmalloc-related events except that
35it is likely easier to pin the event down to a specific cache. At the time
36of writing, no information is available on what slab is being allocated from,
37but the call_site can usually be used to extrapolate that information
38
393. Page allocation
40==================
41mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
42mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
43mm_page_free_direct page=%p pfn=%lu order=%d
44mm_pagevec_free page=%p pfn=%lu order=%d cold=%d
45
46These four events deal with page allocation and freeing. mm_page_alloc is
47a simple indicator of page allocator activity. Pages may be allocated from
48the per-CPU allocator (high performance) or the buddy allocator.
49
50If pages are allocated directly from the buddy allocator, the
51mm_page_alloc_zone_locked event is triggered. This event is important as high
52amounts of activity imply high activity on the zone->lock. Taking this lock
53impairs performance by disabling interrupts, dirtying cache lines between
54CPUs and serialising many CPUs.
55
56When a page is freed directly by the caller, the mm_page_free_direct event
57is triggered. Significant amounts of activity here could indicate that the
58callers should be batching their activities.
59
60When pages are freed using a pagevec, the mm_pagevec_free is
61triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
62freed in batch with a pagevec. Significant amounts of activity here could
63indicate that the system is under memory pressure and can also indicate
64contention on the zone->lru_lock.
65
664. Per-CPU Allocator Activity
67=============================
68mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
69mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
70
71In front of the page allocator is a per-cpu page allocator. It exists only
72for order-0 pages, reduces contention on the zone->lock and reduces the
73amount of writing on struct page.
74
75When a per-CPU list is empty or pages of the wrong type are allocated,
76the zone->lock will be taken once and the per-CPU list refilled. The event
77triggered is mm_page_alloc_zone_locked for each page allocated with the
78event indicating whether it is for a percpu_refill or not.
79
80When the per-CPU list is too full, a number of pages are freed, each one
81which triggers a mm_page_pcpu_drain event.
82
83The individual nature of the events are so that pages can be tracked
84between allocation and freeing. A number of drain or refill pages that occur
85consecutively imply the zone->lock being taken once. Large amounts of PCP
86refills and drains could imply an imbalance between CPUs where too much work
87is being concentrated in one place. It could also indicate that the per-CPU
88lists should be a larger size. Finally, large amounts of refills on one CPU
89and drains on another could be a factor in causing large amounts of cache
90line bounces due to writes between CPUs and worth investigating if pages
91can be allocated and freed on the same CPU through some algorithm change.
92
935. External Fragmentation
94=========================
95mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
96
97External fragmentation affects whether a high-order allocation will be
98successful or not. For some types of hardware, this is important although
99it is avoided where possible. If the system is using huge pages and needs
100to be able to resize the pool over the lifetime of the system, this value
101is important.
102
103Large numbers of this event implies that memory is fragmenting and
104high-order allocations will start failing at some time in the future. One
105means of reducing the occurange of this event is to increase the size of
106min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
107pageblock_size is usually the size of the default hugepage size.
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
index f157d7594ea7..02ac6ed38b2d 100644
--- a/Documentation/trace/events.txt
+++ b/Documentation/trace/events.txt
@@ -1,7 +1,7 @@
1 Event Tracing 1 Event Tracing
2 2
3 Documentation written by Theodore Ts'o 3 Documentation written by Theodore Ts'o
4 Updated by Li Zefan 4 Updated by Li Zefan and Tom Zanussi
5 5
61. Introduction 61. Introduction
7=============== 7===============
@@ -22,12 +22,12 @@ tracing information should be printed.
22--------------------------------- 22---------------------------------
23 23
24The events which are available for tracing can be found in the file 24The events which are available for tracing can be found in the file
25/debug/tracing/available_events. 25/sys/kernel/debug/tracing/available_events.
26 26
27To enable a particular event, such as 'sched_wakeup', simply echo it 27To enable a particular event, such as 'sched_wakeup', simply echo it
28to /debug/tracing/set_event. For example: 28to /sys/kernel/debug/tracing/set_event. For example:
29 29
30 # echo sched_wakeup >> /debug/tracing/set_event 30 # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
31 31
32[ Note: '>>' is necessary, otherwise it will firstly disable 32[ Note: '>>' is necessary, otherwise it will firstly disable
33 all the events. ] 33 all the events. ]
@@ -35,15 +35,15 @@ to /debug/tracing/set_event. For example:
35To disable an event, echo the event name to the set_event file prefixed 35To disable an event, echo the event name to the set_event file prefixed
36with an exclamation point: 36with an exclamation point:
37 37
38 # echo '!sched_wakeup' >> /debug/tracing/set_event 38 # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
39 39
40To disable all events, echo an empty line to the set_event file: 40To disable all events, echo an empty line to the set_event file:
41 41
42 # echo > /debug/tracing/set_event 42 # echo > /sys/kernel/debug/tracing/set_event
43 43
44To enable all events, echo '*:*' or '*:' to the set_event file: 44To enable all events, echo '*:*' or '*:' to the set_event file:
45 45
46 # echo *:* > /debug/tracing/set_event 46 # echo *:* > /sys/kernel/debug/tracing/set_event
47 47
48The events are organized into subsystems, such as ext4, irq, sched, 48The events are organized into subsystems, such as ext4, irq, sched,
49etc., and a full event name looks like this: <subsystem>:<event>. The 49etc., and a full event name looks like this: <subsystem>:<event>. The
@@ -52,29 +52,29 @@ file. All of the events in a subsystem can be specified via the syntax
52"<subsystem>:*"; for example, to enable all irq events, you can use the 52"<subsystem>:*"; for example, to enable all irq events, you can use the
53command: 53command:
54 54
55 # echo 'irq:*' > /debug/tracing/set_event 55 # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
56 56
572.2 Via the 'enable' toggle 572.2 Via the 'enable' toggle
58--------------------------- 58---------------------------
59 59
60The events available are also listed in /debug/tracing/events/ hierarchy 60The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
61of directories. 61of directories.
62 62
63To enable event 'sched_wakeup': 63To enable event 'sched_wakeup':
64 64
65 # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable 65 # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
66 66
67To disable it: 67To disable it:
68 68
69 # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable 69 # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
70 70
71To enable all events in sched subsystem: 71To enable all events in sched subsystem:
72 72
73 # echo 1 > /debug/tracing/events/sched/enable 73 # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
74 74
75To eanble all events: 75To enable all events:
76 76
77 # echo 1 > /debug/tracing/events/enable 77 # echo 1 > /sys/kernel/debug/tracing/events/enable
78 78
79When reading one of these enable files, there are four results: 79When reading one of these enable files, there are four results:
80 80
@@ -83,8 +83,199 @@ When reading one of these enable files, there are four results:
83 X - there is a mixture of events enabled and disabled 83 X - there is a mixture of events enabled and disabled
84 ? - this file does not affect any event 84 ? - this file does not affect any event
85 85
862.3 Boot option
87---------------
88
89In order to facilitate early boot debugging, use boot option:
90
91 trace_event=[event-list]
92
93The format of this boot option is the same as described in section 2.1.
94
863. Defining an event-enabled tracepoint 953. Defining an event-enabled tracepoint
87======================================= 96=======================================
88 97
89See The example provided in samples/trace_events 98See The example provided in samples/trace_events
90 99
1004. Event formats
101================
102
103Each trace event has a 'format' file associated with it that contains
104a description of each field in a logged event. This information can
105be used to parse the binary trace stream, and is also the place to
106find the field names that can be used in event filters (see section 5).
107
108It also displays the format string that will be used to print the
109event in text mode, along with the event name and ID used for
110profiling.
111
112Every event has a set of 'common' fields associated with it; these are
113the fields prefixed with 'common_'. The other fields vary between
114events and correspond to the fields defined in the TRACE_EVENT
115definition for that event.
116
117Each field in the format has the form:
118
119 field:field-type field-name; offset:N; size:N;
120
121where offset is the offset of the field in the trace record and size
122is the size of the data item, in bytes.
123
124For example, here's the information displayed for the 'sched_wakeup'
125event:
126
127# cat /debug/tracing/events/sched/sched_wakeup/format
128
129name: sched_wakeup
130ID: 60
131format:
132 field:unsigned short common_type; offset:0; size:2;
133 field:unsigned char common_flags; offset:2; size:1;
134 field:unsigned char common_preempt_count; offset:3; size:1;
135 field:int common_pid; offset:4; size:4;
136 field:int common_tgid; offset:8; size:4;
137
138 field:char comm[TASK_COMM_LEN]; offset:12; size:16;
139 field:pid_t pid; offset:28; size:4;
140 field:int prio; offset:32; size:4;
141 field:int success; offset:36; size:4;
142 field:int cpu; offset:40; size:4;
143
144print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
145 REC->prio, REC->success, REC->cpu
146
147This event contains 10 fields, the first 5 common and the remaining 5
148event-specific. All the fields for this event are numeric, except for
149'comm' which is a string, a distinction important for event filtering.
150
1515. Event filtering
152==================
153
154Trace events can be filtered in the kernel by associating boolean
155'filter expressions' with them. As soon as an event is logged into
156the trace buffer, its fields are checked against the filter expression
157associated with that event type. An event with field values that
158'match' the filter will appear in the trace output, and an event whose
159values don't match will be discarded. An event with no filter
160associated with it matches everything, and is the default when no
161filter has been set for an event.
162
1635.1 Expression syntax
164---------------------
165
166A filter expression consists of one or more 'predicates' that can be
167combined using the logical operators '&&' and '||'. A predicate is
168simply a clause that compares the value of a field contained within a
169logged event with a constant value and returns either 0 or 1 depending
170on whether the field value matched (1) or didn't match (0):
171
172 field-name relational-operator value
173
174Parentheses can be used to provide arbitrary logical groupings and
175double-quotes can be used to prevent the shell from interpreting
176operators as shell metacharacters.
177
178The field-names available for use in filters can be found in the
179'format' files for trace events (see section 4).
180
181The relational-operators depend on the type of the field being tested:
182
183The operators available for numeric fields are:
184
185==, !=, <, <=, >, >=
186
187And for string fields they are:
188
189==, !=
190
191Currently, only exact string matches are supported.
192
193Currently, the maximum number of predicates in a filter is 16.
194
1955.2 Setting filters
196-------------------
197
198A filter for an individual event is set by writing a filter expression
199to the 'filter' file for the given event.
200
201For example:
202
203# cd /debug/tracing/events/sched/sched_wakeup
204# echo "common_preempt_count > 4" > filter
205
206A slightly more involved example:
207
208# cd /debug/tracing/events/sched/sched_signal_send
209# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
210
211If there is an error in the expression, you'll get an 'Invalid
212argument' error when setting it, and the erroneous string along with
213an error message can be seen by looking at the filter e.g.:
214
215# cd /debug/tracing/events/sched/sched_signal_send
216# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
217-bash: echo: write error: Invalid argument
218# cat filter
219((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
220^
221parse_error: Field not found
222
223Currently the caret ('^') for an error always appears at the beginning of
224the filter string; the error message should still be useful though
225even without more accurate position info.
226
2275.3 Clearing filters
228--------------------
229
230To clear the filter for an event, write a '0' to the event's filter
231file.
232
233To clear the filters for all events in a subsystem, write a '0' to the
234subsystem's filter file.
235
2365.3 Subsystem filters
237---------------------
238
239For convenience, filters for every event in a subsystem can be set or
240cleared as a group by writing a filter expression into the filter file
241at the root of the subsytem. Note however, that if a filter for any
242event within the subsystem lacks a field specified in the subsystem
243filter, or if the filter can't be applied for any other reason, the
244filter for that event will retain its previous setting. This can
245result in an unintended mixture of filters which could lead to
246confusing (to the user who might think different filters are in
247effect) trace output. Only filters that reference just the common
248fields can be guaranteed to propagate successfully to all events.
249
250Here are a few subsystem filter examples that also illustrate the
251above points:
252
253Clear the filters on all events in the sched subsytem:
254
255# cd /sys/kernel/debug/tracing/events/sched
256# echo 0 > filter
257# cat sched_switch/filter
258none
259# cat sched_wakeup/filter
260none
261
262Set a filter using only common fields for all events in the sched
263subsytem (all events end up with the same filter):
264
265# cd /sys/kernel/debug/tracing/events/sched
266# echo common_pid == 0 > filter
267# cat sched_switch/filter
268common_pid == 0
269# cat sched_wakeup/filter
270common_pid == 0
271
272Attempt to set a filter using a non-common field for all events in the
273sched subsytem (all events but those that have a prev_pid field retain
274their old filters):
275
276# cd /sys/kernel/debug/tracing/events/sched
277# echo prev_pid == 0 > filter
278# cat sched_switch/filter
279prev_pid == 0
280# cat sched_wakeup/filter
281common_pid == 0
diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt
new file mode 100644
index 000000000000..7003e10f10f5
--- /dev/null
+++ b/Documentation/trace/ftrace-design.txt
@@ -0,0 +1,233 @@
1 function tracer guts
2 ====================
3
4Introduction
5------------
6
7Here we will cover the architecture pieces that the common function tracing
8code relies on for proper functioning. Things are broken down into increasing
9complexity so that you can start simple and at least get basic functionality.
10
11Note that this focuses on architecture implementation details only. If you
12want more explanation of a feature in terms of common code, review the common
13ftrace.txt file.
14
15
16Prerequisites
17-------------
18
19Ftrace relies on these features being implemented:
20 STACKTRACE_SUPPORT - implement save_stack_trace()
21 TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h
22
23
24HAVE_FUNCTION_TRACER
25--------------------
26
27You will need to implement the mcount and the ftrace_stub functions.
28
29The exact mcount symbol name will depend on your toolchain. Some call it
30"mcount", "_mcount", or even "__mcount". You can probably figure it out by
31running something like:
32 $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount
33 call mcount
34We'll make the assumption below that the symbol is "mcount" just to keep things
35nice and simple in the examples.
36
37Keep in mind that the ABI that is in effect inside of the mcount function is
38*highly* architecture/toolchain specific. We cannot help you in this regard,
39sorry. Dig up some old documentation and/or find someone more familiar than
40you to bang ideas off of. Typically, register usage (argument/scratch/etc...)
41is a major issue at this point, especially in relation to the location of the
42mcount call (before/after function prologue). You might also want to look at
43how glibc has implemented the mcount function for your architecture. It might
44be (semi-)relevant.
45
46The mcount function should check the function pointer ftrace_trace_function
47to see if it is set to ftrace_stub. If it is, there is nothing for you to do,
48so return immediately. If it isn't, then call that function in the same way
49the mcount function normally calls __mcount_internal -- the first argument is
50the "frompc" while the second argument is the "selfpc" (adjusted to remove the
51size of the mcount call that is embedded in the function).
52
53For example, if the function foo() calls bar(), when the bar() function calls
54mcount(), the arguments mcount() will pass to the tracer are:
55 "frompc" - the address bar() will use to return to foo()
56 "selfpc" - the address bar() (with _mcount() size adjustment)
57
58Also keep in mind that this mcount function will be called *a lot*, so
59optimizing for the default case of no tracer will help the smooth running of
60your system when tracing is disabled. So the start of the mcount function is
61typically the bare min with checking things before returning. That also means
62the code flow should usually kept linear (i.e. no branching in the nop case).
63This is of course an optimization and not a hard requirement.
64
65Here is some pseudo code that should help (these functions should actually be
66implemented in assembly):
67
68void ftrace_stub(void)
69{
70 return;
71}
72
73void mcount(void)
74{
75 /* save any bare state needed in order to do initial checking */
76
77 extern void (*ftrace_trace_function)(unsigned long, unsigned long);
78 if (ftrace_trace_function != ftrace_stub)
79 goto do_trace;
80
81 /* restore any bare state */
82
83 return;
84
85do_trace:
86
87 /* save all state needed by the ABI (see paragraph above) */
88
89 unsigned long frompc = ...;
90 unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
91 ftrace_trace_function(frompc, selfpc);
92
93 /* restore all state needed by the ABI */
94}
95
96Don't forget to export mcount for modules !
97extern void mcount(void);
98EXPORT_SYMBOL(mcount);
99
100
101HAVE_FUNCTION_TRACE_MCOUNT_TEST
102-------------------------------
103
104This is an optional optimization for the normal case when tracing is turned off
105in the system. If you do not enable this Kconfig option, the common ftrace
106code will take care of doing the checking for you.
107
108To support this feature, you only need to check the function_trace_stop
109variable in the mcount function. If it is non-zero, there is no tracing to be
110done at all, so you can return.
111
112This additional pseudo code would simply be:
113void mcount(void)
114{
115 /* save any bare state needed in order to do initial checking */
116
117+ if (function_trace_stop)
118+ return;
119
120 extern void (*ftrace_trace_function)(unsigned long, unsigned long);
121 if (ftrace_trace_function != ftrace_stub)
122...
123
124
125HAVE_FUNCTION_GRAPH_TRACER
126--------------------------
127
128Deep breath ... time to do some real work. Here you will need to update the
129mcount function to check ftrace graph function pointers, as well as implement
130some functions to save (hijack) and restore the return address.
131
132The mcount function should check the function pointers ftrace_graph_return
133(compare to ftrace_stub) and ftrace_graph_entry (compare to
134ftrace_graph_entry_stub). If either of those are not set to the relevant stub
135function, call the arch-specific function ftrace_graph_caller which in turn
136calls the arch-specific function prepare_ftrace_return. Neither of these
137function names are strictly required, but you should use them anyways to stay
138consistent across the architecture ports -- easier to compare & contrast
139things.
140
141The arguments to prepare_ftrace_return are slightly different than what are
142passed to ftrace_trace_function. The second argument "selfpc" is the same,
143but the first argument should be a pointer to the "frompc". Typically this is
144located on the stack. This allows the function to hijack the return address
145temporarily to have it point to the arch-specific function return_to_handler.
146That function will simply call the common ftrace_return_to_handler function and
147that will return the original return address with which, you can return to the
148original call site.
149
150Here is the updated mcount pseudo code:
151void mcount(void)
152{
153...
154 if (ftrace_trace_function != ftrace_stub)
155 goto do_trace;
156
157+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
158+ extern void (*ftrace_graph_return)(...);
159+ extern void (*ftrace_graph_entry)(...);
160+ if (ftrace_graph_return != ftrace_stub ||
161+ ftrace_graph_entry != ftrace_graph_entry_stub)
162+ ftrace_graph_caller();
163+#endif
164
165 /* restore any bare state */
166...
167
168Here is the pseudo code for the new ftrace_graph_caller assembly function:
169#ifdef CONFIG_FUNCTION_GRAPH_TRACER
170void ftrace_graph_caller(void)
171{
172 /* save all state needed by the ABI */
173
174 unsigned long *frompc = &...;
175 unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
176 prepare_ftrace_return(frompc, selfpc);
177
178 /* restore all state needed by the ABI */
179}
180#endif
181
182For information on how to implement prepare_ftrace_return(), simply look at
183the x86 version. The only architecture-specific piece in it is the setup of
184the fault recovery table (the asm(...) code). The rest should be the same
185across architectures.
186
187Here is the pseudo code for the new return_to_handler assembly function. Note
188that the ABI that applies here is different from what applies to the mcount
189code. Since you are returning from a function (after the epilogue), you might
190be able to skimp on things saved/restored (usually just registers used to pass
191return values).
192
193#ifdef CONFIG_FUNCTION_GRAPH_TRACER
194void return_to_handler(void)
195{
196 /* save all state needed by the ABI (see paragraph above) */
197
198 void (*original_return_point)(void) = ftrace_return_to_handler();
199
200 /* restore all state needed by the ABI */
201
202 /* this is usually either a return or a jump */
203 original_return_point();
204}
205#endif
206
207
208HAVE_FTRACE_NMI_ENTER
209---------------------
210
211If you can't trace NMI functions, then skip this option.
212
213<details to be filled>
214
215
216HAVE_FTRACE_SYSCALLS
217---------------------
218
219<details to be filled>
220
221
222HAVE_FTRACE_MCOUNT_RECORD
223-------------------------
224
225See scripts/recordmcount.pl for more info.
226
227<details to be filled>
228
229
230HAVE_DYNAMIC_FTRACE
231---------------------
232
233<details to be filled>
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index a39b3c749de5..957b22fde2df 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -26,6 +26,12 @@ disabled, and more (ftrace allows for tracer plugins, which
26means that the list of tracers can always grow). 26means that the list of tracers can always grow).
27 27
28 28
29Implementation Details
30----------------------
31
32See ftrace-design.txt for details for arch porters and such.
33
34
29The File System 35The File System
30--------------- 36---------------
31 37
@@ -85,26 +91,19 @@ of ftrace. Here is a list of some of the key files:
85 This file holds the output of the trace in a human 91 This file holds the output of the trace in a human
86 readable format (described below). 92 readable format (described below).
87 93
88 latency_trace:
89
90 This file shows the same trace but the information
91 is organized more to display possible latencies
92 in the system (described below).
93
94 trace_pipe: 94 trace_pipe:
95 95
96 The output is the same as the "trace" file but this 96 The output is the same as the "trace" file but this
97 file is meant to be streamed with live tracing. 97 file is meant to be streamed with live tracing.
98 Reads from this file will block until new data 98 Reads from this file will block until new data is
99 is retrieved. Unlike the "trace" and "latency_trace" 99 retrieved. Unlike the "trace" file, this file is a
100 files, this file is a consumer. This means reading 100 consumer. This means reading from this file causes
101 from this file causes sequential reads to display 101 sequential reads to display more current data. Once
102 more current data. Once data is read from this 102 data is read from this file, it is consumed, and
103 file, it is consumed, and will not be read 103 will not be read again with a sequential read. The
104 again with a sequential read. The "trace" and 104 "trace" file is static, and if the tracer is not
105 "latency_trace" files are static, and if the 105 adding more data,they will display the same
106 tracer is not adding more data, they will display 106 information every time they are read.
107 the same information every time they are read.
108 107
109 trace_options: 108 trace_options:
110 109
@@ -117,10 +116,10 @@ of ftrace. Here is a list of some of the key files:
117 Some of the tracers record the max latency. 116 Some of the tracers record the max latency.
118 For example, the time interrupts are disabled. 117 For example, the time interrupts are disabled.
119 This time is saved in this file. The max trace 118 This time is saved in this file. The max trace
120 will also be stored, and displayed by either 119 will also be stored, and displayed by "trace".
121 "trace" or "latency_trace". A new max trace will 120 A new max trace will only be recorded if the
122 only be recorded if the latency is greater than 121 latency is greater than the value in this
123 the value in this file. (in microseconds) 122 file. (in microseconds)
124 123
125 buffer_size_kb: 124 buffer_size_kb:
126 125
@@ -134,7 +133,7 @@ of ftrace. Here is a list of some of the key files:
134 than requested, the rest of the page will be used, 133 than requested, the rest of the page will be used,
135 making the actual allocation bigger than requested. 134 making the actual allocation bigger than requested.
136 ( Note, the size may not be a multiple of the page size 135 ( Note, the size may not be a multiple of the page size
137 due to buffer managment overhead. ) 136 due to buffer management overhead. )
138 137
139 This can only be updated when the current_tracer 138 This can only be updated when the current_tracer
140 is set to "nop". 139 is set to "nop".
@@ -210,7 +209,7 @@ Here is the list of current tracers that may be configured.
210 the trace with the longest max latency. 209 the trace with the longest max latency.
211 See tracing_max_latency. When a new max is recorded, 210 See tracing_max_latency. When a new max is recorded,
212 it replaces the old trace. It is best to view this 211 it replaces the old trace. It is best to view this
213 trace via the latency_trace file. 212 trace with the latency-format option enabled.
214 213
215 "preemptoff" 214 "preemptoff"
216 215
@@ -307,8 +306,8 @@ the lowest priority thread (pid 0).
307Latency trace format 306Latency trace format
308-------------------- 307--------------------
309 308
310For traces that display latency times, the latency_trace file 309When the latency-format option is enabled, the trace file gives
311gives somewhat more information to see why a latency happened. 310somewhat more information to see why a latency happened.
312Here is a typical trace. 311Here is a typical trace.
313 312
314# tracer: irqsoff 313# tracer: irqsoff
@@ -380,9 +379,10 @@ explains which is which.
380 379
381The above is mostly meaningful for kernel developers. 380The above is mostly meaningful for kernel developers.
382 381
383 time: This differs from the trace file output. The trace file output 382 time: When the latency-format option is enabled, the trace file
384 includes an absolute timestamp. The timestamp used by the 383 output includes a timestamp relative to the start of the
385 latency_trace file is relative to the start of the trace. 384 trace. This differs from the output when latency-format
385 is disabled, which includes an absolute timestamp.
386 386
387 delay: This is just to help catch your eye a bit better. And 387 delay: This is just to help catch your eye a bit better. And
388 needs to be fixed to be only relative to the same CPU. 388 needs to be fixed to be only relative to the same CPU.
@@ -440,7 +440,8 @@ Here are the available options:
440 sym-addr: 440 sym-addr:
441 bash-4000 [01] 1477.606694: simple_strtoul <c0339346> 441 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
442 442
443 verbose - This deals with the latency_trace file. 443 verbose - This deals with the trace file when the
444 latency-format option is enabled.
444 445
445 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ 446 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
446 (+0.000ms): simple_strtoul (strict_strtoul) 447 (+0.000ms): simple_strtoul (strict_strtoul)
@@ -472,7 +473,7 @@ Here are the available options:
472 the app is no longer running 473 the app is no longer running
473 474
474 The lookup is performed when you read 475 The lookup is performed when you read
475 trace,trace_pipe,latency_trace. Example: 476 trace,trace_pipe. Example:
476 477
477 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 478 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
478x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] 479x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
@@ -481,6 +482,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
481 every scheduling event. Will add overhead if 482 every scheduling event. Will add overhead if
482 there's a lot of tasks running at once. 483 there's a lot of tasks running at once.
483 484
485 latency-format - This option changes the trace. When
486 it is enabled, the trace displays
487 additional information about the
488 latencies, as described in "Latency
489 trace format".
484 490
485sched_switch 491sched_switch
486------------ 492------------
@@ -596,12 +602,13 @@ To reset the maximum, echo 0 into tracing_max_latency. Here is
596an example: 602an example:
597 603
598 # echo irqsoff > current_tracer 604 # echo irqsoff > current_tracer
605 # echo latency-format > trace_options
599 # echo 0 > tracing_max_latency 606 # echo 0 > tracing_max_latency
600 # echo 1 > tracing_enabled 607 # echo 1 > tracing_enabled
601 # ls -ltr 608 # ls -ltr
602 [...] 609 [...]
603 # echo 0 > tracing_enabled 610 # echo 0 > tracing_enabled
604 # cat latency_trace 611 # cat trace
605# tracer: irqsoff 612# tracer: irqsoff
606# 613#
607irqsoff latency trace v1.1.5 on 2.6.26 614irqsoff latency trace v1.1.5 on 2.6.26
@@ -703,12 +710,13 @@ which preemption was disabled. The control of preemptoff tracer
703is much like the irqsoff tracer. 710is much like the irqsoff tracer.
704 711
705 # echo preemptoff > current_tracer 712 # echo preemptoff > current_tracer
713 # echo latency-format > trace_options
706 # echo 0 > tracing_max_latency 714 # echo 0 > tracing_max_latency
707 # echo 1 > tracing_enabled 715 # echo 1 > tracing_enabled
708 # ls -ltr 716 # ls -ltr
709 [...] 717 [...]
710 # echo 0 > tracing_enabled 718 # echo 0 > tracing_enabled
711 # cat latency_trace 719 # cat trace
712# tracer: preemptoff 720# tracer: preemptoff
713# 721#
714preemptoff latency trace v1.1.5 on 2.6.26-rc8 722preemptoff latency trace v1.1.5 on 2.6.26-rc8
@@ -850,12 +858,13 @@ Again, using this trace is much like the irqsoff and preemptoff
850tracers. 858tracers.
851 859
852 # echo preemptirqsoff > current_tracer 860 # echo preemptirqsoff > current_tracer
861 # echo latency-format > trace_options
853 # echo 0 > tracing_max_latency 862 # echo 0 > tracing_max_latency
854 # echo 1 > tracing_enabled 863 # echo 1 > tracing_enabled
855 # ls -ltr 864 # ls -ltr
856 [...] 865 [...]
857 # echo 0 > tracing_enabled 866 # echo 0 > tracing_enabled
858 # cat latency_trace 867 # cat trace
859# tracer: preemptirqsoff 868# tracer: preemptirqsoff
860# 869#
861preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 870preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8
@@ -1012,11 +1021,12 @@ Instead of performing an 'ls', we will run 'sleep 1' under
1012'chrt' which changes the priority of the task. 1021'chrt' which changes the priority of the task.
1013 1022
1014 # echo wakeup > current_tracer 1023 # echo wakeup > current_tracer
1024 # echo latency-format > trace_options
1015 # echo 0 > tracing_max_latency 1025 # echo 0 > tracing_max_latency
1016 # echo 1 > tracing_enabled 1026 # echo 1 > tracing_enabled
1017 # chrt -f 5 sleep 1 1027 # chrt -f 5 sleep 1
1018 # echo 0 > tracing_enabled 1028 # echo 0 > tracing_enabled
1019 # cat latency_trace 1029 # cat trace
1020# tracer: wakeup 1030# tracer: wakeup
1021# 1031#
1022wakeup latency trace v1.1.5 on 2.6.26-rc8 1032wakeup latency trace v1.1.5 on 2.6.26-rc8
diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim
new file mode 100644
index 000000000000..0544b504c8b0
--- /dev/null
+++ b/Documentation/trace/function-graph-fold.vim
@@ -0,0 +1,42 @@
1" Enable folding for ftrace function_graph traces.
2"
3" To use, :source this file while viewing a function_graph trace, or use vim's
4" -S option to load from the command-line together with a trace. You can then
5" use the usual vim fold commands, such as "za", to open and close nested
6" functions. While closed, a fold will show the total time taken for a call,
7" as would normally appear on the line with the closing brace. Folded
8" functions will not include finish_task_switch(), so folding should remain
9" relatively sane even through a context switch.
10"
11" Note that this will almost certainly only work well with a
12" single-CPU trace (e.g. trace-cmd report --cpu 1).
13
14function! FunctionGraphFoldExpr(lnum)
15 let line = getline(a:lnum)
16 if line[-1:] == '{'
17 if line =~ 'finish_task_switch() {$'
18 return '>1'
19 endif
20 return 'a1'
21 elseif line[-1:] == '}'
22 return 's1'
23 else
24 return '='
25 endif
26endfunction
27
28function! FunctionGraphFoldText()
29 let s = split(getline(v:foldstart), '|', 1)
30 if getline(v:foldend+1) =~ 'finish_task_switch() {$'
31 let s[2] = ' task switch '
32 else
33 let e = split(getline(v:foldend), '|', 1)
34 let s[2] = e[2]
35 endif
36 return join(s, '|')
37endfunction
38
39setlocal foldexpr=FunctionGraphFoldExpr(v:lnum)
40setlocal foldtext=FunctionGraphFoldText()
41setlocal foldcolumn=12
42setlocal foldmethod=expr
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100644
index 000000000000..7df50e8cf4d9
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
1#!/usr/bin/perl
2# This is a POC (proof of concept or piece of crap, take your pick) for reading the
3# text representation of trace output related to page allocation. It makes an attempt
4# to extract some high-level information on what is going on. The accuracy of the parser
5# may vary considerably
6#
7# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
8# other options
9# --prepend-parent Report on the parent proc and PID
10# --read-procstat If the trace lacks process info, get it from /proc
11# --ignore-pid Aggregate processes of the same name together
12#
13# Copyright (c) IBM Corporation 2009
14# Author: Mel Gorman <mel@csn.ul.ie>
15use strict;
16use Getopt::Long;
17
18# Tracepoint events
19use constant MM_PAGE_ALLOC => 1;
20use constant MM_PAGE_FREE_DIRECT => 2;
21use constant MM_PAGEVEC_FREE => 3;
22use constant MM_PAGE_PCPU_DRAIN => 4;
23use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
24use constant MM_PAGE_ALLOC_EXTFRAG => 6;
25use constant EVENT_UNKNOWN => 7;
26
27# Constants used to track state
28use constant STATE_PCPU_PAGES_DRAINED => 8;
29use constant STATE_PCPU_PAGES_REFILLED => 9;
30
31# High-level events extrapolated from tracepoints
32use constant HIGH_PCPU_DRAINS => 10;
33use constant HIGH_PCPU_REFILLS => 11;
34use constant HIGH_EXT_FRAGMENT => 12;
35use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
36use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
37use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
38
39my %perprocesspid;
40my %perprocess;
41my $opt_ignorepid;
42my $opt_read_procstat;
43my $opt_prepend_parent;
44
45# Catch sigint and exit on request
46my $sigint_report = 0;
47my $sigint_exit = 0;
48my $sigint_pending = 0;
49my $sigint_received = 0;
50sub sigint_handler {
51 my $current_time = time;
52 if ($current_time - 2 > $sigint_received) {
53 print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
54 $sigint_report = 1;
55 } else {
56 if (!$sigint_exit) {
57 print "Second SIGINT received quickly, exiting\n";
58 }
59 $sigint_exit++;
60 }
61
62 if ($sigint_exit > 3) {
63 print "Many SIGINTs received, exiting now without report\n";
64 exit;
65 }
66
67 $sigint_received = $current_time;
68 $sigint_pending = 1;
69}
70$SIG{INT} = "sigint_handler";
71
72# Parse command line options
73GetOptions(
74 'ignore-pid' => \$opt_ignorepid,
75 'read-procstat' => \$opt_read_procstat,
76 'prepend-parent' => \$opt_prepend_parent,
77);
78
79# Defaults for dynamically discovered regex's
80my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
81
82# Dyanically discovered regex
83my $regex_fragdetails;
84
85# Static regex used. Specified like this for readability and for use with /o
86# (process_pid) (cpus ) ( time ) (tpoint ) (details)
87my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
88my $regex_statname = '[-0-9]*\s\((.*)\).*';
89my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
90
91sub generate_traceevent_regex {
92 my $event = shift;
93 my $default = shift;
94 my $regex;
95
96 # Read the event format or use the default
97 if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
98 $regex = $default;
99 } else {
100 my $line;
101 while (!eof(FORMAT)) {
102 $line = <FORMAT>;
103 if ($line =~ /^print fmt:\s"(.*)",.*/) {
104 $regex = $1;
105 $regex =~ s/%p/\([0-9a-f]*\)/g;
106 $regex =~ s/%d/\([-0-9]*\)/g;
107 $regex =~ s/%lu/\([0-9]*\)/g;
108 }
109 }
110 }
111
112 # Verify fields are in the right order
113 my $tuple;
114 foreach $tuple (split /\s/, $regex) {
115 my ($key, $value) = split(/=/, $tuple);
116 my $expected = shift;
117 if ($key ne $expected) {
118 print("WARNING: Format not as expected '$key' != '$expected'");
119 $regex =~ s/$key=\((.*)\)/$key=$1/;
120 }
121 }
122
123 if (defined shift) {
124 die("Fewer fields than expected in format");
125 }
126
127 return $regex;
128}
129$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
130 $regex_fragdetails_default,
131 "page", "pfn",
132 "alloc_order", "fallback_order", "pageblock_order",
133 "alloc_migratetype", "fallback_migratetype",
134 "fragmenting", "change_ownership");
135
136sub read_statline($) {
137 my $pid = $_[0];
138 my $statline;
139
140 if (open(STAT, "/proc/$pid/stat")) {
141 $statline = <STAT>;
142 close(STAT);
143 }
144
145 if ($statline eq '') {
146 $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
147 }
148
149 return $statline;
150}
151
152sub guess_process_pid($$) {
153 my $pid = $_[0];
154 my $statline = $_[1];
155
156 if ($pid == 0) {
157 return "swapper-0";
158 }
159
160 if ($statline !~ /$regex_statname/o) {
161 die("Failed to math stat line for process name :: $statline");
162 }
163 return "$1-$pid";
164}
165
166sub parent_info($$) {
167 my $pid = $_[0];
168 my $statline = $_[1];
169 my $ppid;
170
171 if ($pid == 0) {
172 return "NOPARENT-0";
173 }
174
175 if ($statline !~ /$regex_statppid/o) {
176 die("Failed to match stat line process ppid:: $statline");
177 }
178
179 # Read the ppid stat line
180 $ppid = $1;
181 return guess_process_pid($ppid, read_statline($ppid));
182}
183
184sub process_events {
185 my $traceevent;
186 my $process_pid;
187 my $cpus;
188 my $timestamp;
189 my $tracepoint;
190 my $details;
191 my $statline;
192
193 # Read each line of the event log
194EVENT_PROCESS:
195 while ($traceevent = <STDIN>) {
196 if ($traceevent =~ /$regex_traceevent/o) {
197 $process_pid = $1;
198 $tracepoint = $4;
199
200 if ($opt_read_procstat || $opt_prepend_parent) {
201 $process_pid =~ /(.*)-([0-9]*)$/;
202 my $process = $1;
203 my $pid = $2;
204
205 $statline = read_statline($pid);
206
207 if ($opt_read_procstat && $process eq '') {
208 $process_pid = guess_process_pid($pid, $statline);
209 }
210
211 if ($opt_prepend_parent) {
212 $process_pid = parent_info($pid, $statline) . " :: $process_pid";
213 }
214 }
215
216 # Unnecessary in this script. Uncomment if required
217 # $cpus = $2;
218 # $timestamp = $3;
219 } else {
220 next;
221 }
222
223 # Perl Switch() sucks majorly
224 if ($tracepoint eq "mm_page_alloc") {
225 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
226 } elsif ($tracepoint eq "mm_page_free_direct") {
227 $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
228 } elsif ($tracepoint eq "mm_pagevec_free") {
229 $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
230 } elsif ($tracepoint eq "mm_page_pcpu_drain") {
231 $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
232 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
233 } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
234 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
235 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
236 } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
237
238 # Extract the details of the event now
239 $details = $5;
240
241 my ($page, $pfn);
242 my ($alloc_order, $fallback_order, $pageblock_order);
243 my ($alloc_migratetype, $fallback_migratetype);
244 my ($fragmenting, $change_ownership);
245
246 if ($details !~ /$regex_fragdetails/o) {
247 print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
248 next;
249 }
250
251 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
252 $page = $1;
253 $pfn = $2;
254 $alloc_order = $3;
255 $fallback_order = $4;
256 $pageblock_order = $5;
257 $alloc_migratetype = $6;
258 $fallback_migratetype = $7;
259 $fragmenting = $8;
260 $change_ownership = $9;
261
262 if ($fragmenting) {
263 $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
264 if ($fallback_order <= 3) {
265 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
266 } else {
267 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
268 }
269 }
270 if ($change_ownership) {
271 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
272 }
273 } else {
274 $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
275 }
276
277 # Catch a full pcpu drain event
278 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
279 $tracepoint ne "mm_page_pcpu_drain") {
280
281 $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
282 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
283 }
284
285 # Catch a full pcpu refill event
286 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
287 $tracepoint ne "mm_page_alloc_zone_locked") {
288 $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
289 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
290 }
291
292 if ($sigint_pending) {
293 last EVENT_PROCESS;
294 }
295 }
296}
297
298sub dump_stats {
299 my $hashref = shift;
300 my %stats = %$hashref;
301
302 # Dump per-process stats
303 my $process_pid;
304 my $max_strlen = 0;
305
306 # Get the maximum process name
307 foreach $process_pid (keys %perprocesspid) {
308 my $len = length($process_pid);
309 if ($len > $max_strlen) {
310 $max_strlen = $len;
311 }
312 }
313 $max_strlen += 2;
314
315 printf("\n");
316 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
317 "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
318 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
319 "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
320
321 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
322 "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
323
324 foreach $process_pid (keys %stats) {
325 # Dump final aggregates
326 if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
327 $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
328 $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
329 }
330 if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
331 $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
332 $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
333 }
334
335 printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
336 $process_pid,
337 $stats{$process_pid}->{MM_PAGE_ALLOC},
338 $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
339 $stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
340 $stats{$process_pid}->{MM_PAGEVEC_FREE},
341 $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
342 $stats{$process_pid}->{HIGH_PCPU_DRAINS},
343 $stats{$process_pid}->{HIGH_PCPU_REFILLS},
344 $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
345 $stats{$process_pid}->{HIGH_EXT_FRAG},
346 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
347 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
348 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
349 $stats{$process_pid}->{EVENT_UNKNOWN});
350 }
351}
352
353sub aggregate_perprocesspid() {
354 my $process_pid;
355 my $process;
356 undef %perprocess;
357
358 foreach $process_pid (keys %perprocesspid) {
359 $process = $process_pid;
360 $process =~ s/-([0-9])*$//;
361 if ($process eq '') {
362 $process = "NO_PROCESS_NAME";
363 }
364
365 $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
366 $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
367 $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
368 $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
369 $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
370 $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
371 $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
372 $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
373 $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
374 $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
375 $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
376 $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
377 $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
378 }
379}
380
381sub report() {
382 if (!$opt_ignorepid) {
383 dump_stats(\%perprocesspid);
384 } else {
385 aggregate_perprocesspid();
386 dump_stats(\%perprocess);
387 }
388}
389
390# Process events or signals until neither is available
391sub signal_loop() {
392 my $sigint_processed;
393 do {
394 $sigint_processed = 0;
395 process_events();
396
397 # Handle pending signals if any
398 if ($sigint_pending) {
399 my $current_time = time;
400
401 if ($sigint_exit) {
402 print "Received exit signal\n";
403 $sigint_pending = 0;
404 }
405 if ($sigint_report) {
406 if ($current_time >= $sigint_received + 2) {
407 report();
408 $sigint_report = 0;
409 $sigint_pending = 0;
410 $sigint_processed = 1;
411 }
412 }
413 }
414 } while ($sigint_pending || $sigint_processed);
415}
416
417signal_loop();
418report();
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
deleted file mode 100644
index cd805e16dc27..000000000000
--- a/Documentation/trace/power.txt
+++ /dev/null
@@ -1,17 +0,0 @@
1The power tracer collects detailed information about C-state and P-state
2transitions, instead of just looking at the high-level "average"
3information.
4
5There is a helper script found in scrips/tracing/power.pl in the kernel
6sources which can be used to parse this information and create a
7Scalable Vector Graphics (SVG) picture from the trace data.
8
9To use this tracer:
10
11 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
12 echo power > /sys/kernel/debug/tracing/current_tracer
13 echo 1 > /sys/kernel/debug/tracing/tracing_enabled
14 sleep 1
15 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
16 cat /sys/kernel/debug/tracing/trace | \
17 perl scripts/tracing/power.pl > out.sv
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt
new file mode 100644
index 000000000000..5b1d23d604c5
--- /dev/null
+++ b/Documentation/trace/ring-buffer-design.txt
@@ -0,0 +1,955 @@
1 Lockless Ring Buffer Design
2 ===========================
3
4Copyright 2009 Red Hat Inc.
5 Author: Steven Rostedt <srostedt@redhat.com>
6 License: The GNU Free Documentation License, Version 1.2
7 (dual licensed under the GPL v2)
8Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto,
9 and Frederic Weisbecker.
10
11
12Written for: 2.6.31
13
14Terminology used in this Document
15---------------------------------
16
17tail - where new writes happen in the ring buffer.
18
19head - where new reads happen in the ring buffer.
20
21producer - the task that writes into the ring buffer (same as writer)
22
23writer - same as producer
24
25consumer - the task that reads from the buffer (same as reader)
26
27reader - same as consumer.
28
29reader_page - A page outside the ring buffer used solely (for the most part)
30 by the reader.
31
32head_page - a pointer to the page that the reader will use next
33
34tail_page - a pointer to the page that will be written to next
35
36commit_page - a pointer to the page with the last finished non nested write.
37
38cmpxchg - hardware assisted atomic transaction that performs the following:
39
40 A = B iff previous A == C
41
42 R = cmpxchg(A, C, B) is saying that we replace A with B if and only if
43 current A is equal to C, and we put the old (current) A into R
44
45 R gets the previous A regardless if A is updated with B or not.
46
47 To see if the update was successful a compare of R == C may be used.
48
49The Generic Ring Buffer
50-----------------------
51
52The ring buffer can be used in either an overwrite mode or in
53producer/consumer mode.
54
55Producer/consumer mode is where the producer were to fill up the
56buffer before the consumer could free up anything, the producer
57will stop writing to the buffer. This will lose most recent events.
58
59Overwrite mode is where the produce were to fill up the buffer
60before the consumer could free up anything, the producer will
61overwrite the older data. This will lose the oldest events.
62
63No two writers can write at the same time (on the same per cpu buffer),
64but a writer may interrupt another writer, but it must finish writing
65before the previous writer may continue. This is very important to the
66algorithm. The writers act like a "stack". The way interrupts works
67enforces this behavior.
68
69
70 writer1 start
71 <preempted> writer2 start
72 <preempted> writer3 start
73 writer3 finishes
74 writer2 finishes
75 writer1 finishes
76
77This is very much like a writer being preempted by an interrupt and
78the interrupt doing a write as well.
79
80Readers can happen at any time. But no two readers may run at the
81same time, nor can a reader preempt/interrupt another reader. A reader
82can not preempt/interrupt a writer, but it may read/consume from the
83buffer at the same time as a writer is writing, but the reader must be
84on another processor to do so. A reader may read on its own processor
85and can be preempted by a writer.
86
87A writer can preempt a reader, but a reader can not preempt a writer.
88But a reader can read the buffer at the same time (on another processor)
89as a writer.
90
91The ring buffer is made up of a list of pages held together by a link list.
92
93At initialization a reader page is allocated for the reader that is not
94part of the ring buffer.
95
96The head_page, tail_page and commit_page are all initialized to point
97to the same page.
98
99The reader page is initialized to have its next pointer pointing to
100the head page, and its previous pointer pointing to a page before
101the head page.
102
103The reader has its own page to use. At start up time, this page is
104allocated but is not attached to the list. When the reader wants
105to read from the buffer, if its page is empty (like it is on start up)
106it will swap its page with the head_page. The old reader page will
107become part of the ring buffer and the head_page will be removed.
108The page after the inserted page (old reader_page) will become the
109new head page.
110
111Once the new page is given to the reader, the reader could do what
112it wants with it, as long as a writer has left that page.
113
114A sample of how the reader page is swapped: Note this does not
115show the head page in the buffer, it is for demonstrating a swap
116only.
117
118 +------+
119 |reader| RING BUFFER
120 |page |
121 +------+
122 +---+ +---+ +---+
123 | |-->| |-->| |
124 | |<--| |<--| |
125 +---+ +---+ +---+
126 ^ | ^ |
127 | +-------------+ |
128 +-----------------+
129
130
131 +------+
132 |reader| RING BUFFER
133 |page |-------------------+
134 +------+ v
135 | +---+ +---+ +---+
136 | | |-->| |-->| |
137 | | |<--| |<--| |<-+
138 | +---+ +---+ +---+ |
139 | ^ | ^ | |
140 | | +-------------+ | |
141 | +-----------------+ |
142 +------------------------------------+
143
144 +------+
145 |reader| RING BUFFER
146 |page |-------------------+
147 +------+ <---------------+ v
148 | ^ +---+ +---+ +---+
149 | | | |-->| |-->| |
150 | | | | | |<--| |<-+
151 | | +---+ +---+ +---+ |
152 | | | ^ | |
153 | | +-------------+ | |
154 | +-----------------------------+ |
155 +------------------------------------+
156
157 +------+
158 |buffer| RING BUFFER
159 |page |-------------------+
160 +------+ <---------------+ v
161 | ^ +---+ +---+ +---+
162 | | | | | |-->| |
163 | | New | | | |<--| |<-+
164 | | Reader +---+ +---+ +---+ |
165 | | page ----^ | |
166 | | | |
167 | +-----------------------------+ |
168 +------------------------------------+
169
170
171
172It is possible that the page swapped is the commit page and the tail page,
173if what is in the ring buffer is less than what is held in a buffer page.
174
175
176 reader page commit page tail page
177 | | |
178 v | |
179 +---+ | |
180 | |<----------+ |
181 | |<------------------------+
182 | |------+
183 +---+ |
184 |
185 v
186 +---+ +---+ +---+ +---+
187<---| |--->| |--->| |--->| |--->
188--->| |<---| |<---| |<---| |<---
189 +---+ +---+ +---+ +---+
190
191This case is still valid for this algorithm.
192When the writer leaves the page, it simply goes into the ring buffer
193since the reader page still points to the next location in the ring
194buffer.
195
196
197The main pointers:
198
199 reader page - The page used solely by the reader and is not part
200 of the ring buffer (may be swapped in)
201
202 head page - the next page in the ring buffer that will be swapped
203 with the reader page.
204
205 tail page - the page where the next write will take place.
206
207 commit page - the page that last finished a write.
208
209The commit page only is updated by the outer most writer in the
210writer stack. A writer that preempts another writer will not move the
211commit page.
212
213When data is written into the ring buffer, a position is reserved
214in the ring buffer and passed back to the writer. When the writer
215is finished writing data into that position, it commits the write.
216
217Another write (or a read) may take place at anytime during this
218transaction. If another write happens it must finish before continuing
219with the previous write.
220
221
222 Write reserve:
223
224 Buffer page
225 +---------+
226 |written |
227 +---------+ <--- given back to writer (current commit)
228 |reserved |
229 +---------+ <--- tail pointer
230 | empty |
231 +---------+
232
233 Write commit:
234
235 Buffer page
236 +---------+
237 |written |
238 +---------+
239 |written |
240 +---------+ <--- next positon for write (current commit)
241 | empty |
242 +---------+
243
244
245 If a write happens after the first reserve:
246
247 Buffer page
248 +---------+
249 |written |
250 +---------+ <-- current commit
251 |reserved |
252 +---------+ <--- given back to second writer
253 |reserved |
254 +---------+ <--- tail pointer
255
256 After second writer commits:
257
258
259 Buffer page
260 +---------+
261 |written |
262 +---------+ <--(last full commit)
263 |reserved |
264 +---------+
265 |pending |
266 |commit |
267 +---------+ <--- tail pointer
268
269 When the first writer commits:
270
271 Buffer page
272 +---------+
273 |written |
274 +---------+
275 |written |
276 +---------+
277 |written |
278 +---------+ <--(last full commit and tail pointer)
279
280
281The commit pointer points to the last write location that was
282committed without preempting another write. When a write that
283preempted another write is committed, it only becomes a pending commit
284and will not be a full commit till all writes have been committed.
285
286The commit page points to the page that has the last full commit.
287The tail page points to the page with the last write (before
288committing).
289
290The tail page is always equal to or after the commit page. It may
291be several pages ahead. If the tail page catches up to the commit
292page then no more writes may take place (regardless of the mode
293of the ring buffer: overwrite and produce/consumer).
294
295The order of pages are:
296
297 head page
298 commit page
299 tail page
300
301Possible scenario:
302 tail page
303 head page commit page |
304 | | |
305 v v v
306 +---+ +---+ +---+ +---+
307<---| |--->| |--->| |--->| |--->
308--->| |<---| |<---| |<---| |<---
309 +---+ +---+ +---+ +---+
310
311There is a special case that the head page is after either the commit page
312and possibly the tail page. That is when the commit (and tail) page has been
313swapped with the reader page. This is because the head page is always
314part of the ring buffer, but the reader page is not. When ever there
315has been less than a full page that has been committed inside the ring buffer,
316and a reader swaps out a page, it will be swapping out the commit page.
317
318
319 reader page commit page tail page
320 | | |
321 v | |
322 +---+ | |
323 | |<----------+ |
324 | |<------------------------+
325 | |------+
326 +---+ |
327 |
328 v
329 +---+ +---+ +---+ +---+
330<---| |--->| |--->| |--->| |--->
331--->| |<---| |<---| |<---| |<---
332 +---+ +---+ +---+ +---+
333 ^
334 |
335 head page
336
337
338In this case, the head page will not move when the tail and commit
339move back into the ring buffer.
340
341The reader can not swap a page into the ring buffer if the commit page
342is still on that page. If the read meets the last commit (real commit
343not pending or reserved), then there is nothing more to read.
344The buffer is considered empty until another full commit finishes.
345
346When the tail meets the head page, if the buffer is in overwrite mode,
347the head page will be pushed ahead one. If the buffer is in producer/consumer
348mode, the write will fail.
349
350Overwrite mode:
351
352 tail page
353 |
354 v
355 +---+ +---+ +---+ +---+
356<---| |--->| |--->| |--->| |--->
357--->| |<---| |<---| |<---| |<---
358 +---+ +---+ +---+ +---+
359 ^
360 |
361 head page
362
363
364 tail page
365 |
366 v
367 +---+ +---+ +---+ +---+
368<---| |--->| |--->| |--->| |--->
369--->| |<---| |<---| |<---| |<---
370 +---+ +---+ +---+ +---+
371 ^
372 |
373 head page
374
375
376 tail page
377 |
378 v
379 +---+ +---+ +---+ +---+
380<---| |--->| |--->| |--->| |--->
381--->| |<---| |<---| |<---| |<---
382 +---+ +---+ +---+ +---+
383 ^
384 |
385 head page
386
387Note, the reader page will still point to the previous head page.
388But when a swap takes place, it will use the most recent head page.
389
390
391Making the Ring Buffer Lockless:
392--------------------------------
393
394The main idea behind the lockless algorithm is to combine the moving
395of the head_page pointer with the swapping of pages with the reader.
396State flags are placed inside the pointer to the page. To do this,
397each page must be aligned in memory by 4 bytes. This will allow the 2
398least significant bits of the address to be used as flags. Since
399they will always be zero for the address. To get the address,
400simply mask out the flags.
401
402 MASK = ~3
403
404 address & MASK
405
406Two flags will be kept by these two bits:
407
408 HEADER - the page being pointed to is a head page
409
410 UPDATE - the page being pointed to is being updated by a writer
411 and was or is about to be a head page.
412
413
414 reader page
415 |
416 v
417 +---+
418 | |------+
419 +---+ |
420 |
421 v
422 +---+ +---+ +---+ +---+
423<---| |--->| |-H->| |--->| |--->
424--->| |<---| |<---| |<---| |<---
425 +---+ +---+ +---+ +---+
426
427
428The above pointer "-H->" would have the HEADER flag set. That is
429the next page is the next page to be swapped out by the reader.
430This pointer means the next page is the head page.
431
432When the tail page meets the head pointer, it will use cmpxchg to
433change the pointer to the UPDATE state:
434
435
436 tail page
437 |
438 v
439 +---+ +---+ +---+ +---+
440<---| |--->| |-H->| |--->| |--->
441--->| |<---| |<---| |<---| |<---
442 +---+ +---+ +---+ +---+
443
444 tail page
445 |
446 v
447 +---+ +---+ +---+ +---+
448<---| |--->| |-U->| |--->| |--->
449--->| |<---| |<---| |<---| |<---
450 +---+ +---+ +---+ +---+
451
452"-U->" represents a pointer in the UPDATE state.
453
454Any access to the reader will need to take some sort of lock to serialize
455the readers. But the writers will never take a lock to write to the
456ring buffer. This means we only need to worry about a single reader,
457and writes only preempt in "stack" formation.
458
459When the reader tries to swap the page with the ring buffer, it
460will also use cmpxchg. If the flag bit in the pointer to the
461head page does not have the HEADER flag set, the compare will fail
462and the reader will need to look for the new head page and try again.
463Note, the flag UPDATE and HEADER are never set at the same time.
464
465The reader swaps the reader page as follows:
466
467 +------+
468 |reader| RING BUFFER
469 |page |
470 +------+
471 +---+ +---+ +---+
472 | |--->| |--->| |
473 | |<---| |<---| |
474 +---+ +---+ +---+
475 ^ | ^ |
476 | +---------------+ |
477 +-----H-------------+
478
479The reader sets the reader page next pointer as HEADER to the page after
480the head page.
481
482
483 +------+
484 |reader| RING BUFFER
485 |page |-------H-----------+
486 +------+ v
487 | +---+ +---+ +---+
488 | | |--->| |--->| |
489 | | |<---| |<---| |<-+
490 | +---+ +---+ +---+ |
491 | ^ | ^ | |
492 | | +---------------+ | |
493 | +-----H-------------+ |
494 +--------------------------------------+
495
496It does a cmpxchg with the pointer to the previous head page to make it
497point to the reader page. Note that the new pointer does not have the HEADER
498flag set. This action atomically moves the head page forward.
499
500 +------+
501 |reader| RING BUFFER
502 |page |-------H-----------+
503 +------+ v
504 | ^ +---+ +---+ +---+
505 | | | |-->| |-->| |
506 | | | |<--| |<--| |<-+
507 | | +---+ +---+ +---+ |
508 | | | ^ | |
509 | | +-------------+ | |
510 | +-----------------------------+ |
511 +------------------------------------+
512
513After the new head page is set, the previous pointer of the head page is
514updated to the reader page.
515
516 +------+
517 |reader| RING BUFFER
518 |page |-------H-----------+
519 +------+ <---------------+ v
520 | ^ +---+ +---+ +---+
521 | | | |-->| |-->| |
522 | | | | | |<--| |<-+
523 | | +---+ +---+ +---+ |
524 | | | ^ | |
525 | | +-------------+ | |
526 | +-----------------------------+ |
527 +------------------------------------+
528
529 +------+
530 |buffer| RING BUFFER
531 |page |-------H-----------+ <--- New head page
532 +------+ <---------------+ v
533 | ^ +---+ +---+ +---+
534 | | | | | |-->| |
535 | | New | | | |<--| |<-+
536 | | Reader +---+ +---+ +---+ |
537 | | page ----^ | |
538 | | | |
539 | +-----------------------------+ |
540 +------------------------------------+
541
542Another important point. The page that the reader page points back to
543by its previous pointer (the one that now points to the new head page)
544never points back to the reader page. That is because the reader page is
545not part of the ring buffer. Traversing the ring buffer via the next pointers
546will always stay in the ring buffer. Traversing the ring buffer via the
547prev pointers may not.
548
549Note, the way to determine a reader page is simply by examining the previous
550pointer of the page. If the next pointer of the previous page does not
551point back to the original page, then the original page is a reader page:
552
553
554 +--------+
555 | reader | next +----+
556 | page |-------->| |<====== (buffer page)
557 +--------+ +----+
558 | | ^
559 | v | next
560 prev | +----+
561 +------------->| |
562 +----+
563
564The way the head page moves forward:
565
566When the tail page meets the head page and the buffer is in overwrite mode
567and more writes take place, the head page must be moved forward before the
568writer may move the tail page. The way this is done is that the writer
569performs a cmpxchg to convert the pointer to the head page from the HEADER
570flag to have the UPDATE flag set. Once this is done, the reader will
571not be able to swap the head page from the buffer, nor will it be able to
572move the head page, until the writer is finished with the move.
573
574This eliminates any races that the reader can have on the writer. The reader
575must spin, and this is why the reader can not preempt the writer.
576
577 tail page
578 |
579 v
580 +---+ +---+ +---+ +---+
581<---| |--->| |-H->| |--->| |--->
582--->| |<---| |<---| |<---| |<---
583 +---+ +---+ +---+ +---+
584
585 tail page
586 |
587 v
588 +---+ +---+ +---+ +---+
589<---| |--->| |-U->| |--->| |--->
590--->| |<---| |<---| |<---| |<---
591 +---+ +---+ +---+ +---+
592
593The following page will be made into the new head page.
594
595 tail page
596 |
597 v
598 +---+ +---+ +---+ +---+
599<---| |--->| |-U->| |-H->| |--->
600--->| |<---| |<---| |<---| |<---
601 +---+ +---+ +---+ +---+
602
603After the new head page has been set, we can set the old head page
604pointer back to NORMAL.
605
606 tail page
607 |
608 v
609 +---+ +---+ +---+ +---+
610<---| |--->| |--->| |-H->| |--->
611--->| |<---| |<---| |<---| |<---
612 +---+ +---+ +---+ +---+
613
614After the head page has been moved, the tail page may now move forward.
615
616 tail page
617 |
618 v
619 +---+ +---+ +---+ +---+
620<---| |--->| |--->| |-H->| |--->
621--->| |<---| |<---| |<---| |<---
622 +---+ +---+ +---+ +---+
623
624
625The above are the trivial updates. Now for the more complex scenarios.
626
627
628As stated before, if enough writes preempt the first write, the
629tail page may make it all the way around the buffer and meet the commit
630page. At this time, we must start dropping writes (usually with some kind
631of warning to the user). But what happens if the commit was still on the
632reader page? The commit page is not part of the ring buffer. The tail page
633must account for this.
634
635
636 reader page commit page
637 | |
638 v |
639 +---+ |
640 | |<----------+
641 | |
642 | |------+
643 +---+ |
644 |
645 v
646 +---+ +---+ +---+ +---+
647<---| |--->| |-H->| |--->| |--->
648--->| |<---| |<---| |<---| |<---
649 +---+ +---+ +---+ +---+
650 ^
651 |
652 tail page
653
654If the tail page were to simply push the head page forward, the commit when
655leaving the reader page would not be pointing to the correct page.
656
657The solution to this is to test if the commit page is on the reader page
658before pushing the head page. If it is, then it can be assumed that the
659tail page wrapped the buffer, and we must drop new writes.
660
661This is not a race condition, because the commit page can only be moved
662by the outter most writer (the writer that was preempted).
663This means that the commit will not move while a writer is moving the
664tail page. The reader can not swap the reader page if it is also being
665used as the commit page. The reader can simply check that the commit
666is off the reader page. Once the commit page leaves the reader page
667it will never go back on it unless a reader does another swap with the
668buffer page that is also the commit page.
669
670
671Nested writes
672-------------
673
674In the pushing forward of the tail page we must first push forward
675the head page if the head page is the next page. If the head page
676is not the next page, the tail page is simply updated with a cmpxchg.
677
678Only writers move the tail page. This must be done atomically to protect
679against nested writers.
680
681 temp_page = tail_page
682 next_page = temp_page->next
683 cmpxchg(tail_page, temp_page, next_page)
684
685The above will update the tail page if it is still pointing to the expected
686page. If this fails, a nested write pushed it forward, the the current write
687does not need to push it.
688
689
690 temp page
691 |
692 v
693 tail page
694 |
695 v
696 +---+ +---+ +---+ +---+
697<---| |--->| |--->| |--->| |--->
698--->| |<---| |<---| |<---| |<---
699 +---+ +---+ +---+ +---+
700
701Nested write comes in and moves the tail page forward:
702
703 tail page (moved by nested writer)
704 temp page |
705 | |
706 v v
707 +---+ +---+ +---+ +---+
708<---| |--->| |--->| |--->| |--->
709--->| |<---| |<---| |<---| |<---
710 +---+ +---+ +---+ +---+
711
712The above would fail the cmpxchg, but since the tail page has already
713been moved forward, the writer will just try again to reserve storage
714on the new tail page.
715
716But the moving of the head page is a bit more complex.
717
718 tail page
719 |
720 v
721 +---+ +---+ +---+ +---+
722<---| |--->| |-H->| |--->| |--->
723--->| |<---| |<---| |<---| |<---
724 +---+ +---+ +---+ +---+
725
726The write converts the head page pointer to UPDATE.
727
728 tail page
729 |
730 v
731 +---+ +---+ +---+ +---+
732<---| |--->| |-U->| |--->| |--->
733--->| |<---| |<---| |<---| |<---
734 +---+ +---+ +---+ +---+
735
736But if a nested writer preempts here. It will see that the next
737page is a head page, but it is also nested. It will detect that
738it is nested and will save that information. The detection is the
739fact that it sees the UPDATE flag instead of a HEADER or NORMAL
740pointer.
741
742The nested writer will set the new head page pointer.
743
744 tail page
745 |
746 v
747 +---+ +---+ +---+ +---+
748<---| |--->| |-U->| |-H->| |--->
749--->| |<---| |<---| |<---| |<---
750 +---+ +---+ +---+ +---+
751
752But it will not reset the update back to normal. Only the writer
753that converted a pointer from HEAD to UPDATE will convert it back
754to NORMAL.
755
756 tail page
757 |
758 v
759 +---+ +---+ +---+ +---+
760<---| |--->| |-U->| |-H->| |--->
761--->| |<---| |<---| |<---| |<---
762 +---+ +---+ +---+ +---+
763
764After the nested writer finishes, the outer most writer will convert
765the UPDATE pointer to NORMAL.
766
767
768 tail page
769 |
770 v
771 +---+ +---+ +---+ +---+
772<---| |--->| |--->| |-H->| |--->
773--->| |<---| |<---| |<---| |<---
774 +---+ +---+ +---+ +---+
775
776
777It can be even more complex if several nested writes came in and moved
778the tail page ahead several pages:
779
780
781(first writer)
782
783 tail page
784 |
785 v
786 +---+ +---+ +---+ +---+
787<---| |--->| |-H->| |--->| |--->
788--->| |<---| |<---| |<---| |<---
789 +---+ +---+ +---+ +---+
790
791The write converts the head page pointer to UPDATE.
792
793 tail page
794 |
795 v
796 +---+ +---+ +---+ +---+
797<---| |--->| |-U->| |--->| |--->
798--->| |<---| |<---| |<---| |<---
799 +---+ +---+ +---+ +---+
800
801Next writer comes in, and sees the update and sets up the new
802head page.
803
804(second writer)
805
806 tail page
807 |
808 v
809 +---+ +---+ +---+ +---+
810<---| |--->| |-U->| |-H->| |--->
811--->| |<---| |<---| |<---| |<---
812 +---+ +---+ +---+ +---+
813
814The nested writer moves the tail page forward. But does not set the old
815update page to NORMAL because it is not the outer most writer.
816
817 tail page
818 |
819 v
820 +---+ +---+ +---+ +---+
821<---| |--->| |-U->| |-H->| |--->
822--->| |<---| |<---| |<---| |<---
823 +---+ +---+ +---+ +---+
824
825Another writer preempts and sees the page after the tail page is a head page.
826It changes it from HEAD to UPDATE.
827
828(third writer)
829
830 tail page
831 |
832 v
833 +---+ +---+ +---+ +---+
834<---| |--->| |-U->| |-U->| |--->
835--->| |<---| |<---| |<---| |<---
836 +---+ +---+ +---+ +---+
837
838The writer will move the head page forward:
839
840
841(third writer)
842
843 tail page
844 |
845 v
846 +---+ +---+ +---+ +---+
847<---| |--->| |-U->| |-U->| |-H->
848--->| |<---| |<---| |<---| |<---
849 +---+ +---+ +---+ +---+
850
851But now that the third writer did change the HEAD flag to UPDATE it
852will convert it to normal:
853
854
855(third writer)
856
857 tail page
858 |
859 v
860 +---+ +---+ +---+ +---+
861<---| |--->| |-U->| |--->| |-H->
862--->| |<---| |<---| |<---| |<---
863 +---+ +---+ +---+ +---+
864
865
866Then it will move the tail page, and return back to the second writer.
867
868
869(second writer)
870
871 tail page
872 |
873 v
874 +---+ +---+ +---+ +---+
875<---| |--->| |-U->| |--->| |-H->
876--->| |<---| |<---| |<---| |<---
877 +---+ +---+ +---+ +---+
878
879
880The second writer will fail to move the tail page because it was already
881moved, so it will try again and add its data to the new tail page.
882It will return to the first writer.
883
884
885(first writer)
886
887 tail page
888 |
889 v
890 +---+ +---+ +---+ +---+
891<---| |--->| |-U->| |--->| |-H->
892--->| |<---| |<---| |<---| |<---
893 +---+ +---+ +---+ +---+
894
895The first writer can not know atomically test if the tail page moved
896while it updates the HEAD page. It will then update the head page to
897what it thinks is the new head page.
898
899
900(first writer)
901
902 tail page
903 |
904 v
905 +---+ +---+ +---+ +---+
906<---| |--->| |-U->| |-H->| |-H->
907--->| |<---| |<---| |<---| |<---
908 +---+ +---+ +---+ +---+
909
910Since the cmpxchg returns the old value of the pointer the first writer
911will see it succeeded in updating the pointer from NORMAL to HEAD.
912But as we can see, this is not good enough. It must also check to see
913if the tail page is either where it use to be or on the next page:
914
915
916(first writer)
917
918 A B tail page
919 | | |
920 v v v
921 +---+ +---+ +---+ +---+
922<---| |--->| |-U->| |-H->| |-H->
923--->| |<---| |<---| |<---| |<---
924 +---+ +---+ +---+ +---+
925
926If tail page != A and tail page does not equal B, then it must reset the
927pointer back to NORMAL. The fact that it only needs to worry about
928nested writers, it only needs to check this after setting the HEAD page.
929
930
931(first writer)
932
933 A B tail page
934 | | |
935 v v v
936 +---+ +---+ +---+ +---+
937<---| |--->| |-U->| |--->| |-H->
938--->| |<---| |<---| |<---| |<---
939 +---+ +---+ +---+ +---+
940
941Now the writer can update the head page. This is also why the head page must
942remain in UPDATE and only reset by the outer most writer. This prevents
943the reader from seeing the incorrect head page.
944
945
946(first writer)
947
948 A B tail page
949 | | |
950 v v v
951 +---+ +---+ +---+ +---+
952<---| |--->| |--->| |--->| |-H->
953--->| |<---| |<---| |<---| |<---
954 +---+ +---+ +---+ +---+
955
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 000000000000..5eb4e487e667
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
1 Notes on Analysing Behaviour Using Events and Tracepoints
2
3 Documentation written by Mel Gorman
4 PCL information heavily based on email from Ingo Molnar
5
61. Introduction
7===============
8
9Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
10creating custom kernel modules to register probe functions using the event
11tracing infrastructure.
12
13Simplistically, tracepoints will represent an important event that when can
14be taken in conjunction with other tracepoints to build a "Big Picture" of
15what is going on within the system. There are a large number of methods for
16gathering and interpreting these events. Lacking any current Best Practises,
17this document describes some of the methods that can be used.
18
19This document assumes that debugfs is mounted on /sys/kernel/debug and that
20the appropriate tracing options have been configured into the kernel. It is
21assumed that the PCL tool tools/perf has been installed and is in your path.
22
232. Listing Available Events
24===========================
25
262.1 Standard Utilities
27----------------------
28
29All possible events are visible from /sys/kernel/debug/tracing/events. Simply
30calling
31
32 $ find /sys/kernel/debug/tracing/events -type d
33
34will give a fair indication of the number of events available.
35
362.2 PCL
37-------
38
39Discovery and enumeration of all counters and events, including tracepoints
40are available with the perf tool. Getting a list of available events is a
41simple case of
42
43 $ perf list 2>&1 | grep Tracepoint
44 ext4:ext4_free_inode [Tracepoint event]
45 ext4:ext4_request_inode [Tracepoint event]
46 ext4:ext4_allocate_inode [Tracepoint event]
47 ext4:ext4_write_begin [Tracepoint event]
48 ext4:ext4_ordered_write_end [Tracepoint event]
49 [ .... remaining output snipped .... ]
50
51
522. Enabling Events
53==================
54
552.1 System-Wide Event Enabling
56------------------------------
57
58See Documentation/trace/events.txt for a proper description on how events
59can be enabled system-wide. A short example of enabling all events related
60to page allocation would look something like
61
62 $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
63
642.2 System-Wide Event Enabling with SystemTap
65---------------------------------------------
66
67In SystemTap, tracepoints are accessible using the kernel.trace() function
68call. The following is an example that reports every 5 seconds what processes
69were allocating the pages.
70
71 global page_allocs
72
73 probe kernel.trace("mm_page_alloc") {
74 page_allocs[execname()]++
75 }
76
77 function print_count() {
78 printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
79 foreach (proc in page_allocs-)
80 printf("%-25d %s\n", page_allocs[proc], proc)
81 printf ("\n")
82 delete page_allocs
83 }
84
85 probe timer.s(5) {
86 print_count()
87 }
88
892.3 System-Wide Event Enabling with PCL
90---------------------------------------
91
92By specifying the -a switch and analysing sleep, the system-wide events
93for a duration of time can be examined.
94
95 $ perf stat -a \
96 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
97 -e kmem:mm_pagevec_free \
98 sleep 10
99 Performance counter stats for 'sleep 10':
100
101 9630 kmem:mm_page_alloc
102 2143 kmem:mm_page_free_direct
103 7424 kmem:mm_pagevec_free
104
105 10.002577764 seconds time elapsed
106
107Similarly, one could execute a shell and exit it as desired to get a report
108at that point.
109
1102.4 Local Event Enabling
111------------------------
112
113Documentation/trace/ftrace.txt describes how to enable events on a per-thread
114basis using set_ftrace_pid.
115
1162.5 Local Event Enablement with PCL
117-----------------------------------
118
119Events can be activate and tracked for the duration of a process on a local
120basis using PCL such as follows.
121
122 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
123 -e kmem:mm_pagevec_free ./hackbench 10
124 Time: 0.909
125
126 Performance counter stats for './hackbench 10':
127
128 17803 kmem:mm_page_alloc
129 12398 kmem:mm_page_free_direct
130 4827 kmem:mm_pagevec_free
131
132 0.973913387 seconds time elapsed
133
1343. Event Filtering
135==================
136
137Documentation/trace/ftrace.txt covers in-depth how to filter events in
138ftrace. Obviously using grep and awk of trace_pipe is an option as well
139as any script reading trace_pipe.
140
1414. Analysing Event Variances with PCL
142=====================================
143
144Any workload can exhibit variances between runs and it can be important
145to know what the standard deviation in. By and large, this is left to the
146performance analyst to do it by hand. In the event that the discrete event
147occurrences are useful to the performance analyst, then perf can be used.
148
149 $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
150 -e kmem:mm_pagevec_free ./hackbench 10
151 Time: 0.890
152 Time: 0.895
153 Time: 0.915
154 Time: 1.001
155 Time: 0.899
156
157 Performance counter stats for './hackbench 10' (5 runs):
158
159 16630 kmem:mm_page_alloc ( +- 3.542% )
160 11486 kmem:mm_page_free_direct ( +- 4.771% )
161 4730 kmem:mm_pagevec_free ( +- 2.325% )
162
163 0.982653002 seconds time elapsed ( +- 1.448% )
164
165In the event that some higher-level event is required that depends on some
166aggregation of discrete events, then a script would need to be developed.
167
168Using --repeat, it is also possible to view how events are fluctuating over
169time on a system wide basis using -a and sleep.
170
171 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
172 -e kmem:mm_pagevec_free \
173 -a --repeat 10 \
174 sleep 1
175 Performance counter stats for 'sleep 1' (10 runs):
176
177 1066 kmem:mm_page_alloc ( +- 26.148% )
178 182 kmem:mm_page_free_direct ( +- 5.464% )
179 890 kmem:mm_pagevec_free ( +- 30.079% )
180
181 1.002251757 seconds time elapsed ( +- 0.005% )
182
1835. Higher-Level Analysis with Helper Scripts
184============================================
185
186When events are enabled the events that are triggering can be read from
187/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
188options exist as well. By post-processing the output, further information can
189be gathered on-line as appropriate. Examples of post-processing might include
190
191 o Reading information from /proc for the PID that triggered the event
192 o Deriving a higher-level event from a series of lower-level events.
193 o Calculate latencies between two events
194
195Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
196script that can read trace_pipe from STDIN or a copy of a trace. When used
197on-line, it can be interrupted once to generate a report without existing
198and twice to exit.
199
200Simplistically, the script just reads STDIN and counts up events but it
201also can do more such as
202
203 o Derive high-level events from many low-level events. If a number of pages
204 are freed to the main allocator from the per-CPU lists, it recognises
205 that as one per-CPU drain even though there is no specific tracepoint
206 for that event
207 o It can aggregate based on PID or individual process number
208 o In the event memory is getting externally fragmented, it reports
209 on whether the fragmentation event was severe or moderate.
210 o When receiving an event about a PID, it can record who the parent was so
211 that if large numbers of events are coming from very short-lived
212 processes, the parent process responsible for creating all the helpers
213 can be identified
214
2156. Lower-Level Analysis with PCL
216================================
217
218There may also be a requirement to identify what functions with a program
219were generating events within the kernel. To begin this sort of analysis, the
220data must be recorded. At the time of writing, this required root
221
222 $ perf record -c 1 \
223 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
224 -e kmem:mm_pagevec_free \
225 ./hackbench 10
226 Time: 0.894
227 [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
228
229Note the use of '-c 1' to set the event period to sample. The default sample
230period is quite high to minimise overhead but the information collected can be
231very coarse as a result.
232
233This record outputted a file called perf.data which can be analysed using
234perf report.
235
236 $ perf report
237 # Samples: 30922
238 #
239 # Overhead Command Shared Object
240 # ........ ......... ................................
241 #
242 87.27% hackbench [vdso]
243 6.85% hackbench /lib/i686/cmov/libc-2.9.so
244 2.62% hackbench /lib/ld-2.9.so
245 1.52% perf [vdso]
246 1.22% hackbench ./hackbench
247 0.48% hackbench [kernel]
248 0.02% perf /lib/i686/cmov/libc-2.9.so
249 0.01% perf /usr/bin/perf
250 0.01% perf /lib/ld-2.9.so
251 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
252 #
253 # (For more details, try: perf report --sort comm,dso,symbol)
254 #
255
256According to this, the vast majority of events occured triggered on events
257within the VDSO. With simple binaries, this will often be the case so lets
258take a slightly different example. In the course of writing this, it was
259noticed that X was generating an insane amount of page allocations so lets look
260at it
261
262 $ perf record -c 1 -f \
263 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
264 -e kmem:mm_pagevec_free \
265 -p `pidof X`
266
267This was interrupted after a few seconds and
268
269 $ perf report
270 # Samples: 27666
271 #
272 # Overhead Command Shared Object
273 # ........ ....... .......................................
274 #
275 51.95% Xorg [vdso]
276 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
277 0.09% Xorg /lib/i686/cmov/libc-2.9.so
278 0.01% Xorg [kernel]
279 #
280 # (For more details, try: perf report --sort comm,dso,symbol)
281 #
282
283So, almost half of the events are occuring in a library. To get an idea which
284symbol.
285
286 $ perf report --sort comm,dso,symbol
287 # Samples: 27666
288 #
289 # Overhead Command Shared Object Symbol
290 # ........ ....... ....................................... ......
291 #
292 51.95% Xorg [vdso] [.] 0x000000ffffe424
293 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
294 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
295 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
296 0.01% Xorg [kernel] [k] read_hpet
297 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
298 0.00% Xorg [kernel] [k] ftrace_trace_userstack
299
300To see where within the function pixmanFillsse2 things are going wrong
301
302 $ perf annotate pixmanFillsse2
303 [ ... ]
304 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
305 : }
306 :
307 : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
308 : _mm_store_si128 (__m128i *__P, __m128i __B) : {
309 : *__P = __B;
310 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
311 0.00 : 34ef5: ff
312 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
313 0.00 : 34efd: ff
314 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
315 0.00 : 34f05: ff
316 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
317 0.00 : 34f0d: ff
318 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
319 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
320 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
321 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
322
323At a glance, it looks like the time is being spent copying pixmaps to
324the card. Further investigation would be needed to determine why pixmaps
325are being copied around so much but a starting point would be to take an
326ancient build of libpixmap out of the library path where it was totally
327forgotten about from months ago!