aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/events-kmem.txt107
-rw-r--r--Documentation/trace/events.txt210
-rw-r--r--Documentation/trace/ftrace-design.txt233
-rw-r--r--Documentation/trace/ftrace.txt8
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl418
-rw-r--r--Documentation/trace/power.txt17
-rw-r--r--Documentation/trace/tracepoint-analysis.txt327
7 files changed, 1288 insertions, 32 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 000000000000..6ef2a8652e17
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
1 Subsystem Trace Points: kmem
2
3The tracing system kmem captures events related to object and page allocation
4within the kernel. Broadly speaking there are four major subheadings.
5
6 o Slab allocation of small objects of unknown type (kmalloc)
7 o Slab allocation of small objects of known type
8 o Page allocation
9 o Per-CPU Allocator Activity
10 o External Fragmentation
11
12This document will describe what each of the tracepoints are and why they
13might be useful.
14
151. Slab allocation of small objects of unknown type
16===================================================
17kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
18kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
19kfree call_site=%lx ptr=%p
20
21Heavy activity for these events may indicate that a specific cache is
22justified, particularly if kmalloc slab pages are getting significantly
23internal fragmented as a result of the allocation pattern. By correlating
24kmalloc with kfree, it may be possible to identify memory leaks and where
25the allocation sites were.
26
27
282. Slab allocation of small objects of known type
29=================================================
30kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
31kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
32kmem_cache_free call_site=%lx ptr=%p
33
34These events are similar in usage to the kmalloc-related events except that
35it is likely easier to pin the event down to a specific cache. At the time
36of writing, no information is available on what slab is being allocated from,
37but the call_site can usually be used to extrapolate that information
38
393. Page allocation
40==================
41mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
42mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
43mm_page_free_direct page=%p pfn=%lu order=%d
44mm_pagevec_free page=%p pfn=%lu order=%d cold=%d
45
46These four events deal with page allocation and freeing. mm_page_alloc is
47a simple indicator of page allocator activity. Pages may be allocated from
48the per-CPU allocator (high performance) or the buddy allocator.
49
50If pages are allocated directly from the buddy allocator, the
51mm_page_alloc_zone_locked event is triggered. This event is important as high
52amounts of activity imply high activity on the zone->lock. Taking this lock
53impairs performance by disabling interrupts, dirtying cache lines between
54CPUs and serialising many CPUs.
55
56When a page is freed directly by the caller, the mm_page_free_direct event
57is triggered. Significant amounts of activity here could indicate that the
58callers should be batching their activities.
59
60When pages are freed using a pagevec, the mm_pagevec_free is
61triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
62freed in batch with a pagevec. Significant amounts of activity here could
63indicate that the system is under memory pressure and can also indicate
64contention on the zone->lru_lock.
65
664. Per-CPU Allocator Activity
67=============================
68mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
69mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
70
71In front of the page allocator is a per-cpu page allocator. It exists only
72for order-0 pages, reduces contention on the zone->lock and reduces the
73amount of writing on struct page.
74
75When a per-CPU list is empty or pages of the wrong type are allocated,
76the zone->lock will be taken once and the per-CPU list refilled. The event
77triggered is mm_page_alloc_zone_locked for each page allocated with the
78event indicating whether it is for a percpu_refill or not.
79
80When the per-CPU list is too full, a number of pages are freed, each one
81which triggers a mm_page_pcpu_drain event.
82
83The individual nature of the events are so that pages can be tracked
84between allocation and freeing. A number of drain or refill pages that occur
85consecutively imply the zone->lock being taken once. Large amounts of PCP
86refills and drains could imply an imbalance between CPUs where too much work
87is being concentrated in one place. It could also indicate that the per-CPU
88lists should be a larger size. Finally, large amounts of refills on one CPU
89and drains on another could be a factor in causing large amounts of cache
90line bounces due to writes between CPUs and worth investigating if pages
91can be allocated and freed on the same CPU through some algorithm change.
92
935. External Fragmentation
94=========================
95mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
96
97External fragmentation affects whether a high-order allocation will be
98successful or not. For some types of hardware, this is important although
99it is avoided where possible. If the system is using huge pages and needs
100to be able to resize the pool over the lifetime of the system, this value
101is important.
102
103Large numbers of this event implies that memory is fragmenting and
104high-order allocations will start failing at some time in the future. One
105means of reducing the occurange of this event is to increase the size of
106min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
107pageblock_size is usually the size of the default hugepage size.
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
index 2bcc8d4dea29..02ac6ed38b2d 100644
--- a/Documentation/trace/events.txt
+++ b/Documentation/trace/events.txt
@@ -1,7 +1,7 @@
1 Event Tracing 1 Event Tracing
2 2
3 Documentation written by Theodore Ts'o 3 Documentation written by Theodore Ts'o
4 Updated by Li Zefan 4 Updated by Li Zefan and Tom Zanussi
5 5
61. Introduction 61. Introduction
7=============== 7===============
@@ -22,12 +22,12 @@ tracing information should be printed.
22--------------------------------- 22---------------------------------
23 23
24The events which are available for tracing can be found in the file 24The events which are available for tracing can be found in the file
25/debug/tracing/available_events. 25/sys/kernel/debug/tracing/available_events.
26 26
27To enable a particular event, such as 'sched_wakeup', simply echo it 27To enable a particular event, such as 'sched_wakeup', simply echo it
28to /debug/tracing/set_event. For example: 28to /sys/kernel/debug/tracing/set_event. For example:
29 29
30 # echo sched_wakeup >> /debug/tracing/set_event 30 # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event
31 31
32[ Note: '>>' is necessary, otherwise it will firstly disable 32[ Note: '>>' is necessary, otherwise it will firstly disable
33 all the events. ] 33 all the events. ]
@@ -35,15 +35,15 @@ to /debug/tracing/set_event. For example:
35To disable an event, echo the event name to the set_event file prefixed 35To disable an event, echo the event name to the set_event file prefixed
36with an exclamation point: 36with an exclamation point:
37 37
38 # echo '!sched_wakeup' >> /debug/tracing/set_event 38 # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event
39 39
40To disable all events, echo an empty line to the set_event file: 40To disable all events, echo an empty line to the set_event file:
41 41
42 # echo > /debug/tracing/set_event 42 # echo > /sys/kernel/debug/tracing/set_event
43 43
44To enable all events, echo '*:*' or '*:' to the set_event file: 44To enable all events, echo '*:*' or '*:' to the set_event file:
45 45
46 # echo *:* > /debug/tracing/set_event 46 # echo *:* > /sys/kernel/debug/tracing/set_event
47 47
48The events are organized into subsystems, such as ext4, irq, sched, 48The events are organized into subsystems, such as ext4, irq, sched,
49etc., and a full event name looks like this: <subsystem>:<event>. The 49etc., and a full event name looks like this: <subsystem>:<event>. The
@@ -52,29 +52,29 @@ file. All of the events in a subsystem can be specified via the syntax
52"<subsystem>:*"; for example, to enable all irq events, you can use the 52"<subsystem>:*"; for example, to enable all irq events, you can use the
53command: 53command:
54 54
55 # echo 'irq:*' > /debug/tracing/set_event 55 # echo 'irq:*' > /sys/kernel/debug/tracing/set_event
56 56
572.2 Via the 'enable' toggle 572.2 Via the 'enable' toggle
58--------------------------- 58---------------------------
59 59
60The events available are also listed in /debug/tracing/events/ hierarchy 60The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy
61of directories. 61of directories.
62 62
63To enable event 'sched_wakeup': 63To enable event 'sched_wakeup':
64 64
65 # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable 65 # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
66 66
67To disable it: 67To disable it:
68 68
69 # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable 69 # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable
70 70
71To enable all events in sched subsystem: 71To enable all events in sched subsystem:
72 72
73 # echo 1 > /debug/tracing/events/sched/enable 73 # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
74 74
75To eanble all events: 75To enable all events:
76 76
77 # echo 1 > /debug/tracing/events/enable 77 # echo 1 > /sys/kernel/debug/tracing/events/enable
78 78
79When reading one of these enable files, there are four results: 79When reading one of these enable files, there are four results:
80 80
@@ -97,3 +97,185 @@ The format of this boot option is the same as described in section 2.1.
97 97
98See The example provided in samples/trace_events 98See The example provided in samples/trace_events
99 99
1004. Event formats
101================
102
103Each trace event has a 'format' file associated with it that contains
104a description of each field in a logged event. This information can
105be used to parse the binary trace stream, and is also the place to
106find the field names that can be used in event filters (see section 5).
107
108It also displays the format string that will be used to print the
109event in text mode, along with the event name and ID used for
110profiling.
111
112Every event has a set of 'common' fields associated with it; these are
113the fields prefixed with 'common_'. The other fields vary between
114events and correspond to the fields defined in the TRACE_EVENT
115definition for that event.
116
117Each field in the format has the form:
118
119 field:field-type field-name; offset:N; size:N;
120
121where offset is the offset of the field in the trace record and size
122is the size of the data item, in bytes.
123
124For example, here's the information displayed for the 'sched_wakeup'
125event:
126
127# cat /debug/tracing/events/sched/sched_wakeup/format
128
129name: sched_wakeup
130ID: 60
131format:
132 field:unsigned short common_type; offset:0; size:2;
133 field:unsigned char common_flags; offset:2; size:1;
134 field:unsigned char common_preempt_count; offset:3; size:1;
135 field:int common_pid; offset:4; size:4;
136 field:int common_tgid; offset:8; size:4;
137
138 field:char comm[TASK_COMM_LEN]; offset:12; size:16;
139 field:pid_t pid; offset:28; size:4;
140 field:int prio; offset:32; size:4;
141 field:int success; offset:36; size:4;
142 field:int cpu; offset:40; size:4;
143
144print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid,
145 REC->prio, REC->success, REC->cpu
146
147This event contains 10 fields, the first 5 common and the remaining 5
148event-specific. All the fields for this event are numeric, except for
149'comm' which is a string, a distinction important for event filtering.
150
1515. Event filtering
152==================
153
154Trace events can be filtered in the kernel by associating boolean
155'filter expressions' with them. As soon as an event is logged into
156the trace buffer, its fields are checked against the filter expression
157associated with that event type. An event with field values that
158'match' the filter will appear in the trace output, and an event whose
159values don't match will be discarded. An event with no filter
160associated with it matches everything, and is the default when no
161filter has been set for an event.
162
1635.1 Expression syntax
164---------------------
165
166A filter expression consists of one or more 'predicates' that can be
167combined using the logical operators '&&' and '||'. A predicate is
168simply a clause that compares the value of a field contained within a
169logged event with a constant value and returns either 0 or 1 depending
170on whether the field value matched (1) or didn't match (0):
171
172 field-name relational-operator value
173
174Parentheses can be used to provide arbitrary logical groupings and
175double-quotes can be used to prevent the shell from interpreting
176operators as shell metacharacters.
177
178The field-names available for use in filters can be found in the
179'format' files for trace events (see section 4).
180
181The relational-operators depend on the type of the field being tested:
182
183The operators available for numeric fields are:
184
185==, !=, <, <=, >, >=
186
187And for string fields they are:
188
189==, !=
190
191Currently, only exact string matches are supported.
192
193Currently, the maximum number of predicates in a filter is 16.
194
1955.2 Setting filters
196-------------------
197
198A filter for an individual event is set by writing a filter expression
199to the 'filter' file for the given event.
200
201For example:
202
203# cd /debug/tracing/events/sched/sched_wakeup
204# echo "common_preempt_count > 4" > filter
205
206A slightly more involved example:
207
208# cd /debug/tracing/events/sched/sched_signal_send
209# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter
210
211If there is an error in the expression, you'll get an 'Invalid
212argument' error when setting it, and the erroneous string along with
213an error message can be seen by looking at the filter e.g.:
214
215# cd /debug/tracing/events/sched/sched_signal_send
216# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter
217-bash: echo: write error: Invalid argument
218# cat filter
219((sig >= 10 && sig < 15) || dsig == 17) && comm != bash
220^
221parse_error: Field not found
222
223Currently the caret ('^') for an error always appears at the beginning of
224the filter string; the error message should still be useful though
225even without more accurate position info.
226
2275.3 Clearing filters
228--------------------
229
230To clear the filter for an event, write a '0' to the event's filter
231file.
232
233To clear the filters for all events in a subsystem, write a '0' to the
234subsystem's filter file.
235
2365.3 Subsystem filters
237---------------------
238
239For convenience, filters for every event in a subsystem can be set or
240cleared as a group by writing a filter expression into the filter file
241at the root of the subsytem. Note however, that if a filter for any
242event within the subsystem lacks a field specified in the subsystem
243filter, or if the filter can't be applied for any other reason, the
244filter for that event will retain its previous setting. This can
245result in an unintended mixture of filters which could lead to
246confusing (to the user who might think different filters are in
247effect) trace output. Only filters that reference just the common
248fields can be guaranteed to propagate successfully to all events.
249
250Here are a few subsystem filter examples that also illustrate the
251above points:
252
253Clear the filters on all events in the sched subsytem:
254
255# cd /sys/kernel/debug/tracing/events/sched
256# echo 0 > filter
257# cat sched_switch/filter
258none
259# cat sched_wakeup/filter
260none
261
262Set a filter using only common fields for all events in the sched
263subsytem (all events end up with the same filter):
264
265# cd /sys/kernel/debug/tracing/events/sched
266# echo common_pid == 0 > filter
267# cat sched_switch/filter
268common_pid == 0
269# cat sched_wakeup/filter
270common_pid == 0
271
272Attempt to set a filter using a non-common field for all events in the
273sched subsytem (all events but those that have a prev_pid field retain
274their old filters):
275
276# cd /sys/kernel/debug/tracing/events/sched
277# echo prev_pid == 0 > filter
278# cat sched_switch/filter
279prev_pid == 0
280# cat sched_wakeup/filter
281common_pid == 0
diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt
new file mode 100644
index 000000000000..7003e10f10f5
--- /dev/null
+++ b/Documentation/trace/ftrace-design.txt
@@ -0,0 +1,233 @@
1 function tracer guts
2 ====================
3
4Introduction
5------------
6
7Here we will cover the architecture pieces that the common function tracing
8code relies on for proper functioning. Things are broken down into increasing
9complexity so that you can start simple and at least get basic functionality.
10
11Note that this focuses on architecture implementation details only. If you
12want more explanation of a feature in terms of common code, review the common
13ftrace.txt file.
14
15
16Prerequisites
17-------------
18
19Ftrace relies on these features being implemented:
20 STACKTRACE_SUPPORT - implement save_stack_trace()
21 TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h
22
23
24HAVE_FUNCTION_TRACER
25--------------------
26
27You will need to implement the mcount and the ftrace_stub functions.
28
29The exact mcount symbol name will depend on your toolchain. Some call it
30"mcount", "_mcount", or even "__mcount". You can probably figure it out by
31running something like:
32 $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount
33 call mcount
34We'll make the assumption below that the symbol is "mcount" just to keep things
35nice and simple in the examples.
36
37Keep in mind that the ABI that is in effect inside of the mcount function is
38*highly* architecture/toolchain specific. We cannot help you in this regard,
39sorry. Dig up some old documentation and/or find someone more familiar than
40you to bang ideas off of. Typically, register usage (argument/scratch/etc...)
41is a major issue at this point, especially in relation to the location of the
42mcount call (before/after function prologue). You might also want to look at
43how glibc has implemented the mcount function for your architecture. It might
44be (semi-)relevant.
45
46The mcount function should check the function pointer ftrace_trace_function
47to see if it is set to ftrace_stub. If it is, there is nothing for you to do,
48so return immediately. If it isn't, then call that function in the same way
49the mcount function normally calls __mcount_internal -- the first argument is
50the "frompc" while the second argument is the "selfpc" (adjusted to remove the
51size of the mcount call that is embedded in the function).
52
53For example, if the function foo() calls bar(), when the bar() function calls
54mcount(), the arguments mcount() will pass to the tracer are:
55 "frompc" - the address bar() will use to return to foo()
56 "selfpc" - the address bar() (with _mcount() size adjustment)
57
58Also keep in mind that this mcount function will be called *a lot*, so
59optimizing for the default case of no tracer will help the smooth running of
60your system when tracing is disabled. So the start of the mcount function is
61typically the bare min with checking things before returning. That also means
62the code flow should usually kept linear (i.e. no branching in the nop case).
63This is of course an optimization and not a hard requirement.
64
65Here is some pseudo code that should help (these functions should actually be
66implemented in assembly):
67
68void ftrace_stub(void)
69{
70 return;
71}
72
73void mcount(void)
74{
75 /* save any bare state needed in order to do initial checking */
76
77 extern void (*ftrace_trace_function)(unsigned long, unsigned long);
78 if (ftrace_trace_function != ftrace_stub)
79 goto do_trace;
80
81 /* restore any bare state */
82
83 return;
84
85do_trace:
86
87 /* save all state needed by the ABI (see paragraph above) */
88
89 unsigned long frompc = ...;
90 unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
91 ftrace_trace_function(frompc, selfpc);
92
93 /* restore all state needed by the ABI */
94}
95
96Don't forget to export mcount for modules !
97extern void mcount(void);
98EXPORT_SYMBOL(mcount);
99
100
101HAVE_FUNCTION_TRACE_MCOUNT_TEST
102-------------------------------
103
104This is an optional optimization for the normal case when tracing is turned off
105in the system. If you do not enable this Kconfig option, the common ftrace
106code will take care of doing the checking for you.
107
108To support this feature, you only need to check the function_trace_stop
109variable in the mcount function. If it is non-zero, there is no tracing to be
110done at all, so you can return.
111
112This additional pseudo code would simply be:
113void mcount(void)
114{
115 /* save any bare state needed in order to do initial checking */
116
117+ if (function_trace_stop)
118+ return;
119
120 extern void (*ftrace_trace_function)(unsigned long, unsigned long);
121 if (ftrace_trace_function != ftrace_stub)
122...
123
124
125HAVE_FUNCTION_GRAPH_TRACER
126--------------------------
127
128Deep breath ... time to do some real work. Here you will need to update the
129mcount function to check ftrace graph function pointers, as well as implement
130some functions to save (hijack) and restore the return address.
131
132The mcount function should check the function pointers ftrace_graph_return
133(compare to ftrace_stub) and ftrace_graph_entry (compare to
134ftrace_graph_entry_stub). If either of those are not set to the relevant stub
135function, call the arch-specific function ftrace_graph_caller which in turn
136calls the arch-specific function prepare_ftrace_return. Neither of these
137function names are strictly required, but you should use them anyways to stay
138consistent across the architecture ports -- easier to compare & contrast
139things.
140
141The arguments to prepare_ftrace_return are slightly different than what are
142passed to ftrace_trace_function. The second argument "selfpc" is the same,
143but the first argument should be a pointer to the "frompc". Typically this is
144located on the stack. This allows the function to hijack the return address
145temporarily to have it point to the arch-specific function return_to_handler.
146That function will simply call the common ftrace_return_to_handler function and
147that will return the original return address with which, you can return to the
148original call site.
149
150Here is the updated mcount pseudo code:
151void mcount(void)
152{
153...
154 if (ftrace_trace_function != ftrace_stub)
155 goto do_trace;
156
157+#ifdef CONFIG_FUNCTION_GRAPH_TRACER
158+ extern void (*ftrace_graph_return)(...);
159+ extern void (*ftrace_graph_entry)(...);
160+ if (ftrace_graph_return != ftrace_stub ||
161+ ftrace_graph_entry != ftrace_graph_entry_stub)
162+ ftrace_graph_caller();
163+#endif
164
165 /* restore any bare state */
166...
167
168Here is the pseudo code for the new ftrace_graph_caller assembly function:
169#ifdef CONFIG_FUNCTION_GRAPH_TRACER
170void ftrace_graph_caller(void)
171{
172 /* save all state needed by the ABI */
173
174 unsigned long *frompc = &...;
175 unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE;
176 prepare_ftrace_return(frompc, selfpc);
177
178 /* restore all state needed by the ABI */
179}
180#endif
181
182For information on how to implement prepare_ftrace_return(), simply look at
183the x86 version. The only architecture-specific piece in it is the setup of
184the fault recovery table (the asm(...) code). The rest should be the same
185across architectures.
186
187Here is the pseudo code for the new return_to_handler assembly function. Note
188that the ABI that applies here is different from what applies to the mcount
189code. Since you are returning from a function (after the epilogue), you might
190be able to skimp on things saved/restored (usually just registers used to pass
191return values).
192
193#ifdef CONFIG_FUNCTION_GRAPH_TRACER
194void return_to_handler(void)
195{
196 /* save all state needed by the ABI (see paragraph above) */
197
198 void (*original_return_point)(void) = ftrace_return_to_handler();
199
200 /* restore all state needed by the ABI */
201
202 /* this is usually either a return or a jump */
203 original_return_point();
204}
205#endif
206
207
208HAVE_FTRACE_NMI_ENTER
209---------------------
210
211If you can't trace NMI functions, then skip this option.
212
213<details to be filled>
214
215
216HAVE_FTRACE_SYSCALLS
217---------------------
218
219<details to be filled>
220
221
222HAVE_FTRACE_MCOUNT_RECORD
223-------------------------
224
225See scripts/recordmcount.pl for more info.
226
227<details to be filled>
228
229
230HAVE_DYNAMIC_FTRACE
231---------------------
232
233<details to be filled>
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 355d0f1f8c50..957b22fde2df 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -26,6 +26,12 @@ disabled, and more (ftrace allows for tracer plugins, which
26means that the list of tracers can always grow). 26means that the list of tracers can always grow).
27 27
28 28
29Implementation Details
30----------------------
31
32See ftrace-design.txt for details for arch porters and such.
33
34
29The File System 35The File System
30--------------- 36---------------
31 37
@@ -127,7 +133,7 @@ of ftrace. Here is a list of some of the key files:
127 than requested, the rest of the page will be used, 133 than requested, the rest of the page will be used,
128 making the actual allocation bigger than requested. 134 making the actual allocation bigger than requested.
129 ( Note, the size may not be a multiple of the page size 135 ( Note, the size may not be a multiple of the page size
130 due to buffer managment overhead. ) 136 due to buffer management overhead. )
131 137
132 This can only be updated when the current_tracer 138 This can only be updated when the current_tracer
133 is set to "nop". 139 is set to "nop".
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100644
index 000000000000..7df50e8cf4d9
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
1#!/usr/bin/perl
2# This is a POC (proof of concept or piece of crap, take your pick) for reading the
3# text representation of trace output related to page allocation. It makes an attempt
4# to extract some high-level information on what is going on. The accuracy of the parser
5# may vary considerably
6#
7# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
8# other options
9# --prepend-parent Report on the parent proc and PID
10# --read-procstat If the trace lacks process info, get it from /proc
11# --ignore-pid Aggregate processes of the same name together
12#
13# Copyright (c) IBM Corporation 2009
14# Author: Mel Gorman <mel@csn.ul.ie>
15use strict;
16use Getopt::Long;
17
18# Tracepoint events
19use constant MM_PAGE_ALLOC => 1;
20use constant MM_PAGE_FREE_DIRECT => 2;
21use constant MM_PAGEVEC_FREE => 3;
22use constant MM_PAGE_PCPU_DRAIN => 4;
23use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
24use constant MM_PAGE_ALLOC_EXTFRAG => 6;
25use constant EVENT_UNKNOWN => 7;
26
27# Constants used to track state
28use constant STATE_PCPU_PAGES_DRAINED => 8;
29use constant STATE_PCPU_PAGES_REFILLED => 9;
30
31# High-level events extrapolated from tracepoints
32use constant HIGH_PCPU_DRAINS => 10;
33use constant HIGH_PCPU_REFILLS => 11;
34use constant HIGH_EXT_FRAGMENT => 12;
35use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
36use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
37use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
38
39my %perprocesspid;
40my %perprocess;
41my $opt_ignorepid;
42my $opt_read_procstat;
43my $opt_prepend_parent;
44
45# Catch sigint and exit on request
46my $sigint_report = 0;
47my $sigint_exit = 0;
48my $sigint_pending = 0;
49my $sigint_received = 0;
50sub sigint_handler {
51 my $current_time = time;
52 if ($current_time - 2 > $sigint_received) {
53 print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
54 $sigint_report = 1;
55 } else {
56 if (!$sigint_exit) {
57 print "Second SIGINT received quickly, exiting\n";
58 }
59 $sigint_exit++;
60 }
61
62 if ($sigint_exit > 3) {
63 print "Many SIGINTs received, exiting now without report\n";
64 exit;
65 }
66
67 $sigint_received = $current_time;
68 $sigint_pending = 1;
69}
70$SIG{INT} = "sigint_handler";
71
72# Parse command line options
73GetOptions(
74 'ignore-pid' => \$opt_ignorepid,
75 'read-procstat' => \$opt_read_procstat,
76 'prepend-parent' => \$opt_prepend_parent,
77);
78
79# Defaults for dynamically discovered regex's
80my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
81
82# Dyanically discovered regex
83my $regex_fragdetails;
84
85# Static regex used. Specified like this for readability and for use with /o
86# (process_pid) (cpus ) ( time ) (tpoint ) (details)
87my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
88my $regex_statname = '[-0-9]*\s\((.*)\).*';
89my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
90
91sub generate_traceevent_regex {
92 my $event = shift;
93 my $default = shift;
94 my $regex;
95
96 # Read the event format or use the default
97 if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
98 $regex = $default;
99 } else {
100 my $line;
101 while (!eof(FORMAT)) {
102 $line = <FORMAT>;
103 if ($line =~ /^print fmt:\s"(.*)",.*/) {
104 $regex = $1;
105 $regex =~ s/%p/\([0-9a-f]*\)/g;
106 $regex =~ s/%d/\([-0-9]*\)/g;
107 $regex =~ s/%lu/\([0-9]*\)/g;
108 }
109 }
110 }
111
112 # Verify fields are in the right order
113 my $tuple;
114 foreach $tuple (split /\s/, $regex) {
115 my ($key, $value) = split(/=/, $tuple);
116 my $expected = shift;
117 if ($key ne $expected) {
118 print("WARNING: Format not as expected '$key' != '$expected'");
119 $regex =~ s/$key=\((.*)\)/$key=$1/;
120 }
121 }
122
123 if (defined shift) {
124 die("Fewer fields than expected in format");
125 }
126
127 return $regex;
128}
129$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
130 $regex_fragdetails_default,
131 "page", "pfn",
132 "alloc_order", "fallback_order", "pageblock_order",
133 "alloc_migratetype", "fallback_migratetype",
134 "fragmenting", "change_ownership");
135
136sub read_statline($) {
137 my $pid = $_[0];
138 my $statline;
139
140 if (open(STAT, "/proc/$pid/stat")) {
141 $statline = <STAT>;
142 close(STAT);
143 }
144
145 if ($statline eq '') {
146 $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
147 }
148
149 return $statline;
150}
151
152sub guess_process_pid($$) {
153 my $pid = $_[0];
154 my $statline = $_[1];
155
156 if ($pid == 0) {
157 return "swapper-0";
158 }
159
160 if ($statline !~ /$regex_statname/o) {
161 die("Failed to math stat line for process name :: $statline");
162 }
163 return "$1-$pid";
164}
165
166sub parent_info($$) {
167 my $pid = $_[0];
168 my $statline = $_[1];
169 my $ppid;
170
171 if ($pid == 0) {
172 return "NOPARENT-0";
173 }
174
175 if ($statline !~ /$regex_statppid/o) {
176 die("Failed to match stat line process ppid:: $statline");
177 }
178
179 # Read the ppid stat line
180 $ppid = $1;
181 return guess_process_pid($ppid, read_statline($ppid));
182}
183
184sub process_events {
185 my $traceevent;
186 my $process_pid;
187 my $cpus;
188 my $timestamp;
189 my $tracepoint;
190 my $details;
191 my $statline;
192
193 # Read each line of the event log
194EVENT_PROCESS:
195 while ($traceevent = <STDIN>) {
196 if ($traceevent =~ /$regex_traceevent/o) {
197 $process_pid = $1;
198 $tracepoint = $4;
199
200 if ($opt_read_procstat || $opt_prepend_parent) {
201 $process_pid =~ /(.*)-([0-9]*)$/;
202 my $process = $1;
203 my $pid = $2;
204
205 $statline = read_statline($pid);
206
207 if ($opt_read_procstat && $process eq '') {
208 $process_pid = guess_process_pid($pid, $statline);
209 }
210
211 if ($opt_prepend_parent) {
212 $process_pid = parent_info($pid, $statline) . " :: $process_pid";
213 }
214 }
215
216 # Unnecessary in this script. Uncomment if required
217 # $cpus = $2;
218 # $timestamp = $3;
219 } else {
220 next;
221 }
222
223 # Perl Switch() sucks majorly
224 if ($tracepoint eq "mm_page_alloc") {
225 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
226 } elsif ($tracepoint eq "mm_page_free_direct") {
227 $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
228 } elsif ($tracepoint eq "mm_pagevec_free") {
229 $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
230 } elsif ($tracepoint eq "mm_page_pcpu_drain") {
231 $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
232 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
233 } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
234 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
235 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
236 } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
237
238 # Extract the details of the event now
239 $details = $5;
240
241 my ($page, $pfn);
242 my ($alloc_order, $fallback_order, $pageblock_order);
243 my ($alloc_migratetype, $fallback_migratetype);
244 my ($fragmenting, $change_ownership);
245
246 if ($details !~ /$regex_fragdetails/o) {
247 print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
248 next;
249 }
250
251 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
252 $page = $1;
253 $pfn = $2;
254 $alloc_order = $3;
255 $fallback_order = $4;
256 $pageblock_order = $5;
257 $alloc_migratetype = $6;
258 $fallback_migratetype = $7;
259 $fragmenting = $8;
260 $change_ownership = $9;
261
262 if ($fragmenting) {
263 $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
264 if ($fallback_order <= 3) {
265 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
266 } else {
267 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
268 }
269 }
270 if ($change_ownership) {
271 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
272 }
273 } else {
274 $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
275 }
276
277 # Catch a full pcpu drain event
278 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
279 $tracepoint ne "mm_page_pcpu_drain") {
280
281 $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
282 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
283 }
284
285 # Catch a full pcpu refill event
286 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
287 $tracepoint ne "mm_page_alloc_zone_locked") {
288 $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
289 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
290 }
291
292 if ($sigint_pending) {
293 last EVENT_PROCESS;
294 }
295 }
296}
297
298sub dump_stats {
299 my $hashref = shift;
300 my %stats = %$hashref;
301
302 # Dump per-process stats
303 my $process_pid;
304 my $max_strlen = 0;
305
306 # Get the maximum process name
307 foreach $process_pid (keys %perprocesspid) {
308 my $len = length($process_pid);
309 if ($len > $max_strlen) {
310 $max_strlen = $len;
311 }
312 }
313 $max_strlen += 2;
314
315 printf("\n");
316 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
317 "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
318 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
319 "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
320
321 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
322 "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
323
324 foreach $process_pid (keys %stats) {
325 # Dump final aggregates
326 if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
327 $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
328 $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
329 }
330 if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
331 $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
332 $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
333 }
334
335 printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
336 $process_pid,
337 $stats{$process_pid}->{MM_PAGE_ALLOC},
338 $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
339 $stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
340 $stats{$process_pid}->{MM_PAGEVEC_FREE},
341 $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
342 $stats{$process_pid}->{HIGH_PCPU_DRAINS},
343 $stats{$process_pid}->{HIGH_PCPU_REFILLS},
344 $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
345 $stats{$process_pid}->{HIGH_EXT_FRAG},
346 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
347 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
348 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
349 $stats{$process_pid}->{EVENT_UNKNOWN});
350 }
351}
352
353sub aggregate_perprocesspid() {
354 my $process_pid;
355 my $process;
356 undef %perprocess;
357
358 foreach $process_pid (keys %perprocesspid) {
359 $process = $process_pid;
360 $process =~ s/-([0-9])*$//;
361 if ($process eq '') {
362 $process = "NO_PROCESS_NAME";
363 }
364
365 $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
366 $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
367 $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
368 $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
369 $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
370 $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
371 $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
372 $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
373 $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
374 $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
375 $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
376 $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
377 $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
378 }
379}
380
381sub report() {
382 if (!$opt_ignorepid) {
383 dump_stats(\%perprocesspid);
384 } else {
385 aggregate_perprocesspid();
386 dump_stats(\%perprocess);
387 }
388}
389
390# Process events or signals until neither is available
391sub signal_loop() {
392 my $sigint_processed;
393 do {
394 $sigint_processed = 0;
395 process_events();
396
397 # Handle pending signals if any
398 if ($sigint_pending) {
399 my $current_time = time;
400
401 if ($sigint_exit) {
402 print "Received exit signal\n";
403 $sigint_pending = 0;
404 }
405 if ($sigint_report) {
406 if ($current_time >= $sigint_received + 2) {
407 report();
408 $sigint_report = 0;
409 $sigint_pending = 0;
410 $sigint_processed = 1;
411 }
412 }
413 }
414 } while ($sigint_pending || $sigint_processed);
415}
416
417signal_loop();
418report();
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
deleted file mode 100644
index cd805e16dc27..000000000000
--- a/Documentation/trace/power.txt
+++ /dev/null
@@ -1,17 +0,0 @@
1The power tracer collects detailed information about C-state and P-state
2transitions, instead of just looking at the high-level "average"
3information.
4
5There is a helper script found in scrips/tracing/power.pl in the kernel
6sources which can be used to parse this information and create a
7Scalable Vector Graphics (SVG) picture from the trace data.
8
9To use this tracer:
10
11 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
12 echo power > /sys/kernel/debug/tracing/current_tracer
13 echo 1 > /sys/kernel/debug/tracing/tracing_enabled
14 sleep 1
15 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
16 cat /sys/kernel/debug/tracing/trace | \
17 perl scripts/tracing/power.pl > out.sv
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 000000000000..5eb4e487e667
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
1 Notes on Analysing Behaviour Using Events and Tracepoints
2
3 Documentation written by Mel Gorman
4 PCL information heavily based on email from Ingo Molnar
5
61. Introduction
7===============
8
9Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
10creating custom kernel modules to register probe functions using the event
11tracing infrastructure.
12
13Simplistically, tracepoints will represent an important event that when can
14be taken in conjunction with other tracepoints to build a "Big Picture" of
15what is going on within the system. There are a large number of methods for
16gathering and interpreting these events. Lacking any current Best Practises,
17this document describes some of the methods that can be used.
18
19This document assumes that debugfs is mounted on /sys/kernel/debug and that
20the appropriate tracing options have been configured into the kernel. It is
21assumed that the PCL tool tools/perf has been installed and is in your path.
22
232. Listing Available Events
24===========================
25
262.1 Standard Utilities
27----------------------
28
29All possible events are visible from /sys/kernel/debug/tracing/events. Simply
30calling
31
32 $ find /sys/kernel/debug/tracing/events -type d
33
34will give a fair indication of the number of events available.
35
362.2 PCL
37-------
38
39Discovery and enumeration of all counters and events, including tracepoints
40are available with the perf tool. Getting a list of available events is a
41simple case of
42
43 $ perf list 2>&1 | grep Tracepoint
44 ext4:ext4_free_inode [Tracepoint event]
45 ext4:ext4_request_inode [Tracepoint event]
46 ext4:ext4_allocate_inode [Tracepoint event]
47 ext4:ext4_write_begin [Tracepoint event]
48 ext4:ext4_ordered_write_end [Tracepoint event]
49 [ .... remaining output snipped .... ]
50
51
522. Enabling Events
53==================
54
552.1 System-Wide Event Enabling
56------------------------------
57
58See Documentation/trace/events.txt for a proper description on how events
59can be enabled system-wide. A short example of enabling all events related
60to page allocation would look something like
61
62 $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
63
642.2 System-Wide Event Enabling with SystemTap
65---------------------------------------------
66
67In SystemTap, tracepoints are accessible using the kernel.trace() function
68call. The following is an example that reports every 5 seconds what processes
69were allocating the pages.
70
71 global page_allocs
72
73 probe kernel.trace("mm_page_alloc") {
74 page_allocs[execname()]++
75 }
76
77 function print_count() {
78 printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
79 foreach (proc in page_allocs-)
80 printf("%-25d %s\n", page_allocs[proc], proc)
81 printf ("\n")
82 delete page_allocs
83 }
84
85 probe timer.s(5) {
86 print_count()
87 }
88
892.3 System-Wide Event Enabling with PCL
90---------------------------------------
91
92By specifying the -a switch and analysing sleep, the system-wide events
93for a duration of time can be examined.
94
95 $ perf stat -a \
96 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
97 -e kmem:mm_pagevec_free \
98 sleep 10
99 Performance counter stats for 'sleep 10':
100
101 9630 kmem:mm_page_alloc
102 2143 kmem:mm_page_free_direct
103 7424 kmem:mm_pagevec_free
104
105 10.002577764 seconds time elapsed
106
107Similarly, one could execute a shell and exit it as desired to get a report
108at that point.
109
1102.4 Local Event Enabling
111------------------------
112
113Documentation/trace/ftrace.txt describes how to enable events on a per-thread
114basis using set_ftrace_pid.
115
1162.5 Local Event Enablement with PCL
117-----------------------------------
118
119Events can be activate and tracked for the duration of a process on a local
120basis using PCL such as follows.
121
122 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
123 -e kmem:mm_pagevec_free ./hackbench 10
124 Time: 0.909
125
126 Performance counter stats for './hackbench 10':
127
128 17803 kmem:mm_page_alloc
129 12398 kmem:mm_page_free_direct
130 4827 kmem:mm_pagevec_free
131
132 0.973913387 seconds time elapsed
133
1343. Event Filtering
135==================
136
137Documentation/trace/ftrace.txt covers in-depth how to filter events in
138ftrace. Obviously using grep and awk of trace_pipe is an option as well
139as any script reading trace_pipe.
140
1414. Analysing Event Variances with PCL
142=====================================
143
144Any workload can exhibit variances between runs and it can be important
145to know what the standard deviation in. By and large, this is left to the
146performance analyst to do it by hand. In the event that the discrete event
147occurrences are useful to the performance analyst, then perf can be used.
148
149 $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
150 -e kmem:mm_pagevec_free ./hackbench 10
151 Time: 0.890
152 Time: 0.895
153 Time: 0.915
154 Time: 1.001
155 Time: 0.899
156
157 Performance counter stats for './hackbench 10' (5 runs):
158
159 16630 kmem:mm_page_alloc ( +- 3.542% )
160 11486 kmem:mm_page_free_direct ( +- 4.771% )
161 4730 kmem:mm_pagevec_free ( +- 2.325% )
162
163 0.982653002 seconds time elapsed ( +- 1.448% )
164
165In the event that some higher-level event is required that depends on some
166aggregation of discrete events, then a script would need to be developed.
167
168Using --repeat, it is also possible to view how events are fluctuating over
169time on a system wide basis using -a and sleep.
170
171 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
172 -e kmem:mm_pagevec_free \
173 -a --repeat 10 \
174 sleep 1
175 Performance counter stats for 'sleep 1' (10 runs):
176
177 1066 kmem:mm_page_alloc ( +- 26.148% )
178 182 kmem:mm_page_free_direct ( +- 5.464% )
179 890 kmem:mm_pagevec_free ( +- 30.079% )
180
181 1.002251757 seconds time elapsed ( +- 0.005% )
182
1835. Higher-Level Analysis with Helper Scripts
184============================================
185
186When events are enabled the events that are triggering can be read from
187/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
188options exist as well. By post-processing the output, further information can
189be gathered on-line as appropriate. Examples of post-processing might include
190
191 o Reading information from /proc for the PID that triggered the event
192 o Deriving a higher-level event from a series of lower-level events.
193 o Calculate latencies between two events
194
195Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
196script that can read trace_pipe from STDIN or a copy of a trace. When used
197on-line, it can be interrupted once to generate a report without existing
198and twice to exit.
199
200Simplistically, the script just reads STDIN and counts up events but it
201also can do more such as
202
203 o Derive high-level events from many low-level events. If a number of pages
204 are freed to the main allocator from the per-CPU lists, it recognises
205 that as one per-CPU drain even though there is no specific tracepoint
206 for that event
207 o It can aggregate based on PID or individual process number
208 o In the event memory is getting externally fragmented, it reports
209 on whether the fragmentation event was severe or moderate.
210 o When receiving an event about a PID, it can record who the parent was so
211 that if large numbers of events are coming from very short-lived
212 processes, the parent process responsible for creating all the helpers
213 can be identified
214
2156. Lower-Level Analysis with PCL
216================================
217
218There may also be a requirement to identify what functions with a program
219were generating events within the kernel. To begin this sort of analysis, the
220data must be recorded. At the time of writing, this required root
221
222 $ perf record -c 1 \
223 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
224 -e kmem:mm_pagevec_free \
225 ./hackbench 10
226 Time: 0.894
227 [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
228
229Note the use of '-c 1' to set the event period to sample. The default sample
230period is quite high to minimise overhead but the information collected can be
231very coarse as a result.
232
233This record outputted a file called perf.data which can be analysed using
234perf report.
235
236 $ perf report
237 # Samples: 30922
238 #
239 # Overhead Command Shared Object
240 # ........ ......... ................................
241 #
242 87.27% hackbench [vdso]
243 6.85% hackbench /lib/i686/cmov/libc-2.9.so
244 2.62% hackbench /lib/ld-2.9.so
245 1.52% perf [vdso]
246 1.22% hackbench ./hackbench
247 0.48% hackbench [kernel]
248 0.02% perf /lib/i686/cmov/libc-2.9.so
249 0.01% perf /usr/bin/perf
250 0.01% perf /lib/ld-2.9.so
251 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
252 #
253 # (For more details, try: perf report --sort comm,dso,symbol)
254 #
255
256According to this, the vast majority of events occured triggered on events
257within the VDSO. With simple binaries, this will often be the case so lets
258take a slightly different example. In the course of writing this, it was
259noticed that X was generating an insane amount of page allocations so lets look
260at it
261
262 $ perf record -c 1 -f \
263 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
264 -e kmem:mm_pagevec_free \
265 -p `pidof X`
266
267This was interrupted after a few seconds and
268
269 $ perf report
270 # Samples: 27666
271 #
272 # Overhead Command Shared Object
273 # ........ ....... .......................................
274 #
275 51.95% Xorg [vdso]
276 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
277 0.09% Xorg /lib/i686/cmov/libc-2.9.so
278 0.01% Xorg [kernel]
279 #
280 # (For more details, try: perf report --sort comm,dso,symbol)
281 #
282
283So, almost half of the events are occuring in a library. To get an idea which
284symbol.
285
286 $ perf report --sort comm,dso,symbol
287 # Samples: 27666
288 #
289 # Overhead Command Shared Object Symbol
290 # ........ ....... ....................................... ......
291 #
292 51.95% Xorg [vdso] [.] 0x000000ffffe424
293 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
294 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
295 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
296 0.01% Xorg [kernel] [k] read_hpet
297 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
298 0.00% Xorg [kernel] [k] ftrace_trace_userstack
299
300To see where within the function pixmanFillsse2 things are going wrong
301
302 $ perf annotate pixmanFillsse2
303 [ ... ]
304 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
305 : }
306 :
307 : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
308 : _mm_store_si128 (__m128i *__P, __m128i __B) : {
309 : *__P = __B;
310 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
311 0.00 : 34ef5: ff
312 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
313 0.00 : 34efd: ff
314 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
315 0.00 : 34f05: ff
316 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
317 0.00 : 34f0d: ff
318 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
319 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
320 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
321 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
322
323At a glance, it looks like the time is being spent copying pixmaps to
324the card. Further investigation would be needed to determine why pixmaps
325are being copied around so much but a starting point would be to take an
326ancient build of libpixmap out of the library path where it was totally
327forgotten about from months ago!