aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/trace
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/trace')
-rw-r--r--Documentation/trace/events-kmem.txt107
-rw-r--r--Documentation/trace/events.txt2
-rw-r--r--Documentation/trace/ftrace.txt2
-rw-r--r--Documentation/trace/postprocess/trace-pagealloc-postprocess.pl418
-rw-r--r--Documentation/trace/power.txt17
-rw-r--r--Documentation/trace/tracepoint-analysis.txt327
6 files changed, 854 insertions, 19 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt
new file mode 100644
index 000000000000..6ef2a8652e17
--- /dev/null
+++ b/Documentation/trace/events-kmem.txt
@@ -0,0 +1,107 @@
1 Subsystem Trace Points: kmem
2
3The tracing system kmem captures events related to object and page allocation
4within the kernel. Broadly speaking there are four major subheadings.
5
6 o Slab allocation of small objects of unknown type (kmalloc)
7 o Slab allocation of small objects of known type
8 o Page allocation
9 o Per-CPU Allocator Activity
10 o External Fragmentation
11
12This document will describe what each of the tracepoints are and why they
13might be useful.
14
151. Slab allocation of small objects of unknown type
16===================================================
17kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
18kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
19kfree call_site=%lx ptr=%p
20
21Heavy activity for these events may indicate that a specific cache is
22justified, particularly if kmalloc slab pages are getting significantly
23internal fragmented as a result of the allocation pattern. By correlating
24kmalloc with kfree, it may be possible to identify memory leaks and where
25the allocation sites were.
26
27
282. Slab allocation of small objects of known type
29=================================================
30kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s
31kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d
32kmem_cache_free call_site=%lx ptr=%p
33
34These events are similar in usage to the kmalloc-related events except that
35it is likely easier to pin the event down to a specific cache. At the time
36of writing, no information is available on what slab is being allocated from,
37but the call_site can usually be used to extrapolate that information
38
393. Page allocation
40==================
41mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s
42mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
43mm_page_free_direct page=%p pfn=%lu order=%d
44mm_pagevec_free page=%p pfn=%lu order=%d cold=%d
45
46These four events deal with page allocation and freeing. mm_page_alloc is
47a simple indicator of page allocator activity. Pages may be allocated from
48the per-CPU allocator (high performance) or the buddy allocator.
49
50If pages are allocated directly from the buddy allocator, the
51mm_page_alloc_zone_locked event is triggered. This event is important as high
52amounts of activity imply high activity on the zone->lock. Taking this lock
53impairs performance by disabling interrupts, dirtying cache lines between
54CPUs and serialising many CPUs.
55
56When a page is freed directly by the caller, the mm_page_free_direct event
57is triggered. Significant amounts of activity here could indicate that the
58callers should be batching their activities.
59
60When pages are freed using a pagevec, the mm_pagevec_free is
61triggered. Broadly speaking, pages are taken off the LRU lock in bulk and
62freed in batch with a pagevec. Significant amounts of activity here could
63indicate that the system is under memory pressure and can also indicate
64contention on the zone->lru_lock.
65
664. Per-CPU Allocator Activity
67=============================
68mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d
69mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d
70
71In front of the page allocator is a per-cpu page allocator. It exists only
72for order-0 pages, reduces contention on the zone->lock and reduces the
73amount of writing on struct page.
74
75When a per-CPU list is empty or pages of the wrong type are allocated,
76the zone->lock will be taken once and the per-CPU list refilled. The event
77triggered is mm_page_alloc_zone_locked for each page allocated with the
78event indicating whether it is for a percpu_refill or not.
79
80When the per-CPU list is too full, a number of pages are freed, each one
81which triggers a mm_page_pcpu_drain event.
82
83The individual nature of the events are so that pages can be tracked
84between allocation and freeing. A number of drain or refill pages that occur
85consecutively imply the zone->lock being taken once. Large amounts of PCP
86refills and drains could imply an imbalance between CPUs where too much work
87is being concentrated in one place. It could also indicate that the per-CPU
88lists should be a larger size. Finally, large amounts of refills on one CPU
89and drains on another could be a factor in causing large amounts of cache
90line bounces due to writes between CPUs and worth investigating if pages
91can be allocated and freed on the same CPU through some algorithm change.
92
935. External Fragmentation
94=========================
95mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d
96
97External fragmentation affects whether a high-order allocation will be
98successful or not. For some types of hardware, this is important although
99it is avoided where possible. If the system is using huge pages and needs
100to be able to resize the pool over the lifetime of the system, this value
101is important.
102
103Large numbers of this event implies that memory is fragmenting and
104high-order allocations will start failing at some time in the future. One
105means of reducing the occurange of this event is to increase the size of
106min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where
107pageblock_size is usually the size of the default hugepage size.
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt
index 78c45a87be57..02ac6ed38b2d 100644
--- a/Documentation/trace/events.txt
+++ b/Documentation/trace/events.txt
@@ -72,7 +72,7 @@ To enable all events in sched subsystem:
72 72
73 # echo 1 > /sys/kernel/debug/tracing/events/sched/enable 73 # echo 1 > /sys/kernel/debug/tracing/events/sched/enable
74 74
75To eanble all events: 75To enable all events:
76 76
77 # echo 1 > /sys/kernel/debug/tracing/events/enable 77 # echo 1 > /sys/kernel/debug/tracing/events/enable
78 78
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt
index 1b6292bbdd6d..957b22fde2df 100644
--- a/Documentation/trace/ftrace.txt
+++ b/Documentation/trace/ftrace.txt
@@ -133,7 +133,7 @@ of ftrace. Here is a list of some of the key files:
133 than requested, the rest of the page will be used, 133 than requested, the rest of the page will be used,
134 making the actual allocation bigger than requested. 134 making the actual allocation bigger than requested.
135 ( Note, the size may not be a multiple of the page size 135 ( Note, the size may not be a multiple of the page size
136 due to buffer managment overhead. ) 136 due to buffer management overhead. )
137 137
138 This can only be updated when the current_tracer 138 This can only be updated when the current_tracer
139 is set to "nop". 139 is set to "nop".
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
new file mode 100644
index 000000000000..7df50e8cf4d9
--- /dev/null
+++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl
@@ -0,0 +1,418 @@
1#!/usr/bin/perl
2# This is a POC (proof of concept or piece of crap, take your pick) for reading the
3# text representation of trace output related to page allocation. It makes an attempt
4# to extract some high-level information on what is going on. The accuracy of the parser
5# may vary considerably
6#
7# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe
8# other options
9# --prepend-parent Report on the parent proc and PID
10# --read-procstat If the trace lacks process info, get it from /proc
11# --ignore-pid Aggregate processes of the same name together
12#
13# Copyright (c) IBM Corporation 2009
14# Author: Mel Gorman <mel@csn.ul.ie>
15use strict;
16use Getopt::Long;
17
18# Tracepoint events
19use constant MM_PAGE_ALLOC => 1;
20use constant MM_PAGE_FREE_DIRECT => 2;
21use constant MM_PAGEVEC_FREE => 3;
22use constant MM_PAGE_PCPU_DRAIN => 4;
23use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5;
24use constant MM_PAGE_ALLOC_EXTFRAG => 6;
25use constant EVENT_UNKNOWN => 7;
26
27# Constants used to track state
28use constant STATE_PCPU_PAGES_DRAINED => 8;
29use constant STATE_PCPU_PAGES_REFILLED => 9;
30
31# High-level events extrapolated from tracepoints
32use constant HIGH_PCPU_DRAINS => 10;
33use constant HIGH_PCPU_REFILLS => 11;
34use constant HIGH_EXT_FRAGMENT => 12;
35use constant HIGH_EXT_FRAGMENT_SEVERE => 13;
36use constant HIGH_EXT_FRAGMENT_MODERATE => 14;
37use constant HIGH_EXT_FRAGMENT_CHANGED => 15;
38
39my %perprocesspid;
40my %perprocess;
41my $opt_ignorepid;
42my $opt_read_procstat;
43my $opt_prepend_parent;
44
45# Catch sigint and exit on request
46my $sigint_report = 0;
47my $sigint_exit = 0;
48my $sigint_pending = 0;
49my $sigint_received = 0;
50sub sigint_handler {
51 my $current_time = time;
52 if ($current_time - 2 > $sigint_received) {
53 print "SIGINT received, report pending. Hit ctrl-c again to exit\n";
54 $sigint_report = 1;
55 } else {
56 if (!$sigint_exit) {
57 print "Second SIGINT received quickly, exiting\n";
58 }
59 $sigint_exit++;
60 }
61
62 if ($sigint_exit > 3) {
63 print "Many SIGINTs received, exiting now without report\n";
64 exit;
65 }
66
67 $sigint_received = $current_time;
68 $sigint_pending = 1;
69}
70$SIG{INT} = "sigint_handler";
71
72# Parse command line options
73GetOptions(
74 'ignore-pid' => \$opt_ignorepid,
75 'read-procstat' => \$opt_read_procstat,
76 'prepend-parent' => \$opt_prepend_parent,
77);
78
79# Defaults for dynamically discovered regex's
80my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])';
81
82# Dyanically discovered regex
83my $regex_fragdetails;
84
85# Static regex used. Specified like this for readability and for use with /o
86# (process_pid) (cpus ) ( time ) (tpoint ) (details)
87my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)';
88my $regex_statname = '[-0-9]*\s\((.*)\).*';
89my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*';
90
91sub generate_traceevent_regex {
92 my $event = shift;
93 my $default = shift;
94 my $regex;
95
96 # Read the event format or use the default
97 if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) {
98 $regex = $default;
99 } else {
100 my $line;
101 while (!eof(FORMAT)) {
102 $line = <FORMAT>;
103 if ($line =~ /^print fmt:\s"(.*)",.*/) {
104 $regex = $1;
105 $regex =~ s/%p/\([0-9a-f]*\)/g;
106 $regex =~ s/%d/\([-0-9]*\)/g;
107 $regex =~ s/%lu/\([0-9]*\)/g;
108 }
109 }
110 }
111
112 # Verify fields are in the right order
113 my $tuple;
114 foreach $tuple (split /\s/, $regex) {
115 my ($key, $value) = split(/=/, $tuple);
116 my $expected = shift;
117 if ($key ne $expected) {
118 print("WARNING: Format not as expected '$key' != '$expected'");
119 $regex =~ s/$key=\((.*)\)/$key=$1/;
120 }
121 }
122
123 if (defined shift) {
124 die("Fewer fields than expected in format");
125 }
126
127 return $regex;
128}
129$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag",
130 $regex_fragdetails_default,
131 "page", "pfn",
132 "alloc_order", "fallback_order", "pageblock_order",
133 "alloc_migratetype", "fallback_migratetype",
134 "fragmenting", "change_ownership");
135
136sub read_statline($) {
137 my $pid = $_[0];
138 my $statline;
139
140 if (open(STAT, "/proc/$pid/stat")) {
141 $statline = <STAT>;
142 close(STAT);
143 }
144
145 if ($statline eq '') {
146 $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0";
147 }
148
149 return $statline;
150}
151
152sub guess_process_pid($$) {
153 my $pid = $_[0];
154 my $statline = $_[1];
155
156 if ($pid == 0) {
157 return "swapper-0";
158 }
159
160 if ($statline !~ /$regex_statname/o) {
161 die("Failed to math stat line for process name :: $statline");
162 }
163 return "$1-$pid";
164}
165
166sub parent_info($$) {
167 my $pid = $_[0];
168 my $statline = $_[1];
169 my $ppid;
170
171 if ($pid == 0) {
172 return "NOPARENT-0";
173 }
174
175 if ($statline !~ /$regex_statppid/o) {
176 die("Failed to match stat line process ppid:: $statline");
177 }
178
179 # Read the ppid stat line
180 $ppid = $1;
181 return guess_process_pid($ppid, read_statline($ppid));
182}
183
184sub process_events {
185 my $traceevent;
186 my $process_pid;
187 my $cpus;
188 my $timestamp;
189 my $tracepoint;
190 my $details;
191 my $statline;
192
193 # Read each line of the event log
194EVENT_PROCESS:
195 while ($traceevent = <STDIN>) {
196 if ($traceevent =~ /$regex_traceevent/o) {
197 $process_pid = $1;
198 $tracepoint = $4;
199
200 if ($opt_read_procstat || $opt_prepend_parent) {
201 $process_pid =~ /(.*)-([0-9]*)$/;
202 my $process = $1;
203 my $pid = $2;
204
205 $statline = read_statline($pid);
206
207 if ($opt_read_procstat && $process eq '') {
208 $process_pid = guess_process_pid($pid, $statline);
209 }
210
211 if ($opt_prepend_parent) {
212 $process_pid = parent_info($pid, $statline) . " :: $process_pid";
213 }
214 }
215
216 # Unnecessary in this script. Uncomment if required
217 # $cpus = $2;
218 # $timestamp = $3;
219 } else {
220 next;
221 }
222
223 # Perl Switch() sucks majorly
224 if ($tracepoint eq "mm_page_alloc") {
225 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++;
226 } elsif ($tracepoint eq "mm_page_free_direct") {
227 $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++;
228 } elsif ($tracepoint eq "mm_pagevec_free") {
229 $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++;
230 } elsif ($tracepoint eq "mm_page_pcpu_drain") {
231 $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++;
232 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++;
233 } elsif ($tracepoint eq "mm_page_alloc_zone_locked") {
234 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++;
235 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++;
236 } elsif ($tracepoint eq "mm_page_alloc_extfrag") {
237
238 # Extract the details of the event now
239 $details = $5;
240
241 my ($page, $pfn);
242 my ($alloc_order, $fallback_order, $pageblock_order);
243 my ($alloc_migratetype, $fallback_migratetype);
244 my ($fragmenting, $change_ownership);
245
246 if ($details !~ /$regex_fragdetails/o) {
247 print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n";
248 next;
249 }
250
251 $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++;
252 $page = $1;
253 $pfn = $2;
254 $alloc_order = $3;
255 $fallback_order = $4;
256 $pageblock_order = $5;
257 $alloc_migratetype = $6;
258 $fallback_migratetype = $7;
259 $fragmenting = $8;
260 $change_ownership = $9;
261
262 if ($fragmenting) {
263 $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++;
264 if ($fallback_order <= 3) {
265 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++;
266 } else {
267 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++;
268 }
269 }
270 if ($change_ownership) {
271 $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++;
272 }
273 } else {
274 $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++;
275 }
276
277 # Catch a full pcpu drain event
278 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} &&
279 $tracepoint ne "mm_page_pcpu_drain") {
280
281 $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++;
282 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
283 }
284
285 # Catch a full pcpu refill event
286 if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} &&
287 $tracepoint ne "mm_page_alloc_zone_locked") {
288 $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++;
289 $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
290 }
291
292 if ($sigint_pending) {
293 last EVENT_PROCESS;
294 }
295 }
296}
297
298sub dump_stats {
299 my $hashref = shift;
300 my %stats = %$hashref;
301
302 # Dump per-process stats
303 my $process_pid;
304 my $max_strlen = 0;
305
306 # Get the maximum process name
307 foreach $process_pid (keys %perprocesspid) {
308 my $len = length($process_pid);
309 if ($len > $max_strlen) {
310 $max_strlen = $len;
311 }
312 }
313 $max_strlen += 2;
314
315 printf("\n");
316 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
317 "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown");
318 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
319 "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", "");
320
321 printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n",
322 "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", "");
323
324 foreach $process_pid (keys %stats) {
325 # Dump final aggregates
326 if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) {
327 $stats{$process_pid}->{HIGH_PCPU_DRAINS}++;
328 $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0;
329 }
330 if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) {
331 $stats{$process_pid}->{HIGH_PCPU_REFILLS}++;
332 $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0;
333 }
334
335 printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n",
336 $process_pid,
337 $stats{$process_pid}->{MM_PAGE_ALLOC},
338 $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED},
339 $stats{$process_pid}->{MM_PAGE_FREE_DIRECT},
340 $stats{$process_pid}->{MM_PAGEVEC_FREE},
341 $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN},
342 $stats{$process_pid}->{HIGH_PCPU_DRAINS},
343 $stats{$process_pid}->{HIGH_PCPU_REFILLS},
344 $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG},
345 $stats{$process_pid}->{HIGH_EXT_FRAG},
346 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED},
347 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE},
348 $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE},
349 $stats{$process_pid}->{EVENT_UNKNOWN});
350 }
351}
352
353sub aggregate_perprocesspid() {
354 my $process_pid;
355 my $process;
356 undef %perprocess;
357
358 foreach $process_pid (keys %perprocesspid) {
359 $process = $process_pid;
360 $process =~ s/-([0-9])*$//;
361 if ($process eq '') {
362 $process = "NO_PROCESS_NAME";
363 }
364
365 $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC};
366 $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED};
367 $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT};
368 $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE};
369 $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN};
370 $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS};
371 $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS};
372 $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG};
373 $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG};
374 $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED};
375 $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE};
376 $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE};
377 $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN};
378 }
379}
380
381sub report() {
382 if (!$opt_ignorepid) {
383 dump_stats(\%perprocesspid);
384 } else {
385 aggregate_perprocesspid();
386 dump_stats(\%perprocess);
387 }
388}
389
390# Process events or signals until neither is available
391sub signal_loop() {
392 my $sigint_processed;
393 do {
394 $sigint_processed = 0;
395 process_events();
396
397 # Handle pending signals if any
398 if ($sigint_pending) {
399 my $current_time = time;
400
401 if ($sigint_exit) {
402 print "Received exit signal\n";
403 $sigint_pending = 0;
404 }
405 if ($sigint_report) {
406 if ($current_time >= $sigint_received + 2) {
407 report();
408 $sigint_report = 0;
409 $sigint_pending = 0;
410 $sigint_processed = 1;
411 }
412 }
413 }
414 } while ($sigint_pending || $sigint_processed);
415}
416
417signal_loop();
418report();
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt
deleted file mode 100644
index cd805e16dc27..000000000000
--- a/Documentation/trace/power.txt
+++ /dev/null
@@ -1,17 +0,0 @@
1The power tracer collects detailed information about C-state and P-state
2transitions, instead of just looking at the high-level "average"
3information.
4
5There is a helper script found in scrips/tracing/power.pl in the kernel
6sources which can be used to parse this information and create a
7Scalable Vector Graphics (SVG) picture from the trace data.
8
9To use this tracer:
10
11 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
12 echo power > /sys/kernel/debug/tracing/current_tracer
13 echo 1 > /sys/kernel/debug/tracing/tracing_enabled
14 sleep 1
15 echo 0 > /sys/kernel/debug/tracing/tracing_enabled
16 cat /sys/kernel/debug/tracing/trace | \
17 perl scripts/tracing/power.pl > out.sv
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt
new file mode 100644
index 000000000000..5eb4e487e667
--- /dev/null
+++ b/Documentation/trace/tracepoint-analysis.txt
@@ -0,0 +1,327 @@
1 Notes on Analysing Behaviour Using Events and Tracepoints
2
3 Documentation written by Mel Gorman
4 PCL information heavily based on email from Ingo Molnar
5
61. Introduction
7===============
8
9Tracepoints (see Documentation/trace/tracepoints.txt) can be used without
10creating custom kernel modules to register probe functions using the event
11tracing infrastructure.
12
13Simplistically, tracepoints will represent an important event that when can
14be taken in conjunction with other tracepoints to build a "Big Picture" of
15what is going on within the system. There are a large number of methods for
16gathering and interpreting these events. Lacking any current Best Practises,
17this document describes some of the methods that can be used.
18
19This document assumes that debugfs is mounted on /sys/kernel/debug and that
20the appropriate tracing options have been configured into the kernel. It is
21assumed that the PCL tool tools/perf has been installed and is in your path.
22
232. Listing Available Events
24===========================
25
262.1 Standard Utilities
27----------------------
28
29All possible events are visible from /sys/kernel/debug/tracing/events. Simply
30calling
31
32 $ find /sys/kernel/debug/tracing/events -type d
33
34will give a fair indication of the number of events available.
35
362.2 PCL
37-------
38
39Discovery and enumeration of all counters and events, including tracepoints
40are available with the perf tool. Getting a list of available events is a
41simple case of
42
43 $ perf list 2>&1 | grep Tracepoint
44 ext4:ext4_free_inode [Tracepoint event]
45 ext4:ext4_request_inode [Tracepoint event]
46 ext4:ext4_allocate_inode [Tracepoint event]
47 ext4:ext4_write_begin [Tracepoint event]
48 ext4:ext4_ordered_write_end [Tracepoint event]
49 [ .... remaining output snipped .... ]
50
51
522. Enabling Events
53==================
54
552.1 System-Wide Event Enabling
56------------------------------
57
58See Documentation/trace/events.txt for a proper description on how events
59can be enabled system-wide. A short example of enabling all events related
60to page allocation would look something like
61
62 $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done
63
642.2 System-Wide Event Enabling with SystemTap
65---------------------------------------------
66
67In SystemTap, tracepoints are accessible using the kernel.trace() function
68call. The following is an example that reports every 5 seconds what processes
69were allocating the pages.
70
71 global page_allocs
72
73 probe kernel.trace("mm_page_alloc") {
74 page_allocs[execname()]++
75 }
76
77 function print_count() {
78 printf ("%-25s %-s\n", "#Pages Allocated", "Process Name")
79 foreach (proc in page_allocs-)
80 printf("%-25d %s\n", page_allocs[proc], proc)
81 printf ("\n")
82 delete page_allocs
83 }
84
85 probe timer.s(5) {
86 print_count()
87 }
88
892.3 System-Wide Event Enabling with PCL
90---------------------------------------
91
92By specifying the -a switch and analysing sleep, the system-wide events
93for a duration of time can be examined.
94
95 $ perf stat -a \
96 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
97 -e kmem:mm_pagevec_free \
98 sleep 10
99 Performance counter stats for 'sleep 10':
100
101 9630 kmem:mm_page_alloc
102 2143 kmem:mm_page_free_direct
103 7424 kmem:mm_pagevec_free
104
105 10.002577764 seconds time elapsed
106
107Similarly, one could execute a shell and exit it as desired to get a report
108at that point.
109
1102.4 Local Event Enabling
111------------------------
112
113Documentation/trace/ftrace.txt describes how to enable events on a per-thread
114basis using set_ftrace_pid.
115
1162.5 Local Event Enablement with PCL
117-----------------------------------
118
119Events can be activate and tracked for the duration of a process on a local
120basis using PCL such as follows.
121
122 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
123 -e kmem:mm_pagevec_free ./hackbench 10
124 Time: 0.909
125
126 Performance counter stats for './hackbench 10':
127
128 17803 kmem:mm_page_alloc
129 12398 kmem:mm_page_free_direct
130 4827 kmem:mm_pagevec_free
131
132 0.973913387 seconds time elapsed
133
1343. Event Filtering
135==================
136
137Documentation/trace/ftrace.txt covers in-depth how to filter events in
138ftrace. Obviously using grep and awk of trace_pipe is an option as well
139as any script reading trace_pipe.
140
1414. Analysing Event Variances with PCL
142=====================================
143
144Any workload can exhibit variances between runs and it can be important
145to know what the standard deviation in. By and large, this is left to the
146performance analyst to do it by hand. In the event that the discrete event
147occurrences are useful to the performance analyst, then perf can be used.
148
149 $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct
150 -e kmem:mm_pagevec_free ./hackbench 10
151 Time: 0.890
152 Time: 0.895
153 Time: 0.915
154 Time: 1.001
155 Time: 0.899
156
157 Performance counter stats for './hackbench 10' (5 runs):
158
159 16630 kmem:mm_page_alloc ( +- 3.542% )
160 11486 kmem:mm_page_free_direct ( +- 4.771% )
161 4730 kmem:mm_pagevec_free ( +- 2.325% )
162
163 0.982653002 seconds time elapsed ( +- 1.448% )
164
165In the event that some higher-level event is required that depends on some
166aggregation of discrete events, then a script would need to be developed.
167
168Using --repeat, it is also possible to view how events are fluctuating over
169time on a system wide basis using -a and sleep.
170
171 $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
172 -e kmem:mm_pagevec_free \
173 -a --repeat 10 \
174 sleep 1
175 Performance counter stats for 'sleep 1' (10 runs):
176
177 1066 kmem:mm_page_alloc ( +- 26.148% )
178 182 kmem:mm_page_free_direct ( +- 5.464% )
179 890 kmem:mm_pagevec_free ( +- 30.079% )
180
181 1.002251757 seconds time elapsed ( +- 0.005% )
182
1835. Higher-Level Analysis with Helper Scripts
184============================================
185
186When events are enabled the events that are triggering can be read from
187/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary
188options exist as well. By post-processing the output, further information can
189be gathered on-line as appropriate. Examples of post-processing might include
190
191 o Reading information from /proc for the PID that triggered the event
192 o Deriving a higher-level event from a series of lower-level events.
193 o Calculate latencies between two events
194
195Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example
196script that can read trace_pipe from STDIN or a copy of a trace. When used
197on-line, it can be interrupted once to generate a report without existing
198and twice to exit.
199
200Simplistically, the script just reads STDIN and counts up events but it
201also can do more such as
202
203 o Derive high-level events from many low-level events. If a number of pages
204 are freed to the main allocator from the per-CPU lists, it recognises
205 that as one per-CPU drain even though there is no specific tracepoint
206 for that event
207 o It can aggregate based on PID or individual process number
208 o In the event memory is getting externally fragmented, it reports
209 on whether the fragmentation event was severe or moderate.
210 o When receiving an event about a PID, it can record who the parent was so
211 that if large numbers of events are coming from very short-lived
212 processes, the parent process responsible for creating all the helpers
213 can be identified
214
2156. Lower-Level Analysis with PCL
216================================
217
218There may also be a requirement to identify what functions with a program
219were generating events within the kernel. To begin this sort of analysis, the
220data must be recorded. At the time of writing, this required root
221
222 $ perf record -c 1 \
223 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
224 -e kmem:mm_pagevec_free \
225 ./hackbench 10
226 Time: 0.894
227 [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ]
228
229Note the use of '-c 1' to set the event period to sample. The default sample
230period is quite high to minimise overhead but the information collected can be
231very coarse as a result.
232
233This record outputted a file called perf.data which can be analysed using
234perf report.
235
236 $ perf report
237 # Samples: 30922
238 #
239 # Overhead Command Shared Object
240 # ........ ......... ................................
241 #
242 87.27% hackbench [vdso]
243 6.85% hackbench /lib/i686/cmov/libc-2.9.so
244 2.62% hackbench /lib/ld-2.9.so
245 1.52% perf [vdso]
246 1.22% hackbench ./hackbench
247 0.48% hackbench [kernel]
248 0.02% perf /lib/i686/cmov/libc-2.9.so
249 0.01% perf /usr/bin/perf
250 0.01% perf /lib/ld-2.9.so
251 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so
252 #
253 # (For more details, try: perf report --sort comm,dso,symbol)
254 #
255
256According to this, the vast majority of events occured triggered on events
257within the VDSO. With simple binaries, this will often be the case so lets
258take a slightly different example. In the course of writing this, it was
259noticed that X was generating an insane amount of page allocations so lets look
260at it
261
262 $ perf record -c 1 -f \
263 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \
264 -e kmem:mm_pagevec_free \
265 -p `pidof X`
266
267This was interrupted after a few seconds and
268
269 $ perf report
270 # Samples: 27666
271 #
272 # Overhead Command Shared Object
273 # ........ ....... .......................................
274 #
275 51.95% Xorg [vdso]
276 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1
277 0.09% Xorg /lib/i686/cmov/libc-2.9.so
278 0.01% Xorg [kernel]
279 #
280 # (For more details, try: perf report --sort comm,dso,symbol)
281 #
282
283So, almost half of the events are occuring in a library. To get an idea which
284symbol.
285
286 $ perf report --sort comm,dso,symbol
287 # Samples: 27666
288 #
289 # Overhead Command Shared Object Symbol
290 # ........ ....... ....................................... ......
291 #
292 51.95% Xorg [vdso] [.] 0x000000ffffe424
293 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2
294 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc
295 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f
296 0.01% Xorg [kernel] [k] read_hpet
297 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path
298 0.00% Xorg [kernel] [k] ftrace_trace_userstack
299
300To see where within the function pixmanFillsse2 things are going wrong
301
302 $ perf annotate pixmanFillsse2
303 [ ... ]
304 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax)
305 : }
306 :
307 : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _
308 : _mm_store_si128 (__m128i *__P, __m128i __B) : {
309 : *__P = __B;
310 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax)
311 0.00 : 34ef5: ff
312 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax)
313 0.00 : 34efd: ff
314 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax)
315 0.00 : 34f05: ff
316 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax)
317 0.00 : 34f0d: ff
318 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax)
319 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax)
320 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax)
321 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax)
322
323At a glance, it looks like the time is being spent copying pixmaps to
324the card. Further investigation would be needed to determine why pixmaps
325are being copied around so much but a starting point would be to take an
326ancient build of libpixmap out of the library path where it was totally
327forgotten about from months ago!