diff options
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/events-kmem.txt | 107 | ||||
-rw-r--r-- | Documentation/trace/events.txt | 2 | ||||
-rw-r--r-- | Documentation/trace/ftrace.txt | 2 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | 418 | ||||
-rw-r--r-- | Documentation/trace/power.txt | 17 | ||||
-rw-r--r-- | Documentation/trace/tracepoint-analysis.txt | 327 |
6 files changed, 854 insertions, 19 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt new file mode 100644 index 000000000000..6ef2a8652e17 --- /dev/null +++ b/Documentation/trace/events-kmem.txt | |||
@@ -0,0 +1,107 @@ | |||
1 | Subsystem Trace Points: kmem | ||
2 | |||
3 | The tracing system kmem captures events related to object and page allocation | ||
4 | within the kernel. Broadly speaking there are four major subheadings. | ||
5 | |||
6 | o Slab allocation of small objects of unknown type (kmalloc) | ||
7 | o Slab allocation of small objects of known type | ||
8 | o Page allocation | ||
9 | o Per-CPU Allocator Activity | ||
10 | o External Fragmentation | ||
11 | |||
12 | This document will describe what each of the tracepoints are and why they | ||
13 | might be useful. | ||
14 | |||
15 | 1. Slab allocation of small objects of unknown type | ||
16 | =================================================== | ||
17 | kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s | ||
18 | kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d | ||
19 | kfree call_site=%lx ptr=%p | ||
20 | |||
21 | Heavy activity for these events may indicate that a specific cache is | ||
22 | justified, particularly if kmalloc slab pages are getting significantly | ||
23 | internal fragmented as a result of the allocation pattern. By correlating | ||
24 | kmalloc with kfree, it may be possible to identify memory leaks and where | ||
25 | the allocation sites were. | ||
26 | |||
27 | |||
28 | 2. Slab allocation of small objects of known type | ||
29 | ================================================= | ||
30 | kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s | ||
31 | kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d | ||
32 | kmem_cache_free call_site=%lx ptr=%p | ||
33 | |||
34 | These events are similar in usage to the kmalloc-related events except that | ||
35 | it is likely easier to pin the event down to a specific cache. At the time | ||
36 | of writing, no information is available on what slab is being allocated from, | ||
37 | but the call_site can usually be used to extrapolate that information | ||
38 | |||
39 | 3. Page allocation | ||
40 | ================== | ||
41 | mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s | ||
42 | mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d | ||
43 | mm_page_free_direct page=%p pfn=%lu order=%d | ||
44 | mm_pagevec_free page=%p pfn=%lu order=%d cold=%d | ||
45 | |||
46 | These four events deal with page allocation and freeing. mm_page_alloc is | ||
47 | a simple indicator of page allocator activity. Pages may be allocated from | ||
48 | the per-CPU allocator (high performance) or the buddy allocator. | ||
49 | |||
50 | If pages are allocated directly from the buddy allocator, the | ||
51 | mm_page_alloc_zone_locked event is triggered. This event is important as high | ||
52 | amounts of activity imply high activity on the zone->lock. Taking this lock | ||
53 | impairs performance by disabling interrupts, dirtying cache lines between | ||
54 | CPUs and serialising many CPUs. | ||
55 | |||
56 | When a page is freed directly by the caller, the mm_page_free_direct event | ||
57 | is triggered. Significant amounts of activity here could indicate that the | ||
58 | callers should be batching their activities. | ||
59 | |||
60 | When pages are freed using a pagevec, the mm_pagevec_free is | ||
61 | triggered. Broadly speaking, pages are taken off the LRU lock in bulk and | ||
62 | freed in batch with a pagevec. Significant amounts of activity here could | ||
63 | indicate that the system is under memory pressure and can also indicate | ||
64 | contention on the zone->lru_lock. | ||
65 | |||
66 | 4. Per-CPU Allocator Activity | ||
67 | ============================= | ||
68 | mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d | ||
69 | mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d | ||
70 | |||
71 | In front of the page allocator is a per-cpu page allocator. It exists only | ||
72 | for order-0 pages, reduces contention on the zone->lock and reduces the | ||
73 | amount of writing on struct page. | ||
74 | |||
75 | When a per-CPU list is empty or pages of the wrong type are allocated, | ||
76 | the zone->lock will be taken once and the per-CPU list refilled. The event | ||
77 | triggered is mm_page_alloc_zone_locked for each page allocated with the | ||
78 | event indicating whether it is for a percpu_refill or not. | ||
79 | |||
80 | When the per-CPU list is too full, a number of pages are freed, each one | ||
81 | which triggers a mm_page_pcpu_drain event. | ||
82 | |||
83 | The individual nature of the events are so that pages can be tracked | ||
84 | between allocation and freeing. A number of drain or refill pages that occur | ||
85 | consecutively imply the zone->lock being taken once. Large amounts of PCP | ||
86 | refills and drains could imply an imbalance between CPUs where too much work | ||
87 | is being concentrated in one place. It could also indicate that the per-CPU | ||
88 | lists should be a larger size. Finally, large amounts of refills on one CPU | ||
89 | and drains on another could be a factor in causing large amounts of cache | ||
90 | line bounces due to writes between CPUs and worth investigating if pages | ||
91 | can be allocated and freed on the same CPU through some algorithm change. | ||
92 | |||
93 | 5. External Fragmentation | ||
94 | ========================= | ||
95 | mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d | ||
96 | |||
97 | External fragmentation affects whether a high-order allocation will be | ||
98 | successful or not. For some types of hardware, this is important although | ||
99 | it is avoided where possible. If the system is using huge pages and needs | ||
100 | to be able to resize the pool over the lifetime of the system, this value | ||
101 | is important. | ||
102 | |||
103 | Large numbers of this event implies that memory is fragmenting and | ||
104 | high-order allocations will start failing at some time in the future. One | ||
105 | means of reducing the occurange of this event is to increase the size of | ||
106 | min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where | ||
107 | pageblock_size is usually the size of the default hugepage size. | ||
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index 78c45a87be57..02ac6ed38b2d 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt | |||
@@ -72,7 +72,7 @@ To enable all events in sched subsystem: | |||
72 | 72 | ||
73 | # echo 1 > /sys/kernel/debug/tracing/events/sched/enable | 73 | # echo 1 > /sys/kernel/debug/tracing/events/sched/enable |
74 | 74 | ||
75 | To eanble all events: | 75 | To enable all events: |
76 | 76 | ||
77 | # echo 1 > /sys/kernel/debug/tracing/events/enable | 77 | # echo 1 > /sys/kernel/debug/tracing/events/enable |
78 | 78 | ||
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 1b6292bbdd6d..957b22fde2df 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt | |||
@@ -133,7 +133,7 @@ of ftrace. Here is a list of some of the key files: | |||
133 | than requested, the rest of the page will be used, | 133 | than requested, the rest of the page will be used, |
134 | making the actual allocation bigger than requested. | 134 | making the actual allocation bigger than requested. |
135 | ( Note, the size may not be a multiple of the page size | 135 | ( Note, the size may not be a multiple of the page size |
136 | due to buffer managment overhead. ) | 136 | due to buffer management overhead. ) |
137 | 137 | ||
138 | This can only be updated when the current_tracer | 138 | This can only be updated when the current_tracer |
139 | is set to "nop". | 139 | is set to "nop". |
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 000000000000..7df50e8cf4d9 --- /dev/null +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | |||
@@ -0,0 +1,418 @@ | |||
1 | #!/usr/bin/perl | ||
2 | # This is a POC (proof of concept or piece of crap, take your pick) for reading the | ||
3 | # text representation of trace output related to page allocation. It makes an attempt | ||
4 | # to extract some high-level information on what is going on. The accuracy of the parser | ||
5 | # may vary considerably | ||
6 | # | ||
7 | # Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe | ||
8 | # other options | ||
9 | # --prepend-parent Report on the parent proc and PID | ||
10 | # --read-procstat If the trace lacks process info, get it from /proc | ||
11 | # --ignore-pid Aggregate processes of the same name together | ||
12 | # | ||
13 | # Copyright (c) IBM Corporation 2009 | ||
14 | # Author: Mel Gorman <mel@csn.ul.ie> | ||
15 | use strict; | ||
16 | use Getopt::Long; | ||
17 | |||
18 | # Tracepoint events | ||
19 | use constant MM_PAGE_ALLOC => 1; | ||
20 | use constant MM_PAGE_FREE_DIRECT => 2; | ||
21 | use constant MM_PAGEVEC_FREE => 3; | ||
22 | use constant MM_PAGE_PCPU_DRAIN => 4; | ||
23 | use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; | ||
24 | use constant MM_PAGE_ALLOC_EXTFRAG => 6; | ||
25 | use constant EVENT_UNKNOWN => 7; | ||
26 | |||
27 | # Constants used to track state | ||
28 | use constant STATE_PCPU_PAGES_DRAINED => 8; | ||
29 | use constant STATE_PCPU_PAGES_REFILLED => 9; | ||
30 | |||
31 | # High-level events extrapolated from tracepoints | ||
32 | use constant HIGH_PCPU_DRAINS => 10; | ||
33 | use constant HIGH_PCPU_REFILLS => 11; | ||
34 | use constant HIGH_EXT_FRAGMENT => 12; | ||
35 | use constant HIGH_EXT_FRAGMENT_SEVERE => 13; | ||
36 | use constant HIGH_EXT_FRAGMENT_MODERATE => 14; | ||
37 | use constant HIGH_EXT_FRAGMENT_CHANGED => 15; | ||
38 | |||
39 | my %perprocesspid; | ||
40 | my %perprocess; | ||
41 | my $opt_ignorepid; | ||
42 | my $opt_read_procstat; | ||
43 | my $opt_prepend_parent; | ||
44 | |||
45 | # Catch sigint and exit on request | ||
46 | my $sigint_report = 0; | ||
47 | my $sigint_exit = 0; | ||
48 | my $sigint_pending = 0; | ||
49 | my $sigint_received = 0; | ||
50 | sub sigint_handler { | ||
51 | my $current_time = time; | ||
52 | if ($current_time - 2 > $sigint_received) { | ||
53 | print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; | ||
54 | $sigint_report = 1; | ||
55 | } else { | ||
56 | if (!$sigint_exit) { | ||
57 | print "Second SIGINT received quickly, exiting\n"; | ||
58 | } | ||
59 | $sigint_exit++; | ||
60 | } | ||
61 | |||
62 | if ($sigint_exit > 3) { | ||
63 | print "Many SIGINTs received, exiting now without report\n"; | ||
64 | exit; | ||
65 | } | ||
66 | |||
67 | $sigint_received = $current_time; | ||
68 | $sigint_pending = 1; | ||
69 | } | ||
70 | $SIG{INT} = "sigint_handler"; | ||
71 | |||
72 | # Parse command line options | ||
73 | GetOptions( | ||
74 | 'ignore-pid' => \$opt_ignorepid, | ||
75 | 'read-procstat' => \$opt_read_procstat, | ||
76 | 'prepend-parent' => \$opt_prepend_parent, | ||
77 | ); | ||
78 | |||
79 | # Defaults for dynamically discovered regex's | ||
80 | my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; | ||
81 | |||
82 | # Dyanically discovered regex | ||
83 | my $regex_fragdetails; | ||
84 | |||
85 | # Static regex used. Specified like this for readability and for use with /o | ||
86 | # (process_pid) (cpus ) ( time ) (tpoint ) (details) | ||
87 | my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; | ||
88 | my $regex_statname = '[-0-9]*\s\((.*)\).*'; | ||
89 | my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; | ||
90 | |||
91 | sub generate_traceevent_regex { | ||
92 | my $event = shift; | ||
93 | my $default = shift; | ||
94 | my $regex; | ||
95 | |||
96 | # Read the event format or use the default | ||
97 | if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { | ||
98 | $regex = $default; | ||
99 | } else { | ||
100 | my $line; | ||
101 | while (!eof(FORMAT)) { | ||
102 | $line = <FORMAT>; | ||
103 | if ($line =~ /^print fmt:\s"(.*)",.*/) { | ||
104 | $regex = $1; | ||
105 | $regex =~ s/%p/\([0-9a-f]*\)/g; | ||
106 | $regex =~ s/%d/\([-0-9]*\)/g; | ||
107 | $regex =~ s/%lu/\([0-9]*\)/g; | ||
108 | } | ||
109 | } | ||
110 | } | ||
111 | |||
112 | # Verify fields are in the right order | ||
113 | my $tuple; | ||
114 | foreach $tuple (split /\s/, $regex) { | ||
115 | my ($key, $value) = split(/=/, $tuple); | ||
116 | my $expected = shift; | ||
117 | if ($key ne $expected) { | ||
118 | print("WARNING: Format not as expected '$key' != '$expected'"); | ||
119 | $regex =~ s/$key=\((.*)\)/$key=$1/; | ||
120 | } | ||
121 | } | ||
122 | |||
123 | if (defined shift) { | ||
124 | die("Fewer fields than expected in format"); | ||
125 | } | ||
126 | |||
127 | return $regex; | ||
128 | } | ||
129 | $regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", | ||
130 | $regex_fragdetails_default, | ||
131 | "page", "pfn", | ||
132 | "alloc_order", "fallback_order", "pageblock_order", | ||
133 | "alloc_migratetype", "fallback_migratetype", | ||
134 | "fragmenting", "change_ownership"); | ||
135 | |||
136 | sub read_statline($) { | ||
137 | my $pid = $_[0]; | ||
138 | my $statline; | ||
139 | |||
140 | if (open(STAT, "/proc/$pid/stat")) { | ||
141 | $statline = <STAT>; | ||
142 | close(STAT); | ||
143 | } | ||
144 | |||
145 | if ($statline eq '') { | ||
146 | $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; | ||
147 | } | ||
148 | |||
149 | return $statline; | ||
150 | } | ||
151 | |||
152 | sub guess_process_pid($$) { | ||
153 | my $pid = $_[0]; | ||
154 | my $statline = $_[1]; | ||
155 | |||
156 | if ($pid == 0) { | ||
157 | return "swapper-0"; | ||
158 | } | ||
159 | |||
160 | if ($statline !~ /$regex_statname/o) { | ||
161 | die("Failed to math stat line for process name :: $statline"); | ||
162 | } | ||
163 | return "$1-$pid"; | ||
164 | } | ||
165 | |||
166 | sub parent_info($$) { | ||
167 | my $pid = $_[0]; | ||
168 | my $statline = $_[1]; | ||
169 | my $ppid; | ||
170 | |||
171 | if ($pid == 0) { | ||
172 | return "NOPARENT-0"; | ||
173 | } | ||
174 | |||
175 | if ($statline !~ /$regex_statppid/o) { | ||
176 | die("Failed to match stat line process ppid:: $statline"); | ||
177 | } | ||
178 | |||
179 | # Read the ppid stat line | ||
180 | $ppid = $1; | ||
181 | return guess_process_pid($ppid, read_statline($ppid)); | ||
182 | } | ||
183 | |||
184 | sub process_events { | ||
185 | my $traceevent; | ||
186 | my $process_pid; | ||
187 | my $cpus; | ||
188 | my $timestamp; | ||
189 | my $tracepoint; | ||
190 | my $details; | ||
191 | my $statline; | ||
192 | |||
193 | # Read each line of the event log | ||
194 | EVENT_PROCESS: | ||
195 | while ($traceevent = <STDIN>) { | ||
196 | if ($traceevent =~ /$regex_traceevent/o) { | ||
197 | $process_pid = $1; | ||
198 | $tracepoint = $4; | ||
199 | |||
200 | if ($opt_read_procstat || $opt_prepend_parent) { | ||
201 | $process_pid =~ /(.*)-([0-9]*)$/; | ||
202 | my $process = $1; | ||
203 | my $pid = $2; | ||
204 | |||
205 | $statline = read_statline($pid); | ||
206 | |||
207 | if ($opt_read_procstat && $process eq '') { | ||
208 | $process_pid = guess_process_pid($pid, $statline); | ||
209 | } | ||
210 | |||
211 | if ($opt_prepend_parent) { | ||
212 | $process_pid = parent_info($pid, $statline) . " :: $process_pid"; | ||
213 | } | ||
214 | } | ||
215 | |||
216 | # Unnecessary in this script. Uncomment if required | ||
217 | # $cpus = $2; | ||
218 | # $timestamp = $3; | ||
219 | } else { | ||
220 | next; | ||
221 | } | ||
222 | |||
223 | # Perl Switch() sucks majorly | ||
224 | if ($tracepoint eq "mm_page_alloc") { | ||
225 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; | ||
226 | } elsif ($tracepoint eq "mm_page_free_direct") { | ||
227 | $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++; | ||
228 | } elsif ($tracepoint eq "mm_pagevec_free") { | ||
229 | $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++; | ||
230 | } elsif ($tracepoint eq "mm_page_pcpu_drain") { | ||
231 | $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; | ||
232 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; | ||
233 | } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { | ||
234 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; | ||
235 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; | ||
236 | } elsif ($tracepoint eq "mm_page_alloc_extfrag") { | ||
237 | |||
238 | # Extract the details of the event now | ||
239 | $details = $5; | ||
240 | |||
241 | my ($page, $pfn); | ||
242 | my ($alloc_order, $fallback_order, $pageblock_order); | ||
243 | my ($alloc_migratetype, $fallback_migratetype); | ||
244 | my ($fragmenting, $change_ownership); | ||
245 | |||
246 | if ($details !~ /$regex_fragdetails/o) { | ||
247 | print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; | ||
248 | next; | ||
249 | } | ||
250 | |||
251 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; | ||
252 | $page = $1; | ||
253 | $pfn = $2; | ||
254 | $alloc_order = $3; | ||
255 | $fallback_order = $4; | ||
256 | $pageblock_order = $5; | ||
257 | $alloc_migratetype = $6; | ||
258 | $fallback_migratetype = $7; | ||
259 | $fragmenting = $8; | ||
260 | $change_ownership = $9; | ||
261 | |||
262 | if ($fragmenting) { | ||
263 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; | ||
264 | if ($fallback_order <= 3) { | ||
265 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; | ||
266 | } else { | ||
267 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; | ||
268 | } | ||
269 | } | ||
270 | if ($change_ownership) { | ||
271 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; | ||
272 | } | ||
273 | } else { | ||
274 | $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; | ||
275 | } | ||
276 | |||
277 | # Catch a full pcpu drain event | ||
278 | if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && | ||
279 | $tracepoint ne "mm_page_pcpu_drain") { | ||
280 | |||
281 | $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; | ||
282 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; | ||
283 | } | ||
284 | |||
285 | # Catch a full pcpu refill event | ||
286 | if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && | ||
287 | $tracepoint ne "mm_page_alloc_zone_locked") { | ||
288 | $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; | ||
289 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; | ||
290 | } | ||
291 | |||
292 | if ($sigint_pending) { | ||
293 | last EVENT_PROCESS; | ||
294 | } | ||
295 | } | ||
296 | } | ||
297 | |||
298 | sub dump_stats { | ||
299 | my $hashref = shift; | ||
300 | my %stats = %$hashref; | ||
301 | |||
302 | # Dump per-process stats | ||
303 | my $process_pid; | ||
304 | my $max_strlen = 0; | ||
305 | |||
306 | # Get the maximum process name | ||
307 | foreach $process_pid (keys %perprocesspid) { | ||
308 | my $len = length($process_pid); | ||
309 | if ($len > $max_strlen) { | ||
310 | $max_strlen = $len; | ||
311 | } | ||
312 | } | ||
313 | $max_strlen += 2; | ||
314 | |||
315 | printf("\n"); | ||
316 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
317 | "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); | ||
318 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
319 | "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); | ||
320 | |||
321 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
322 | "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); | ||
323 | |||
324 | foreach $process_pid (keys %stats) { | ||
325 | # Dump final aggregates | ||
326 | if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { | ||
327 | $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; | ||
328 | $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; | ||
329 | } | ||
330 | if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { | ||
331 | $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; | ||
332 | $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; | ||
333 | } | ||
334 | |||
335 | printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", | ||
336 | $process_pid, | ||
337 | $stats{$process_pid}->{MM_PAGE_ALLOC}, | ||
338 | $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, | ||
339 | $stats{$process_pid}->{MM_PAGE_FREE_DIRECT}, | ||
340 | $stats{$process_pid}->{MM_PAGEVEC_FREE}, | ||
341 | $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, | ||
342 | $stats{$process_pid}->{HIGH_PCPU_DRAINS}, | ||
343 | $stats{$process_pid}->{HIGH_PCPU_REFILLS}, | ||
344 | $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, | ||
345 | $stats{$process_pid}->{HIGH_EXT_FRAG}, | ||
346 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, | ||
347 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, | ||
348 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, | ||
349 | $stats{$process_pid}->{EVENT_UNKNOWN}); | ||
350 | } | ||
351 | } | ||
352 | |||
353 | sub aggregate_perprocesspid() { | ||
354 | my $process_pid; | ||
355 | my $process; | ||
356 | undef %perprocess; | ||
357 | |||
358 | foreach $process_pid (keys %perprocesspid) { | ||
359 | $process = $process_pid; | ||
360 | $process =~ s/-([0-9])*$//; | ||
361 | if ($process eq '') { | ||
362 | $process = "NO_PROCESS_NAME"; | ||
363 | } | ||
364 | |||
365 | $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; | ||
366 | $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; | ||
367 | $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}; | ||
368 | $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}; | ||
369 | $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; | ||
370 | $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; | ||
371 | $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; | ||
372 | $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; | ||
373 | $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; | ||
374 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; | ||
375 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; | ||
376 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; | ||
377 | $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; | ||
378 | } | ||
379 | } | ||
380 | |||
381 | sub report() { | ||
382 | if (!$opt_ignorepid) { | ||
383 | dump_stats(\%perprocesspid); | ||
384 | } else { | ||
385 | aggregate_perprocesspid(); | ||
386 | dump_stats(\%perprocess); | ||
387 | } | ||
388 | } | ||
389 | |||
390 | # Process events or signals until neither is available | ||
391 | sub signal_loop() { | ||
392 | my $sigint_processed; | ||
393 | do { | ||
394 | $sigint_processed = 0; | ||
395 | process_events(); | ||
396 | |||
397 | # Handle pending signals if any | ||
398 | if ($sigint_pending) { | ||
399 | my $current_time = time; | ||
400 | |||
401 | if ($sigint_exit) { | ||
402 | print "Received exit signal\n"; | ||
403 | $sigint_pending = 0; | ||
404 | } | ||
405 | if ($sigint_report) { | ||
406 | if ($current_time >= $sigint_received + 2) { | ||
407 | report(); | ||
408 | $sigint_report = 0; | ||
409 | $sigint_pending = 0; | ||
410 | $sigint_processed = 1; | ||
411 | } | ||
412 | } | ||
413 | } | ||
414 | } while ($sigint_pending || $sigint_processed); | ||
415 | } | ||
416 | |||
417 | signal_loop(); | ||
418 | report(); | ||
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt deleted file mode 100644 index cd805e16dc27..000000000000 --- a/Documentation/trace/power.txt +++ /dev/null | |||
@@ -1,17 +0,0 @@ | |||
1 | The power tracer collects detailed information about C-state and P-state | ||
2 | transitions, instead of just looking at the high-level "average" | ||
3 | information. | ||
4 | |||
5 | There is a helper script found in scrips/tracing/power.pl in the kernel | ||
6 | sources which can be used to parse this information and create a | ||
7 | Scalable Vector Graphics (SVG) picture from the trace data. | ||
8 | |||
9 | To use this tracer: | ||
10 | |||
11 | echo 0 > /sys/kernel/debug/tracing/tracing_enabled | ||
12 | echo power > /sys/kernel/debug/tracing/current_tracer | ||
13 | echo 1 > /sys/kernel/debug/tracing/tracing_enabled | ||
14 | sleep 1 | ||
15 | echo 0 > /sys/kernel/debug/tracing/tracing_enabled | ||
16 | cat /sys/kernel/debug/tracing/trace | \ | ||
17 | perl scripts/tracing/power.pl > out.sv | ||
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 000000000000..5eb4e487e667 --- /dev/null +++ b/Documentation/trace/tracepoint-analysis.txt | |||
@@ -0,0 +1,327 @@ | |||
1 | Notes on Analysing Behaviour Using Events and Tracepoints | ||
2 | |||
3 | Documentation written by Mel Gorman | ||
4 | PCL information heavily based on email from Ingo Molnar | ||
5 | |||
6 | 1. Introduction | ||
7 | =============== | ||
8 | |||
9 | Tracepoints (see Documentation/trace/tracepoints.txt) can be used without | ||
10 | creating custom kernel modules to register probe functions using the event | ||
11 | tracing infrastructure. | ||
12 | |||
13 | Simplistically, tracepoints will represent an important event that when can | ||
14 | be taken in conjunction with other tracepoints to build a "Big Picture" of | ||
15 | what is going on within the system. There are a large number of methods for | ||
16 | gathering and interpreting these events. Lacking any current Best Practises, | ||
17 | this document describes some of the methods that can be used. | ||
18 | |||
19 | This document assumes that debugfs is mounted on /sys/kernel/debug and that | ||
20 | the appropriate tracing options have been configured into the kernel. It is | ||
21 | assumed that the PCL tool tools/perf has been installed and is in your path. | ||
22 | |||
23 | 2. Listing Available Events | ||
24 | =========================== | ||
25 | |||
26 | 2.1 Standard Utilities | ||
27 | ---------------------- | ||
28 | |||
29 | All possible events are visible from /sys/kernel/debug/tracing/events. Simply | ||
30 | calling | ||
31 | |||
32 | $ find /sys/kernel/debug/tracing/events -type d | ||
33 | |||
34 | will give a fair indication of the number of events available. | ||
35 | |||
36 | 2.2 PCL | ||
37 | ------- | ||
38 | |||
39 | Discovery and enumeration of all counters and events, including tracepoints | ||
40 | are available with the perf tool. Getting a list of available events is a | ||
41 | simple case of | ||
42 | |||
43 | $ perf list 2>&1 | grep Tracepoint | ||
44 | ext4:ext4_free_inode [Tracepoint event] | ||
45 | ext4:ext4_request_inode [Tracepoint event] | ||
46 | ext4:ext4_allocate_inode [Tracepoint event] | ||
47 | ext4:ext4_write_begin [Tracepoint event] | ||
48 | ext4:ext4_ordered_write_end [Tracepoint event] | ||
49 | [ .... remaining output snipped .... ] | ||
50 | |||
51 | |||
52 | 2. Enabling Events | ||
53 | ================== | ||
54 | |||
55 | 2.1 System-Wide Event Enabling | ||
56 | ------------------------------ | ||
57 | |||
58 | See Documentation/trace/events.txt for a proper description on how events | ||
59 | can be enabled system-wide. A short example of enabling all events related | ||
60 | to page allocation would look something like | ||
61 | |||
62 | $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done | ||
63 | |||
64 | 2.2 System-Wide Event Enabling with SystemTap | ||
65 | --------------------------------------------- | ||
66 | |||
67 | In SystemTap, tracepoints are accessible using the kernel.trace() function | ||
68 | call. The following is an example that reports every 5 seconds what processes | ||
69 | were allocating the pages. | ||
70 | |||
71 | global page_allocs | ||
72 | |||
73 | probe kernel.trace("mm_page_alloc") { | ||
74 | page_allocs[execname()]++ | ||
75 | } | ||
76 | |||
77 | function print_count() { | ||
78 | printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") | ||
79 | foreach (proc in page_allocs-) | ||
80 | printf("%-25d %s\n", page_allocs[proc], proc) | ||
81 | printf ("\n") | ||
82 | delete page_allocs | ||
83 | } | ||
84 | |||
85 | probe timer.s(5) { | ||
86 | print_count() | ||
87 | } | ||
88 | |||
89 | 2.3 System-Wide Event Enabling with PCL | ||
90 | --------------------------------------- | ||
91 | |||
92 | By specifying the -a switch and analysing sleep, the system-wide events | ||
93 | for a duration of time can be examined. | ||
94 | |||
95 | $ perf stat -a \ | ||
96 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
97 | -e kmem:mm_pagevec_free \ | ||
98 | sleep 10 | ||
99 | Performance counter stats for 'sleep 10': | ||
100 | |||
101 | 9630 kmem:mm_page_alloc | ||
102 | 2143 kmem:mm_page_free_direct | ||
103 | 7424 kmem:mm_pagevec_free | ||
104 | |||
105 | 10.002577764 seconds time elapsed | ||
106 | |||
107 | Similarly, one could execute a shell and exit it as desired to get a report | ||
108 | at that point. | ||
109 | |||
110 | 2.4 Local Event Enabling | ||
111 | ------------------------ | ||
112 | |||
113 | Documentation/trace/ftrace.txt describes how to enable events on a per-thread | ||
114 | basis using set_ftrace_pid. | ||
115 | |||
116 | 2.5 Local Event Enablement with PCL | ||
117 | ----------------------------------- | ||
118 | |||
119 | Events can be activate and tracked for the duration of a process on a local | ||
120 | basis using PCL such as follows. | ||
121 | |||
122 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
123 | -e kmem:mm_pagevec_free ./hackbench 10 | ||
124 | Time: 0.909 | ||
125 | |||
126 | Performance counter stats for './hackbench 10': | ||
127 | |||
128 | 17803 kmem:mm_page_alloc | ||
129 | 12398 kmem:mm_page_free_direct | ||
130 | 4827 kmem:mm_pagevec_free | ||
131 | |||
132 | 0.973913387 seconds time elapsed | ||
133 | |||
134 | 3. Event Filtering | ||
135 | ================== | ||
136 | |||
137 | Documentation/trace/ftrace.txt covers in-depth how to filter events in | ||
138 | ftrace. Obviously using grep and awk of trace_pipe is an option as well | ||
139 | as any script reading trace_pipe. | ||
140 | |||
141 | 4. Analysing Event Variances with PCL | ||
142 | ===================================== | ||
143 | |||
144 | Any workload can exhibit variances between runs and it can be important | ||
145 | to know what the standard deviation in. By and large, this is left to the | ||
146 | performance analyst to do it by hand. In the event that the discrete event | ||
147 | occurrences are useful to the performance analyst, then perf can be used. | ||
148 | |||
149 | $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct | ||
150 | -e kmem:mm_pagevec_free ./hackbench 10 | ||
151 | Time: 0.890 | ||
152 | Time: 0.895 | ||
153 | Time: 0.915 | ||
154 | Time: 1.001 | ||
155 | Time: 0.899 | ||
156 | |||
157 | Performance counter stats for './hackbench 10' (5 runs): | ||
158 | |||
159 | 16630 kmem:mm_page_alloc ( +- 3.542% ) | ||
160 | 11486 kmem:mm_page_free_direct ( +- 4.771% ) | ||
161 | 4730 kmem:mm_pagevec_free ( +- 2.325% ) | ||
162 | |||
163 | 0.982653002 seconds time elapsed ( +- 1.448% ) | ||
164 | |||
165 | In the event that some higher-level event is required that depends on some | ||
166 | aggregation of discrete events, then a script would need to be developed. | ||
167 | |||
168 | Using --repeat, it is also possible to view how events are fluctuating over | ||
169 | time on a system wide basis using -a and sleep. | ||
170 | |||
171 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
172 | -e kmem:mm_pagevec_free \ | ||
173 | -a --repeat 10 \ | ||
174 | sleep 1 | ||
175 | Performance counter stats for 'sleep 1' (10 runs): | ||
176 | |||
177 | 1066 kmem:mm_page_alloc ( +- 26.148% ) | ||
178 | 182 kmem:mm_page_free_direct ( +- 5.464% ) | ||
179 | 890 kmem:mm_pagevec_free ( +- 30.079% ) | ||
180 | |||
181 | 1.002251757 seconds time elapsed ( +- 0.005% ) | ||
182 | |||
183 | 5. Higher-Level Analysis with Helper Scripts | ||
184 | ============================================ | ||
185 | |||
186 | When events are enabled the events that are triggering can be read from | ||
187 | /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary | ||
188 | options exist as well. By post-processing the output, further information can | ||
189 | be gathered on-line as appropriate. Examples of post-processing might include | ||
190 | |||
191 | o Reading information from /proc for the PID that triggered the event | ||
192 | o Deriving a higher-level event from a series of lower-level events. | ||
193 | o Calculate latencies between two events | ||
194 | |||
195 | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example | ||
196 | script that can read trace_pipe from STDIN or a copy of a trace. When used | ||
197 | on-line, it can be interrupted once to generate a report without existing | ||
198 | and twice to exit. | ||
199 | |||
200 | Simplistically, the script just reads STDIN and counts up events but it | ||
201 | also can do more such as | ||
202 | |||
203 | o Derive high-level events from many low-level events. If a number of pages | ||
204 | are freed to the main allocator from the per-CPU lists, it recognises | ||
205 | that as one per-CPU drain even though there is no specific tracepoint | ||
206 | for that event | ||
207 | o It can aggregate based on PID or individual process number | ||
208 | o In the event memory is getting externally fragmented, it reports | ||
209 | on whether the fragmentation event was severe or moderate. | ||
210 | o When receiving an event about a PID, it can record who the parent was so | ||
211 | that if large numbers of events are coming from very short-lived | ||
212 | processes, the parent process responsible for creating all the helpers | ||
213 | can be identified | ||
214 | |||
215 | 6. Lower-Level Analysis with PCL | ||
216 | ================================ | ||
217 | |||
218 | There may also be a requirement to identify what functions with a program | ||
219 | were generating events within the kernel. To begin this sort of analysis, the | ||
220 | data must be recorded. At the time of writing, this required root | ||
221 | |||
222 | $ perf record -c 1 \ | ||
223 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
224 | -e kmem:mm_pagevec_free \ | ||
225 | ./hackbench 10 | ||
226 | Time: 0.894 | ||
227 | [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] | ||
228 | |||
229 | Note the use of '-c 1' to set the event period to sample. The default sample | ||
230 | period is quite high to minimise overhead but the information collected can be | ||
231 | very coarse as a result. | ||
232 | |||
233 | This record outputted a file called perf.data which can be analysed using | ||
234 | perf report. | ||
235 | |||
236 | $ perf report | ||
237 | # Samples: 30922 | ||
238 | # | ||
239 | # Overhead Command Shared Object | ||
240 | # ........ ......... ................................ | ||
241 | # | ||
242 | 87.27% hackbench [vdso] | ||
243 | 6.85% hackbench /lib/i686/cmov/libc-2.9.so | ||
244 | 2.62% hackbench /lib/ld-2.9.so | ||
245 | 1.52% perf [vdso] | ||
246 | 1.22% hackbench ./hackbench | ||
247 | 0.48% hackbench [kernel] | ||
248 | 0.02% perf /lib/i686/cmov/libc-2.9.so | ||
249 | 0.01% perf /usr/bin/perf | ||
250 | 0.01% perf /lib/ld-2.9.so | ||
251 | 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so | ||
252 | # | ||
253 | # (For more details, try: perf report --sort comm,dso,symbol) | ||
254 | # | ||
255 | |||
256 | According to this, the vast majority of events occured triggered on events | ||
257 | within the VDSO. With simple binaries, this will often be the case so lets | ||
258 | take a slightly different example. In the course of writing this, it was | ||
259 | noticed that X was generating an insane amount of page allocations so lets look | ||
260 | at it | ||
261 | |||
262 | $ perf record -c 1 -f \ | ||
263 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
264 | -e kmem:mm_pagevec_free \ | ||
265 | -p `pidof X` | ||
266 | |||
267 | This was interrupted after a few seconds and | ||
268 | |||
269 | $ perf report | ||
270 | # Samples: 27666 | ||
271 | # | ||
272 | # Overhead Command Shared Object | ||
273 | # ........ ....... ....................................... | ||
274 | # | ||
275 | 51.95% Xorg [vdso] | ||
276 | 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 | ||
277 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so | ||
278 | 0.01% Xorg [kernel] | ||
279 | # | ||
280 | # (For more details, try: perf report --sort comm,dso,symbol) | ||
281 | # | ||
282 | |||
283 | So, almost half of the events are occuring in a library. To get an idea which | ||
284 | symbol. | ||
285 | |||
286 | $ perf report --sort comm,dso,symbol | ||
287 | # Samples: 27666 | ||
288 | # | ||
289 | # Overhead Command Shared Object Symbol | ||
290 | # ........ ....... ....................................... ...... | ||
291 | # | ||
292 | 51.95% Xorg [vdso] [.] 0x000000ffffe424 | ||
293 | 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 | ||
294 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc | ||
295 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f | ||
296 | 0.01% Xorg [kernel] [k] read_hpet | ||
297 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path | ||
298 | 0.00% Xorg [kernel] [k] ftrace_trace_userstack | ||
299 | |||
300 | To see where within the function pixmanFillsse2 things are going wrong | ||
301 | |||
302 | $ perf annotate pixmanFillsse2 | ||
303 | [ ... ] | ||
304 | 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) | ||
305 | : } | ||
306 | : | ||
307 | : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ | ||
308 | : _mm_store_si128 (__m128i *__P, __m128i __B) : { | ||
309 | : *__P = __B; | ||
310 | 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) | ||
311 | 0.00 : 34ef5: ff | ||
312 | 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) | ||
313 | 0.00 : 34efd: ff | ||
314 | 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) | ||
315 | 0.00 : 34f05: ff | ||
316 | 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) | ||
317 | 0.00 : 34f0d: ff | ||
318 | 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) | ||
319 | 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) | ||
320 | 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) | ||
321 | 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) | ||
322 | |||
323 | At a glance, it looks like the time is being spent copying pixmaps to | ||
324 | the card. Further investigation would be needed to determine why pixmaps | ||
325 | are being copied around so much but a starting point would be to take an | ||
326 | ancient build of libpixmap out of the library path where it was totally | ||
327 | forgotten about from months ago! | ||