diff options
author | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
---|---|---|
committer | Andrea Bastoni <bastoni@cs.unc.edu> | 2010-05-30 19:16:45 -0400 |
commit | ada47b5fe13d89735805b566185f4885f5a3f750 (patch) | |
tree | 644b88f8a71896307d71438e9b3af49126ffb22b /Documentation/trace | |
parent | 43e98717ad40a4ae64545b5ba047c7b86aa44f4f (diff) | |
parent | 3280f21d43ee541f97f8cda5792150d2dbec20d5 (diff) |
Merge branch 'wip-2.6.34' into old-private-masterarchived-private-master
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/events-kmem.txt | 14 | ||||
-rw-r--r-- | Documentation/trace/ftrace-design.txt | 52 | ||||
-rw-r--r-- | Documentation/trace/ftrace.txt | 4 | ||||
-rw-r--r-- | Documentation/trace/kprobetrace.txt | 156 | ||||
-rw-r--r-- | Documentation/trace/mmiotrace.txt | 15 | ||||
-rw-r--r-- | Documentation/trace/ring-buffer-design.txt | 56 | ||||
-rw-r--r-- | Documentation/trace/tracepoint-analysis.txt | 60 |
7 files changed, 271 insertions, 86 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt index 6ef2a8652e17..aa82ee4a5a87 100644 --- a/Documentation/trace/events-kmem.txt +++ b/Documentation/trace/events-kmem.txt | |||
@@ -1,7 +1,7 @@ | |||
1 | Subsystem Trace Points: kmem | 1 | Subsystem Trace Points: kmem |
2 | 2 | ||
3 | The tracing system kmem captures events related to object and page allocation | 3 | The kmem tracing system captures events related to object and page allocation |
4 | within the kernel. Broadly speaking there are four major subheadings. | 4 | within the kernel. Broadly speaking there are five major subheadings. |
5 | 5 | ||
6 | o Slab allocation of small objects of unknown type (kmalloc) | 6 | o Slab allocation of small objects of unknown type (kmalloc) |
7 | o Slab allocation of small objects of known type | 7 | o Slab allocation of small objects of known type |
@@ -9,7 +9,7 @@ within the kernel. Broadly speaking there are four major subheadings. | |||
9 | o Per-CPU Allocator Activity | 9 | o Per-CPU Allocator Activity |
10 | o External Fragmentation | 10 | o External Fragmentation |
11 | 11 | ||
12 | This document will describe what each of the tracepoints are and why they | 12 | This document describes what each of the tracepoints is and why they |
13 | might be useful. | 13 | might be useful. |
14 | 14 | ||
15 | 1. Slab allocation of small objects of unknown type | 15 | 1. Slab allocation of small objects of unknown type |
@@ -34,7 +34,7 @@ kmem_cache_free call_site=%lx ptr=%p | |||
34 | These events are similar in usage to the kmalloc-related events except that | 34 | These events are similar in usage to the kmalloc-related events except that |
35 | it is likely easier to pin the event down to a specific cache. At the time | 35 | it is likely easier to pin the event down to a specific cache. At the time |
36 | of writing, no information is available on what slab is being allocated from, | 36 | of writing, no information is available on what slab is being allocated from, |
37 | but the call_site can usually be used to extrapolate that information | 37 | but the call_site can usually be used to extrapolate that information. |
38 | 38 | ||
39 | 3. Page allocation | 39 | 3. Page allocation |
40 | ================== | 40 | ================== |
@@ -80,9 +80,9 @@ event indicating whether it is for a percpu_refill or not. | |||
80 | When the per-CPU list is too full, a number of pages are freed, each one | 80 | When the per-CPU list is too full, a number of pages are freed, each one |
81 | which triggers a mm_page_pcpu_drain event. | 81 | which triggers a mm_page_pcpu_drain event. |
82 | 82 | ||
83 | The individual nature of the events are so that pages can be tracked | 83 | The individual nature of the events is so that pages can be tracked |
84 | between allocation and freeing. A number of drain or refill pages that occur | 84 | between allocation and freeing. A number of drain or refill pages that occur |
85 | consecutively imply the zone->lock being taken once. Large amounts of PCP | 85 | consecutively imply the zone->lock being taken once. Large amounts of per-CPU |
86 | refills and drains could imply an imbalance between CPUs where too much work | 86 | refills and drains could imply an imbalance between CPUs where too much work |
87 | is being concentrated in one place. It could also indicate that the per-CPU | 87 | is being concentrated in one place. It could also indicate that the per-CPU |
88 | lists should be a larger size. Finally, large amounts of refills on one CPU | 88 | lists should be a larger size. Finally, large amounts of refills on one CPU |
@@ -102,6 +102,6 @@ is important. | |||
102 | 102 | ||
103 | Large numbers of this event implies that memory is fragmenting and | 103 | Large numbers of this event implies that memory is fragmenting and |
104 | high-order allocations will start failing at some time in the future. One | 104 | high-order allocations will start failing at some time in the future. One |
105 | means of reducing the occurange of this event is to increase the size of | 105 | means of reducing the occurrence of this event is to increase the size of |
106 | min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where | 106 | min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where |
107 | pageblock_size is usually the size of the default hugepage size. | 107 | pageblock_size is usually the size of the default hugepage size. |
diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt index 7003e10f10f5..f1f81afee8a0 100644 --- a/Documentation/trace/ftrace-design.txt +++ b/Documentation/trace/ftrace-design.txt | |||
@@ -1,5 +1,6 @@ | |||
1 | function tracer guts | 1 | function tracer guts |
2 | ==================== | 2 | ==================== |
3 | By Mike Frysinger | ||
3 | 4 | ||
4 | Introduction | 5 | Introduction |
5 | ------------ | 6 | ------------ |
@@ -53,14 +54,14 @@ size of the mcount call that is embedded in the function). | |||
53 | For example, if the function foo() calls bar(), when the bar() function calls | 54 | For example, if the function foo() calls bar(), when the bar() function calls |
54 | mcount(), the arguments mcount() will pass to the tracer are: | 55 | mcount(), the arguments mcount() will pass to the tracer are: |
55 | "frompc" - the address bar() will use to return to foo() | 56 | "frompc" - the address bar() will use to return to foo() |
56 | "selfpc" - the address bar() (with _mcount() size adjustment) | 57 | "selfpc" - the address bar() (with mcount() size adjustment) |
57 | 58 | ||
58 | Also keep in mind that this mcount function will be called *a lot*, so | 59 | Also keep in mind that this mcount function will be called *a lot*, so |
59 | optimizing for the default case of no tracer will help the smooth running of | 60 | optimizing for the default case of no tracer will help the smooth running of |
60 | your system when tracing is disabled. So the start of the mcount function is | 61 | your system when tracing is disabled. So the start of the mcount function is |
61 | typically the bare min with checking things before returning. That also means | 62 | typically the bare minimum with checking things before returning. That also |
62 | the code flow should usually kept linear (i.e. no branching in the nop case). | 63 | means the code flow should usually be kept linear (i.e. no branching in the nop |
63 | This is of course an optimization and not a hard requirement. | 64 | case). This is of course an optimization and not a hard requirement. |
64 | 65 | ||
65 | Here is some pseudo code that should help (these functions should actually be | 66 | Here is some pseudo code that should help (these functions should actually be |
66 | implemented in assembly): | 67 | implemented in assembly): |
@@ -131,10 +132,10 @@ some functions to save (hijack) and restore the return address. | |||
131 | 132 | ||
132 | The mcount function should check the function pointers ftrace_graph_return | 133 | The mcount function should check the function pointers ftrace_graph_return |
133 | (compare to ftrace_stub) and ftrace_graph_entry (compare to | 134 | (compare to ftrace_stub) and ftrace_graph_entry (compare to |
134 | ftrace_graph_entry_stub). If either of those are not set to the relevant stub | 135 | ftrace_graph_entry_stub). If either of those is not set to the relevant stub |
135 | function, call the arch-specific function ftrace_graph_caller which in turn | 136 | function, call the arch-specific function ftrace_graph_caller which in turn |
136 | calls the arch-specific function prepare_ftrace_return. Neither of these | 137 | calls the arch-specific function prepare_ftrace_return. Neither of these |
137 | function names are strictly required, but you should use them anyways to stay | 138 | function names is strictly required, but you should use them anyway to stay |
138 | consistent across the architecture ports -- easier to compare & contrast | 139 | consistent across the architecture ports -- easier to compare & contrast |
139 | things. | 140 | things. |
140 | 141 | ||
@@ -144,7 +145,7 @@ but the first argument should be a pointer to the "frompc". Typically this is | |||
144 | located on the stack. This allows the function to hijack the return address | 145 | located on the stack. This allows the function to hijack the return address |
145 | temporarily to have it point to the arch-specific function return_to_handler. | 146 | temporarily to have it point to the arch-specific function return_to_handler. |
146 | That function will simply call the common ftrace_return_to_handler function and | 147 | That function will simply call the common ftrace_return_to_handler function and |
147 | that will return the original return address with which, you can return to the | 148 | that will return the original return address with which you can return to the |
148 | original call site. | 149 | original call site. |
149 | 150 | ||
150 | Here is the updated mcount pseudo code: | 151 | Here is the updated mcount pseudo code: |
@@ -173,14 +174,16 @@ void ftrace_graph_caller(void) | |||
173 | 174 | ||
174 | unsigned long *frompc = &...; | 175 | unsigned long *frompc = &...; |
175 | unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; | 176 | unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; |
176 | prepare_ftrace_return(frompc, selfpc); | 177 | /* passing frame pointer up is optional -- see below */ |
178 | prepare_ftrace_return(frompc, selfpc, frame_pointer); | ||
177 | 179 | ||
178 | /* restore all state needed by the ABI */ | 180 | /* restore all state needed by the ABI */ |
179 | } | 181 | } |
180 | #endif | 182 | #endif |
181 | 183 | ||
182 | For information on how to implement prepare_ftrace_return(), simply look at | 184 | For information on how to implement prepare_ftrace_return(), simply look at the |
183 | the x86 version. The only architecture-specific piece in it is the setup of | 185 | x86 version (the frame pointer passing is optional; see the next section for |
186 | more information). The only architecture-specific piece in it is the setup of | ||
184 | the fault recovery table (the asm(...) code). The rest should be the same | 187 | the fault recovery table (the asm(...) code). The rest should be the same |
185 | across architectures. | 188 | across architectures. |
186 | 189 | ||
@@ -205,6 +208,23 @@ void return_to_handler(void) | |||
205 | #endif | 208 | #endif |
206 | 209 | ||
207 | 210 | ||
211 | HAVE_FUNCTION_GRAPH_FP_TEST | ||
212 | --------------------------- | ||
213 | |||
214 | An arch may pass in a unique value (frame pointer) to both the entering and | ||
215 | exiting of a function. On exit, the value is compared and if it does not | ||
216 | match, then it will panic the kernel. This is largely a sanity check for bad | ||
217 | code generation with gcc. If gcc for your port sanely updates the frame | ||
218 | pointer under different opitmization levels, then ignore this option. | ||
219 | |||
220 | However, adding support for it isn't terribly difficult. In your assembly code | ||
221 | that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument. | ||
222 | Then in the C version of that function, do what the x86 port does and pass it | ||
223 | along to ftrace_push_return_trace() instead of a stub value of 0. | ||
224 | |||
225 | Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer. | ||
226 | |||
227 | |||
208 | HAVE_FTRACE_NMI_ENTER | 228 | HAVE_FTRACE_NMI_ENTER |
209 | --------------------- | 229 | --------------------- |
210 | 230 | ||
@@ -213,10 +233,18 @@ If you can't trace NMI functions, then skip this option. | |||
213 | <details to be filled> | 233 | <details to be filled> |
214 | 234 | ||
215 | 235 | ||
216 | HAVE_FTRACE_SYSCALLS | 236 | HAVE_SYSCALL_TRACEPOINTS |
217 | --------------------- | 237 | --------------------- |
218 | 238 | ||
219 | <details to be filled> | 239 | You need very few things to get the syscalls tracing in an arch. |
240 | |||
241 | - Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig). | ||
242 | - Have a NR_syscalls variable in <asm/unistd.h> that provides the number | ||
243 | of syscalls supported by the arch. | ||
244 | - Support the TIF_SYSCALL_TRACEPOINT thread flags. | ||
245 | - Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace | ||
246 | in the ptrace syscalls tracing path. | ||
247 | - Tag this arch as HAVE_SYSCALL_TRACEPOINTS. | ||
220 | 248 | ||
221 | 249 | ||
222 | HAVE_FTRACE_MCOUNT_RECORD | 250 | HAVE_FTRACE_MCOUNT_RECORD |
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 8179692fbb90..03485bfbd797 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt | |||
@@ -1588,7 +1588,7 @@ module author does not need to worry about it. | |||
1588 | 1588 | ||
1589 | When tracing is enabled, kstop_machine is called to prevent | 1589 | When tracing is enabled, kstop_machine is called to prevent |
1590 | races with the CPUS executing code being modified (which can | 1590 | races with the CPUS executing code being modified (which can |
1591 | cause the CPU to do undesireable things), and the nops are | 1591 | cause the CPU to do undesirable things), and the nops are |
1592 | patched back to calls. But this time, they do not call mcount | 1592 | patched back to calls. But this time, they do not call mcount |
1593 | (which is just a function stub). They now call into the ftrace | 1593 | (which is just a function stub). They now call into the ftrace |
1594 | infrastructure. | 1594 | infrastructure. |
@@ -1625,7 +1625,7 @@ If I am only interested in sys_nanosleep and hrtimer_interrupt: | |||
1625 | 1625 | ||
1626 | # echo sys_nanosleep hrtimer_interrupt \ | 1626 | # echo sys_nanosleep hrtimer_interrupt \ |
1627 | > set_ftrace_filter | 1627 | > set_ftrace_filter |
1628 | # echo ftrace > current_tracer | 1628 | # echo function > current_tracer |
1629 | # echo 1 > tracing_enabled | 1629 | # echo 1 > tracing_enabled |
1630 | # usleep 1 | 1630 | # usleep 1 |
1631 | # echo 0 > tracing_enabled | 1631 | # echo 0 > tracing_enabled |
diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt new file mode 100644 index 000000000000..a9100b28eb84 --- /dev/null +++ b/Documentation/trace/kprobetrace.txt | |||
@@ -0,0 +1,156 @@ | |||
1 | Kprobe-based Event Tracing | ||
2 | ========================== | ||
3 | |||
4 | Documentation is written by Masami Hiramatsu | ||
5 | |||
6 | |||
7 | Overview | ||
8 | -------- | ||
9 | These events are similar to tracepoint based events. Instead of Tracepoint, | ||
10 | this is based on kprobes (kprobe and kretprobe). So it can probe wherever | ||
11 | kprobes can probe (this means, all functions body except for __kprobes | ||
12 | functions). Unlike the Tracepoint based event, this can be added and removed | ||
13 | dynamically, on the fly. | ||
14 | |||
15 | To enable this feature, build your kernel with CONFIG_KPROBE_TRACING=y. | ||
16 | |||
17 | Similar to the events tracer, this doesn't need to be activated via | ||
18 | current_tracer. Instead of that, add probe points via | ||
19 | /sys/kernel/debug/tracing/kprobe_events, and enable it via | ||
20 | /sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled. | ||
21 | |||
22 | |||
23 | Synopsis of kprobe_events | ||
24 | ------------------------- | ||
25 | p[:[GRP/]EVENT] SYMBOL[+offs]|MEMADDR [FETCHARGS] : Set a probe | ||
26 | r[:[GRP/]EVENT] SYMBOL[+0] [FETCHARGS] : Set a return probe | ||
27 | -:[GRP/]EVENT : Clear a probe | ||
28 | |||
29 | GRP : Group name. If omitted, use "kprobes" for it. | ||
30 | EVENT : Event name. If omitted, the event name is generated | ||
31 | based on SYMBOL+offs or MEMADDR. | ||
32 | SYMBOL[+offs] : Symbol+offset where the probe is inserted. | ||
33 | MEMADDR : Address where the probe is inserted. | ||
34 | |||
35 | FETCHARGS : Arguments. Each probe can have up to 128 args. | ||
36 | %REG : Fetch register REG | ||
37 | @ADDR : Fetch memory at ADDR (ADDR should be in kernel) | ||
38 | @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) | ||
39 | $stackN : Fetch Nth entry of stack (N >= 0) | ||
40 | $stack : Fetch stack address. | ||
41 | $retval : Fetch return value.(*) | ||
42 | +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) | ||
43 | NAME=FETCHARG: Set NAME as the argument name of FETCHARG. | ||
44 | |||
45 | (*) only for return probe. | ||
46 | (**) this is useful for fetching a field of data structures. | ||
47 | |||
48 | |||
49 | Per-Probe Event Filtering | ||
50 | ------------------------- | ||
51 | Per-probe event filtering feature allows you to set different filter on each | ||
52 | probe and gives you what arguments will be shown in trace buffer. If an event | ||
53 | name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event | ||
54 | under tracing/events/kprobes/<EVENT>, at the directory you can see 'id', | ||
55 | 'enabled', 'format' and 'filter'. | ||
56 | |||
57 | enabled: | ||
58 | You can enable/disable the probe by writing 1 or 0 on it. | ||
59 | |||
60 | format: | ||
61 | This shows the format of this probe event. | ||
62 | |||
63 | filter: | ||
64 | You can write filtering rules of this event. | ||
65 | |||
66 | id: | ||
67 | This shows the id of this probe event. | ||
68 | |||
69 | |||
70 | Event Profiling | ||
71 | --------------- | ||
72 | You can check the total number of probe hits and probe miss-hits via | ||
73 | /sys/kernel/debug/tracing/kprobe_profile. | ||
74 | The first column is event name, the second is the number of probe hits, | ||
75 | the third is the number of probe miss-hits. | ||
76 | |||
77 | |||
78 | Usage examples | ||
79 | -------------- | ||
80 | To add a probe as a new event, write a new definition to kprobe_events | ||
81 | as below. | ||
82 | |||
83 | echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events | ||
84 | |||
85 | This sets a kprobe on the top of do_sys_open() function with recording | ||
86 | 1st to 4th arguments as "myprobe" event. Note, which register/stack entry is | ||
87 | assigned to each function argument depends on arch-specific ABI. If you unsure | ||
88 | the ABI, please try to use probe subcommand of perf-tools (you can find it | ||
89 | under tools/perf/). | ||
90 | As this example shows, users can choose more familiar names for each arguments. | ||
91 | |||
92 | echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events | ||
93 | |||
94 | This sets a kretprobe on the return point of do_sys_open() function with | ||
95 | recording return value as "myretprobe" event. | ||
96 | You can see the format of these events via | ||
97 | /sys/kernel/debug/tracing/events/kprobes/<EVENT>/format. | ||
98 | |||
99 | cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format | ||
100 | name: myprobe | ||
101 | ID: 780 | ||
102 | format: | ||
103 | field:unsigned short common_type; offset:0; size:2; signed:0; | ||
104 | field:unsigned char common_flags; offset:2; size:1; signed:0; | ||
105 | field:unsigned char common_preempt_count; offset:3; size:1;signed:0; | ||
106 | field:int common_pid; offset:4; size:4; signed:1; | ||
107 | field:int common_lock_depth; offset:8; size:4; signed:1; | ||
108 | |||
109 | field:unsigned long __probe_ip; offset:12; size:4; signed:0; | ||
110 | field:int __probe_nargs; offset:16; size:4; signed:1; | ||
111 | field:unsigned long dfd; offset:20; size:4; signed:0; | ||
112 | field:unsigned long filename; offset:24; size:4; signed:0; | ||
113 | field:unsigned long flags; offset:28; size:4; signed:0; | ||
114 | field:unsigned long mode; offset:32; size:4; signed:0; | ||
115 | |||
116 | |||
117 | print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip, | ||
118 | REC->dfd, REC->filename, REC->flags, REC->mode | ||
119 | |||
120 | You can see that the event has 4 arguments as in the expressions you specified. | ||
121 | |||
122 | echo > /sys/kernel/debug/tracing/kprobe_events | ||
123 | |||
124 | This clears all probe points. | ||
125 | |||
126 | Or, | ||
127 | |||
128 | echo -:myprobe >> kprobe_events | ||
129 | |||
130 | This clears probe points selectively. | ||
131 | |||
132 | Right after definition, each event is disabled by default. For tracing these | ||
133 | events, you need to enable it. | ||
134 | |||
135 | echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable | ||
136 | echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable | ||
137 | |||
138 | And you can see the traced information via /sys/kernel/debug/tracing/trace. | ||
139 | |||
140 | cat /sys/kernel/debug/tracing/trace | ||
141 | # tracer: nop | ||
142 | # | ||
143 | # TASK-PID CPU# TIMESTAMP FUNCTION | ||
144 | # | | | | | | ||
145 | <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0 | ||
146 | <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe | ||
147 | <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6 | ||
148 | <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 | ||
149 | <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10 | ||
150 | <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 | ||
151 | |||
152 | |||
153 | Each line shows when the kernel hits an event, and <- SYMBOL means kernel | ||
154 | returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel | ||
155 | returns from do_sys_open to sys_open+0x1b). | ||
156 | |||
diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt index 162effbfbdec..664e7386d89e 100644 --- a/Documentation/trace/mmiotrace.txt +++ b/Documentation/trace/mmiotrace.txt | |||
@@ -44,7 +44,8 @@ Check for lost events. | |||
44 | Usage | 44 | Usage |
45 | ----- | 45 | ----- |
46 | 46 | ||
47 | Make sure debugfs is mounted to /sys/kernel/debug. If not, (requires root privileges) | 47 | Make sure debugfs is mounted to /sys/kernel/debug. |
48 | If not (requires root privileges): | ||
48 | $ mount -t debugfs debugfs /sys/kernel/debug | 49 | $ mount -t debugfs debugfs /sys/kernel/debug |
49 | 50 | ||
50 | Check that the driver you are about to trace is not loaded. | 51 | Check that the driver you are about to trace is not loaded. |
@@ -91,7 +92,7 @@ $ dmesg > dmesg.txt | |||
91 | $ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt | 92 | $ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt |
92 | and then send the .tar.gz file. The trace compresses considerably. Replace | 93 | and then send the .tar.gz file. The trace compresses considerably. Replace |
93 | "pciid" and "nick" with the PCI ID or model name of your piece of hardware | 94 | "pciid" and "nick" with the PCI ID or model name of your piece of hardware |
94 | under investigation and your nick name. | 95 | under investigation and your nickname. |
95 | 96 | ||
96 | 97 | ||
97 | How Mmiotrace Works | 98 | How Mmiotrace Works |
@@ -100,7 +101,7 @@ How Mmiotrace Works | |||
100 | Access to hardware IO-memory is gained by mapping addresses from PCI bus by | 101 | Access to hardware IO-memory is gained by mapping addresses from PCI bus by |
101 | calling one of the ioremap_*() functions. Mmiotrace is hooked into the | 102 | calling one of the ioremap_*() functions. Mmiotrace is hooked into the |
102 | __ioremap() function and gets called whenever a mapping is created. Mapping is | 103 | __ioremap() function and gets called whenever a mapping is created. Mapping is |
103 | an event that is recorded into the trace log. Note, that ISA range mappings | 104 | an event that is recorded into the trace log. Note that ISA range mappings |
104 | are not caught, since the mapping always exists and is returned directly. | 105 | are not caught, since the mapping always exists and is returned directly. |
105 | 106 | ||
106 | MMIO accesses are recorded via page faults. Just before __ioremap() returns, | 107 | MMIO accesses are recorded via page faults. Just before __ioremap() returns, |
@@ -122,11 +123,11 @@ Trace Log Format | |||
122 | ---------------- | 123 | ---------------- |
123 | 124 | ||
124 | The raw log is text and easily filtered with e.g. grep and awk. One record is | 125 | The raw log is text and easily filtered with e.g. grep and awk. One record is |
125 | one line in the log. A record starts with a keyword, followed by keyword | 126 | one line in the log. A record starts with a keyword, followed by keyword- |
126 | dependant arguments. Arguments are separated by a space, or continue until the | 127 | dependent arguments. Arguments are separated by a space, or continue until the |
127 | end of line. The format for version 20070824 is as follows: | 128 | end of line. The format for version 20070824 is as follows: |
128 | 129 | ||
129 | Explanation Keyword Space separated arguments | 130 | Explanation Keyword Space-separated arguments |
130 | --------------------------------------------------------------------------- | 131 | --------------------------------------------------------------------------- |
131 | 132 | ||
132 | read event R width, timestamp, map id, physical, value, PC, PID | 133 | read event R width, timestamp, map id, physical, value, PC, PID |
@@ -136,7 +137,7 @@ iounmap event UNMAP timestamp, map id, PC, PID | |||
136 | marker MARK timestamp, text | 137 | marker MARK timestamp, text |
137 | version VERSION the string "20070824" | 138 | version VERSION the string "20070824" |
138 | info for reader LSPCI one line from lspci -v | 139 | info for reader LSPCI one line from lspci -v |
139 | PCI address map PCIDEV space separated /proc/bus/pci/devices data | 140 | PCI address map PCIDEV space-separated /proc/bus/pci/devices data |
140 | unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID | 141 | unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID |
141 | 142 | ||
142 | Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual | 143 | Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual |
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt index 5b1d23d604c5..d299ff31df57 100644 --- a/Documentation/trace/ring-buffer-design.txt +++ b/Documentation/trace/ring-buffer-design.txt | |||
@@ -33,9 +33,9 @@ head_page - a pointer to the page that the reader will use next | |||
33 | 33 | ||
34 | tail_page - a pointer to the page that will be written to next | 34 | tail_page - a pointer to the page that will be written to next |
35 | 35 | ||
36 | commit_page - a pointer to the page with the last finished non nested write. | 36 | commit_page - a pointer to the page with the last finished non-nested write. |
37 | 37 | ||
38 | cmpxchg - hardware assisted atomic transaction that performs the following: | 38 | cmpxchg - hardware-assisted atomic transaction that performs the following: |
39 | 39 | ||
40 | A = B iff previous A == C | 40 | A = B iff previous A == C |
41 | 41 | ||
@@ -52,15 +52,15 @@ The Generic Ring Buffer | |||
52 | The ring buffer can be used in either an overwrite mode or in | 52 | The ring buffer can be used in either an overwrite mode or in |
53 | producer/consumer mode. | 53 | producer/consumer mode. |
54 | 54 | ||
55 | Producer/consumer mode is where the producer were to fill up the | 55 | Producer/consumer mode is where if the producer were to fill up the |
56 | buffer before the consumer could free up anything, the producer | 56 | buffer before the consumer could free up anything, the producer |
57 | will stop writing to the buffer. This will lose most recent events. | 57 | will stop writing to the buffer. This will lose most recent events. |
58 | 58 | ||
59 | Overwrite mode is where the produce were to fill up the buffer | 59 | Overwrite mode is where if the producer were to fill up the buffer |
60 | before the consumer could free up anything, the producer will | 60 | before the consumer could free up anything, the producer will |
61 | overwrite the older data. This will lose the oldest events. | 61 | overwrite the older data. This will lose the oldest events. |
62 | 62 | ||
63 | No two writers can write at the same time (on the same per cpu buffer), | 63 | No two writers can write at the same time (on the same per-cpu buffer), |
64 | but a writer may interrupt another writer, but it must finish writing | 64 | but a writer may interrupt another writer, but it must finish writing |
65 | before the previous writer may continue. This is very important to the | 65 | before the previous writer may continue. This is very important to the |
66 | algorithm. The writers act like a "stack". The way interrupts works | 66 | algorithm. The writers act like a "stack". The way interrupts works |
@@ -79,16 +79,16 @@ the interrupt doing a write as well. | |||
79 | 79 | ||
80 | Readers can happen at any time. But no two readers may run at the | 80 | Readers can happen at any time. But no two readers may run at the |
81 | same time, nor can a reader preempt/interrupt another reader. A reader | 81 | same time, nor can a reader preempt/interrupt another reader. A reader |
82 | can not preempt/interrupt a writer, but it may read/consume from the | 82 | cannot preempt/interrupt a writer, but it may read/consume from the |
83 | buffer at the same time as a writer is writing, but the reader must be | 83 | buffer at the same time as a writer is writing, but the reader must be |
84 | on another processor to do so. A reader may read on its own processor | 84 | on another processor to do so. A reader may read on its own processor |
85 | and can be preempted by a writer. | 85 | and can be preempted by a writer. |
86 | 86 | ||
87 | A writer can preempt a reader, but a reader can not preempt a writer. | 87 | A writer can preempt a reader, but a reader cannot preempt a writer. |
88 | But a reader can read the buffer at the same time (on another processor) | 88 | But a reader can read the buffer at the same time (on another processor) |
89 | as a writer. | 89 | as a writer. |
90 | 90 | ||
91 | The ring buffer is made up of a list of pages held together by a link list. | 91 | The ring buffer is made up of a list of pages held together by a linked list. |
92 | 92 | ||
93 | At initialization a reader page is allocated for the reader that is not | 93 | At initialization a reader page is allocated for the reader that is not |
94 | part of the ring buffer. | 94 | part of the ring buffer. |
@@ -102,7 +102,7 @@ the head page. | |||
102 | 102 | ||
103 | The reader has its own page to use. At start up time, this page is | 103 | The reader has its own page to use. At start up time, this page is |
104 | allocated but is not attached to the list. When the reader wants | 104 | allocated but is not attached to the list. When the reader wants |
105 | to read from the buffer, if its page is empty (like it is on start up) | 105 | to read from the buffer, if its page is empty (like it is on start-up), |
106 | it will swap its page with the head_page. The old reader page will | 106 | it will swap its page with the head_page. The old reader page will |
107 | become part of the ring buffer and the head_page will be removed. | 107 | become part of the ring buffer and the head_page will be removed. |
108 | The page after the inserted page (old reader_page) will become the | 108 | The page after the inserted page (old reader_page) will become the |
@@ -206,7 +206,7 @@ The main pointers: | |||
206 | 206 | ||
207 | commit page - the page that last finished a write. | 207 | commit page - the page that last finished a write. |
208 | 208 | ||
209 | The commit page only is updated by the outer most writer in the | 209 | The commit page only is updated by the outermost writer in the |
210 | writer stack. A writer that preempts another writer will not move the | 210 | writer stack. A writer that preempts another writer will not move the |
211 | commit page. | 211 | commit page. |
212 | 212 | ||
@@ -281,7 +281,7 @@ with the previous write. | |||
281 | The commit pointer points to the last write location that was | 281 | The commit pointer points to the last write location that was |
282 | committed without preempting another write. When a write that | 282 | committed without preempting another write. When a write that |
283 | preempted another write is committed, it only becomes a pending commit | 283 | preempted another write is committed, it only becomes a pending commit |
284 | and will not be a full commit till all writes have been committed. | 284 | and will not be a full commit until all writes have been committed. |
285 | 285 | ||
286 | The commit page points to the page that has the last full commit. | 286 | The commit page points to the page that has the last full commit. |
287 | The tail page points to the page with the last write (before | 287 | The tail page points to the page with the last write (before |
@@ -292,7 +292,7 @@ be several pages ahead. If the tail page catches up to the commit | |||
292 | page then no more writes may take place (regardless of the mode | 292 | page then no more writes may take place (regardless of the mode |
293 | of the ring buffer: overwrite and produce/consumer). | 293 | of the ring buffer: overwrite and produce/consumer). |
294 | 294 | ||
295 | The order of pages are: | 295 | The order of pages is: |
296 | 296 | ||
297 | head page | 297 | head page |
298 | commit page | 298 | commit page |
@@ -311,7 +311,7 @@ Possible scenario: | |||
311 | There is a special case that the head page is after either the commit page | 311 | There is a special case that the head page is after either the commit page |
312 | and possibly the tail page. That is when the commit (and tail) page has been | 312 | and possibly the tail page. That is when the commit (and tail) page has been |
313 | swapped with the reader page. This is because the head page is always | 313 | swapped with the reader page. This is because the head page is always |
314 | part of the ring buffer, but the reader page is not. When ever there | 314 | part of the ring buffer, but the reader page is not. Whenever there |
315 | has been less than a full page that has been committed inside the ring buffer, | 315 | has been less than a full page that has been committed inside the ring buffer, |
316 | and a reader swaps out a page, it will be swapping out the commit page. | 316 | and a reader swaps out a page, it will be swapping out the commit page. |
317 | 317 | ||
@@ -338,7 +338,7 @@ and a reader swaps out a page, it will be swapping out the commit page. | |||
338 | In this case, the head page will not move when the tail and commit | 338 | In this case, the head page will not move when the tail and commit |
339 | move back into the ring buffer. | 339 | move back into the ring buffer. |
340 | 340 | ||
341 | The reader can not swap a page into the ring buffer if the commit page | 341 | The reader cannot swap a page into the ring buffer if the commit page |
342 | is still on that page. If the read meets the last commit (real commit | 342 | is still on that page. If the read meets the last commit (real commit |
343 | not pending or reserved), then there is nothing more to read. | 343 | not pending or reserved), then there is nothing more to read. |
344 | The buffer is considered empty until another full commit finishes. | 344 | The buffer is considered empty until another full commit finishes. |
@@ -395,7 +395,7 @@ The main idea behind the lockless algorithm is to combine the moving | |||
395 | of the head_page pointer with the swapping of pages with the reader. | 395 | of the head_page pointer with the swapping of pages with the reader. |
396 | State flags are placed inside the pointer to the page. To do this, | 396 | State flags are placed inside the pointer to the page. To do this, |
397 | each page must be aligned in memory by 4 bytes. This will allow the 2 | 397 | each page must be aligned in memory by 4 bytes. This will allow the 2 |
398 | least significant bits of the address to be used as flags. Since | 398 | least significant bits of the address to be used as flags, since |
399 | they will always be zero for the address. To get the address, | 399 | they will always be zero for the address. To get the address, |
400 | simply mask out the flags. | 400 | simply mask out the flags. |
401 | 401 | ||
@@ -460,7 +460,7 @@ When the reader tries to swap the page with the ring buffer, it | |||
460 | will also use cmpxchg. If the flag bit in the pointer to the | 460 | will also use cmpxchg. If the flag bit in the pointer to the |
461 | head page does not have the HEADER flag set, the compare will fail | 461 | head page does not have the HEADER flag set, the compare will fail |
462 | and the reader will need to look for the new head page and try again. | 462 | and the reader will need to look for the new head page and try again. |
463 | Note, the flag UPDATE and HEADER are never set at the same time. | 463 | Note, the flags UPDATE and HEADER are never set at the same time. |
464 | 464 | ||
465 | The reader swaps the reader page as follows: | 465 | The reader swaps the reader page as follows: |
466 | 466 | ||
@@ -539,7 +539,7 @@ updated to the reader page. | |||
539 | | +-----------------------------+ | | 539 | | +-----------------------------+ | |
540 | +------------------------------------+ | 540 | +------------------------------------+ |
541 | 541 | ||
542 | Another important point. The page that the reader page points back to | 542 | Another important point: The page that the reader page points back to |
543 | by its previous pointer (the one that now points to the new head page) | 543 | by its previous pointer (the one that now points to the new head page) |
544 | never points back to the reader page. That is because the reader page is | 544 | never points back to the reader page. That is because the reader page is |
545 | not part of the ring buffer. Traversing the ring buffer via the next pointers | 545 | not part of the ring buffer. Traversing the ring buffer via the next pointers |
@@ -572,7 +572,7 @@ not be able to swap the head page from the buffer, nor will it be able to | |||
572 | move the head page, until the writer is finished with the move. | 572 | move the head page, until the writer is finished with the move. |
573 | 573 | ||
574 | This eliminates any races that the reader can have on the writer. The reader | 574 | This eliminates any races that the reader can have on the writer. The reader |
575 | must spin, and this is why the reader can not preempt the writer. | 575 | must spin, and this is why the reader cannot preempt the writer. |
576 | 576 | ||
577 | tail page | 577 | tail page |
578 | | | 578 | | |
@@ -659,9 +659,9 @@ before pushing the head page. If it is, then it can be assumed that the | |||
659 | tail page wrapped the buffer, and we must drop new writes. | 659 | tail page wrapped the buffer, and we must drop new writes. |
660 | 660 | ||
661 | This is not a race condition, because the commit page can only be moved | 661 | This is not a race condition, because the commit page can only be moved |
662 | by the outter most writer (the writer that was preempted). | 662 | by the outermost writer (the writer that was preempted). |
663 | This means that the commit will not move while a writer is moving the | 663 | This means that the commit will not move while a writer is moving the |
664 | tail page. The reader can not swap the reader page if it is also being | 664 | tail page. The reader cannot swap the reader page if it is also being |
665 | used as the commit page. The reader can simply check that the commit | 665 | used as the commit page. The reader can simply check that the commit |
666 | is off the reader page. Once the commit page leaves the reader page | 666 | is off the reader page. Once the commit page leaves the reader page |
667 | it will never go back on it unless a reader does another swap with the | 667 | it will never go back on it unless a reader does another swap with the |
@@ -733,7 +733,7 @@ The write converts the head page pointer to UPDATE. | |||
733 | --->| |<---| |<---| |<---| |<--- | 733 | --->| |<---| |<---| |<---| |<--- |
734 | +---+ +---+ +---+ +---+ | 734 | +---+ +---+ +---+ +---+ |
735 | 735 | ||
736 | But if a nested writer preempts here. It will see that the next | 736 | But if a nested writer preempts here, it will see that the next |
737 | page is a head page, but it is also nested. It will detect that | 737 | page is a head page, but it is also nested. It will detect that |
738 | it is nested and will save that information. The detection is the | 738 | it is nested and will save that information. The detection is the |
739 | fact that it sees the UPDATE flag instead of a HEADER or NORMAL | 739 | fact that it sees the UPDATE flag instead of a HEADER or NORMAL |
@@ -761,7 +761,7 @@ to NORMAL. | |||
761 | --->| |<---| |<---| |<---| |<--- | 761 | --->| |<---| |<---| |<---| |<--- |
762 | +---+ +---+ +---+ +---+ | 762 | +---+ +---+ +---+ +---+ |
763 | 763 | ||
764 | After the nested writer finishes, the outer most writer will convert | 764 | After the nested writer finishes, the outermost writer will convert |
765 | the UPDATE pointer to NORMAL. | 765 | the UPDATE pointer to NORMAL. |
766 | 766 | ||
767 | 767 | ||
@@ -812,7 +812,7 @@ head page. | |||
812 | +---+ +---+ +---+ +---+ | 812 | +---+ +---+ +---+ +---+ |
813 | 813 | ||
814 | The nested writer moves the tail page forward. But does not set the old | 814 | The nested writer moves the tail page forward. But does not set the old |
815 | update page to NORMAL because it is not the outer most writer. | 815 | update page to NORMAL because it is not the outermost writer. |
816 | 816 | ||
817 | tail page | 817 | tail page |
818 | | | 818 | | |
@@ -892,7 +892,7 @@ It will return to the first writer. | |||
892 | --->| |<---| |<---| |<---| |<--- | 892 | --->| |<---| |<---| |<---| |<--- |
893 | +---+ +---+ +---+ +---+ | 893 | +---+ +---+ +---+ +---+ |
894 | 894 | ||
895 | The first writer can not know atomically test if the tail page moved | 895 | The first writer cannot know atomically if the tail page moved |
896 | while it updates the HEAD page. It will then update the head page to | 896 | while it updates the HEAD page. It will then update the head page to |
897 | what it thinks is the new head page. | 897 | what it thinks is the new head page. |
898 | 898 | ||
@@ -923,9 +923,9 @@ if the tail page is either where it use to be or on the next page: | |||
923 | --->| |<---| |<---| |<---| |<--- | 923 | --->| |<---| |<---| |<---| |<--- |
924 | +---+ +---+ +---+ +---+ | 924 | +---+ +---+ +---+ +---+ |
925 | 925 | ||
926 | If tail page != A and tail page does not equal B, then it must reset the | 926 | If tail page != A and tail page != B, then it must reset the pointer |
927 | pointer back to NORMAL. The fact that it only needs to worry about | 927 | back to NORMAL. The fact that it only needs to worry about nested |
928 | nested writers, it only needs to check this after setting the HEAD page. | 928 | writers means that it only needs to check this after setting the HEAD page. |
929 | 929 | ||
930 | 930 | ||
931 | (first writer) | 931 | (first writer) |
@@ -939,7 +939,7 @@ nested writers, it only needs to check this after setting the HEAD page. | |||
939 | +---+ +---+ +---+ +---+ | 939 | +---+ +---+ +---+ +---+ |
940 | 940 | ||
941 | Now the writer can update the head page. This is also why the head page must | 941 | Now the writer can update the head page. This is also why the head page must |
942 | remain in UPDATE and only reset by the outer most writer. This prevents | 942 | remain in UPDATE and only reset by the outermost writer. This prevents |
943 | the reader from seeing the incorrect head page. | 943 | the reader from seeing the incorrect head page. |
944 | 944 | ||
945 | 945 | ||
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt index 5eb4e487e667..87bee3c129ba 100644 --- a/Documentation/trace/tracepoint-analysis.txt +++ b/Documentation/trace/tracepoint-analysis.txt | |||
@@ -10,8 +10,8 @@ Tracepoints (see Documentation/trace/tracepoints.txt) can be used without | |||
10 | creating custom kernel modules to register probe functions using the event | 10 | creating custom kernel modules to register probe functions using the event |
11 | tracing infrastructure. | 11 | tracing infrastructure. |
12 | 12 | ||
13 | Simplistically, tracepoints will represent an important event that when can | 13 | Simplistically, tracepoints represent important events that can be |
14 | be taken in conjunction with other tracepoints to build a "Big Picture" of | 14 | taken in conjunction with other tracepoints to build a "Big Picture" of |
15 | what is going on within the system. There are a large number of methods for | 15 | what is going on within the system. There are a large number of methods for |
16 | gathering and interpreting these events. Lacking any current Best Practises, | 16 | gathering and interpreting these events. Lacking any current Best Practises, |
17 | this document describes some of the methods that can be used. | 17 | this document describes some of the methods that can be used. |
@@ -33,12 +33,12 @@ calling | |||
33 | 33 | ||
34 | will give a fair indication of the number of events available. | 34 | will give a fair indication of the number of events available. |
35 | 35 | ||
36 | 2.2 PCL | 36 | 2.2 PCL (Performance Counters for Linux) |
37 | ------- | 37 | ------- |
38 | 38 | ||
39 | Discovery and enumeration of all counters and events, including tracepoints | 39 | Discovery and enumeration of all counters and events, including tracepoints, |
40 | are available with the perf tool. Getting a list of available events is a | 40 | are available with the perf tool. Getting a list of available events is a |
41 | simple case of | 41 | simple case of: |
42 | 42 | ||
43 | $ perf list 2>&1 | grep Tracepoint | 43 | $ perf list 2>&1 | grep Tracepoint |
44 | ext4:ext4_free_inode [Tracepoint event] | 44 | ext4:ext4_free_inode [Tracepoint event] |
@@ -49,19 +49,19 @@ simple case of | |||
49 | [ .... remaining output snipped .... ] | 49 | [ .... remaining output snipped .... ] |
50 | 50 | ||
51 | 51 | ||
52 | 2. Enabling Events | 52 | 3. Enabling Events |
53 | ================== | 53 | ================== |
54 | 54 | ||
55 | 2.1 System-Wide Event Enabling | 55 | 3.1 System-Wide Event Enabling |
56 | ------------------------------ | 56 | ------------------------------ |
57 | 57 | ||
58 | See Documentation/trace/events.txt for a proper description on how events | 58 | See Documentation/trace/events.txt for a proper description on how events |
59 | can be enabled system-wide. A short example of enabling all events related | 59 | can be enabled system-wide. A short example of enabling all events related |
60 | to page allocation would look something like | 60 | to page allocation would look something like: |
61 | 61 | ||
62 | $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done | 62 | $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done |
63 | 63 | ||
64 | 2.2 System-Wide Event Enabling with SystemTap | 64 | 3.2 System-Wide Event Enabling with SystemTap |
65 | --------------------------------------------- | 65 | --------------------------------------------- |
66 | 66 | ||
67 | In SystemTap, tracepoints are accessible using the kernel.trace() function | 67 | In SystemTap, tracepoints are accessible using the kernel.trace() function |
@@ -86,7 +86,7 @@ were allocating the pages. | |||
86 | print_count() | 86 | print_count() |
87 | } | 87 | } |
88 | 88 | ||
89 | 2.3 System-Wide Event Enabling with PCL | 89 | 3.3 System-Wide Event Enabling with PCL |
90 | --------------------------------------- | 90 | --------------------------------------- |
91 | 91 | ||
92 | By specifying the -a switch and analysing sleep, the system-wide events | 92 | By specifying the -a switch and analysing sleep, the system-wide events |
@@ -107,16 +107,16 @@ for a duration of time can be examined. | |||
107 | Similarly, one could execute a shell and exit it as desired to get a report | 107 | Similarly, one could execute a shell and exit it as desired to get a report |
108 | at that point. | 108 | at that point. |
109 | 109 | ||
110 | 2.4 Local Event Enabling | 110 | 3.4 Local Event Enabling |
111 | ------------------------ | 111 | ------------------------ |
112 | 112 | ||
113 | Documentation/trace/ftrace.txt describes how to enable events on a per-thread | 113 | Documentation/trace/ftrace.txt describes how to enable events on a per-thread |
114 | basis using set_ftrace_pid. | 114 | basis using set_ftrace_pid. |
115 | 115 | ||
116 | 2.5 Local Event Enablement with PCL | 116 | 3.5 Local Event Enablement with PCL |
117 | ----------------------------------- | 117 | ----------------------------------- |
118 | 118 | ||
119 | Events can be activate and tracked for the duration of a process on a local | 119 | Events can be activated and tracked for the duration of a process on a local |
120 | basis using PCL such as follows. | 120 | basis using PCL such as follows. |
121 | 121 | ||
122 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | 122 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ |
@@ -131,18 +131,18 @@ basis using PCL such as follows. | |||
131 | 131 | ||
132 | 0.973913387 seconds time elapsed | 132 | 0.973913387 seconds time elapsed |
133 | 133 | ||
134 | 3. Event Filtering | 134 | 4. Event Filtering |
135 | ================== | 135 | ================== |
136 | 136 | ||
137 | Documentation/trace/ftrace.txt covers in-depth how to filter events in | 137 | Documentation/trace/ftrace.txt covers in-depth how to filter events in |
138 | ftrace. Obviously using grep and awk of trace_pipe is an option as well | 138 | ftrace. Obviously using grep and awk of trace_pipe is an option as well |
139 | as any script reading trace_pipe. | 139 | as any script reading trace_pipe. |
140 | 140 | ||
141 | 4. Analysing Event Variances with PCL | 141 | 5. Analysing Event Variances with PCL |
142 | ===================================== | 142 | ===================================== |
143 | 143 | ||
144 | Any workload can exhibit variances between runs and it can be important | 144 | Any workload can exhibit variances between runs and it can be important |
145 | to know what the standard deviation in. By and large, this is left to the | 145 | to know what the standard deviation is. By and large, this is left to the |
146 | performance analyst to do it by hand. In the event that the discrete event | 146 | performance analyst to do it by hand. In the event that the discrete event |
147 | occurrences are useful to the performance analyst, then perf can be used. | 147 | occurrences are useful to the performance analyst, then perf can be used. |
148 | 148 | ||
@@ -166,7 +166,7 @@ In the event that some higher-level event is required that depends on some | |||
166 | aggregation of discrete events, then a script would need to be developed. | 166 | aggregation of discrete events, then a script would need to be developed. |
167 | 167 | ||
168 | Using --repeat, it is also possible to view how events are fluctuating over | 168 | Using --repeat, it is also possible to view how events are fluctuating over |
169 | time on a system wide basis using -a and sleep. | 169 | time on a system-wide basis using -a and sleep. |
170 | 170 | ||
171 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | 171 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ |
172 | -e kmem:mm_pagevec_free \ | 172 | -e kmem:mm_pagevec_free \ |
@@ -180,7 +180,7 @@ time on a system wide basis using -a and sleep. | |||
180 | 180 | ||
181 | 1.002251757 seconds time elapsed ( +- 0.005% ) | 181 | 1.002251757 seconds time elapsed ( +- 0.005% ) |
182 | 182 | ||
183 | 5. Higher-Level Analysis with Helper Scripts | 183 | 6. Higher-Level Analysis with Helper Scripts |
184 | ============================================ | 184 | ============================================ |
185 | 185 | ||
186 | When events are enabled the events that are triggering can be read from | 186 | When events are enabled the events that are triggering can be read from |
@@ -190,11 +190,11 @@ be gathered on-line as appropriate. Examples of post-processing might include | |||
190 | 190 | ||
191 | o Reading information from /proc for the PID that triggered the event | 191 | o Reading information from /proc for the PID that triggered the event |
192 | o Deriving a higher-level event from a series of lower-level events. | 192 | o Deriving a higher-level event from a series of lower-level events. |
193 | o Calculate latencies between two events | 193 | o Calculating latencies between two events |
194 | 194 | ||
195 | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example | 195 | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example |
196 | script that can read trace_pipe from STDIN or a copy of a trace. When used | 196 | script that can read trace_pipe from STDIN or a copy of a trace. When used |
197 | on-line, it can be interrupted once to generate a report without existing | 197 | on-line, it can be interrupted once to generate a report without exiting |
198 | and twice to exit. | 198 | and twice to exit. |
199 | 199 | ||
200 | Simplistically, the script just reads STDIN and counts up events but it | 200 | Simplistically, the script just reads STDIN and counts up events but it |
@@ -212,12 +212,12 @@ also can do more such as | |||
212 | processes, the parent process responsible for creating all the helpers | 212 | processes, the parent process responsible for creating all the helpers |
213 | can be identified | 213 | can be identified |
214 | 214 | ||
215 | 6. Lower-Level Analysis with PCL | 215 | 7. Lower-Level Analysis with PCL |
216 | ================================ | 216 | ================================ |
217 | 217 | ||
218 | There may also be a requirement to identify what functions with a program | 218 | There may also be a requirement to identify what functions within a program |
219 | were generating events within the kernel. To begin this sort of analysis, the | 219 | were generating events within the kernel. To begin this sort of analysis, the |
220 | data must be recorded. At the time of writing, this required root | 220 | data must be recorded. At the time of writing, this required root: |
221 | 221 | ||
222 | $ perf record -c 1 \ | 222 | $ perf record -c 1 \ |
223 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | 223 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ |
@@ -253,11 +253,11 @@ perf report. | |||
253 | # (For more details, try: perf report --sort comm,dso,symbol) | 253 | # (For more details, try: perf report --sort comm,dso,symbol) |
254 | # | 254 | # |
255 | 255 | ||
256 | According to this, the vast majority of events occured triggered on events | 256 | According to this, the vast majority of events triggered on events |
257 | within the VDSO. With simple binaries, this will often be the case so lets | 257 | within the VDSO. With simple binaries, this will often be the case so let's |
258 | take a slightly different example. In the course of writing this, it was | 258 | take a slightly different example. In the course of writing this, it was |
259 | noticed that X was generating an insane amount of page allocations so lets look | 259 | noticed that X was generating an insane amount of page allocations so let's look |
260 | at it | 260 | at it: |
261 | 261 | ||
262 | $ perf record -c 1 -f \ | 262 | $ perf record -c 1 -f \ |
263 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | 263 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ |
@@ -280,8 +280,8 @@ This was interrupted after a few seconds and | |||
280 | # (For more details, try: perf report --sort comm,dso,symbol) | 280 | # (For more details, try: perf report --sort comm,dso,symbol) |
281 | # | 281 | # |
282 | 282 | ||
283 | So, almost half of the events are occuring in a library. To get an idea which | 283 | So, almost half of the events are occurring in a library. To get an idea which |
284 | symbol. | 284 | symbol: |
285 | 285 | ||
286 | $ perf report --sort comm,dso,symbol | 286 | $ perf report --sort comm,dso,symbol |
287 | # Samples: 27666 | 287 | # Samples: 27666 |
@@ -297,7 +297,7 @@ symbol. | |||
297 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path | 297 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path |
298 | 0.00% Xorg [kernel] [k] ftrace_trace_userstack | 298 | 0.00% Xorg [kernel] [k] ftrace_trace_userstack |
299 | 299 | ||
300 | To see where within the function pixmanFillsse2 things are going wrong | 300 | To see where within the function pixmanFillsse2 things are going wrong: |
301 | 301 | ||
302 | $ perf annotate pixmanFillsse2 | 302 | $ perf annotate pixmanFillsse2 |
303 | [ ... ] | 303 | [ ... ] |