diff options
Diffstat (limited to 'Documentation/trace')
-rw-r--r-- | Documentation/trace/events-kmem.txt | 107 | ||||
-rw-r--r-- | Documentation/trace/events.txt | 219 | ||||
-rw-r--r-- | Documentation/trace/ftrace-design.txt | 233 | ||||
-rw-r--r-- | Documentation/trace/ftrace.txt | 76 | ||||
-rw-r--r-- | Documentation/trace/function-graph-fold.vim | 42 | ||||
-rw-r--r-- | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | 418 | ||||
-rw-r--r-- | Documentation/trace/power.txt | 17 | ||||
-rw-r--r-- | Documentation/trace/ring-buffer-design.txt | 955 | ||||
-rw-r--r-- | Documentation/trace/tracepoint-analysis.txt | 327 |
9 files changed, 2330 insertions, 64 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt new file mode 100644 index 000000000000..6ef2a8652e17 --- /dev/null +++ b/Documentation/trace/events-kmem.txt | |||
@@ -0,0 +1,107 @@ | |||
1 | Subsystem Trace Points: kmem | ||
2 | |||
3 | The tracing system kmem captures events related to object and page allocation | ||
4 | within the kernel. Broadly speaking there are four major subheadings. | ||
5 | |||
6 | o Slab allocation of small objects of unknown type (kmalloc) | ||
7 | o Slab allocation of small objects of known type | ||
8 | o Page allocation | ||
9 | o Per-CPU Allocator Activity | ||
10 | o External Fragmentation | ||
11 | |||
12 | This document will describe what each of the tracepoints are and why they | ||
13 | might be useful. | ||
14 | |||
15 | 1. Slab allocation of small objects of unknown type | ||
16 | =================================================== | ||
17 | kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s | ||
18 | kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d | ||
19 | kfree call_site=%lx ptr=%p | ||
20 | |||
21 | Heavy activity for these events may indicate that a specific cache is | ||
22 | justified, particularly if kmalloc slab pages are getting significantly | ||
23 | internal fragmented as a result of the allocation pattern. By correlating | ||
24 | kmalloc with kfree, it may be possible to identify memory leaks and where | ||
25 | the allocation sites were. | ||
26 | |||
27 | |||
28 | 2. Slab allocation of small objects of known type | ||
29 | ================================================= | ||
30 | kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s | ||
31 | kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d | ||
32 | kmem_cache_free call_site=%lx ptr=%p | ||
33 | |||
34 | These events are similar in usage to the kmalloc-related events except that | ||
35 | it is likely easier to pin the event down to a specific cache. At the time | ||
36 | of writing, no information is available on what slab is being allocated from, | ||
37 | but the call_site can usually be used to extrapolate that information | ||
38 | |||
39 | 3. Page allocation | ||
40 | ================== | ||
41 | mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s | ||
42 | mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d | ||
43 | mm_page_free_direct page=%p pfn=%lu order=%d | ||
44 | mm_pagevec_free page=%p pfn=%lu order=%d cold=%d | ||
45 | |||
46 | These four events deal with page allocation and freeing. mm_page_alloc is | ||
47 | a simple indicator of page allocator activity. Pages may be allocated from | ||
48 | the per-CPU allocator (high performance) or the buddy allocator. | ||
49 | |||
50 | If pages are allocated directly from the buddy allocator, the | ||
51 | mm_page_alloc_zone_locked event is triggered. This event is important as high | ||
52 | amounts of activity imply high activity on the zone->lock. Taking this lock | ||
53 | impairs performance by disabling interrupts, dirtying cache lines between | ||
54 | CPUs and serialising many CPUs. | ||
55 | |||
56 | When a page is freed directly by the caller, the mm_page_free_direct event | ||
57 | is triggered. Significant amounts of activity here could indicate that the | ||
58 | callers should be batching their activities. | ||
59 | |||
60 | When pages are freed using a pagevec, the mm_pagevec_free is | ||
61 | triggered. Broadly speaking, pages are taken off the LRU lock in bulk and | ||
62 | freed in batch with a pagevec. Significant amounts of activity here could | ||
63 | indicate that the system is under memory pressure and can also indicate | ||
64 | contention on the zone->lru_lock. | ||
65 | |||
66 | 4. Per-CPU Allocator Activity | ||
67 | ============================= | ||
68 | mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d | ||
69 | mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d | ||
70 | |||
71 | In front of the page allocator is a per-cpu page allocator. It exists only | ||
72 | for order-0 pages, reduces contention on the zone->lock and reduces the | ||
73 | amount of writing on struct page. | ||
74 | |||
75 | When a per-CPU list is empty or pages of the wrong type are allocated, | ||
76 | the zone->lock will be taken once and the per-CPU list refilled. The event | ||
77 | triggered is mm_page_alloc_zone_locked for each page allocated with the | ||
78 | event indicating whether it is for a percpu_refill or not. | ||
79 | |||
80 | When the per-CPU list is too full, a number of pages are freed, each one | ||
81 | which triggers a mm_page_pcpu_drain event. | ||
82 | |||
83 | The individual nature of the events are so that pages can be tracked | ||
84 | between allocation and freeing. A number of drain or refill pages that occur | ||
85 | consecutively imply the zone->lock being taken once. Large amounts of PCP | ||
86 | refills and drains could imply an imbalance between CPUs where too much work | ||
87 | is being concentrated in one place. It could also indicate that the per-CPU | ||
88 | lists should be a larger size. Finally, large amounts of refills on one CPU | ||
89 | and drains on another could be a factor in causing large amounts of cache | ||
90 | line bounces due to writes between CPUs and worth investigating if pages | ||
91 | can be allocated and freed on the same CPU through some algorithm change. | ||
92 | |||
93 | 5. External Fragmentation | ||
94 | ========================= | ||
95 | mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d | ||
96 | |||
97 | External fragmentation affects whether a high-order allocation will be | ||
98 | successful or not. For some types of hardware, this is important although | ||
99 | it is avoided where possible. If the system is using huge pages and needs | ||
100 | to be able to resize the pool over the lifetime of the system, this value | ||
101 | is important. | ||
102 | |||
103 | Large numbers of this event implies that memory is fragmenting and | ||
104 | high-order allocations will start failing at some time in the future. One | ||
105 | means of reducing the occurange of this event is to increase the size of | ||
106 | min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where | ||
107 | pageblock_size is usually the size of the default hugepage size. | ||
diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index f157d7594ea7..02ac6ed38b2d 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt | |||
@@ -1,7 +1,7 @@ | |||
1 | Event Tracing | 1 | Event Tracing |
2 | 2 | ||
3 | Documentation written by Theodore Ts'o | 3 | Documentation written by Theodore Ts'o |
4 | Updated by Li Zefan | 4 | Updated by Li Zefan and Tom Zanussi |
5 | 5 | ||
6 | 1. Introduction | 6 | 1. Introduction |
7 | =============== | 7 | =============== |
@@ -22,12 +22,12 @@ tracing information should be printed. | |||
22 | --------------------------------- | 22 | --------------------------------- |
23 | 23 | ||
24 | The events which are available for tracing can be found in the file | 24 | The events which are available for tracing can be found in the file |
25 | /debug/tracing/available_events. | 25 | /sys/kernel/debug/tracing/available_events. |
26 | 26 | ||
27 | To enable a particular event, such as 'sched_wakeup', simply echo it | 27 | To enable a particular event, such as 'sched_wakeup', simply echo it |
28 | to /debug/tracing/set_event. For example: | 28 | to /sys/kernel/debug/tracing/set_event. For example: |
29 | 29 | ||
30 | # echo sched_wakeup >> /debug/tracing/set_event | 30 | # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event |
31 | 31 | ||
32 | [ Note: '>>' is necessary, otherwise it will firstly disable | 32 | [ Note: '>>' is necessary, otherwise it will firstly disable |
33 | all the events. ] | 33 | all the events. ] |
@@ -35,15 +35,15 @@ to /debug/tracing/set_event. For example: | |||
35 | To disable an event, echo the event name to the set_event file prefixed | 35 | To disable an event, echo the event name to the set_event file prefixed |
36 | with an exclamation point: | 36 | with an exclamation point: |
37 | 37 | ||
38 | # echo '!sched_wakeup' >> /debug/tracing/set_event | 38 | # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event |
39 | 39 | ||
40 | To disable all events, echo an empty line to the set_event file: | 40 | To disable all events, echo an empty line to the set_event file: |
41 | 41 | ||
42 | # echo > /debug/tracing/set_event | 42 | # echo > /sys/kernel/debug/tracing/set_event |
43 | 43 | ||
44 | To enable all events, echo '*:*' or '*:' to the set_event file: | 44 | To enable all events, echo '*:*' or '*:' to the set_event file: |
45 | 45 | ||
46 | # echo *:* > /debug/tracing/set_event | 46 | # echo *:* > /sys/kernel/debug/tracing/set_event |
47 | 47 | ||
48 | The events are organized into subsystems, such as ext4, irq, sched, | 48 | The events are organized into subsystems, such as ext4, irq, sched, |
49 | etc., and a full event name looks like this: <subsystem>:<event>. The | 49 | etc., and a full event name looks like this: <subsystem>:<event>. The |
@@ -52,29 +52,29 @@ file. All of the events in a subsystem can be specified via the syntax | |||
52 | "<subsystem>:*"; for example, to enable all irq events, you can use the | 52 | "<subsystem>:*"; for example, to enable all irq events, you can use the |
53 | command: | 53 | command: |
54 | 54 | ||
55 | # echo 'irq:*' > /debug/tracing/set_event | 55 | # echo 'irq:*' > /sys/kernel/debug/tracing/set_event |
56 | 56 | ||
57 | 2.2 Via the 'enable' toggle | 57 | 2.2 Via the 'enable' toggle |
58 | --------------------------- | 58 | --------------------------- |
59 | 59 | ||
60 | The events available are also listed in /debug/tracing/events/ hierarchy | 60 | The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy |
61 | of directories. | 61 | of directories. |
62 | 62 | ||
63 | To enable event 'sched_wakeup': | 63 | To enable event 'sched_wakeup': |
64 | 64 | ||
65 | # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable | 65 | # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable |
66 | 66 | ||
67 | To disable it: | 67 | To disable it: |
68 | 68 | ||
69 | # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable | 69 | # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable |
70 | 70 | ||
71 | To enable all events in sched subsystem: | 71 | To enable all events in sched subsystem: |
72 | 72 | ||
73 | # echo 1 > /debug/tracing/events/sched/enable | 73 | # echo 1 > /sys/kernel/debug/tracing/events/sched/enable |
74 | 74 | ||
75 | To eanble all events: | 75 | To enable all events: |
76 | 76 | ||
77 | # echo 1 > /debug/tracing/events/enable | 77 | # echo 1 > /sys/kernel/debug/tracing/events/enable |
78 | 78 | ||
79 | When reading one of these enable files, there are four results: | 79 | When reading one of these enable files, there are four results: |
80 | 80 | ||
@@ -83,8 +83,199 @@ When reading one of these enable files, there are four results: | |||
83 | X - there is a mixture of events enabled and disabled | 83 | X - there is a mixture of events enabled and disabled |
84 | ? - this file does not affect any event | 84 | ? - this file does not affect any event |
85 | 85 | ||
86 | 2.3 Boot option | ||
87 | --------------- | ||
88 | |||
89 | In order to facilitate early boot debugging, use boot option: | ||
90 | |||
91 | trace_event=[event-list] | ||
92 | |||
93 | The format of this boot option is the same as described in section 2.1. | ||
94 | |||
86 | 3. Defining an event-enabled tracepoint | 95 | 3. Defining an event-enabled tracepoint |
87 | ======================================= | 96 | ======================================= |
88 | 97 | ||
89 | See The example provided in samples/trace_events | 98 | See The example provided in samples/trace_events |
90 | 99 | ||
100 | 4. Event formats | ||
101 | ================ | ||
102 | |||
103 | Each trace event has a 'format' file associated with it that contains | ||
104 | a description of each field in a logged event. This information can | ||
105 | be used to parse the binary trace stream, and is also the place to | ||
106 | find the field names that can be used in event filters (see section 5). | ||
107 | |||
108 | It also displays the format string that will be used to print the | ||
109 | event in text mode, along with the event name and ID used for | ||
110 | profiling. | ||
111 | |||
112 | Every event has a set of 'common' fields associated with it; these are | ||
113 | the fields prefixed with 'common_'. The other fields vary between | ||
114 | events and correspond to the fields defined in the TRACE_EVENT | ||
115 | definition for that event. | ||
116 | |||
117 | Each field in the format has the form: | ||
118 | |||
119 | field:field-type field-name; offset:N; size:N; | ||
120 | |||
121 | where offset is the offset of the field in the trace record and size | ||
122 | is the size of the data item, in bytes. | ||
123 | |||
124 | For example, here's the information displayed for the 'sched_wakeup' | ||
125 | event: | ||
126 | |||
127 | # cat /debug/tracing/events/sched/sched_wakeup/format | ||
128 | |||
129 | name: sched_wakeup | ||
130 | ID: 60 | ||
131 | format: | ||
132 | field:unsigned short common_type; offset:0; size:2; | ||
133 | field:unsigned char common_flags; offset:2; size:1; | ||
134 | field:unsigned char common_preempt_count; offset:3; size:1; | ||
135 | field:int common_pid; offset:4; size:4; | ||
136 | field:int common_tgid; offset:8; size:4; | ||
137 | |||
138 | field:char comm[TASK_COMM_LEN]; offset:12; size:16; | ||
139 | field:pid_t pid; offset:28; size:4; | ||
140 | field:int prio; offset:32; size:4; | ||
141 | field:int success; offset:36; size:4; | ||
142 | field:int cpu; offset:40; size:4; | ||
143 | |||
144 | print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, | ||
145 | REC->prio, REC->success, REC->cpu | ||
146 | |||
147 | This event contains 10 fields, the first 5 common and the remaining 5 | ||
148 | event-specific. All the fields for this event are numeric, except for | ||
149 | 'comm' which is a string, a distinction important for event filtering. | ||
150 | |||
151 | 5. Event filtering | ||
152 | ================== | ||
153 | |||
154 | Trace events can be filtered in the kernel by associating boolean | ||
155 | 'filter expressions' with them. As soon as an event is logged into | ||
156 | the trace buffer, its fields are checked against the filter expression | ||
157 | associated with that event type. An event with field values that | ||
158 | 'match' the filter will appear in the trace output, and an event whose | ||
159 | values don't match will be discarded. An event with no filter | ||
160 | associated with it matches everything, and is the default when no | ||
161 | filter has been set for an event. | ||
162 | |||
163 | 5.1 Expression syntax | ||
164 | --------------------- | ||
165 | |||
166 | A filter expression consists of one or more 'predicates' that can be | ||
167 | combined using the logical operators '&&' and '||'. A predicate is | ||
168 | simply a clause that compares the value of a field contained within a | ||
169 | logged event with a constant value and returns either 0 or 1 depending | ||
170 | on whether the field value matched (1) or didn't match (0): | ||
171 | |||
172 | field-name relational-operator value | ||
173 | |||
174 | Parentheses can be used to provide arbitrary logical groupings and | ||
175 | double-quotes can be used to prevent the shell from interpreting | ||
176 | operators as shell metacharacters. | ||
177 | |||
178 | The field-names available for use in filters can be found in the | ||
179 | 'format' files for trace events (see section 4). | ||
180 | |||
181 | The relational-operators depend on the type of the field being tested: | ||
182 | |||
183 | The operators available for numeric fields are: | ||
184 | |||
185 | ==, !=, <, <=, >, >= | ||
186 | |||
187 | And for string fields they are: | ||
188 | |||
189 | ==, != | ||
190 | |||
191 | Currently, only exact string matches are supported. | ||
192 | |||
193 | Currently, the maximum number of predicates in a filter is 16. | ||
194 | |||
195 | 5.2 Setting filters | ||
196 | ------------------- | ||
197 | |||
198 | A filter for an individual event is set by writing a filter expression | ||
199 | to the 'filter' file for the given event. | ||
200 | |||
201 | For example: | ||
202 | |||
203 | # cd /debug/tracing/events/sched/sched_wakeup | ||
204 | # echo "common_preempt_count > 4" > filter | ||
205 | |||
206 | A slightly more involved example: | ||
207 | |||
208 | # cd /debug/tracing/events/sched/sched_signal_send | ||
209 | # echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter | ||
210 | |||
211 | If there is an error in the expression, you'll get an 'Invalid | ||
212 | argument' error when setting it, and the erroneous string along with | ||
213 | an error message can be seen by looking at the filter e.g.: | ||
214 | |||
215 | # cd /debug/tracing/events/sched/sched_signal_send | ||
216 | # echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter | ||
217 | -bash: echo: write error: Invalid argument | ||
218 | # cat filter | ||
219 | ((sig >= 10 && sig < 15) || dsig == 17) && comm != bash | ||
220 | ^ | ||
221 | parse_error: Field not found | ||
222 | |||
223 | Currently the caret ('^') for an error always appears at the beginning of | ||
224 | the filter string; the error message should still be useful though | ||
225 | even without more accurate position info. | ||
226 | |||
227 | 5.3 Clearing filters | ||
228 | -------------------- | ||
229 | |||
230 | To clear the filter for an event, write a '0' to the event's filter | ||
231 | file. | ||
232 | |||
233 | To clear the filters for all events in a subsystem, write a '0' to the | ||
234 | subsystem's filter file. | ||
235 | |||
236 | 5.3 Subsystem filters | ||
237 | --------------------- | ||
238 | |||
239 | For convenience, filters for every event in a subsystem can be set or | ||
240 | cleared as a group by writing a filter expression into the filter file | ||
241 | at the root of the subsytem. Note however, that if a filter for any | ||
242 | event within the subsystem lacks a field specified in the subsystem | ||
243 | filter, or if the filter can't be applied for any other reason, the | ||
244 | filter for that event will retain its previous setting. This can | ||
245 | result in an unintended mixture of filters which could lead to | ||
246 | confusing (to the user who might think different filters are in | ||
247 | effect) trace output. Only filters that reference just the common | ||
248 | fields can be guaranteed to propagate successfully to all events. | ||
249 | |||
250 | Here are a few subsystem filter examples that also illustrate the | ||
251 | above points: | ||
252 | |||
253 | Clear the filters on all events in the sched subsytem: | ||
254 | |||
255 | # cd /sys/kernel/debug/tracing/events/sched | ||
256 | # echo 0 > filter | ||
257 | # cat sched_switch/filter | ||
258 | none | ||
259 | # cat sched_wakeup/filter | ||
260 | none | ||
261 | |||
262 | Set a filter using only common fields for all events in the sched | ||
263 | subsytem (all events end up with the same filter): | ||
264 | |||
265 | # cd /sys/kernel/debug/tracing/events/sched | ||
266 | # echo common_pid == 0 > filter | ||
267 | # cat sched_switch/filter | ||
268 | common_pid == 0 | ||
269 | # cat sched_wakeup/filter | ||
270 | common_pid == 0 | ||
271 | |||
272 | Attempt to set a filter using a non-common field for all events in the | ||
273 | sched subsytem (all events but those that have a prev_pid field retain | ||
274 | their old filters): | ||
275 | |||
276 | # cd /sys/kernel/debug/tracing/events/sched | ||
277 | # echo prev_pid == 0 > filter | ||
278 | # cat sched_switch/filter | ||
279 | prev_pid == 0 | ||
280 | # cat sched_wakeup/filter | ||
281 | common_pid == 0 | ||
diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt new file mode 100644 index 000000000000..7003e10f10f5 --- /dev/null +++ b/Documentation/trace/ftrace-design.txt | |||
@@ -0,0 +1,233 @@ | |||
1 | function tracer guts | ||
2 | ==================== | ||
3 | |||
4 | Introduction | ||
5 | ------------ | ||
6 | |||
7 | Here we will cover the architecture pieces that the common function tracing | ||
8 | code relies on for proper functioning. Things are broken down into increasing | ||
9 | complexity so that you can start simple and at least get basic functionality. | ||
10 | |||
11 | Note that this focuses on architecture implementation details only. If you | ||
12 | want more explanation of a feature in terms of common code, review the common | ||
13 | ftrace.txt file. | ||
14 | |||
15 | |||
16 | Prerequisites | ||
17 | ------------- | ||
18 | |||
19 | Ftrace relies on these features being implemented: | ||
20 | STACKTRACE_SUPPORT - implement save_stack_trace() | ||
21 | TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h | ||
22 | |||
23 | |||
24 | HAVE_FUNCTION_TRACER | ||
25 | -------------------- | ||
26 | |||
27 | You will need to implement the mcount and the ftrace_stub functions. | ||
28 | |||
29 | The exact mcount symbol name will depend on your toolchain. Some call it | ||
30 | "mcount", "_mcount", or even "__mcount". You can probably figure it out by | ||
31 | running something like: | ||
32 | $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount | ||
33 | call mcount | ||
34 | We'll make the assumption below that the symbol is "mcount" just to keep things | ||
35 | nice and simple in the examples. | ||
36 | |||
37 | Keep in mind that the ABI that is in effect inside of the mcount function is | ||
38 | *highly* architecture/toolchain specific. We cannot help you in this regard, | ||
39 | sorry. Dig up some old documentation and/or find someone more familiar than | ||
40 | you to bang ideas off of. Typically, register usage (argument/scratch/etc...) | ||
41 | is a major issue at this point, especially in relation to the location of the | ||
42 | mcount call (before/after function prologue). You might also want to look at | ||
43 | how glibc has implemented the mcount function for your architecture. It might | ||
44 | be (semi-)relevant. | ||
45 | |||
46 | The mcount function should check the function pointer ftrace_trace_function | ||
47 | to see if it is set to ftrace_stub. If it is, there is nothing for you to do, | ||
48 | so return immediately. If it isn't, then call that function in the same way | ||
49 | the mcount function normally calls __mcount_internal -- the first argument is | ||
50 | the "frompc" while the second argument is the "selfpc" (adjusted to remove the | ||
51 | size of the mcount call that is embedded in the function). | ||
52 | |||
53 | For example, if the function foo() calls bar(), when the bar() function calls | ||
54 | mcount(), the arguments mcount() will pass to the tracer are: | ||
55 | "frompc" - the address bar() will use to return to foo() | ||
56 | "selfpc" - the address bar() (with _mcount() size adjustment) | ||
57 | |||
58 | Also keep in mind that this mcount function will be called *a lot*, so | ||
59 | optimizing for the default case of no tracer will help the smooth running of | ||
60 | your system when tracing is disabled. So the start of the mcount function is | ||
61 | typically the bare min with checking things before returning. That also means | ||
62 | the code flow should usually kept linear (i.e. no branching in the nop case). | ||
63 | This is of course an optimization and not a hard requirement. | ||
64 | |||
65 | Here is some pseudo code that should help (these functions should actually be | ||
66 | implemented in assembly): | ||
67 | |||
68 | void ftrace_stub(void) | ||
69 | { | ||
70 | return; | ||
71 | } | ||
72 | |||
73 | void mcount(void) | ||
74 | { | ||
75 | /* save any bare state needed in order to do initial checking */ | ||
76 | |||
77 | extern void (*ftrace_trace_function)(unsigned long, unsigned long); | ||
78 | if (ftrace_trace_function != ftrace_stub) | ||
79 | goto do_trace; | ||
80 | |||
81 | /* restore any bare state */ | ||
82 | |||
83 | return; | ||
84 | |||
85 | do_trace: | ||
86 | |||
87 | /* save all state needed by the ABI (see paragraph above) */ | ||
88 | |||
89 | unsigned long frompc = ...; | ||
90 | unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; | ||
91 | ftrace_trace_function(frompc, selfpc); | ||
92 | |||
93 | /* restore all state needed by the ABI */ | ||
94 | } | ||
95 | |||
96 | Don't forget to export mcount for modules ! | ||
97 | extern void mcount(void); | ||
98 | EXPORT_SYMBOL(mcount); | ||
99 | |||
100 | |||
101 | HAVE_FUNCTION_TRACE_MCOUNT_TEST | ||
102 | ------------------------------- | ||
103 | |||
104 | This is an optional optimization for the normal case when tracing is turned off | ||
105 | in the system. If you do not enable this Kconfig option, the common ftrace | ||
106 | code will take care of doing the checking for you. | ||
107 | |||
108 | To support this feature, you only need to check the function_trace_stop | ||
109 | variable in the mcount function. If it is non-zero, there is no tracing to be | ||
110 | done at all, so you can return. | ||
111 | |||
112 | This additional pseudo code would simply be: | ||
113 | void mcount(void) | ||
114 | { | ||
115 | /* save any bare state needed in order to do initial checking */ | ||
116 | |||
117 | + if (function_trace_stop) | ||
118 | + return; | ||
119 | |||
120 | extern void (*ftrace_trace_function)(unsigned long, unsigned long); | ||
121 | if (ftrace_trace_function != ftrace_stub) | ||
122 | ... | ||
123 | |||
124 | |||
125 | HAVE_FUNCTION_GRAPH_TRACER | ||
126 | -------------------------- | ||
127 | |||
128 | Deep breath ... time to do some real work. Here you will need to update the | ||
129 | mcount function to check ftrace graph function pointers, as well as implement | ||
130 | some functions to save (hijack) and restore the return address. | ||
131 | |||
132 | The mcount function should check the function pointers ftrace_graph_return | ||
133 | (compare to ftrace_stub) and ftrace_graph_entry (compare to | ||
134 | ftrace_graph_entry_stub). If either of those are not set to the relevant stub | ||
135 | function, call the arch-specific function ftrace_graph_caller which in turn | ||
136 | calls the arch-specific function prepare_ftrace_return. Neither of these | ||
137 | function names are strictly required, but you should use them anyways to stay | ||
138 | consistent across the architecture ports -- easier to compare & contrast | ||
139 | things. | ||
140 | |||
141 | The arguments to prepare_ftrace_return are slightly different than what are | ||
142 | passed to ftrace_trace_function. The second argument "selfpc" is the same, | ||
143 | but the first argument should be a pointer to the "frompc". Typically this is | ||
144 | located on the stack. This allows the function to hijack the return address | ||
145 | temporarily to have it point to the arch-specific function return_to_handler. | ||
146 | That function will simply call the common ftrace_return_to_handler function and | ||
147 | that will return the original return address with which, you can return to the | ||
148 | original call site. | ||
149 | |||
150 | Here is the updated mcount pseudo code: | ||
151 | void mcount(void) | ||
152 | { | ||
153 | ... | ||
154 | if (ftrace_trace_function != ftrace_stub) | ||
155 | goto do_trace; | ||
156 | |||
157 | +#ifdef CONFIG_FUNCTION_GRAPH_TRACER | ||
158 | + extern void (*ftrace_graph_return)(...); | ||
159 | + extern void (*ftrace_graph_entry)(...); | ||
160 | + if (ftrace_graph_return != ftrace_stub || | ||
161 | + ftrace_graph_entry != ftrace_graph_entry_stub) | ||
162 | + ftrace_graph_caller(); | ||
163 | +#endif | ||
164 | |||
165 | /* restore any bare state */ | ||
166 | ... | ||
167 | |||
168 | Here is the pseudo code for the new ftrace_graph_caller assembly function: | ||
169 | #ifdef CONFIG_FUNCTION_GRAPH_TRACER | ||
170 | void ftrace_graph_caller(void) | ||
171 | { | ||
172 | /* save all state needed by the ABI */ | ||
173 | |||
174 | unsigned long *frompc = &...; | ||
175 | unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; | ||
176 | prepare_ftrace_return(frompc, selfpc); | ||
177 | |||
178 | /* restore all state needed by the ABI */ | ||
179 | } | ||
180 | #endif | ||
181 | |||
182 | For information on how to implement prepare_ftrace_return(), simply look at | ||
183 | the x86 version. The only architecture-specific piece in it is the setup of | ||
184 | the fault recovery table (the asm(...) code). The rest should be the same | ||
185 | across architectures. | ||
186 | |||
187 | Here is the pseudo code for the new return_to_handler assembly function. Note | ||
188 | that the ABI that applies here is different from what applies to the mcount | ||
189 | code. Since you are returning from a function (after the epilogue), you might | ||
190 | be able to skimp on things saved/restored (usually just registers used to pass | ||
191 | return values). | ||
192 | |||
193 | #ifdef CONFIG_FUNCTION_GRAPH_TRACER | ||
194 | void return_to_handler(void) | ||
195 | { | ||
196 | /* save all state needed by the ABI (see paragraph above) */ | ||
197 | |||
198 | void (*original_return_point)(void) = ftrace_return_to_handler(); | ||
199 | |||
200 | /* restore all state needed by the ABI */ | ||
201 | |||
202 | /* this is usually either a return or a jump */ | ||
203 | original_return_point(); | ||
204 | } | ||
205 | #endif | ||
206 | |||
207 | |||
208 | HAVE_FTRACE_NMI_ENTER | ||
209 | --------------------- | ||
210 | |||
211 | If you can't trace NMI functions, then skip this option. | ||
212 | |||
213 | <details to be filled> | ||
214 | |||
215 | |||
216 | HAVE_FTRACE_SYSCALLS | ||
217 | --------------------- | ||
218 | |||
219 | <details to be filled> | ||
220 | |||
221 | |||
222 | HAVE_FTRACE_MCOUNT_RECORD | ||
223 | ------------------------- | ||
224 | |||
225 | See scripts/recordmcount.pl for more info. | ||
226 | |||
227 | <details to be filled> | ||
228 | |||
229 | |||
230 | HAVE_DYNAMIC_FTRACE | ||
231 | --------------------- | ||
232 | |||
233 | <details to be filled> | ||
diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index a39b3c749de5..957b22fde2df 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt | |||
@@ -26,6 +26,12 @@ disabled, and more (ftrace allows for tracer plugins, which | |||
26 | means that the list of tracers can always grow). | 26 | means that the list of tracers can always grow). |
27 | 27 | ||
28 | 28 | ||
29 | Implementation Details | ||
30 | ---------------------- | ||
31 | |||
32 | See ftrace-design.txt for details for arch porters and such. | ||
33 | |||
34 | |||
29 | The File System | 35 | The File System |
30 | --------------- | 36 | --------------- |
31 | 37 | ||
@@ -85,26 +91,19 @@ of ftrace. Here is a list of some of the key files: | |||
85 | This file holds the output of the trace in a human | 91 | This file holds the output of the trace in a human |
86 | readable format (described below). | 92 | readable format (described below). |
87 | 93 | ||
88 | latency_trace: | ||
89 | |||
90 | This file shows the same trace but the information | ||
91 | is organized more to display possible latencies | ||
92 | in the system (described below). | ||
93 | |||
94 | trace_pipe: | 94 | trace_pipe: |
95 | 95 | ||
96 | The output is the same as the "trace" file but this | 96 | The output is the same as the "trace" file but this |
97 | file is meant to be streamed with live tracing. | 97 | file is meant to be streamed with live tracing. |
98 | Reads from this file will block until new data | 98 | Reads from this file will block until new data is |
99 | is retrieved. Unlike the "trace" and "latency_trace" | 99 | retrieved. Unlike the "trace" file, this file is a |
100 | files, this file is a consumer. This means reading | 100 | consumer. This means reading from this file causes |
101 | from this file causes sequential reads to display | 101 | sequential reads to display more current data. Once |
102 | more current data. Once data is read from this | 102 | data is read from this file, it is consumed, and |
103 | file, it is consumed, and will not be read | 103 | will not be read again with a sequential read. The |
104 | again with a sequential read. The "trace" and | 104 | "trace" file is static, and if the tracer is not |
105 | "latency_trace" files are static, and if the | 105 | adding more data,they will display the same |
106 | tracer is not adding more data, they will display | 106 | information every time they are read. |
107 | the same information every time they are read. | ||
108 | 107 | ||
109 | trace_options: | 108 | trace_options: |
110 | 109 | ||
@@ -117,10 +116,10 @@ of ftrace. Here is a list of some of the key files: | |||
117 | Some of the tracers record the max latency. | 116 | Some of the tracers record the max latency. |
118 | For example, the time interrupts are disabled. | 117 | For example, the time interrupts are disabled. |
119 | This time is saved in this file. The max trace | 118 | This time is saved in this file. The max trace |
120 | will also be stored, and displayed by either | 119 | will also be stored, and displayed by "trace". |
121 | "trace" or "latency_trace". A new max trace will | 120 | A new max trace will only be recorded if the |
122 | only be recorded if the latency is greater than | 121 | latency is greater than the value in this |
123 | the value in this file. (in microseconds) | 122 | file. (in microseconds) |
124 | 123 | ||
125 | buffer_size_kb: | 124 | buffer_size_kb: |
126 | 125 | ||
@@ -134,7 +133,7 @@ of ftrace. Here is a list of some of the key files: | |||
134 | than requested, the rest of the page will be used, | 133 | than requested, the rest of the page will be used, |
135 | making the actual allocation bigger than requested. | 134 | making the actual allocation bigger than requested. |
136 | ( Note, the size may not be a multiple of the page size | 135 | ( Note, the size may not be a multiple of the page size |
137 | due to buffer managment overhead. ) | 136 | due to buffer management overhead. ) |
138 | 137 | ||
139 | This can only be updated when the current_tracer | 138 | This can only be updated when the current_tracer |
140 | is set to "nop". | 139 | is set to "nop". |
@@ -210,7 +209,7 @@ Here is the list of current tracers that may be configured. | |||
210 | the trace with the longest max latency. | 209 | the trace with the longest max latency. |
211 | See tracing_max_latency. When a new max is recorded, | 210 | See tracing_max_latency. When a new max is recorded, |
212 | it replaces the old trace. It is best to view this | 211 | it replaces the old trace. It is best to view this |
213 | trace via the latency_trace file. | 212 | trace with the latency-format option enabled. |
214 | 213 | ||
215 | "preemptoff" | 214 | "preemptoff" |
216 | 215 | ||
@@ -307,8 +306,8 @@ the lowest priority thread (pid 0). | |||
307 | Latency trace format | 306 | Latency trace format |
308 | -------------------- | 307 | -------------------- |
309 | 308 | ||
310 | For traces that display latency times, the latency_trace file | 309 | When the latency-format option is enabled, the trace file gives |
311 | gives somewhat more information to see why a latency happened. | 310 | somewhat more information to see why a latency happened. |
312 | Here is a typical trace. | 311 | Here is a typical trace. |
313 | 312 | ||
314 | # tracer: irqsoff | 313 | # tracer: irqsoff |
@@ -380,9 +379,10 @@ explains which is which. | |||
380 | 379 | ||
381 | The above is mostly meaningful for kernel developers. | 380 | The above is mostly meaningful for kernel developers. |
382 | 381 | ||
383 | time: This differs from the trace file output. The trace file output | 382 | time: When the latency-format option is enabled, the trace file |
384 | includes an absolute timestamp. The timestamp used by the | 383 | output includes a timestamp relative to the start of the |
385 | latency_trace file is relative to the start of the trace. | 384 | trace. This differs from the output when latency-format |
385 | is disabled, which includes an absolute timestamp. | ||
386 | 386 | ||
387 | delay: This is just to help catch your eye a bit better. And | 387 | delay: This is just to help catch your eye a bit better. And |
388 | needs to be fixed to be only relative to the same CPU. | 388 | needs to be fixed to be only relative to the same CPU. |
@@ -440,7 +440,8 @@ Here are the available options: | |||
440 | sym-addr: | 440 | sym-addr: |
441 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> | 441 | bash-4000 [01] 1477.606694: simple_strtoul <c0339346> |
442 | 442 | ||
443 | verbose - This deals with the latency_trace file. | 443 | verbose - This deals with the trace file when the |
444 | latency-format option is enabled. | ||
444 | 445 | ||
445 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ | 446 | bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ |
446 | (+0.000ms): simple_strtoul (strict_strtoul) | 447 | (+0.000ms): simple_strtoul (strict_strtoul) |
@@ -472,7 +473,7 @@ Here are the available options: | |||
472 | the app is no longer running | 473 | the app is no longer running |
473 | 474 | ||
474 | The lookup is performed when you read | 475 | The lookup is performed when you read |
475 | trace,trace_pipe,latency_trace. Example: | 476 | trace,trace_pipe. Example: |
476 | 477 | ||
477 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 | 478 | a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 |
478 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | 479 | x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] |
@@ -481,6 +482,11 @@ x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] | |||
481 | every scheduling event. Will add overhead if | 482 | every scheduling event. Will add overhead if |
482 | there's a lot of tasks running at once. | 483 | there's a lot of tasks running at once. |
483 | 484 | ||
485 | latency-format - This option changes the trace. When | ||
486 | it is enabled, the trace displays | ||
487 | additional information about the | ||
488 | latencies, as described in "Latency | ||
489 | trace format". | ||
484 | 490 | ||
485 | sched_switch | 491 | sched_switch |
486 | ------------ | 492 | ------------ |
@@ -596,12 +602,13 @@ To reset the maximum, echo 0 into tracing_max_latency. Here is | |||
596 | an example: | 602 | an example: |
597 | 603 | ||
598 | # echo irqsoff > current_tracer | 604 | # echo irqsoff > current_tracer |
605 | # echo latency-format > trace_options | ||
599 | # echo 0 > tracing_max_latency | 606 | # echo 0 > tracing_max_latency |
600 | # echo 1 > tracing_enabled | 607 | # echo 1 > tracing_enabled |
601 | # ls -ltr | 608 | # ls -ltr |
602 | [...] | 609 | [...] |
603 | # echo 0 > tracing_enabled | 610 | # echo 0 > tracing_enabled |
604 | # cat latency_trace | 611 | # cat trace |
605 | # tracer: irqsoff | 612 | # tracer: irqsoff |
606 | # | 613 | # |
607 | irqsoff latency trace v1.1.5 on 2.6.26 | 614 | irqsoff latency trace v1.1.5 on 2.6.26 |
@@ -703,12 +710,13 @@ which preemption was disabled. The control of preemptoff tracer | |||
703 | is much like the irqsoff tracer. | 710 | is much like the irqsoff tracer. |
704 | 711 | ||
705 | # echo preemptoff > current_tracer | 712 | # echo preemptoff > current_tracer |
713 | # echo latency-format > trace_options | ||
706 | # echo 0 > tracing_max_latency | 714 | # echo 0 > tracing_max_latency |
707 | # echo 1 > tracing_enabled | 715 | # echo 1 > tracing_enabled |
708 | # ls -ltr | 716 | # ls -ltr |
709 | [...] | 717 | [...] |
710 | # echo 0 > tracing_enabled | 718 | # echo 0 > tracing_enabled |
711 | # cat latency_trace | 719 | # cat trace |
712 | # tracer: preemptoff | 720 | # tracer: preemptoff |
713 | # | 721 | # |
714 | preemptoff latency trace v1.1.5 on 2.6.26-rc8 | 722 | preemptoff latency trace v1.1.5 on 2.6.26-rc8 |
@@ -850,12 +858,13 @@ Again, using this trace is much like the irqsoff and preemptoff | |||
850 | tracers. | 858 | tracers. |
851 | 859 | ||
852 | # echo preemptirqsoff > current_tracer | 860 | # echo preemptirqsoff > current_tracer |
861 | # echo latency-format > trace_options | ||
853 | # echo 0 > tracing_max_latency | 862 | # echo 0 > tracing_max_latency |
854 | # echo 1 > tracing_enabled | 863 | # echo 1 > tracing_enabled |
855 | # ls -ltr | 864 | # ls -ltr |
856 | [...] | 865 | [...] |
857 | # echo 0 > tracing_enabled | 866 | # echo 0 > tracing_enabled |
858 | # cat latency_trace | 867 | # cat trace |
859 | # tracer: preemptirqsoff | 868 | # tracer: preemptirqsoff |
860 | # | 869 | # |
861 | preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 | 870 | preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 |
@@ -1012,11 +1021,12 @@ Instead of performing an 'ls', we will run 'sleep 1' under | |||
1012 | 'chrt' which changes the priority of the task. | 1021 | 'chrt' which changes the priority of the task. |
1013 | 1022 | ||
1014 | # echo wakeup > current_tracer | 1023 | # echo wakeup > current_tracer |
1024 | # echo latency-format > trace_options | ||
1015 | # echo 0 > tracing_max_latency | 1025 | # echo 0 > tracing_max_latency |
1016 | # echo 1 > tracing_enabled | 1026 | # echo 1 > tracing_enabled |
1017 | # chrt -f 5 sleep 1 | 1027 | # chrt -f 5 sleep 1 |
1018 | # echo 0 > tracing_enabled | 1028 | # echo 0 > tracing_enabled |
1019 | # cat latency_trace | 1029 | # cat trace |
1020 | # tracer: wakeup | 1030 | # tracer: wakeup |
1021 | # | 1031 | # |
1022 | wakeup latency trace v1.1.5 on 2.6.26-rc8 | 1032 | wakeup latency trace v1.1.5 on 2.6.26-rc8 |
diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim new file mode 100644 index 000000000000..0544b504c8b0 --- /dev/null +++ b/Documentation/trace/function-graph-fold.vim | |||
@@ -0,0 +1,42 @@ | |||
1 | " Enable folding for ftrace function_graph traces. | ||
2 | " | ||
3 | " To use, :source this file while viewing a function_graph trace, or use vim's | ||
4 | " -S option to load from the command-line together with a trace. You can then | ||
5 | " use the usual vim fold commands, such as "za", to open and close nested | ||
6 | " functions. While closed, a fold will show the total time taken for a call, | ||
7 | " as would normally appear on the line with the closing brace. Folded | ||
8 | " functions will not include finish_task_switch(), so folding should remain | ||
9 | " relatively sane even through a context switch. | ||
10 | " | ||
11 | " Note that this will almost certainly only work well with a | ||
12 | " single-CPU trace (e.g. trace-cmd report --cpu 1). | ||
13 | |||
14 | function! FunctionGraphFoldExpr(lnum) | ||
15 | let line = getline(a:lnum) | ||
16 | if line[-1:] == '{' | ||
17 | if line =~ 'finish_task_switch() {$' | ||
18 | return '>1' | ||
19 | endif | ||
20 | return 'a1' | ||
21 | elseif line[-1:] == '}' | ||
22 | return 's1' | ||
23 | else | ||
24 | return '=' | ||
25 | endif | ||
26 | endfunction | ||
27 | |||
28 | function! FunctionGraphFoldText() | ||
29 | let s = split(getline(v:foldstart), '|', 1) | ||
30 | if getline(v:foldend+1) =~ 'finish_task_switch() {$' | ||
31 | let s[2] = ' task switch ' | ||
32 | else | ||
33 | let e = split(getline(v:foldend), '|', 1) | ||
34 | let s[2] = e[2] | ||
35 | endif | ||
36 | return join(s, '|') | ||
37 | endfunction | ||
38 | |||
39 | setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) | ||
40 | setlocal foldtext=FunctionGraphFoldText() | ||
41 | setlocal foldcolumn=12 | ||
42 | setlocal foldmethod=expr | ||
diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 000000000000..7df50e8cf4d9 --- /dev/null +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | |||
@@ -0,0 +1,418 @@ | |||
1 | #!/usr/bin/perl | ||
2 | # This is a POC (proof of concept or piece of crap, take your pick) for reading the | ||
3 | # text representation of trace output related to page allocation. It makes an attempt | ||
4 | # to extract some high-level information on what is going on. The accuracy of the parser | ||
5 | # may vary considerably | ||
6 | # | ||
7 | # Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe | ||
8 | # other options | ||
9 | # --prepend-parent Report on the parent proc and PID | ||
10 | # --read-procstat If the trace lacks process info, get it from /proc | ||
11 | # --ignore-pid Aggregate processes of the same name together | ||
12 | # | ||
13 | # Copyright (c) IBM Corporation 2009 | ||
14 | # Author: Mel Gorman <mel@csn.ul.ie> | ||
15 | use strict; | ||
16 | use Getopt::Long; | ||
17 | |||
18 | # Tracepoint events | ||
19 | use constant MM_PAGE_ALLOC => 1; | ||
20 | use constant MM_PAGE_FREE_DIRECT => 2; | ||
21 | use constant MM_PAGEVEC_FREE => 3; | ||
22 | use constant MM_PAGE_PCPU_DRAIN => 4; | ||
23 | use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; | ||
24 | use constant MM_PAGE_ALLOC_EXTFRAG => 6; | ||
25 | use constant EVENT_UNKNOWN => 7; | ||
26 | |||
27 | # Constants used to track state | ||
28 | use constant STATE_PCPU_PAGES_DRAINED => 8; | ||
29 | use constant STATE_PCPU_PAGES_REFILLED => 9; | ||
30 | |||
31 | # High-level events extrapolated from tracepoints | ||
32 | use constant HIGH_PCPU_DRAINS => 10; | ||
33 | use constant HIGH_PCPU_REFILLS => 11; | ||
34 | use constant HIGH_EXT_FRAGMENT => 12; | ||
35 | use constant HIGH_EXT_FRAGMENT_SEVERE => 13; | ||
36 | use constant HIGH_EXT_FRAGMENT_MODERATE => 14; | ||
37 | use constant HIGH_EXT_FRAGMENT_CHANGED => 15; | ||
38 | |||
39 | my %perprocesspid; | ||
40 | my %perprocess; | ||
41 | my $opt_ignorepid; | ||
42 | my $opt_read_procstat; | ||
43 | my $opt_prepend_parent; | ||
44 | |||
45 | # Catch sigint and exit on request | ||
46 | my $sigint_report = 0; | ||
47 | my $sigint_exit = 0; | ||
48 | my $sigint_pending = 0; | ||
49 | my $sigint_received = 0; | ||
50 | sub sigint_handler { | ||
51 | my $current_time = time; | ||
52 | if ($current_time - 2 > $sigint_received) { | ||
53 | print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; | ||
54 | $sigint_report = 1; | ||
55 | } else { | ||
56 | if (!$sigint_exit) { | ||
57 | print "Second SIGINT received quickly, exiting\n"; | ||
58 | } | ||
59 | $sigint_exit++; | ||
60 | } | ||
61 | |||
62 | if ($sigint_exit > 3) { | ||
63 | print "Many SIGINTs received, exiting now without report\n"; | ||
64 | exit; | ||
65 | } | ||
66 | |||
67 | $sigint_received = $current_time; | ||
68 | $sigint_pending = 1; | ||
69 | } | ||
70 | $SIG{INT} = "sigint_handler"; | ||
71 | |||
72 | # Parse command line options | ||
73 | GetOptions( | ||
74 | 'ignore-pid' => \$opt_ignorepid, | ||
75 | 'read-procstat' => \$opt_read_procstat, | ||
76 | 'prepend-parent' => \$opt_prepend_parent, | ||
77 | ); | ||
78 | |||
79 | # Defaults for dynamically discovered regex's | ||
80 | my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; | ||
81 | |||
82 | # Dyanically discovered regex | ||
83 | my $regex_fragdetails; | ||
84 | |||
85 | # Static regex used. Specified like this for readability and for use with /o | ||
86 | # (process_pid) (cpus ) ( time ) (tpoint ) (details) | ||
87 | my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; | ||
88 | my $regex_statname = '[-0-9]*\s\((.*)\).*'; | ||
89 | my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; | ||
90 | |||
91 | sub generate_traceevent_regex { | ||
92 | my $event = shift; | ||
93 | my $default = shift; | ||
94 | my $regex; | ||
95 | |||
96 | # Read the event format or use the default | ||
97 | if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { | ||
98 | $regex = $default; | ||
99 | } else { | ||
100 | my $line; | ||
101 | while (!eof(FORMAT)) { | ||
102 | $line = <FORMAT>; | ||
103 | if ($line =~ /^print fmt:\s"(.*)",.*/) { | ||
104 | $regex = $1; | ||
105 | $regex =~ s/%p/\([0-9a-f]*\)/g; | ||
106 | $regex =~ s/%d/\([-0-9]*\)/g; | ||
107 | $regex =~ s/%lu/\([0-9]*\)/g; | ||
108 | } | ||
109 | } | ||
110 | } | ||
111 | |||
112 | # Verify fields are in the right order | ||
113 | my $tuple; | ||
114 | foreach $tuple (split /\s/, $regex) { | ||
115 | my ($key, $value) = split(/=/, $tuple); | ||
116 | my $expected = shift; | ||
117 | if ($key ne $expected) { | ||
118 | print("WARNING: Format not as expected '$key' != '$expected'"); | ||
119 | $regex =~ s/$key=\((.*)\)/$key=$1/; | ||
120 | } | ||
121 | } | ||
122 | |||
123 | if (defined shift) { | ||
124 | die("Fewer fields than expected in format"); | ||
125 | } | ||
126 | |||
127 | return $regex; | ||
128 | } | ||
129 | $regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", | ||
130 | $regex_fragdetails_default, | ||
131 | "page", "pfn", | ||
132 | "alloc_order", "fallback_order", "pageblock_order", | ||
133 | "alloc_migratetype", "fallback_migratetype", | ||
134 | "fragmenting", "change_ownership"); | ||
135 | |||
136 | sub read_statline($) { | ||
137 | my $pid = $_[0]; | ||
138 | my $statline; | ||
139 | |||
140 | if (open(STAT, "/proc/$pid/stat")) { | ||
141 | $statline = <STAT>; | ||
142 | close(STAT); | ||
143 | } | ||
144 | |||
145 | if ($statline eq '') { | ||
146 | $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; | ||
147 | } | ||
148 | |||
149 | return $statline; | ||
150 | } | ||
151 | |||
152 | sub guess_process_pid($$) { | ||
153 | my $pid = $_[0]; | ||
154 | my $statline = $_[1]; | ||
155 | |||
156 | if ($pid == 0) { | ||
157 | return "swapper-0"; | ||
158 | } | ||
159 | |||
160 | if ($statline !~ /$regex_statname/o) { | ||
161 | die("Failed to math stat line for process name :: $statline"); | ||
162 | } | ||
163 | return "$1-$pid"; | ||
164 | } | ||
165 | |||
166 | sub parent_info($$) { | ||
167 | my $pid = $_[0]; | ||
168 | my $statline = $_[1]; | ||
169 | my $ppid; | ||
170 | |||
171 | if ($pid == 0) { | ||
172 | return "NOPARENT-0"; | ||
173 | } | ||
174 | |||
175 | if ($statline !~ /$regex_statppid/o) { | ||
176 | die("Failed to match stat line process ppid:: $statline"); | ||
177 | } | ||
178 | |||
179 | # Read the ppid stat line | ||
180 | $ppid = $1; | ||
181 | return guess_process_pid($ppid, read_statline($ppid)); | ||
182 | } | ||
183 | |||
184 | sub process_events { | ||
185 | my $traceevent; | ||
186 | my $process_pid; | ||
187 | my $cpus; | ||
188 | my $timestamp; | ||
189 | my $tracepoint; | ||
190 | my $details; | ||
191 | my $statline; | ||
192 | |||
193 | # Read each line of the event log | ||
194 | EVENT_PROCESS: | ||
195 | while ($traceevent = <STDIN>) { | ||
196 | if ($traceevent =~ /$regex_traceevent/o) { | ||
197 | $process_pid = $1; | ||
198 | $tracepoint = $4; | ||
199 | |||
200 | if ($opt_read_procstat || $opt_prepend_parent) { | ||
201 | $process_pid =~ /(.*)-([0-9]*)$/; | ||
202 | my $process = $1; | ||
203 | my $pid = $2; | ||
204 | |||
205 | $statline = read_statline($pid); | ||
206 | |||
207 | if ($opt_read_procstat && $process eq '') { | ||
208 | $process_pid = guess_process_pid($pid, $statline); | ||
209 | } | ||
210 | |||
211 | if ($opt_prepend_parent) { | ||
212 | $process_pid = parent_info($pid, $statline) . " :: $process_pid"; | ||
213 | } | ||
214 | } | ||
215 | |||
216 | # Unnecessary in this script. Uncomment if required | ||
217 | # $cpus = $2; | ||
218 | # $timestamp = $3; | ||
219 | } else { | ||
220 | next; | ||
221 | } | ||
222 | |||
223 | # Perl Switch() sucks majorly | ||
224 | if ($tracepoint eq "mm_page_alloc") { | ||
225 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; | ||
226 | } elsif ($tracepoint eq "mm_page_free_direct") { | ||
227 | $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}++; | ||
228 | } elsif ($tracepoint eq "mm_pagevec_free") { | ||
229 | $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}++; | ||
230 | } elsif ($tracepoint eq "mm_page_pcpu_drain") { | ||
231 | $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; | ||
232 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; | ||
233 | } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { | ||
234 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; | ||
235 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; | ||
236 | } elsif ($tracepoint eq "mm_page_alloc_extfrag") { | ||
237 | |||
238 | # Extract the details of the event now | ||
239 | $details = $5; | ||
240 | |||
241 | my ($page, $pfn); | ||
242 | my ($alloc_order, $fallback_order, $pageblock_order); | ||
243 | my ($alloc_migratetype, $fallback_migratetype); | ||
244 | my ($fragmenting, $change_ownership); | ||
245 | |||
246 | if ($details !~ /$regex_fragdetails/o) { | ||
247 | print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; | ||
248 | next; | ||
249 | } | ||
250 | |||
251 | $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; | ||
252 | $page = $1; | ||
253 | $pfn = $2; | ||
254 | $alloc_order = $3; | ||
255 | $fallback_order = $4; | ||
256 | $pageblock_order = $5; | ||
257 | $alloc_migratetype = $6; | ||
258 | $fallback_migratetype = $7; | ||
259 | $fragmenting = $8; | ||
260 | $change_ownership = $9; | ||
261 | |||
262 | if ($fragmenting) { | ||
263 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; | ||
264 | if ($fallback_order <= 3) { | ||
265 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; | ||
266 | } else { | ||
267 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; | ||
268 | } | ||
269 | } | ||
270 | if ($change_ownership) { | ||
271 | $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; | ||
272 | } | ||
273 | } else { | ||
274 | $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; | ||
275 | } | ||
276 | |||
277 | # Catch a full pcpu drain event | ||
278 | if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && | ||
279 | $tracepoint ne "mm_page_pcpu_drain") { | ||
280 | |||
281 | $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; | ||
282 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; | ||
283 | } | ||
284 | |||
285 | # Catch a full pcpu refill event | ||
286 | if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && | ||
287 | $tracepoint ne "mm_page_alloc_zone_locked") { | ||
288 | $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; | ||
289 | $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; | ||
290 | } | ||
291 | |||
292 | if ($sigint_pending) { | ||
293 | last EVENT_PROCESS; | ||
294 | } | ||
295 | } | ||
296 | } | ||
297 | |||
298 | sub dump_stats { | ||
299 | my $hashref = shift; | ||
300 | my %stats = %$hashref; | ||
301 | |||
302 | # Dump per-process stats | ||
303 | my $process_pid; | ||
304 | my $max_strlen = 0; | ||
305 | |||
306 | # Get the maximum process name | ||
307 | foreach $process_pid (keys %perprocesspid) { | ||
308 | my $len = length($process_pid); | ||
309 | if ($len > $max_strlen) { | ||
310 | $max_strlen = $len; | ||
311 | } | ||
312 | } | ||
313 | $max_strlen += 2; | ||
314 | |||
315 | printf("\n"); | ||
316 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
317 | "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); | ||
318 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
319 | "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); | ||
320 | |||
321 | printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", | ||
322 | "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); | ||
323 | |||
324 | foreach $process_pid (keys %stats) { | ||
325 | # Dump final aggregates | ||
326 | if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { | ||
327 | $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; | ||
328 | $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; | ||
329 | } | ||
330 | if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { | ||
331 | $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; | ||
332 | $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; | ||
333 | } | ||
334 | |||
335 | printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", | ||
336 | $process_pid, | ||
337 | $stats{$process_pid}->{MM_PAGE_ALLOC}, | ||
338 | $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, | ||
339 | $stats{$process_pid}->{MM_PAGE_FREE_DIRECT}, | ||
340 | $stats{$process_pid}->{MM_PAGEVEC_FREE}, | ||
341 | $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, | ||
342 | $stats{$process_pid}->{HIGH_PCPU_DRAINS}, | ||
343 | $stats{$process_pid}->{HIGH_PCPU_REFILLS}, | ||
344 | $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, | ||
345 | $stats{$process_pid}->{HIGH_EXT_FRAG}, | ||
346 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, | ||
347 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, | ||
348 | $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, | ||
349 | $stats{$process_pid}->{EVENT_UNKNOWN}); | ||
350 | } | ||
351 | } | ||
352 | |||
353 | sub aggregate_perprocesspid() { | ||
354 | my $process_pid; | ||
355 | my $process; | ||
356 | undef %perprocess; | ||
357 | |||
358 | foreach $process_pid (keys %perprocesspid) { | ||
359 | $process = $process_pid; | ||
360 | $process =~ s/-([0-9])*$//; | ||
361 | if ($process eq '') { | ||
362 | $process = "NO_PROCESS_NAME"; | ||
363 | } | ||
364 | |||
365 | $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; | ||
366 | $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; | ||
367 | $perprocess{$process}->{MM_PAGE_FREE_DIRECT} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_DIRECT}; | ||
368 | $perprocess{$process}->{MM_PAGEVEC_FREE} += $perprocesspid{$process_pid}->{MM_PAGEVEC_FREE}; | ||
369 | $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; | ||
370 | $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; | ||
371 | $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; | ||
372 | $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; | ||
373 | $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; | ||
374 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; | ||
375 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; | ||
376 | $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; | ||
377 | $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; | ||
378 | } | ||
379 | } | ||
380 | |||
381 | sub report() { | ||
382 | if (!$opt_ignorepid) { | ||
383 | dump_stats(\%perprocesspid); | ||
384 | } else { | ||
385 | aggregate_perprocesspid(); | ||
386 | dump_stats(\%perprocess); | ||
387 | } | ||
388 | } | ||
389 | |||
390 | # Process events or signals until neither is available | ||
391 | sub signal_loop() { | ||
392 | my $sigint_processed; | ||
393 | do { | ||
394 | $sigint_processed = 0; | ||
395 | process_events(); | ||
396 | |||
397 | # Handle pending signals if any | ||
398 | if ($sigint_pending) { | ||
399 | my $current_time = time; | ||
400 | |||
401 | if ($sigint_exit) { | ||
402 | print "Received exit signal\n"; | ||
403 | $sigint_pending = 0; | ||
404 | } | ||
405 | if ($sigint_report) { | ||
406 | if ($current_time >= $sigint_received + 2) { | ||
407 | report(); | ||
408 | $sigint_report = 0; | ||
409 | $sigint_pending = 0; | ||
410 | $sigint_processed = 1; | ||
411 | } | ||
412 | } | ||
413 | } | ||
414 | } while ($sigint_pending || $sigint_processed); | ||
415 | } | ||
416 | |||
417 | signal_loop(); | ||
418 | report(); | ||
diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt deleted file mode 100644 index cd805e16dc27..000000000000 --- a/Documentation/trace/power.txt +++ /dev/null | |||
@@ -1,17 +0,0 @@ | |||
1 | The power tracer collects detailed information about C-state and P-state | ||
2 | transitions, instead of just looking at the high-level "average" | ||
3 | information. | ||
4 | |||
5 | There is a helper script found in scrips/tracing/power.pl in the kernel | ||
6 | sources which can be used to parse this information and create a | ||
7 | Scalable Vector Graphics (SVG) picture from the trace data. | ||
8 | |||
9 | To use this tracer: | ||
10 | |||
11 | echo 0 > /sys/kernel/debug/tracing/tracing_enabled | ||
12 | echo power > /sys/kernel/debug/tracing/current_tracer | ||
13 | echo 1 > /sys/kernel/debug/tracing/tracing_enabled | ||
14 | sleep 1 | ||
15 | echo 0 > /sys/kernel/debug/tracing/tracing_enabled | ||
16 | cat /sys/kernel/debug/tracing/trace | \ | ||
17 | perl scripts/tracing/power.pl > out.sv | ||
diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt new file mode 100644 index 000000000000..5b1d23d604c5 --- /dev/null +++ b/Documentation/trace/ring-buffer-design.txt | |||
@@ -0,0 +1,955 @@ | |||
1 | Lockless Ring Buffer Design | ||
2 | =========================== | ||
3 | |||
4 | Copyright 2009 Red Hat Inc. | ||
5 | Author: Steven Rostedt <srostedt@redhat.com> | ||
6 | License: The GNU Free Documentation License, Version 1.2 | ||
7 | (dual licensed under the GPL v2) | ||
8 | Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, | ||
9 | and Frederic Weisbecker. | ||
10 | |||
11 | |||
12 | Written for: 2.6.31 | ||
13 | |||
14 | Terminology used in this Document | ||
15 | --------------------------------- | ||
16 | |||
17 | tail - where new writes happen in the ring buffer. | ||
18 | |||
19 | head - where new reads happen in the ring buffer. | ||
20 | |||
21 | producer - the task that writes into the ring buffer (same as writer) | ||
22 | |||
23 | writer - same as producer | ||
24 | |||
25 | consumer - the task that reads from the buffer (same as reader) | ||
26 | |||
27 | reader - same as consumer. | ||
28 | |||
29 | reader_page - A page outside the ring buffer used solely (for the most part) | ||
30 | by the reader. | ||
31 | |||
32 | head_page - a pointer to the page that the reader will use next | ||
33 | |||
34 | tail_page - a pointer to the page that will be written to next | ||
35 | |||
36 | commit_page - a pointer to the page with the last finished non nested write. | ||
37 | |||
38 | cmpxchg - hardware assisted atomic transaction that performs the following: | ||
39 | |||
40 | A = B iff previous A == C | ||
41 | |||
42 | R = cmpxchg(A, C, B) is saying that we replace A with B if and only if | ||
43 | current A is equal to C, and we put the old (current) A into R | ||
44 | |||
45 | R gets the previous A regardless if A is updated with B or not. | ||
46 | |||
47 | To see if the update was successful a compare of R == C may be used. | ||
48 | |||
49 | The Generic Ring Buffer | ||
50 | ----------------------- | ||
51 | |||
52 | The ring buffer can be used in either an overwrite mode or in | ||
53 | producer/consumer mode. | ||
54 | |||
55 | Producer/consumer mode is where the producer were to fill up the | ||
56 | buffer before the consumer could free up anything, the producer | ||
57 | will stop writing to the buffer. This will lose most recent events. | ||
58 | |||
59 | Overwrite mode is where the produce were to fill up the buffer | ||
60 | before the consumer could free up anything, the producer will | ||
61 | overwrite the older data. This will lose the oldest events. | ||
62 | |||
63 | No two writers can write at the same time (on the same per cpu buffer), | ||
64 | but a writer may interrupt another writer, but it must finish writing | ||
65 | before the previous writer may continue. This is very important to the | ||
66 | algorithm. The writers act like a "stack". The way interrupts works | ||
67 | enforces this behavior. | ||
68 | |||
69 | |||
70 | writer1 start | ||
71 | <preempted> writer2 start | ||
72 | <preempted> writer3 start | ||
73 | writer3 finishes | ||
74 | writer2 finishes | ||
75 | writer1 finishes | ||
76 | |||
77 | This is very much like a writer being preempted by an interrupt and | ||
78 | the interrupt doing a write as well. | ||
79 | |||
80 | Readers can happen at any time. But no two readers may run at the | ||
81 | same time, nor can a reader preempt/interrupt another reader. A reader | ||
82 | can not preempt/interrupt a writer, but it may read/consume from the | ||
83 | buffer at the same time as a writer is writing, but the reader must be | ||
84 | on another processor to do so. A reader may read on its own processor | ||
85 | and can be preempted by a writer. | ||
86 | |||
87 | A writer can preempt a reader, but a reader can not preempt a writer. | ||
88 | But a reader can read the buffer at the same time (on another processor) | ||
89 | as a writer. | ||
90 | |||
91 | The ring buffer is made up of a list of pages held together by a link list. | ||
92 | |||
93 | At initialization a reader page is allocated for the reader that is not | ||
94 | part of the ring buffer. | ||
95 | |||
96 | The head_page, tail_page and commit_page are all initialized to point | ||
97 | to the same page. | ||
98 | |||
99 | The reader page is initialized to have its next pointer pointing to | ||
100 | the head page, and its previous pointer pointing to a page before | ||
101 | the head page. | ||
102 | |||
103 | The reader has its own page to use. At start up time, this page is | ||
104 | allocated but is not attached to the list. When the reader wants | ||
105 | to read from the buffer, if its page is empty (like it is on start up) | ||
106 | it will swap its page with the head_page. The old reader page will | ||
107 | become part of the ring buffer and the head_page will be removed. | ||
108 | The page after the inserted page (old reader_page) will become the | ||
109 | new head page. | ||
110 | |||
111 | Once the new page is given to the reader, the reader could do what | ||
112 | it wants with it, as long as a writer has left that page. | ||
113 | |||
114 | A sample of how the reader page is swapped: Note this does not | ||
115 | show the head page in the buffer, it is for demonstrating a swap | ||
116 | only. | ||
117 | |||
118 | +------+ | ||
119 | |reader| RING BUFFER | ||
120 | |page | | ||
121 | +------+ | ||
122 | +---+ +---+ +---+ | ||
123 | | |-->| |-->| | | ||
124 | | |<--| |<--| | | ||
125 | +---+ +---+ +---+ | ||
126 | ^ | ^ | | ||
127 | | +-------------+ | | ||
128 | +-----------------+ | ||
129 | |||
130 | |||
131 | +------+ | ||
132 | |reader| RING BUFFER | ||
133 | |page |-------------------+ | ||
134 | +------+ v | ||
135 | | +---+ +---+ +---+ | ||
136 | | | |-->| |-->| | | ||
137 | | | |<--| |<--| |<-+ | ||
138 | | +---+ +---+ +---+ | | ||
139 | | ^ | ^ | | | ||
140 | | | +-------------+ | | | ||
141 | | +-----------------+ | | ||
142 | +------------------------------------+ | ||
143 | |||
144 | +------+ | ||
145 | |reader| RING BUFFER | ||
146 | |page |-------------------+ | ||
147 | +------+ <---------------+ v | ||
148 | | ^ +---+ +---+ +---+ | ||
149 | | | | |-->| |-->| | | ||
150 | | | | | | |<--| |<-+ | ||
151 | | | +---+ +---+ +---+ | | ||
152 | | | | ^ | | | ||
153 | | | +-------------+ | | | ||
154 | | +-----------------------------+ | | ||
155 | +------------------------------------+ | ||
156 | |||
157 | +------+ | ||
158 | |buffer| RING BUFFER | ||
159 | |page |-------------------+ | ||
160 | +------+ <---------------+ v | ||
161 | | ^ +---+ +---+ +---+ | ||
162 | | | | | | |-->| | | ||
163 | | | New | | | |<--| |<-+ | ||
164 | | | Reader +---+ +---+ +---+ | | ||
165 | | | page ----^ | | | ||
166 | | | | | | ||
167 | | +-----------------------------+ | | ||
168 | +------------------------------------+ | ||
169 | |||
170 | |||
171 | |||
172 | It is possible that the page swapped is the commit page and the tail page, | ||
173 | if what is in the ring buffer is less than what is held in a buffer page. | ||
174 | |||
175 | |||
176 | reader page commit page tail page | ||
177 | | | | | ||
178 | v | | | ||
179 | +---+ | | | ||
180 | | |<----------+ | | ||
181 | | |<------------------------+ | ||
182 | | |------+ | ||
183 | +---+ | | ||
184 | | | ||
185 | v | ||
186 | +---+ +---+ +---+ +---+ | ||
187 | <---| |--->| |--->| |--->| |---> | ||
188 | --->| |<---| |<---| |<---| |<--- | ||
189 | +---+ +---+ +---+ +---+ | ||
190 | |||
191 | This case is still valid for this algorithm. | ||
192 | When the writer leaves the page, it simply goes into the ring buffer | ||
193 | since the reader page still points to the next location in the ring | ||
194 | buffer. | ||
195 | |||
196 | |||
197 | The main pointers: | ||
198 | |||
199 | reader page - The page used solely by the reader and is not part | ||
200 | of the ring buffer (may be swapped in) | ||
201 | |||
202 | head page - the next page in the ring buffer that will be swapped | ||
203 | with the reader page. | ||
204 | |||
205 | tail page - the page where the next write will take place. | ||
206 | |||
207 | commit page - the page that last finished a write. | ||
208 | |||
209 | The commit page only is updated by the outer most writer in the | ||
210 | writer stack. A writer that preempts another writer will not move the | ||
211 | commit page. | ||
212 | |||
213 | When data is written into the ring buffer, a position is reserved | ||
214 | in the ring buffer and passed back to the writer. When the writer | ||
215 | is finished writing data into that position, it commits the write. | ||
216 | |||
217 | Another write (or a read) may take place at anytime during this | ||
218 | transaction. If another write happens it must finish before continuing | ||
219 | with the previous write. | ||
220 | |||
221 | |||
222 | Write reserve: | ||
223 | |||
224 | Buffer page | ||
225 | +---------+ | ||
226 | |written | | ||
227 | +---------+ <--- given back to writer (current commit) | ||
228 | |reserved | | ||
229 | +---------+ <--- tail pointer | ||
230 | | empty | | ||
231 | +---------+ | ||
232 | |||
233 | Write commit: | ||
234 | |||
235 | Buffer page | ||
236 | +---------+ | ||
237 | |written | | ||
238 | +---------+ | ||
239 | |written | | ||
240 | +---------+ <--- next positon for write (current commit) | ||
241 | | empty | | ||
242 | +---------+ | ||
243 | |||
244 | |||
245 | If a write happens after the first reserve: | ||
246 | |||
247 | Buffer page | ||
248 | +---------+ | ||
249 | |written | | ||
250 | +---------+ <-- current commit | ||
251 | |reserved | | ||
252 | +---------+ <--- given back to second writer | ||
253 | |reserved | | ||
254 | +---------+ <--- tail pointer | ||
255 | |||
256 | After second writer commits: | ||
257 | |||
258 | |||
259 | Buffer page | ||
260 | +---------+ | ||
261 | |written | | ||
262 | +---------+ <--(last full commit) | ||
263 | |reserved | | ||
264 | +---------+ | ||
265 | |pending | | ||
266 | |commit | | ||
267 | +---------+ <--- tail pointer | ||
268 | |||
269 | When the first writer commits: | ||
270 | |||
271 | Buffer page | ||
272 | +---------+ | ||
273 | |written | | ||
274 | +---------+ | ||
275 | |written | | ||
276 | +---------+ | ||
277 | |written | | ||
278 | +---------+ <--(last full commit and tail pointer) | ||
279 | |||
280 | |||
281 | The commit pointer points to the last write location that was | ||
282 | committed without preempting another write. When a write that | ||
283 | preempted another write is committed, it only becomes a pending commit | ||
284 | and will not be a full commit till all writes have been committed. | ||
285 | |||
286 | The commit page points to the page that has the last full commit. | ||
287 | The tail page points to the page with the last write (before | ||
288 | committing). | ||
289 | |||
290 | The tail page is always equal to or after the commit page. It may | ||
291 | be several pages ahead. If the tail page catches up to the commit | ||
292 | page then no more writes may take place (regardless of the mode | ||
293 | of the ring buffer: overwrite and produce/consumer). | ||
294 | |||
295 | The order of pages are: | ||
296 | |||
297 | head page | ||
298 | commit page | ||
299 | tail page | ||
300 | |||
301 | Possible scenario: | ||
302 | tail page | ||
303 | head page commit page | | ||
304 | | | | | ||
305 | v v v | ||
306 | +---+ +---+ +---+ +---+ | ||
307 | <---| |--->| |--->| |--->| |---> | ||
308 | --->| |<---| |<---| |<---| |<--- | ||
309 | +---+ +---+ +---+ +---+ | ||
310 | |||
311 | There is a special case that the head page is after either the commit page | ||
312 | and possibly the tail page. That is when the commit (and tail) page has been | ||
313 | swapped with the reader page. This is because the head page is always | ||
314 | part of the ring buffer, but the reader page is not. When ever there | ||
315 | has been less than a full page that has been committed inside the ring buffer, | ||
316 | and a reader swaps out a page, it will be swapping out the commit page. | ||
317 | |||
318 | |||
319 | reader page commit page tail page | ||
320 | | | | | ||
321 | v | | | ||
322 | +---+ | | | ||
323 | | |<----------+ | | ||
324 | | |<------------------------+ | ||
325 | | |------+ | ||
326 | +---+ | | ||
327 | | | ||
328 | v | ||
329 | +---+ +---+ +---+ +---+ | ||
330 | <---| |--->| |--->| |--->| |---> | ||
331 | --->| |<---| |<---| |<---| |<--- | ||
332 | +---+ +---+ +---+ +---+ | ||
333 | ^ | ||
334 | | | ||
335 | head page | ||
336 | |||
337 | |||
338 | In this case, the head page will not move when the tail and commit | ||
339 | move back into the ring buffer. | ||
340 | |||
341 | The reader can not swap a page into the ring buffer if the commit page | ||
342 | is still on that page. If the read meets the last commit (real commit | ||
343 | not pending or reserved), then there is nothing more to read. | ||
344 | The buffer is considered empty until another full commit finishes. | ||
345 | |||
346 | When the tail meets the head page, if the buffer is in overwrite mode, | ||
347 | the head page will be pushed ahead one. If the buffer is in producer/consumer | ||
348 | mode, the write will fail. | ||
349 | |||
350 | Overwrite mode: | ||
351 | |||
352 | tail page | ||
353 | | | ||
354 | v | ||
355 | +---+ +---+ +---+ +---+ | ||
356 | <---| |--->| |--->| |--->| |---> | ||
357 | --->| |<---| |<---| |<---| |<--- | ||
358 | +---+ +---+ +---+ +---+ | ||
359 | ^ | ||
360 | | | ||
361 | head page | ||
362 | |||
363 | |||
364 | tail page | ||
365 | | | ||
366 | v | ||
367 | +---+ +---+ +---+ +---+ | ||
368 | <---| |--->| |--->| |--->| |---> | ||
369 | --->| |<---| |<---| |<---| |<--- | ||
370 | +---+ +---+ +---+ +---+ | ||
371 | ^ | ||
372 | | | ||
373 | head page | ||
374 | |||
375 | |||
376 | tail page | ||
377 | | | ||
378 | v | ||
379 | +---+ +---+ +---+ +---+ | ||
380 | <---| |--->| |--->| |--->| |---> | ||
381 | --->| |<---| |<---| |<---| |<--- | ||
382 | +---+ +---+ +---+ +---+ | ||
383 | ^ | ||
384 | | | ||
385 | head page | ||
386 | |||
387 | Note, the reader page will still point to the previous head page. | ||
388 | But when a swap takes place, it will use the most recent head page. | ||
389 | |||
390 | |||
391 | Making the Ring Buffer Lockless: | ||
392 | -------------------------------- | ||
393 | |||
394 | The main idea behind the lockless algorithm is to combine the moving | ||
395 | of the head_page pointer with the swapping of pages with the reader. | ||
396 | State flags are placed inside the pointer to the page. To do this, | ||
397 | each page must be aligned in memory by 4 bytes. This will allow the 2 | ||
398 | least significant bits of the address to be used as flags. Since | ||
399 | they will always be zero for the address. To get the address, | ||
400 | simply mask out the flags. | ||
401 | |||
402 | MASK = ~3 | ||
403 | |||
404 | address & MASK | ||
405 | |||
406 | Two flags will be kept by these two bits: | ||
407 | |||
408 | HEADER - the page being pointed to is a head page | ||
409 | |||
410 | UPDATE - the page being pointed to is being updated by a writer | ||
411 | and was or is about to be a head page. | ||
412 | |||
413 | |||
414 | reader page | ||
415 | | | ||
416 | v | ||
417 | +---+ | ||
418 | | |------+ | ||
419 | +---+ | | ||
420 | | | ||
421 | v | ||
422 | +---+ +---+ +---+ +---+ | ||
423 | <---| |--->| |-H->| |--->| |---> | ||
424 | --->| |<---| |<---| |<---| |<--- | ||
425 | +---+ +---+ +---+ +---+ | ||
426 | |||
427 | |||
428 | The above pointer "-H->" would have the HEADER flag set. That is | ||
429 | the next page is the next page to be swapped out by the reader. | ||
430 | This pointer means the next page is the head page. | ||
431 | |||
432 | When the tail page meets the head pointer, it will use cmpxchg to | ||
433 | change the pointer to the UPDATE state: | ||
434 | |||
435 | |||
436 | tail page | ||
437 | | | ||
438 | v | ||
439 | +---+ +---+ +---+ +---+ | ||
440 | <---| |--->| |-H->| |--->| |---> | ||
441 | --->| |<---| |<---| |<---| |<--- | ||
442 | +---+ +---+ +---+ +---+ | ||
443 | |||
444 | tail page | ||
445 | | | ||
446 | v | ||
447 | +---+ +---+ +---+ +---+ | ||
448 | <---| |--->| |-U->| |--->| |---> | ||
449 | --->| |<---| |<---| |<---| |<--- | ||
450 | +---+ +---+ +---+ +---+ | ||
451 | |||
452 | "-U->" represents a pointer in the UPDATE state. | ||
453 | |||
454 | Any access to the reader will need to take some sort of lock to serialize | ||
455 | the readers. But the writers will never take a lock to write to the | ||
456 | ring buffer. This means we only need to worry about a single reader, | ||
457 | and writes only preempt in "stack" formation. | ||
458 | |||
459 | When the reader tries to swap the page with the ring buffer, it | ||
460 | will also use cmpxchg. If the flag bit in the pointer to the | ||
461 | head page does not have the HEADER flag set, the compare will fail | ||
462 | and the reader will need to look for the new head page and try again. | ||
463 | Note, the flag UPDATE and HEADER are never set at the same time. | ||
464 | |||
465 | The reader swaps the reader page as follows: | ||
466 | |||
467 | +------+ | ||
468 | |reader| RING BUFFER | ||
469 | |page | | ||
470 | +------+ | ||
471 | +---+ +---+ +---+ | ||
472 | | |--->| |--->| | | ||
473 | | |<---| |<---| | | ||
474 | +---+ +---+ +---+ | ||
475 | ^ | ^ | | ||
476 | | +---------------+ | | ||
477 | +-----H-------------+ | ||
478 | |||
479 | The reader sets the reader page next pointer as HEADER to the page after | ||
480 | the head page. | ||
481 | |||
482 | |||
483 | +------+ | ||
484 | |reader| RING BUFFER | ||
485 | |page |-------H-----------+ | ||
486 | +------+ v | ||
487 | | +---+ +---+ +---+ | ||
488 | | | |--->| |--->| | | ||
489 | | | |<---| |<---| |<-+ | ||
490 | | +---+ +---+ +---+ | | ||
491 | | ^ | ^ | | | ||
492 | | | +---------------+ | | | ||
493 | | +-----H-------------+ | | ||
494 | +--------------------------------------+ | ||
495 | |||
496 | It does a cmpxchg with the pointer to the previous head page to make it | ||
497 | point to the reader page. Note that the new pointer does not have the HEADER | ||
498 | flag set. This action atomically moves the head page forward. | ||
499 | |||
500 | +------+ | ||
501 | |reader| RING BUFFER | ||
502 | |page |-------H-----------+ | ||
503 | +------+ v | ||
504 | | ^ +---+ +---+ +---+ | ||
505 | | | | |-->| |-->| | | ||
506 | | | | |<--| |<--| |<-+ | ||
507 | | | +---+ +---+ +---+ | | ||
508 | | | | ^ | | | ||
509 | | | +-------------+ | | | ||
510 | | +-----------------------------+ | | ||
511 | +------------------------------------+ | ||
512 | |||
513 | After the new head page is set, the previous pointer of the head page is | ||
514 | updated to the reader page. | ||
515 | |||
516 | +------+ | ||
517 | |reader| RING BUFFER | ||
518 | |page |-------H-----------+ | ||
519 | +------+ <---------------+ v | ||
520 | | ^ +---+ +---+ +---+ | ||
521 | | | | |-->| |-->| | | ||
522 | | | | | | |<--| |<-+ | ||
523 | | | +---+ +---+ +---+ | | ||
524 | | | | ^ | | | ||
525 | | | +-------------+ | | | ||
526 | | +-----------------------------+ | | ||
527 | +------------------------------------+ | ||
528 | |||
529 | +------+ | ||
530 | |buffer| RING BUFFER | ||
531 | |page |-------H-----------+ <--- New head page | ||
532 | +------+ <---------------+ v | ||
533 | | ^ +---+ +---+ +---+ | ||
534 | | | | | | |-->| | | ||
535 | | | New | | | |<--| |<-+ | ||
536 | | | Reader +---+ +---+ +---+ | | ||
537 | | | page ----^ | | | ||
538 | | | | | | ||
539 | | +-----------------------------+ | | ||
540 | +------------------------------------+ | ||
541 | |||
542 | Another important point. The page that the reader page points back to | ||
543 | by its previous pointer (the one that now points to the new head page) | ||
544 | never points back to the reader page. That is because the reader page is | ||
545 | not part of the ring buffer. Traversing the ring buffer via the next pointers | ||
546 | will always stay in the ring buffer. Traversing the ring buffer via the | ||
547 | prev pointers may not. | ||
548 | |||
549 | Note, the way to determine a reader page is simply by examining the previous | ||
550 | pointer of the page. If the next pointer of the previous page does not | ||
551 | point back to the original page, then the original page is a reader page: | ||
552 | |||
553 | |||
554 | +--------+ | ||
555 | | reader | next +----+ | ||
556 | | page |-------->| |<====== (buffer page) | ||
557 | +--------+ +----+ | ||
558 | | | ^ | ||
559 | | v | next | ||
560 | prev | +----+ | ||
561 | +------------->| | | ||
562 | +----+ | ||
563 | |||
564 | The way the head page moves forward: | ||
565 | |||
566 | When the tail page meets the head page and the buffer is in overwrite mode | ||
567 | and more writes take place, the head page must be moved forward before the | ||
568 | writer may move the tail page. The way this is done is that the writer | ||
569 | performs a cmpxchg to convert the pointer to the head page from the HEADER | ||
570 | flag to have the UPDATE flag set. Once this is done, the reader will | ||
571 | not be able to swap the head page from the buffer, nor will it be able to | ||
572 | move the head page, until the writer is finished with the move. | ||
573 | |||
574 | This eliminates any races that the reader can have on the writer. The reader | ||
575 | must spin, and this is why the reader can not preempt the writer. | ||
576 | |||
577 | tail page | ||
578 | | | ||
579 | v | ||
580 | +---+ +---+ +---+ +---+ | ||
581 | <---| |--->| |-H->| |--->| |---> | ||
582 | --->| |<---| |<---| |<---| |<--- | ||
583 | +---+ +---+ +---+ +---+ | ||
584 | |||
585 | tail page | ||
586 | | | ||
587 | v | ||
588 | +---+ +---+ +---+ +---+ | ||
589 | <---| |--->| |-U->| |--->| |---> | ||
590 | --->| |<---| |<---| |<---| |<--- | ||
591 | +---+ +---+ +---+ +---+ | ||
592 | |||
593 | The following page will be made into the new head page. | ||
594 | |||
595 | tail page | ||
596 | | | ||
597 | v | ||
598 | +---+ +---+ +---+ +---+ | ||
599 | <---| |--->| |-U->| |-H->| |---> | ||
600 | --->| |<---| |<---| |<---| |<--- | ||
601 | +---+ +---+ +---+ +---+ | ||
602 | |||
603 | After the new head page has been set, we can set the old head page | ||
604 | pointer back to NORMAL. | ||
605 | |||
606 | tail page | ||
607 | | | ||
608 | v | ||
609 | +---+ +---+ +---+ +---+ | ||
610 | <---| |--->| |--->| |-H->| |---> | ||
611 | --->| |<---| |<---| |<---| |<--- | ||
612 | +---+ +---+ +---+ +---+ | ||
613 | |||
614 | After the head page has been moved, the tail page may now move forward. | ||
615 | |||
616 | tail page | ||
617 | | | ||
618 | v | ||
619 | +---+ +---+ +---+ +---+ | ||
620 | <---| |--->| |--->| |-H->| |---> | ||
621 | --->| |<---| |<---| |<---| |<--- | ||
622 | +---+ +---+ +---+ +---+ | ||
623 | |||
624 | |||
625 | The above are the trivial updates. Now for the more complex scenarios. | ||
626 | |||
627 | |||
628 | As stated before, if enough writes preempt the first write, the | ||
629 | tail page may make it all the way around the buffer and meet the commit | ||
630 | page. At this time, we must start dropping writes (usually with some kind | ||
631 | of warning to the user). But what happens if the commit was still on the | ||
632 | reader page? The commit page is not part of the ring buffer. The tail page | ||
633 | must account for this. | ||
634 | |||
635 | |||
636 | reader page commit page | ||
637 | | | | ||
638 | v | | ||
639 | +---+ | | ||
640 | | |<----------+ | ||
641 | | | | ||
642 | | |------+ | ||
643 | +---+ | | ||
644 | | | ||
645 | v | ||
646 | +---+ +---+ +---+ +---+ | ||
647 | <---| |--->| |-H->| |--->| |---> | ||
648 | --->| |<---| |<---| |<---| |<--- | ||
649 | +---+ +---+ +---+ +---+ | ||
650 | ^ | ||
651 | | | ||
652 | tail page | ||
653 | |||
654 | If the tail page were to simply push the head page forward, the commit when | ||
655 | leaving the reader page would not be pointing to the correct page. | ||
656 | |||
657 | The solution to this is to test if the commit page is on the reader page | ||
658 | before pushing the head page. If it is, then it can be assumed that the | ||
659 | tail page wrapped the buffer, and we must drop new writes. | ||
660 | |||
661 | This is not a race condition, because the commit page can only be moved | ||
662 | by the outter most writer (the writer that was preempted). | ||
663 | This means that the commit will not move while a writer is moving the | ||
664 | tail page. The reader can not swap the reader page if it is also being | ||
665 | used as the commit page. The reader can simply check that the commit | ||
666 | is off the reader page. Once the commit page leaves the reader page | ||
667 | it will never go back on it unless a reader does another swap with the | ||
668 | buffer page that is also the commit page. | ||
669 | |||
670 | |||
671 | Nested writes | ||
672 | ------------- | ||
673 | |||
674 | In the pushing forward of the tail page we must first push forward | ||
675 | the head page if the head page is the next page. If the head page | ||
676 | is not the next page, the tail page is simply updated with a cmpxchg. | ||
677 | |||
678 | Only writers move the tail page. This must be done atomically to protect | ||
679 | against nested writers. | ||
680 | |||
681 | temp_page = tail_page | ||
682 | next_page = temp_page->next | ||
683 | cmpxchg(tail_page, temp_page, next_page) | ||
684 | |||
685 | The above will update the tail page if it is still pointing to the expected | ||
686 | page. If this fails, a nested write pushed it forward, the the current write | ||
687 | does not need to push it. | ||
688 | |||
689 | |||
690 | temp page | ||
691 | | | ||
692 | v | ||
693 | tail page | ||
694 | | | ||
695 | v | ||
696 | +---+ +---+ +---+ +---+ | ||
697 | <---| |--->| |--->| |--->| |---> | ||
698 | --->| |<---| |<---| |<---| |<--- | ||
699 | +---+ +---+ +---+ +---+ | ||
700 | |||
701 | Nested write comes in and moves the tail page forward: | ||
702 | |||
703 | tail page (moved by nested writer) | ||
704 | temp page | | ||
705 | | | | ||
706 | v v | ||
707 | +---+ +---+ +---+ +---+ | ||
708 | <---| |--->| |--->| |--->| |---> | ||
709 | --->| |<---| |<---| |<---| |<--- | ||
710 | +---+ +---+ +---+ +---+ | ||
711 | |||
712 | The above would fail the cmpxchg, but since the tail page has already | ||
713 | been moved forward, the writer will just try again to reserve storage | ||
714 | on the new tail page. | ||
715 | |||
716 | But the moving of the head page is a bit more complex. | ||
717 | |||
718 | tail page | ||
719 | | | ||
720 | v | ||
721 | +---+ +---+ +---+ +---+ | ||
722 | <---| |--->| |-H->| |--->| |---> | ||
723 | --->| |<---| |<---| |<---| |<--- | ||
724 | +---+ +---+ +---+ +---+ | ||
725 | |||
726 | The write converts the head page pointer to UPDATE. | ||
727 | |||
728 | tail page | ||
729 | | | ||
730 | v | ||
731 | +---+ +---+ +---+ +---+ | ||
732 | <---| |--->| |-U->| |--->| |---> | ||
733 | --->| |<---| |<---| |<---| |<--- | ||
734 | +---+ +---+ +---+ +---+ | ||
735 | |||
736 | But if a nested writer preempts here. It will see that the next | ||
737 | page is a head page, but it is also nested. It will detect that | ||
738 | it is nested and will save that information. The detection is the | ||
739 | fact that it sees the UPDATE flag instead of a HEADER or NORMAL | ||
740 | pointer. | ||
741 | |||
742 | The nested writer will set the new head page pointer. | ||
743 | |||
744 | tail page | ||
745 | | | ||
746 | v | ||
747 | +---+ +---+ +---+ +---+ | ||
748 | <---| |--->| |-U->| |-H->| |---> | ||
749 | --->| |<---| |<---| |<---| |<--- | ||
750 | +---+ +---+ +---+ +---+ | ||
751 | |||
752 | But it will not reset the update back to normal. Only the writer | ||
753 | that converted a pointer from HEAD to UPDATE will convert it back | ||
754 | to NORMAL. | ||
755 | |||
756 | tail page | ||
757 | | | ||
758 | v | ||
759 | +---+ +---+ +---+ +---+ | ||
760 | <---| |--->| |-U->| |-H->| |---> | ||
761 | --->| |<---| |<---| |<---| |<--- | ||
762 | +---+ +---+ +---+ +---+ | ||
763 | |||
764 | After the nested writer finishes, the outer most writer will convert | ||
765 | the UPDATE pointer to NORMAL. | ||
766 | |||
767 | |||
768 | tail page | ||
769 | | | ||
770 | v | ||
771 | +---+ +---+ +---+ +---+ | ||
772 | <---| |--->| |--->| |-H->| |---> | ||
773 | --->| |<---| |<---| |<---| |<--- | ||
774 | +---+ +---+ +---+ +---+ | ||
775 | |||
776 | |||
777 | It can be even more complex if several nested writes came in and moved | ||
778 | the tail page ahead several pages: | ||
779 | |||
780 | |||
781 | (first writer) | ||
782 | |||
783 | tail page | ||
784 | | | ||
785 | v | ||
786 | +---+ +---+ +---+ +---+ | ||
787 | <---| |--->| |-H->| |--->| |---> | ||
788 | --->| |<---| |<---| |<---| |<--- | ||
789 | +---+ +---+ +---+ +---+ | ||
790 | |||
791 | The write converts the head page pointer to UPDATE. | ||
792 | |||
793 | tail page | ||
794 | | | ||
795 | v | ||
796 | +---+ +---+ +---+ +---+ | ||
797 | <---| |--->| |-U->| |--->| |---> | ||
798 | --->| |<---| |<---| |<---| |<--- | ||
799 | +---+ +---+ +---+ +---+ | ||
800 | |||
801 | Next writer comes in, and sees the update and sets up the new | ||
802 | head page. | ||
803 | |||
804 | (second writer) | ||
805 | |||
806 | tail page | ||
807 | | | ||
808 | v | ||
809 | +---+ +---+ +---+ +---+ | ||
810 | <---| |--->| |-U->| |-H->| |---> | ||
811 | --->| |<---| |<---| |<---| |<--- | ||
812 | +---+ +---+ +---+ +---+ | ||
813 | |||
814 | The nested writer moves the tail page forward. But does not set the old | ||
815 | update page to NORMAL because it is not the outer most writer. | ||
816 | |||
817 | tail page | ||
818 | | | ||
819 | v | ||
820 | +---+ +---+ +---+ +---+ | ||
821 | <---| |--->| |-U->| |-H->| |---> | ||
822 | --->| |<---| |<---| |<---| |<--- | ||
823 | +---+ +---+ +---+ +---+ | ||
824 | |||
825 | Another writer preempts and sees the page after the tail page is a head page. | ||
826 | It changes it from HEAD to UPDATE. | ||
827 | |||
828 | (third writer) | ||
829 | |||
830 | tail page | ||
831 | | | ||
832 | v | ||
833 | +---+ +---+ +---+ +---+ | ||
834 | <---| |--->| |-U->| |-U->| |---> | ||
835 | --->| |<---| |<---| |<---| |<--- | ||
836 | +---+ +---+ +---+ +---+ | ||
837 | |||
838 | The writer will move the head page forward: | ||
839 | |||
840 | |||
841 | (third writer) | ||
842 | |||
843 | tail page | ||
844 | | | ||
845 | v | ||
846 | +---+ +---+ +---+ +---+ | ||
847 | <---| |--->| |-U->| |-U->| |-H-> | ||
848 | --->| |<---| |<---| |<---| |<--- | ||
849 | +---+ +---+ +---+ +---+ | ||
850 | |||
851 | But now that the third writer did change the HEAD flag to UPDATE it | ||
852 | will convert it to normal: | ||
853 | |||
854 | |||
855 | (third writer) | ||
856 | |||
857 | tail page | ||
858 | | | ||
859 | v | ||
860 | +---+ +---+ +---+ +---+ | ||
861 | <---| |--->| |-U->| |--->| |-H-> | ||
862 | --->| |<---| |<---| |<---| |<--- | ||
863 | +---+ +---+ +---+ +---+ | ||
864 | |||
865 | |||
866 | Then it will move the tail page, and return back to the second writer. | ||
867 | |||
868 | |||
869 | (second writer) | ||
870 | |||
871 | tail page | ||
872 | | | ||
873 | v | ||
874 | +---+ +---+ +---+ +---+ | ||
875 | <---| |--->| |-U->| |--->| |-H-> | ||
876 | --->| |<---| |<---| |<---| |<--- | ||
877 | +---+ +---+ +---+ +---+ | ||
878 | |||
879 | |||
880 | The second writer will fail to move the tail page because it was already | ||
881 | moved, so it will try again and add its data to the new tail page. | ||
882 | It will return to the first writer. | ||
883 | |||
884 | |||
885 | (first writer) | ||
886 | |||
887 | tail page | ||
888 | | | ||
889 | v | ||
890 | +---+ +---+ +---+ +---+ | ||
891 | <---| |--->| |-U->| |--->| |-H-> | ||
892 | --->| |<---| |<---| |<---| |<--- | ||
893 | +---+ +---+ +---+ +---+ | ||
894 | |||
895 | The first writer can not know atomically test if the tail page moved | ||
896 | while it updates the HEAD page. It will then update the head page to | ||
897 | what it thinks is the new head page. | ||
898 | |||
899 | |||
900 | (first writer) | ||
901 | |||
902 | tail page | ||
903 | | | ||
904 | v | ||
905 | +---+ +---+ +---+ +---+ | ||
906 | <---| |--->| |-U->| |-H->| |-H-> | ||
907 | --->| |<---| |<---| |<---| |<--- | ||
908 | +---+ +---+ +---+ +---+ | ||
909 | |||
910 | Since the cmpxchg returns the old value of the pointer the first writer | ||
911 | will see it succeeded in updating the pointer from NORMAL to HEAD. | ||
912 | But as we can see, this is not good enough. It must also check to see | ||
913 | if the tail page is either where it use to be or on the next page: | ||
914 | |||
915 | |||
916 | (first writer) | ||
917 | |||
918 | A B tail page | ||
919 | | | | | ||
920 | v v v | ||
921 | +---+ +---+ +---+ +---+ | ||
922 | <---| |--->| |-U->| |-H->| |-H-> | ||
923 | --->| |<---| |<---| |<---| |<--- | ||
924 | +---+ +---+ +---+ +---+ | ||
925 | |||
926 | If tail page != A and tail page does not equal B, then it must reset the | ||
927 | pointer back to NORMAL. The fact that it only needs to worry about | ||
928 | nested writers, it only needs to check this after setting the HEAD page. | ||
929 | |||
930 | |||
931 | (first writer) | ||
932 | |||
933 | A B tail page | ||
934 | | | | | ||
935 | v v v | ||
936 | +---+ +---+ +---+ +---+ | ||
937 | <---| |--->| |-U->| |--->| |-H-> | ||
938 | --->| |<---| |<---| |<---| |<--- | ||
939 | +---+ +---+ +---+ +---+ | ||
940 | |||
941 | Now the writer can update the head page. This is also why the head page must | ||
942 | remain in UPDATE and only reset by the outer most writer. This prevents | ||
943 | the reader from seeing the incorrect head page. | ||
944 | |||
945 | |||
946 | (first writer) | ||
947 | |||
948 | A B tail page | ||
949 | | | | | ||
950 | v v v | ||
951 | +---+ +---+ +---+ +---+ | ||
952 | <---| |--->| |--->| |--->| |-H-> | ||
953 | --->| |<---| |<---| |<---| |<--- | ||
954 | +---+ +---+ +---+ +---+ | ||
955 | |||
diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 000000000000..5eb4e487e667 --- /dev/null +++ b/Documentation/trace/tracepoint-analysis.txt | |||
@@ -0,0 +1,327 @@ | |||
1 | Notes on Analysing Behaviour Using Events and Tracepoints | ||
2 | |||
3 | Documentation written by Mel Gorman | ||
4 | PCL information heavily based on email from Ingo Molnar | ||
5 | |||
6 | 1. Introduction | ||
7 | =============== | ||
8 | |||
9 | Tracepoints (see Documentation/trace/tracepoints.txt) can be used without | ||
10 | creating custom kernel modules to register probe functions using the event | ||
11 | tracing infrastructure. | ||
12 | |||
13 | Simplistically, tracepoints will represent an important event that when can | ||
14 | be taken in conjunction with other tracepoints to build a "Big Picture" of | ||
15 | what is going on within the system. There are a large number of methods for | ||
16 | gathering and interpreting these events. Lacking any current Best Practises, | ||
17 | this document describes some of the methods that can be used. | ||
18 | |||
19 | This document assumes that debugfs is mounted on /sys/kernel/debug and that | ||
20 | the appropriate tracing options have been configured into the kernel. It is | ||
21 | assumed that the PCL tool tools/perf has been installed and is in your path. | ||
22 | |||
23 | 2. Listing Available Events | ||
24 | =========================== | ||
25 | |||
26 | 2.1 Standard Utilities | ||
27 | ---------------------- | ||
28 | |||
29 | All possible events are visible from /sys/kernel/debug/tracing/events. Simply | ||
30 | calling | ||
31 | |||
32 | $ find /sys/kernel/debug/tracing/events -type d | ||
33 | |||
34 | will give a fair indication of the number of events available. | ||
35 | |||
36 | 2.2 PCL | ||
37 | ------- | ||
38 | |||
39 | Discovery and enumeration of all counters and events, including tracepoints | ||
40 | are available with the perf tool. Getting a list of available events is a | ||
41 | simple case of | ||
42 | |||
43 | $ perf list 2>&1 | grep Tracepoint | ||
44 | ext4:ext4_free_inode [Tracepoint event] | ||
45 | ext4:ext4_request_inode [Tracepoint event] | ||
46 | ext4:ext4_allocate_inode [Tracepoint event] | ||
47 | ext4:ext4_write_begin [Tracepoint event] | ||
48 | ext4:ext4_ordered_write_end [Tracepoint event] | ||
49 | [ .... remaining output snipped .... ] | ||
50 | |||
51 | |||
52 | 2. Enabling Events | ||
53 | ================== | ||
54 | |||
55 | 2.1 System-Wide Event Enabling | ||
56 | ------------------------------ | ||
57 | |||
58 | See Documentation/trace/events.txt for a proper description on how events | ||
59 | can be enabled system-wide. A short example of enabling all events related | ||
60 | to page allocation would look something like | ||
61 | |||
62 | $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done | ||
63 | |||
64 | 2.2 System-Wide Event Enabling with SystemTap | ||
65 | --------------------------------------------- | ||
66 | |||
67 | In SystemTap, tracepoints are accessible using the kernel.trace() function | ||
68 | call. The following is an example that reports every 5 seconds what processes | ||
69 | were allocating the pages. | ||
70 | |||
71 | global page_allocs | ||
72 | |||
73 | probe kernel.trace("mm_page_alloc") { | ||
74 | page_allocs[execname()]++ | ||
75 | } | ||
76 | |||
77 | function print_count() { | ||
78 | printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") | ||
79 | foreach (proc in page_allocs-) | ||
80 | printf("%-25d %s\n", page_allocs[proc], proc) | ||
81 | printf ("\n") | ||
82 | delete page_allocs | ||
83 | } | ||
84 | |||
85 | probe timer.s(5) { | ||
86 | print_count() | ||
87 | } | ||
88 | |||
89 | 2.3 System-Wide Event Enabling with PCL | ||
90 | --------------------------------------- | ||
91 | |||
92 | By specifying the -a switch and analysing sleep, the system-wide events | ||
93 | for a duration of time can be examined. | ||
94 | |||
95 | $ perf stat -a \ | ||
96 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
97 | -e kmem:mm_pagevec_free \ | ||
98 | sleep 10 | ||
99 | Performance counter stats for 'sleep 10': | ||
100 | |||
101 | 9630 kmem:mm_page_alloc | ||
102 | 2143 kmem:mm_page_free_direct | ||
103 | 7424 kmem:mm_pagevec_free | ||
104 | |||
105 | 10.002577764 seconds time elapsed | ||
106 | |||
107 | Similarly, one could execute a shell and exit it as desired to get a report | ||
108 | at that point. | ||
109 | |||
110 | 2.4 Local Event Enabling | ||
111 | ------------------------ | ||
112 | |||
113 | Documentation/trace/ftrace.txt describes how to enable events on a per-thread | ||
114 | basis using set_ftrace_pid. | ||
115 | |||
116 | 2.5 Local Event Enablement with PCL | ||
117 | ----------------------------------- | ||
118 | |||
119 | Events can be activate and tracked for the duration of a process on a local | ||
120 | basis using PCL such as follows. | ||
121 | |||
122 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
123 | -e kmem:mm_pagevec_free ./hackbench 10 | ||
124 | Time: 0.909 | ||
125 | |||
126 | Performance counter stats for './hackbench 10': | ||
127 | |||
128 | 17803 kmem:mm_page_alloc | ||
129 | 12398 kmem:mm_page_free_direct | ||
130 | 4827 kmem:mm_pagevec_free | ||
131 | |||
132 | 0.973913387 seconds time elapsed | ||
133 | |||
134 | 3. Event Filtering | ||
135 | ================== | ||
136 | |||
137 | Documentation/trace/ftrace.txt covers in-depth how to filter events in | ||
138 | ftrace. Obviously using grep and awk of trace_pipe is an option as well | ||
139 | as any script reading trace_pipe. | ||
140 | |||
141 | 4. Analysing Event Variances with PCL | ||
142 | ===================================== | ||
143 | |||
144 | Any workload can exhibit variances between runs and it can be important | ||
145 | to know what the standard deviation in. By and large, this is left to the | ||
146 | performance analyst to do it by hand. In the event that the discrete event | ||
147 | occurrences are useful to the performance analyst, then perf can be used. | ||
148 | |||
149 | $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free_direct | ||
150 | -e kmem:mm_pagevec_free ./hackbench 10 | ||
151 | Time: 0.890 | ||
152 | Time: 0.895 | ||
153 | Time: 0.915 | ||
154 | Time: 1.001 | ||
155 | Time: 0.899 | ||
156 | |||
157 | Performance counter stats for './hackbench 10' (5 runs): | ||
158 | |||
159 | 16630 kmem:mm_page_alloc ( +- 3.542% ) | ||
160 | 11486 kmem:mm_page_free_direct ( +- 4.771% ) | ||
161 | 4730 kmem:mm_pagevec_free ( +- 2.325% ) | ||
162 | |||
163 | 0.982653002 seconds time elapsed ( +- 1.448% ) | ||
164 | |||
165 | In the event that some higher-level event is required that depends on some | ||
166 | aggregation of discrete events, then a script would need to be developed. | ||
167 | |||
168 | Using --repeat, it is also possible to view how events are fluctuating over | ||
169 | time on a system wide basis using -a and sleep. | ||
170 | |||
171 | $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
172 | -e kmem:mm_pagevec_free \ | ||
173 | -a --repeat 10 \ | ||
174 | sleep 1 | ||
175 | Performance counter stats for 'sleep 1' (10 runs): | ||
176 | |||
177 | 1066 kmem:mm_page_alloc ( +- 26.148% ) | ||
178 | 182 kmem:mm_page_free_direct ( +- 5.464% ) | ||
179 | 890 kmem:mm_pagevec_free ( +- 30.079% ) | ||
180 | |||
181 | 1.002251757 seconds time elapsed ( +- 0.005% ) | ||
182 | |||
183 | 5. Higher-Level Analysis with Helper Scripts | ||
184 | ============================================ | ||
185 | |||
186 | When events are enabled the events that are triggering can be read from | ||
187 | /sys/kernel/debug/tracing/trace_pipe in human-readable format although binary | ||
188 | options exist as well. By post-processing the output, further information can | ||
189 | be gathered on-line as appropriate. Examples of post-processing might include | ||
190 | |||
191 | o Reading information from /proc for the PID that triggered the event | ||
192 | o Deriving a higher-level event from a series of lower-level events. | ||
193 | o Calculate latencies between two events | ||
194 | |||
195 | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example | ||
196 | script that can read trace_pipe from STDIN or a copy of a trace. When used | ||
197 | on-line, it can be interrupted once to generate a report without existing | ||
198 | and twice to exit. | ||
199 | |||
200 | Simplistically, the script just reads STDIN and counts up events but it | ||
201 | also can do more such as | ||
202 | |||
203 | o Derive high-level events from many low-level events. If a number of pages | ||
204 | are freed to the main allocator from the per-CPU lists, it recognises | ||
205 | that as one per-CPU drain even though there is no specific tracepoint | ||
206 | for that event | ||
207 | o It can aggregate based on PID or individual process number | ||
208 | o In the event memory is getting externally fragmented, it reports | ||
209 | on whether the fragmentation event was severe or moderate. | ||
210 | o When receiving an event about a PID, it can record who the parent was so | ||
211 | that if large numbers of events are coming from very short-lived | ||
212 | processes, the parent process responsible for creating all the helpers | ||
213 | can be identified | ||
214 | |||
215 | 6. Lower-Level Analysis with PCL | ||
216 | ================================ | ||
217 | |||
218 | There may also be a requirement to identify what functions with a program | ||
219 | were generating events within the kernel. To begin this sort of analysis, the | ||
220 | data must be recorded. At the time of writing, this required root | ||
221 | |||
222 | $ perf record -c 1 \ | ||
223 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
224 | -e kmem:mm_pagevec_free \ | ||
225 | ./hackbench 10 | ||
226 | Time: 0.894 | ||
227 | [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] | ||
228 | |||
229 | Note the use of '-c 1' to set the event period to sample. The default sample | ||
230 | period is quite high to minimise overhead but the information collected can be | ||
231 | very coarse as a result. | ||
232 | |||
233 | This record outputted a file called perf.data which can be analysed using | ||
234 | perf report. | ||
235 | |||
236 | $ perf report | ||
237 | # Samples: 30922 | ||
238 | # | ||
239 | # Overhead Command Shared Object | ||
240 | # ........ ......... ................................ | ||
241 | # | ||
242 | 87.27% hackbench [vdso] | ||
243 | 6.85% hackbench /lib/i686/cmov/libc-2.9.so | ||
244 | 2.62% hackbench /lib/ld-2.9.so | ||
245 | 1.52% perf [vdso] | ||
246 | 1.22% hackbench ./hackbench | ||
247 | 0.48% hackbench [kernel] | ||
248 | 0.02% perf /lib/i686/cmov/libc-2.9.so | ||
249 | 0.01% perf /usr/bin/perf | ||
250 | 0.01% perf /lib/ld-2.9.so | ||
251 | 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so | ||
252 | # | ||
253 | # (For more details, try: perf report --sort comm,dso,symbol) | ||
254 | # | ||
255 | |||
256 | According to this, the vast majority of events occured triggered on events | ||
257 | within the VDSO. With simple binaries, this will often be the case so lets | ||
258 | take a slightly different example. In the course of writing this, it was | ||
259 | noticed that X was generating an insane amount of page allocations so lets look | ||
260 | at it | ||
261 | |||
262 | $ perf record -c 1 -f \ | ||
263 | -e kmem:mm_page_alloc -e kmem:mm_page_free_direct \ | ||
264 | -e kmem:mm_pagevec_free \ | ||
265 | -p `pidof X` | ||
266 | |||
267 | This was interrupted after a few seconds and | ||
268 | |||
269 | $ perf report | ||
270 | # Samples: 27666 | ||
271 | # | ||
272 | # Overhead Command Shared Object | ||
273 | # ........ ....... ....................................... | ||
274 | # | ||
275 | 51.95% Xorg [vdso] | ||
276 | 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 | ||
277 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so | ||
278 | 0.01% Xorg [kernel] | ||
279 | # | ||
280 | # (For more details, try: perf report --sort comm,dso,symbol) | ||
281 | # | ||
282 | |||
283 | So, almost half of the events are occuring in a library. To get an idea which | ||
284 | symbol. | ||
285 | |||
286 | $ perf report --sort comm,dso,symbol | ||
287 | # Samples: 27666 | ||
288 | # | ||
289 | # Overhead Command Shared Object Symbol | ||
290 | # ........ ....... ....................................... ...... | ||
291 | # | ||
292 | 51.95% Xorg [vdso] [.] 0x000000ffffe424 | ||
293 | 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 | ||
294 | 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc | ||
295 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f | ||
296 | 0.01% Xorg [kernel] [k] read_hpet | ||
297 | 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path | ||
298 | 0.00% Xorg [kernel] [k] ftrace_trace_userstack | ||
299 | |||
300 | To see where within the function pixmanFillsse2 things are going wrong | ||
301 | |||
302 | $ perf annotate pixmanFillsse2 | ||
303 | [ ... ] | ||
304 | 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) | ||
305 | : } | ||
306 | : | ||
307 | : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ | ||
308 | : _mm_store_si128 (__m128i *__P, __m128i __B) : { | ||
309 | : *__P = __B; | ||
310 | 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) | ||
311 | 0.00 : 34ef5: ff | ||
312 | 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) | ||
313 | 0.00 : 34efd: ff | ||
314 | 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) | ||
315 | 0.00 : 34f05: ff | ||
316 | 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) | ||
317 | 0.00 : 34f0d: ff | ||
318 | 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) | ||
319 | 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) | ||
320 | 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) | ||
321 | 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) | ||
322 | |||
323 | At a glance, it looks like the time is being spent copying pixmaps to | ||
324 | the card. Further investigation would be needed to determine why pixmaps | ||
325 | are being copied around so much but a starting point would be to take an | ||
326 | ancient build of libpixmap out of the library path where it was totally | ||
327 | forgotten about from months ago! | ||