diff options
author | Stephane Eranian <eranian@google.com> | 2011-08-25 09:58:03 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2011-08-29 06:28:33 -0400 |
commit | a8d757ef076f0f95f13a918808824058de25b3eb (patch) | |
tree | 3c1151ef886d9b72d0a7b7b267d9f37c72d5f475 /kernel/sched.c | |
parent | c6a389f123b9f68d605bb7e0f9b32ec1e3e14132 (diff) |
perf events: Fix slow and broken cgroup context switch code
The current cgroup context switch code was incorrect leading
to bogus counts. Furthermore, as soon as there was an active
cgroup event on a CPU, the context switch cost on that CPU
would increase by a significant amount as demonstrated by a
simple ping/pong example:
$ ./pong
Both processes pinned to CPU1, running for 10s
10684.51 ctxsw/s
Now start a cgroup perf stat:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
$ ./pong
Both processes pinned to CPU1, running for 10s
6674.61 ctxsw/s
That's a 37% penalty.
Note that pong is not even in the monitored cgroup.
The results shown by perf stat are bogus:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 100
Performance counter stats for 'sleep 100':
CPU1 <not counted> cycles test
CPU1 16,984,189,138 cycles # 0.000 GHz
The second 'cycles' event should report a count @ CPU clock
(here 2.4GHz) as it is counting across all cgroups.
The patch below fixes the bogus accounting and bypasses any
cgroup switches in case the outgoing and incoming tasks are
in the same cgroup.
With this patch the same test now yields:
$ ./pong
Both processes pinned to CPU1, running for 10s
10775.30 ctxsw/s
Start perf stat with cgroup:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Run pong outside the cgroup:
$ /pong
Both processes pinned to CPU1, running for 10s
10687.80 ctxsw/s
The penalty is now less than 2%.
And the results for perf stat are correct:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 <not counted> cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
Now perf stat reports the correct counts for
for the non cgroup event.
If we run pong inside the cgroup, then we also get the
correct counts:
$ perf stat -e cycles,cycles -A -a -G test -C 1 -- sleep 10
Performance counter stats for 'sleep 10':
CPU1 22,297,726,205 cycles test # 0.000 GHz
CPU1 23,933,981,448 cycles # 0.000 GHz
10.001457237 seconds time elapsed
Signed-off-by: Stephane Eranian <eranian@google.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/r/20110825135803.GA4697@quad
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'kernel/sched.c')
-rw-r--r-- | kernel/sched.c | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/kernel/sched.c b/kernel/sched.c index ccacdbdecf45..0408cdc6d572 100644 --- a/kernel/sched.c +++ b/kernel/sched.c | |||
@@ -3065,7 +3065,7 @@ static void finish_task_switch(struct rq *rq, struct task_struct *prev) | |||
3065 | #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW | 3065 | #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW |
3066 | local_irq_disable(); | 3066 | local_irq_disable(); |
3067 | #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ | 3067 | #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ |
3068 | perf_event_task_sched_in(current); | 3068 | perf_event_task_sched_in(prev, current); |
3069 | #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW | 3069 | #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW |
3070 | local_irq_enable(); | 3070 | local_irq_enable(); |
3071 | #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ | 3071 | #endif /* __ARCH_WANT_INTERRUPTS_ON_CTXSW */ |