diff options
author | Peter Zijlstra <peterz@infradead.org> | 2015-09-29 08:45:09 -0400 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2015-10-06 11:05:17 -0400 |
commit | 95913d97914f44db2b81271c2e2ebd4d2ac2df83 (patch) | |
tree | d29d5b8aa7e0815068a39d4303447a2f2258464d /kernel | |
parent | 049e6dde7e57f0054fdc49102e7ef4830c698b46 (diff) |
sched/core: Fix TASK_DEAD race in finish_task_switch()
So the problem this patch is trying to address is as follows:
CPU0 CPU1
context_switch(A, B)
ttwu(A)
LOCK A->pi_lock
A->on_cpu == 0
finish_task_switch(A)
prev_state = A->state <-.
WMB |
A->on_cpu = 0; |
UNLOCK rq0->lock |
| context_switch(C, A)
`-- A->state = TASK_DEAD
prev_state == TASK_DEAD
put_task_struct(A)
context_switch(A, C)
finish_task_switch(A)
A->state == TASK_DEAD
put_task_struct(A)
The argument being that the WMB will allow the load of A->state on CPU0
to cross over and observe CPU1's store of A->state, which will then
result in a double-drop and use-after-free.
Now the comment states (and this was true once upon a long time ago)
that we need to observe A->state while holding rq->lock because that
will order us against the wakeup; however the wakeup will not in fact
acquire (that) rq->lock; it takes A->pi_lock these days.
We can obviously fix this by upgrading the WMB to an MB, but that is
expensive, so we'd rather avoid that.
The alternative this patch takes is: smp_store_release(&A->on_cpu, 0),
which avoids the MB on some archs, but not important ones like ARM.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: <stable@vger.kernel.org> # v3.1+
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Cc: manfred@colorfullife.com
Cc: will.deacon@arm.com
Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()")
Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'kernel')
-rw-r--r-- | kernel/sched/core.c | 10 | ||||
-rw-r--r-- | kernel/sched/sched.h | 5 |
2 files changed, 8 insertions, 7 deletions
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 615953141951..10a8faa1b0d4 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c | |||
@@ -2517,11 +2517,11 @@ static struct rq *finish_task_switch(struct task_struct *prev) | |||
2517 | * If a task dies, then it sets TASK_DEAD in tsk->state and calls | 2517 | * If a task dies, then it sets TASK_DEAD in tsk->state and calls |
2518 | * schedule one last time. The schedule call will never return, and | 2518 | * schedule one last time. The schedule call will never return, and |
2519 | * the scheduled task must drop that reference. | 2519 | * the scheduled task must drop that reference. |
2520 | * The test for TASK_DEAD must occur while the runqueue locks are | 2520 | * |
2521 | * still held, otherwise prev could be scheduled on another cpu, die | 2521 | * We must observe prev->state before clearing prev->on_cpu (in |
2522 | * there before we look at prev->state, and then the reference would | 2522 | * finish_lock_switch), otherwise a concurrent wakeup can get prev |
2523 | * be dropped twice. | 2523 | * running on another CPU and we could rave with its RUNNING -> DEAD |
2524 | * Manfred Spraul <manfred@colorfullife.com> | 2524 | * transition, resulting in a double drop. |
2525 | */ | 2525 | */ |
2526 | prev_state = prev->state; | 2526 | prev_state = prev->state; |
2527 | vtime_task_switch(prev); | 2527 | vtime_task_switch(prev); |
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 68cda117574c..6d2a119c7ad9 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h | |||
@@ -1078,9 +1078,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev) | |||
1078 | * After ->on_cpu is cleared, the task can be moved to a different CPU. | 1078 | * After ->on_cpu is cleared, the task can be moved to a different CPU. |
1079 | * We must ensure this doesn't happen until the switch is completely | 1079 | * We must ensure this doesn't happen until the switch is completely |
1080 | * finished. | 1080 | * finished. |
1081 | * | ||
1082 | * Pairs with the control dependency and rmb in try_to_wake_up(). | ||
1081 | */ | 1083 | */ |
1082 | smp_wmb(); | 1084 | smp_store_release(&prev->on_cpu, 0); |
1083 | prev->on_cpu = 0; | ||
1084 | #endif | 1085 | #endif |
1085 | #ifdef CONFIG_DEBUG_SPINLOCK | 1086 | #ifdef CONFIG_DEBUG_SPINLOCK |
1086 | /* this is a valid case when another task releases the spinlock */ | 1087 | /* this is a valid case when another task releases the spinlock */ |