aboutsummaryrefslogtreecommitdiffstats
path: root/kernel
diff options
context:
space:
mode:
authorPeter Zijlstra <peterz@infradead.org>2015-09-29 08:45:09 -0400
committerIngo Molnar <mingo@kernel.org>2015-10-06 11:05:17 -0400
commit95913d97914f44db2b81271c2e2ebd4d2ac2df83 (patch)
treed29d5b8aa7e0815068a39d4303447a2f2258464d /kernel
parent049e6dde7e57f0054fdc49102e7ef4830c698b46 (diff)
sched/core: Fix TASK_DEAD race in finish_task_switch()
So the problem this patch is trying to address is as follows: CPU0 CPU1 context_switch(A, B) ttwu(A) LOCK A->pi_lock A->on_cpu == 0 finish_task_switch(A) prev_state = A->state <-. WMB | A->on_cpu = 0; | UNLOCK rq0->lock | | context_switch(C, A) `-- A->state = TASK_DEAD prev_state == TASK_DEAD put_task_struct(A) context_switch(A, C) finish_task_switch(A) A->state == TASK_DEAD put_task_struct(A) The argument being that the WMB will allow the load of A->state on CPU0 to cross over and observe CPU1's store of A->state, which will then result in a double-drop and use-after-free. Now the comment states (and this was true once upon a long time ago) that we need to observe A->state while holding rq->lock because that will order us against the wakeup; however the wakeup will not in fact acquire (that) rq->lock; it takes A->pi_lock these days. We can obviously fix this by upgrading the WMB to an MB, but that is expensive, so we'd rather avoid that. The alternative this patch takes is: smp_store_release(&A->on_cpu, 0), which avoids the MB on some archs, but not important ones like ARM. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: <stable@vger.kernel.org> # v3.1+ Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: linux-kernel@vger.kernel.org Cc: manfred@colorfullife.com Cc: will.deacon@arm.com Fixes: e4a52bcb9a18 ("sched: Remove rq->lock from the first half of ttwu()") Link: http://lkml.kernel.org/r/20150929124509.GG3816@twins.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'kernel')
-rw-r--r--kernel/sched/core.c10
-rw-r--r--kernel/sched/sched.h5
2 files changed, 8 insertions, 7 deletions
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 615953141951..10a8faa1b0d4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2517,11 +2517,11 @@ static struct rq *finish_task_switch(struct task_struct *prev)
2517 * If a task dies, then it sets TASK_DEAD in tsk->state and calls 2517 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
2518 * schedule one last time. The schedule call will never return, and 2518 * schedule one last time. The schedule call will never return, and
2519 * the scheduled task must drop that reference. 2519 * the scheduled task must drop that reference.
2520 * The test for TASK_DEAD must occur while the runqueue locks are 2520 *
2521 * still held, otherwise prev could be scheduled on another cpu, die 2521 * We must observe prev->state before clearing prev->on_cpu (in
2522 * there before we look at prev->state, and then the reference would 2522 * finish_lock_switch), otherwise a concurrent wakeup can get prev
2523 * be dropped twice. 2523 * running on another CPU and we could rave with its RUNNING -> DEAD
2524 * Manfred Spraul <manfred@colorfullife.com> 2524 * transition, resulting in a double drop.
2525 */ 2525 */
2526 prev_state = prev->state; 2526 prev_state = prev->state;
2527 vtime_task_switch(prev); 2527 vtime_task_switch(prev);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 68cda117574c..6d2a119c7ad9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1078,9 +1078,10 @@ static inline void finish_lock_switch(struct rq *rq, struct task_struct *prev)
1078 * After ->on_cpu is cleared, the task can be moved to a different CPU. 1078 * After ->on_cpu is cleared, the task can be moved to a different CPU.
1079 * We must ensure this doesn't happen until the switch is completely 1079 * We must ensure this doesn't happen until the switch is completely
1080 * finished. 1080 * finished.
1081 *
1082 * Pairs with the control dependency and rmb in try_to_wake_up().
1081 */ 1083 */
1082 smp_wmb(); 1084 smp_store_release(&prev->on_cpu, 0);
1083 prev->on_cpu = 0;
1084#endif 1085#endif
1085#ifdef CONFIG_DEBUG_SPINLOCK 1086#ifdef CONFIG_DEBUG_SPINLOCK
1086 /* this is a valid case when another task releases the spinlock */ 1087 /* this is a valid case when another task releases the spinlock */