Fix possible runqueue lock starvation in wait_task_inactive()

Miklos Szeredi reported very long pauses (several seconds, sometimes more) on his T60 (with a Core2Duo) which he managed to track down to wait_task_inactive()'s open-coded busy-loop. He observed that an interrupt on one core tries to acquire the runqueue-lock but does not succeed in doing so for a very long time - while wait_task_inactive() on the other core loops waiting for the first core to deschedule a task (which it wont do while spinning in an interrupt handler). This rewrites wait_task_inactive() to do all its waiting optimistically without any locks taken at all, and then just double-check the end result with the proper runqueue lock held over just a very short section. If there were races in the optimistic wait, of a preemption event scheduled the process away, we simply re-synchronize, and start over. So the code now looks like this: repeat: /* Unlocked, optimistic looping! */ rq = task_rq(p); while (task_running(rq, p)) cpu_relax(); /* Get the *real* values */ rq = task_rq_lock(p, &flags); running = task_running(rq, p); array = p->array; task_rq_unlock(rq, &flags); /* Check them.. */ if (unlikely(running)) { cpu_relax(); goto repeat; } /* Preempted away? Yield if so.. */ if (unlikely(array)) { yield(); goto repeat; } Basically, that first "while()" loop is done entirely without any locking at all (and doesn't check for the case where the target process might have been preempted away), and so it's possibly "incorrect", but we don't really care. Both the runqueue used, and the "task_running()" check might be the wrong tests, but they won't oops - they just mean that we could possibly get the wrong results due to lack of locking and exit the loop early in the case of a race condition. So once we've exited the loop, we then get the proper (and careful) rq lock, and check the running/runnable state _safely_. And if it turns out that our quick-and-dirty and unsafe loop was wrong after all, we just go back and try it all again. (The patch also adds a lot of comments, which is the actual bulk of it all, to make it more obvious why we can do these things without holding the locks). Thanks to Miklos for all the testing and tracking it down. Tested-by: Miklos Szeredi <miklos@szeredi.hu> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Linus Torvalds <torvalds@linux-foundation.org> 2007-06-18 12:34:40 -0400
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-06-18 14:52:55 -0400
commit: fa490cfd15d7ce0900097cc4e60cfd7a76381138 (patch)
tree: 37c0294ed6f6f9e0362db974c4136979a37d9ecd /kernel/sched.c
parent: a0f98a1cb7d27c656de450ba56efd31bdc59065e (diff)
1 files changed, 60 insertions, 9 deletions
diff --git a/kernel/sched.c b/kernel/sched.c
index 49be34e1f0b8..a7475913b009 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1159,21 +1159,72 @@ void wait_task_inactive(struct task_struct *p)
 {
        unsigned long flags;
        struct rq *rq;
-        int preempted;
+        struct prio_array *array;
+        int running;
 repeat:
+        /*
+         * We do the initial early heuristics without holding
+         * any task-queue locks at all. We'll only try to get
+         * the runqueue lock when things look like they will
+         * work out!
+         */
+        rq = task_rq(p);
+        /*
+         * If the task is actively running on another CPU
+         * still, just relax and busy-wait without holding
+         * any locks.
+         *
+         * NOTE! Since we don't hold any locks, it's not
+         * even sure that "rq" stays as the right runqueue!
+         * But we don't care, since "task_running()" will
+         * return false if the runqueue has changed and p
+         * is actually now running somewhere else!
+         */
+        while (task_running(rq, p))
+                cpu_relax();
+        /*
+         * Ok, time to look more closely! We need the rq
+         * lock now, to be *sure*. If we're wrong, we'll
+         * just go back and repeat.
+         */
        rq = task_rq_lock(p, &flags);
-        /* Must be off runqueue entirely, not preempted. */
+        running = task_running(rq, p);
-        if (unlikely(p->array || task_running(rq, p))) {
+        array = p->array;
-                /* If it's preempted, we yield.  It could be a while. */
+        task_rq_unlock(rq, &flags);
-                preempted = !task_running(rq, p);
-                task_rq_unlock(rq, &flags);
+        /*
+         * Was it really running after all now that we
+         * checked with the proper locks actually held?
+         *
+         * Oops. Go back and try again..
+         */
+        if (unlikely(running)) {
                cpu_relax();
-                if (preempted)
-                        yield();
                goto repeat;
        }
-        task_rq_unlock(rq, &flags);
+        /*
+         * It's not enough that it's not actively running,
+         * it must be off the runqueue _entirely_, and not
+         * preempted!
+         *
+         * So if it wa still runnable (but just not actively
+         * running right now), it's preempted, and we should
+         * yield - it could be a while.
+         */
+        if (unlikely(array)) {
+                yield();
+                goto repeat;
+        }
+        /*
+         * Ahh, all good. It wasn't running, and it wasn't
+         * runnable, which means that it will never become
+         * running in the future either. We're all done!
+         */
 }
 /***
author	Linus Torvalds <torvalds@linux-foundation.org>	2007-06-18 12:34:40 -0400
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-06-18 14:52:55 -0400
commit	fa490cfd15d7ce0900097cc4e60cfd7a76381138 (patch)
tree	37c0294ed6f6f9e0362db974c4136979a37d9ecd /kernel/sched.c
parent	a0f98a1cb7d27c656de450ba56efd31bdc59065e (diff)

diff --git a/kernel/sched.c b/kernel/sched.c index 49be34e1f0b8..a7475913b009 100644 --- a/kernel/sched.c +++ b/kernel/sched.c
@@ -1159,21 +1159,72 @@ void wait_task_inactive(struct task_struct *p)
1159	{	1159	{
1160	unsigned long flags;	1160	unsigned long flags;
1161	struct rq *rq;	1161	struct rq *rq;
1162	int preempted;	1162	struct prio_array *array;
		1163	int running;
1163		1164
1164	repeat:	1165	repeat:
		1166	/*
		1167	* We do the initial early heuristics without holding
		1168	* any task-queue locks at all. We'll only try to get
		1169	* the runqueue lock when things look like they will
		1170	* work out!
		1171	*/
		1172	rq = task_rq(p);
		1173
		1174	/*
		1175	* If the task is actively running on another CPU
		1176	* still, just relax and busy-wait without holding
		1177	* any locks.
		1178	*
		1179	* NOTE! Since we don't hold any locks, it's not
		1180	* even sure that "rq" stays as the right runqueue!
		1181	* But we don't care, since "task_running()" will
		1182	* return false if the runqueue has changed and p
		1183	* is actually now running somewhere else!
		1184	*/
		1185	while (task_running(rq, p))
		1186	cpu_relax();
		1187
		1188	/*
		1189	* Ok, time to look more closely! We need the rq
		1190	* lock now, to be sure. If we're wrong, we'll
		1191	* just go back and repeat.
		1192	*/
1165	rq = task_rq_lock(p, &flags);	1193	rq = task_rq_lock(p, &flags);
1166	/* Must be off runqueue entirely, not preempted. */	1194	running = task_running(rq, p);
1167	if (unlikely(p->array \|\| task_running(rq, p))) {	1195	array = p->array;
1168	/* If it's preempted, we yield. It could be a while. */	1196	task_rq_unlock(rq, &flags);
1169	preempted = !task_running(rq, p);	1197
1170	task_rq_unlock(rq, &flags);	1198	/*
		1199	* Was it really running after all now that we
		1200	* checked with the proper locks actually held?
		1201	*
		1202	* Oops. Go back and try again..
		1203	*/
		1204	if (unlikely(running)) {
1171	cpu_relax();	1205	cpu_relax();
1172	if (preempted)
1173	yield();
1174	goto repeat;	1206	goto repeat;
1175	}	1207	}
1176	task_rq_unlock(rq, &flags);	1208
		1209	/*
		1210	* It's not enough that it's not actively running,
		1211	* it must be off the runqueue _entirely_, and not
		1212	* preempted!
		1213	*
		1214	* So if it wa still runnable (but just not actively
		1215	* running right now), it's preempted, and we should
		1216	* yield - it could be a while.
		1217	*/
		1218	if (unlikely(array)) {
		1219	yield();
		1220	goto repeat;
		1221	}
		1222
		1223	/*
		1224	* Ahh, all good. It wasn't running, and it wasn't
		1225	* runnable, which means that it will never become
		1226	* running in the future either. We're all done!
		1227	*/
1177	}	1228	}
1178		1229
1179	/***	1230	/***