summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
-rw-r--r--kernel/sched/rt.c81
1 files changed, 81 insertions, 0 deletions
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 9f3e40226dec..979b7341008a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1927,6 +1927,87 @@ static int find_next_push_cpu(struct rq *rq)
1927#define RT_PUSH_IPI_EXECUTING 1 1927#define RT_PUSH_IPI_EXECUTING 1
1928#define RT_PUSH_IPI_RESTART 2 1928#define RT_PUSH_IPI_RESTART 2
1929 1929
1930/*
1931 * When a high priority task schedules out from a CPU and a lower priority
1932 * task is scheduled in, a check is made to see if there's any RT tasks
1933 * on other CPUs that are waiting to run because a higher priority RT task
1934 * is currently running on its CPU. In this case, the CPU with multiple RT
1935 * tasks queued on it (overloaded) needs to be notified that a CPU has opened
1936 * up that may be able to run one of its non-running queued RT tasks.
1937 *
1938 * On large CPU boxes, there's the case that several CPUs could schedule
1939 * a lower priority task at the same time, in which case it will look for
1940 * any overloaded CPUs that it could pull a task from. To do this, the runqueue
1941 * lock must be taken from that overloaded CPU. Having 10s of CPUs all fighting
1942 * for a single overloaded CPU's runqueue lock can produce a large latency.
1943 * (This has actually been observed on large boxes running cyclictest).
1944 * Instead of taking the runqueue lock of the overloaded CPU, each of the
1945 * CPUs that scheduled a lower priority task simply sends an IPI to the
1946 * overloaded CPU. An IPI is much cheaper than taking an runqueue lock with
1947 * lots of contention. The overloaded CPU will look to push its non-running
1948 * RT task off, and if it does, it can then ignore the other IPIs coming
1949 * in, and just pass those IPIs off to any other overloaded CPU.
1950 *
1951 * When a CPU schedules a lower priority task, it only sends an IPI to
1952 * the "next" CPU that has overloaded RT tasks. This prevents IPI storms,
1953 * as having 10 CPUs scheduling lower priority tasks and 10 CPUs with
1954 * RT overloaded tasks, would cause 100 IPIs to go out at once.
1955 *
1956 * The overloaded RT CPU, when receiving an IPI, will try to push off its
1957 * overloaded RT tasks and then send an IPI to the next CPU that has
1958 * overloaded RT tasks. This stops when all CPUs with overloaded RT tasks
1959 * have completed. Just because a CPU may have pushed off its own overloaded
1960 * RT task does not mean it should stop sending the IPI around to other
1961 * overloaded CPUs. There may be another RT task waiting to run on one of
1962 * those CPUs that are of higher priority than the one that was just
1963 * pushed.
1964 *
1965 * An optimization that could possibly be made is to make a CPU array similar
1966 * to the cpupri array mask of all running RT tasks, but for the overloaded
1967 * case, then the IPI could be sent to only the CPU with the highest priority
1968 * RT task waiting, and that CPU could send off further IPIs to the CPU with
1969 * the next highest waiting task. Since the overloaded case is much less likely
1970 * to happen, the complexity of this implementation may not be worth it.
1971 * Instead, just send an IPI around to all overloaded CPUs.
1972 *
1973 * The rq->rt.push_flags holds the status of the IPI that is going around.
1974 * A run queue can only send out a single IPI at a time. The possible flags
1975 * for rq->rt.push_flags are:
1976 *
1977 * (None or zero): No IPI is going around for the current rq
1978 * RT_PUSH_IPI_EXECUTING: An IPI for the rq is being passed around
1979 * RT_PUSH_IPI_RESTART: The priority of the running task for the rq
1980 * has changed, and the IPI should restart
1981 * circulating the overloaded CPUs again.
1982 *
1983 * rq->rt.push_cpu contains the CPU that is being sent the IPI. It is updated
1984 * before sending to the next CPU.
1985 *
1986 * Instead of having all CPUs that schedule a lower priority task send
1987 * an IPI to the same "first" CPU in the RT overload mask, they send it
1988 * to the next overloaded CPU after their own CPU. This helps distribute
1989 * the work when there's more than one overloaded CPU and multiple CPUs
1990 * scheduling in lower priority tasks.
1991 *
1992 * When a rq schedules a lower priority task than what was currently
1993 * running, the next CPU with overloaded RT tasks is examined first.
1994 * That is, if CPU 1 and 5 are overloaded, and CPU 3 schedules a lower
1995 * priority task, it will send an IPI first to CPU 5, then CPU 5 will
1996 * send to CPU 1 if it is still overloaded. CPU 1 will clear the
1997 * rq->rt.push_flags if RT_PUSH_IPI_RESTART is not set.
1998 *
1999 * The first CPU to notice IPI_RESTART is set, will clear that flag and then
2000 * send an IPI to the next overloaded CPU after the rq->cpu and not the next
2001 * CPU after push_cpu. That is, if CPU 1, 4 and 5 are overloaded when CPU 3
2002 * schedules a lower priority task, and the IPI_RESTART gets set while the
2003 * handling is being done on CPU 5, it will clear the flag and send it back to
2004 * CPU 4 instead of CPU 1.
2005 *
2006 * Note, the above logic can be disabled by turning off the sched_feature
2007 * RT_PUSH_IPI. Then the rq lock of the overloaded CPU will simply be
2008 * taken by the CPU requesting a pull and the waiting RT task will be pulled
2009 * by that CPU. This may be fine for machines with few CPUs.
2010 */
1930static void tell_cpu_to_push(struct rq *rq) 2011static void tell_cpu_to_push(struct rq *rq)
1931{ 2012{
1932 int cpu; 2013 int cpu;