aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/scheduler/sched-design-CFS.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/scheduler/sched-design-CFS.txt')
-rw-r--r--Documentation/scheduler/sched-design-CFS.txt371
1 files changed, 218 insertions, 153 deletions
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt
index 88bcb8767335..b2aa856339a7 100644
--- a/Documentation/scheduler/sched-design-CFS.txt
+++ b/Documentation/scheduler/sched-design-CFS.txt
@@ -1,151 +1,218 @@
1 =============
2 CFS Scheduler
3 =============
1 4
2This is the CFS scheduler.
3
480% of CFS's design can be summed up in a single sentence: CFS basically
5models an "ideal, precise multi-tasking CPU" on real hardware.
6
7"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%
8physical power and which can run each task at precise equal speed, in
9parallel, each at 1/nr_running speed. For example: if there are 2 tasks
10running then it runs each at 50% physical power - totally in parallel.
11
12On real hardware, we can run only a single task at once, so while that
13one task runs, the other tasks that are waiting for the CPU are at a
14disadvantage - the current task gets an unfair amount of CPU time. In
15CFS this fairness imbalance is expressed and tracked via the per-task
16p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
17time the task should now run on the CPU for it to become completely fair
18and balanced.
19
20( small detail: on 'ideal' hardware, the p->wait_runtime value would
21 always be zero - no task would ever get 'out of balance' from the
22 'ideal' share of CPU time. )
23
24CFS's task picking logic is based on this p->wait_runtime value and it
25is thus very simple: it always tries to run the task with the largest
26p->wait_runtime value. In other words, CFS tries to run the task with
27the 'gravest need' for more CPU time. So CFS always tries to split up
28CPU time between runnable tasks as close to 'ideal multitasking
29hardware' as possible.
30
31Most of the rest of CFS's design just falls out of this really simple
32concept, with a few add-on embellishments like nice levels,
33multiprocessing and various algorithm variants to recognize sleepers.
34
35In practice it works like this: the system runs a task a bit, and when
36the task schedules (or a scheduler tick happens) the task's CPU usage is
37'accounted for': the (small) time it just spent using the physical CPU
38is deducted from p->wait_runtime. [minus the 'fair share' it would have
39gotten anyway]. Once p->wait_runtime gets low enough so that another
40task becomes the 'leftmost task' of the time-ordered rbtree it maintains
41(plus a small amount of 'granularity' distance relative to the leftmost
42task so that we do not over-schedule tasks and trash the cache) then the
43new leftmost task is picked and the current task is preempted.
44
45The rq->fair_clock value tracks the 'CPU time a runnable task would have
46fairly gotten, had it been runnable during that time'. So by using
47rq->fair_clock values we can accurately timestamp and measure the
48'expected CPU time' a task should have gotten. All runnable tasks are
49sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
50CFS picks the 'leftmost' task and sticks to it. As the system progresses
51forwards, newly woken tasks are put into the tree more and more to the
52right - slowly but surely giving a chance for every task to become the
53'leftmost task' and thus get on the CPU within a deterministic amount of
54time.
55
56Some implementation details:
57
58 - the introduction of Scheduling Classes: an extensible hierarchy of
59 scheduler modules. These modules encapsulate scheduling policy
60 details and are handled by the scheduler core without the core
61 code assuming about them too much.
62
63 - sched_fair.c implements the 'CFS desktop scheduler': it is a
64 replacement for the vanilla scheduler's SCHED_OTHER interactivity
65 code.
66
67 I'd like to give credit to Con Kolivas for the general approach here:
68 he has proven via RSDL/SD that 'fair scheduling' is possible and that
69 it results in better desktop scheduling. Kudos Con!
70
71 The CFS patch uses a completely different approach and implementation
72 from RSDL/SD. My goal was to make CFS's interactivity quality exceed
73 that of RSDL/SD, which is a high standard to meet :-) Testing
74 feedback is welcome to decide this one way or another. [ and, in any
75 case, all of SD's logic could be added via a kernel/sched_sd.c module
76 as well, if Con is interested in such an approach. ]
77
78 CFS's design is quite radical: it does not use runqueues, it uses a
79 time-ordered rbtree to build a 'timeline' of future task execution,
80 and thus has no 'array switch' artifacts (by which both the vanilla
81 scheduler and RSDL/SD are affected).
82
83 CFS uses nanosecond granularity accounting and does not rely on any
84 jiffies or other HZ detail. Thus the CFS scheduler has no notion of
85 'timeslices' and has no heuristics whatsoever. There is only one
86 central tunable (you have to switch on CONFIG_SCHED_DEBUG):
87
88 /proc/sys/kernel/sched_granularity_ns
89
90 which can be used to tune the scheduler from 'desktop' (low
91 latencies) to 'server' (good batching) workloads. It defaults to a
92 setting suitable for desktop workloads. SCHED_BATCH is handled by the
93 CFS scheduler module too.
94
95 Due to its design, the CFS scheduler is not prone to any of the
96 'attacks' that exist today against the heuristics of the stock
97 scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
98 work fine and do not impact interactivity and produce the expected
99 behavior.
100
101 the CFS scheduler has a much stronger handling of nice levels and
102 SCHED_BATCH: both types of workloads should be isolated much more
103 agressively than under the vanilla scheduler.
104
105 ( another detail: due to nanosec accounting and timeline sorting,
106 sched_yield() support is very simple under CFS, and in fact under
107 CFS sched_yield() behaves much better than under any other
108 scheduler i have tested so far. )
109
110 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
111 way than the vanilla scheduler does. It uses 100 runqueues (for all
112 100 RT priority levels, instead of 140 in the vanilla scheduler)
113 and it needs no expired array.
114
115 - reworked/sanitized SMP load-balancing: the runqueue-walking
116 assumptions are gone from the load-balancing code now, and
117 iterators of the scheduling modules are used. The balancing code got
118 quite a bit simpler as a result.
119
120
121Group scheduler extension to CFS
122================================
123
124Normally the scheduler operates on individual tasks and strives to provide
125fair CPU time to each task. Sometimes, it may be desirable to group tasks
126and provide fair CPU time to each such task group. For example, it may
127be desirable to first provide fair CPU time to each user on the system
128and then to each task belonging to a user.
129
130CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets
131SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such
132groups. At present, there are two (mutually exclusive) mechanisms to group
133tasks for CPU bandwidth control purpose:
134
135 - Based on user id (CONFIG_FAIR_USER_SCHED)
136 In this option, tasks are grouped according to their user id.
137 - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED)
138 This options lets the administrator create arbitrary groups
139 of tasks, using the "cgroup" pseudo filesystem. See
140 Documentation/cgroups.txt for more information about this
141 filesystem.
142 5
143Only one of these options to group tasks can be chosen and not both. 61. OVERVIEW
7
8CFS stands for "Completely Fair Scheduler," and is the new "desktop" process
9scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the
10replacement for the previous vanilla scheduler's SCHED_OTHER interactivity
11code.
12
1380% of CFS's design can be summed up in a single sentence: CFS basically models
14an "ideal, precise multi-tasking CPU" on real hardware.
15
16"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical
17power and which can run each task at precise equal speed, in parallel, each at
181/nr_running speed. For example: if there are 2 tasks running, then it runs
19each at 50% physical power --- i.e., actually in parallel.
20
21On real hardware, we can run only a single task at once, so we have to
22introduce the concept of "virtual runtime." The virtual runtime of a task
23specifies when its next timeslice would start execution on the ideal
24multi-tasking CPU described above. In practice, the virtual runtime of a task
25is its actual runtime normalized to the total number of running tasks.
26
27
28
292. FEW IMPLEMENTATION DETAILS
30
31In CFS the virtual runtime is expressed and tracked via the per-task
32p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately
33timestamp and measure the "expected CPU time" a task should have gotten.
34
35[ small detail: on "ideal" hardware, at any time all tasks would have the same
36 p->se.vruntime value --- i.e., tasks would execute simultaneously and no task
37 would ever get "out of balance" from the "ideal" share of CPU time. ]
38
39CFS's task picking logic is based on this p->se.vruntime value and it is thus
40very simple: it always tries to run the task with the smallest p->se.vruntime
41value (i.e., the task which executed least so far). CFS always tries to split
42up CPU time between runnable tasks as close to "ideal multitasking hardware" as
43possible.
44
45Most of the rest of CFS's design just falls out of this really simple concept,
46with a few add-on embellishments like nice levels, multiprocessing and various
47algorithm variants to recognize sleepers.
48
49
50
513. THE RBTREE
52
53CFS's design is quite radical: it does not use the old data structures for the
54runqueues, but it uses a time-ordered rbtree to build a "timeline" of future
55task execution, and thus has no "array switch" artifacts (by which both the
56previous vanilla scheduler and RSDL/SD are affected).
57
58CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic
59increasing value tracking the smallest vruntime among all tasks in the
60runqueue. The total amount of work done by the system is tracked using
61min_vruntime; that value is used to place newly activated entities on the left
62side of the tree as much as possible.
63
64The total number of running tasks in the runqueue is accounted through the
65rq->cfs.load value, which is the sum of the weights of the tasks queued on the
66runqueue.
67
68CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the
69p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to
70account for possible wraparounds). CFS picks the "leftmost" task from this
71tree and sticks to it.
72As the system progresses forwards, the executed tasks are put into the tree
73more and more to the right --- slowly but surely giving a chance for every task
74to become the "leftmost task" and thus get on the CPU within a deterministic
75amount of time.
76
77Summing up, CFS works like this: it runs a task a bit, and when the task
78schedules (or a scheduler tick happens) the task's CPU usage is "accounted
79for": the (small) time it just spent using the physical CPU is added to
80p->se.vruntime. Once p->se.vruntime gets high enough so that another task
81becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a
82small amount of "granularity" distance relative to the leftmost task so that we
83do not over-schedule tasks and trash the cache), then the new leftmost task is
84picked and the current task is preempted.
85
86
87
884. SOME FEATURES OF CFS
89
90CFS uses nanosecond granularity accounting and does not rely on any jiffies or
91other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the
92way the previous scheduler had, and has no heuristics whatsoever. There is
93only one central tunable (you have to switch on CONFIG_SCHED_DEBUG):
94
95 /proc/sys/kernel/sched_granularity_ns
96
97which can be used to tune the scheduler from "desktop" (i.e., low latencies) to
98"server" (i.e., good batching) workloads. It defaults to a setting suitable
99for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too.
100
101Due to its design, the CFS scheduler is not prone to any of the "attacks" that
102exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c,
103chew.c, ring-test.c, massive_intr.c all work fine and do not impact
104interactivity and produce the expected behavior.
105
106The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH
107than the previous vanilla scheduler: both types of workloads are isolated much
108more aggressively.
109
110SMP load-balancing has been reworked/sanitized: the runqueue-walking
111assumptions are gone from the load-balancing code now, and iterators of the
112scheduling modules are used. The balancing code got quite a bit simpler as a
113result.
114
115
116
1175. SCHEDULING CLASSES
118
119The new CFS scheduler has been designed in such a way to introduce "Scheduling
120Classes," an extensible hierarchy of scheduler modules. These modules
121encapsulate scheduling policy details and are handled by the scheduler core
122without the core code assuming too much about them.
123
124sched_fair.c implements the CFS scheduler described above.
144 125
145Group scheduler tunables: 126sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
127the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
128priority levels, instead of 140 in the previous scheduler) and it needs no
129expired array.
146 130
147When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for 131Scheduling classes are implemented through the sched_class structure, which
148each new user and a "cpu_share" file is added in that directory. 132contains hooks to functions that must be called whenever an interesting event
133occurs.
134
135This is the (partial) list of the hooks:
136
137 - enqueue_task(...)
138
139 Called when a task enters a runnable state.
140 It puts the scheduling entity (task) into the red-black tree and
141 increments the nr_running variable.
142
143 - dequeue_tree(...)
144
145 When a task is no longer runnable, this function is called to keep the
146 corresponding scheduling entity out of the red-black tree. It decrements
147 the nr_running variable.
148
149 - yield_task(...)
150
151 This function is basically just a dequeue followed by an enqueue, unless the
152 compat_yield sysctl is turned on; in that case, it places the scheduling
153 entity at the right-most end of the red-black tree.
154
155 - check_preempt_curr(...)
156
157 This function checks if a task that entered the runnable state should
158 preempt the currently running task.
159
160 - pick_next_task(...)
161
162 This function chooses the most appropriate task eligible to run next.
163
164 - set_curr_task(...)
165
166 This function is called when a task changes its scheduling class or changes
167 its task group.
168
169 - task_tick(...)
170
171 This function is mostly called from time tick functions; it might lead to
172 process switch. This drives the running preemption.
173
174 - task_new(...)
175
176 The core scheduler gives the scheduling module an opportunity to manage new
177 task startup. The CFS scheduling module uses it for group scheduling, while
178 the scheduling module for a real-time task does not use it.
179
180
181
1826. GROUP SCHEDULER EXTENSIONS TO CFS
183
184Normally, the scheduler operates on individual tasks and strives to provide
185fair CPU time to each task. Sometimes, it may be desirable to group tasks and
186provide fair CPU time to each such task group. For example, it may be
187desirable to first provide fair CPU time to each user on the system and then to
188each task belonging to a user.
189
190CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be
191grouped and divides CPU time fairly among such groups.
192
193CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and
194SCHED_RR) tasks.
195
196CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and
197SCHED_BATCH) tasks.
198
199At present, there are two (mutually exclusive) mechanisms to group tasks for
200CPU bandwidth control purposes:
201
202 - Based on user id (CONFIG_USER_SCHED)
203
204 With this option, tasks are grouped according to their user id.
205
206 - Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED)
207
208 This options needs CONFIG_CGROUPS to be defined, and lets the administrator
209 create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See
210 Documentation/cgroups.txt for more information about this filesystem.
211
212Only one of these options to group tasks can be chosen and not both.
213
214When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new
215user and a "cpu_share" file is added in that directory.
149 216
150 # cd /sys/kernel/uids 217 # cd /sys/kernel/uids
151 # cat 512/cpu_share # Display user 512's CPU share 218 # cat 512/cpu_share # Display user 512's CPU share
@@ -155,16 +222,14 @@ each new user and a "cpu_share" file is added in that directory.
155 2048 222 2048
156 # 223 #
157 224
158CPU bandwidth between two users are divided in the ratio of their CPU shares. 225CPU bandwidth between two users is divided in the ratio of their CPU shares.
159For ex: if you would like user "root" to get twice the bandwidth of user 226For example: if you would like user "root" to get twice the bandwidth of user
160"guest", then set the cpu_share for both the users such that "root"'s 227"guest," then set the cpu_share for both the users such that "root"'s cpu_share
161cpu_share is twice "guest"'s cpu_share 228is twice "guest"'s cpu_share.
162
163 229
164When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created 230When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each
165for each group created using the pseudo filesystem. See example steps 231group created using the pseudo filesystem. See example steps below to create
166below to create task groups and modify their CPU share using the "cgroups" 232task groups and modify their CPU share using the "cgroups" pseudo filesystem.
167pseudo filesystem
168 233
169 # mkdir /dev/cpuctl 234 # mkdir /dev/cpuctl
170 # mount -t cgroup -ocpu none /dev/cpuctl 235 # mount -t cgroup -ocpu none /dev/cpuctl