diff options
Diffstat (limited to 'Documentation/scheduler/sched-design-CFS.txt')
-rw-r--r-- | Documentation/scheduler/sched-design-CFS.txt | 371 |
1 files changed, 218 insertions, 153 deletions
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt index 88bcb8767335..b2aa856339a7 100644 --- a/Documentation/scheduler/sched-design-CFS.txt +++ b/Documentation/scheduler/sched-design-CFS.txt | |||
@@ -1,151 +1,218 @@ | |||
1 | ============= | ||
2 | CFS Scheduler | ||
3 | ============= | ||
1 | 4 | ||
2 | This is the CFS scheduler. | ||
3 | |||
4 | 80% of CFS's design can be summed up in a single sentence: CFS basically | ||
5 | models an "ideal, precise multi-tasking CPU" on real hardware. | ||
6 | |||
7 | "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% | ||
8 | physical power and which can run each task at precise equal speed, in | ||
9 | parallel, each at 1/nr_running speed. For example: if there are 2 tasks | ||
10 | running then it runs each at 50% physical power - totally in parallel. | ||
11 | |||
12 | On real hardware, we can run only a single task at once, so while that | ||
13 | one task runs, the other tasks that are waiting for the CPU are at a | ||
14 | disadvantage - the current task gets an unfair amount of CPU time. In | ||
15 | CFS this fairness imbalance is expressed and tracked via the per-task | ||
16 | p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of | ||
17 | time the task should now run on the CPU for it to become completely fair | ||
18 | and balanced. | ||
19 | |||
20 | ( small detail: on 'ideal' hardware, the p->wait_runtime value would | ||
21 | always be zero - no task would ever get 'out of balance' from the | ||
22 | 'ideal' share of CPU time. ) | ||
23 | |||
24 | CFS's task picking logic is based on this p->wait_runtime value and it | ||
25 | is thus very simple: it always tries to run the task with the largest | ||
26 | p->wait_runtime value. In other words, CFS tries to run the task with | ||
27 | the 'gravest need' for more CPU time. So CFS always tries to split up | ||
28 | CPU time between runnable tasks as close to 'ideal multitasking | ||
29 | hardware' as possible. | ||
30 | |||
31 | Most of the rest of CFS's design just falls out of this really simple | ||
32 | concept, with a few add-on embellishments like nice levels, | ||
33 | multiprocessing and various algorithm variants to recognize sleepers. | ||
34 | |||
35 | In practice it works like this: the system runs a task a bit, and when | ||
36 | the task schedules (or a scheduler tick happens) the task's CPU usage is | ||
37 | 'accounted for': the (small) time it just spent using the physical CPU | ||
38 | is deducted from p->wait_runtime. [minus the 'fair share' it would have | ||
39 | gotten anyway]. Once p->wait_runtime gets low enough so that another | ||
40 | task becomes the 'leftmost task' of the time-ordered rbtree it maintains | ||
41 | (plus a small amount of 'granularity' distance relative to the leftmost | ||
42 | task so that we do not over-schedule tasks and trash the cache) then the | ||
43 | new leftmost task is picked and the current task is preempted. | ||
44 | |||
45 | The rq->fair_clock value tracks the 'CPU time a runnable task would have | ||
46 | fairly gotten, had it been runnable during that time'. So by using | ||
47 | rq->fair_clock values we can accurately timestamp and measure the | ||
48 | 'expected CPU time' a task should have gotten. All runnable tasks are | ||
49 | sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and | ||
50 | CFS picks the 'leftmost' task and sticks to it. As the system progresses | ||
51 | forwards, newly woken tasks are put into the tree more and more to the | ||
52 | right - slowly but surely giving a chance for every task to become the | ||
53 | 'leftmost task' and thus get on the CPU within a deterministic amount of | ||
54 | time. | ||
55 | |||
56 | Some implementation details: | ||
57 | |||
58 | - the introduction of Scheduling Classes: an extensible hierarchy of | ||
59 | scheduler modules. These modules encapsulate scheduling policy | ||
60 | details and are handled by the scheduler core without the core | ||
61 | code assuming about them too much. | ||
62 | |||
63 | - sched_fair.c implements the 'CFS desktop scheduler': it is a | ||
64 | replacement for the vanilla scheduler's SCHED_OTHER interactivity | ||
65 | code. | ||
66 | |||
67 | I'd like to give credit to Con Kolivas for the general approach here: | ||
68 | he has proven via RSDL/SD that 'fair scheduling' is possible and that | ||
69 | it results in better desktop scheduling. Kudos Con! | ||
70 | |||
71 | The CFS patch uses a completely different approach and implementation | ||
72 | from RSDL/SD. My goal was to make CFS's interactivity quality exceed | ||
73 | that of RSDL/SD, which is a high standard to meet :-) Testing | ||
74 | feedback is welcome to decide this one way or another. [ and, in any | ||
75 | case, all of SD's logic could be added via a kernel/sched_sd.c module | ||
76 | as well, if Con is interested in such an approach. ] | ||
77 | |||
78 | CFS's design is quite radical: it does not use runqueues, it uses a | ||
79 | time-ordered rbtree to build a 'timeline' of future task execution, | ||
80 | and thus has no 'array switch' artifacts (by which both the vanilla | ||
81 | scheduler and RSDL/SD are affected). | ||
82 | |||
83 | CFS uses nanosecond granularity accounting and does not rely on any | ||
84 | jiffies or other HZ detail. Thus the CFS scheduler has no notion of | ||
85 | 'timeslices' and has no heuristics whatsoever. There is only one | ||
86 | central tunable (you have to switch on CONFIG_SCHED_DEBUG): | ||
87 | |||
88 | /proc/sys/kernel/sched_granularity_ns | ||
89 | |||
90 | which can be used to tune the scheduler from 'desktop' (low | ||
91 | latencies) to 'server' (good batching) workloads. It defaults to a | ||
92 | setting suitable for desktop workloads. SCHED_BATCH is handled by the | ||
93 | CFS scheduler module too. | ||
94 | |||
95 | Due to its design, the CFS scheduler is not prone to any of the | ||
96 | 'attacks' that exist today against the heuristics of the stock | ||
97 | scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all | ||
98 | work fine and do not impact interactivity and produce the expected | ||
99 | behavior. | ||
100 | |||
101 | the CFS scheduler has a much stronger handling of nice levels and | ||
102 | SCHED_BATCH: both types of workloads should be isolated much more | ||
103 | agressively than under the vanilla scheduler. | ||
104 | |||
105 | ( another detail: due to nanosec accounting and timeline sorting, | ||
106 | sched_yield() support is very simple under CFS, and in fact under | ||
107 | CFS sched_yield() behaves much better than under any other | ||
108 | scheduler i have tested so far. ) | ||
109 | |||
110 | - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler | ||
111 | way than the vanilla scheduler does. It uses 100 runqueues (for all | ||
112 | 100 RT priority levels, instead of 140 in the vanilla scheduler) | ||
113 | and it needs no expired array. | ||
114 | |||
115 | - reworked/sanitized SMP load-balancing: the runqueue-walking | ||
116 | assumptions are gone from the load-balancing code now, and | ||
117 | iterators of the scheduling modules are used. The balancing code got | ||
118 | quite a bit simpler as a result. | ||
119 | |||
120 | |||
121 | Group scheduler extension to CFS | ||
122 | ================================ | ||
123 | |||
124 | Normally the scheduler operates on individual tasks and strives to provide | ||
125 | fair CPU time to each task. Sometimes, it may be desirable to group tasks | ||
126 | and provide fair CPU time to each such task group. For example, it may | ||
127 | be desirable to first provide fair CPU time to each user on the system | ||
128 | and then to each task belonging to a user. | ||
129 | |||
130 | CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets | ||
131 | SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such | ||
132 | groups. At present, there are two (mutually exclusive) mechanisms to group | ||
133 | tasks for CPU bandwidth control purpose: | ||
134 | |||
135 | - Based on user id (CONFIG_FAIR_USER_SCHED) | ||
136 | In this option, tasks are grouped according to their user id. | ||
137 | - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED) | ||
138 | This options lets the administrator create arbitrary groups | ||
139 | of tasks, using the "cgroup" pseudo filesystem. See | ||
140 | Documentation/cgroups.txt for more information about this | ||
141 | filesystem. | ||
142 | 5 | ||
143 | Only one of these options to group tasks can be chosen and not both. | 6 | 1. OVERVIEW |
7 | |||
8 | CFS stands for "Completely Fair Scheduler," and is the new "desktop" process | ||
9 | scheduler implemented by Ingo Molnar and merged in Linux 2.6.23. It is the | ||
10 | replacement for the previous vanilla scheduler's SCHED_OTHER interactivity | ||
11 | code. | ||
12 | |||
13 | 80% of CFS's design can be summed up in a single sentence: CFS basically models | ||
14 | an "ideal, precise multi-tasking CPU" on real hardware. | ||
15 | |||
16 | "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% physical | ||
17 | power and which can run each task at precise equal speed, in parallel, each at | ||
18 | 1/nr_running speed. For example: if there are 2 tasks running, then it runs | ||
19 | each at 50% physical power --- i.e., actually in parallel. | ||
20 | |||
21 | On real hardware, we can run only a single task at once, so we have to | ||
22 | introduce the concept of "virtual runtime." The virtual runtime of a task | ||
23 | specifies when its next timeslice would start execution on the ideal | ||
24 | multi-tasking CPU described above. In practice, the virtual runtime of a task | ||
25 | is its actual runtime normalized to the total number of running tasks. | ||
26 | |||
27 | |||
28 | |||
29 | 2. FEW IMPLEMENTATION DETAILS | ||
30 | |||
31 | In CFS the virtual runtime is expressed and tracked via the per-task | ||
32 | p->se.vruntime (nanosec-unit) value. This way, it's possible to accurately | ||
33 | timestamp and measure the "expected CPU time" a task should have gotten. | ||
34 | |||
35 | [ small detail: on "ideal" hardware, at any time all tasks would have the same | ||
36 | p->se.vruntime value --- i.e., tasks would execute simultaneously and no task | ||
37 | would ever get "out of balance" from the "ideal" share of CPU time. ] | ||
38 | |||
39 | CFS's task picking logic is based on this p->se.vruntime value and it is thus | ||
40 | very simple: it always tries to run the task with the smallest p->se.vruntime | ||
41 | value (i.e., the task which executed least so far). CFS always tries to split | ||
42 | up CPU time between runnable tasks as close to "ideal multitasking hardware" as | ||
43 | possible. | ||
44 | |||
45 | Most of the rest of CFS's design just falls out of this really simple concept, | ||
46 | with a few add-on embellishments like nice levels, multiprocessing and various | ||
47 | algorithm variants to recognize sleepers. | ||
48 | |||
49 | |||
50 | |||
51 | 3. THE RBTREE | ||
52 | |||
53 | CFS's design is quite radical: it does not use the old data structures for the | ||
54 | runqueues, but it uses a time-ordered rbtree to build a "timeline" of future | ||
55 | task execution, and thus has no "array switch" artifacts (by which both the | ||
56 | previous vanilla scheduler and RSDL/SD are affected). | ||
57 | |||
58 | CFS also maintains the rq->cfs.min_vruntime value, which is a monotonic | ||
59 | increasing value tracking the smallest vruntime among all tasks in the | ||
60 | runqueue. The total amount of work done by the system is tracked using | ||
61 | min_vruntime; that value is used to place newly activated entities on the left | ||
62 | side of the tree as much as possible. | ||
63 | |||
64 | The total number of running tasks in the runqueue is accounted through the | ||
65 | rq->cfs.load value, which is the sum of the weights of the tasks queued on the | ||
66 | runqueue. | ||
67 | |||
68 | CFS maintains a time-ordered rbtree, where all runnable tasks are sorted by the | ||
69 | p->se.vruntime key (there is a subtraction using rq->cfs.min_vruntime to | ||
70 | account for possible wraparounds). CFS picks the "leftmost" task from this | ||
71 | tree and sticks to it. | ||
72 | As the system progresses forwards, the executed tasks are put into the tree | ||
73 | more and more to the right --- slowly but surely giving a chance for every task | ||
74 | to become the "leftmost task" and thus get on the CPU within a deterministic | ||
75 | amount of time. | ||
76 | |||
77 | Summing up, CFS works like this: it runs a task a bit, and when the task | ||
78 | schedules (or a scheduler tick happens) the task's CPU usage is "accounted | ||
79 | for": the (small) time it just spent using the physical CPU is added to | ||
80 | p->se.vruntime. Once p->se.vruntime gets high enough so that another task | ||
81 | becomes the "leftmost task" of the time-ordered rbtree it maintains (plus a | ||
82 | small amount of "granularity" distance relative to the leftmost task so that we | ||
83 | do not over-schedule tasks and trash the cache), then the new leftmost task is | ||
84 | picked and the current task is preempted. | ||
85 | |||
86 | |||
87 | |||
88 | 4. SOME FEATURES OF CFS | ||
89 | |||
90 | CFS uses nanosecond granularity accounting and does not rely on any jiffies or | ||
91 | other HZ detail. Thus the CFS scheduler has no notion of "timeslices" in the | ||
92 | way the previous scheduler had, and has no heuristics whatsoever. There is | ||
93 | only one central tunable (you have to switch on CONFIG_SCHED_DEBUG): | ||
94 | |||
95 | /proc/sys/kernel/sched_granularity_ns | ||
96 | |||
97 | which can be used to tune the scheduler from "desktop" (i.e., low latencies) to | ||
98 | "server" (i.e., good batching) workloads. It defaults to a setting suitable | ||
99 | for desktop workloads. SCHED_BATCH is handled by the CFS scheduler module too. | ||
100 | |||
101 | Due to its design, the CFS scheduler is not prone to any of the "attacks" that | ||
102 | exist today against the heuristics of the stock scheduler: fiftyp.c, thud.c, | ||
103 | chew.c, ring-test.c, massive_intr.c all work fine and do not impact | ||
104 | interactivity and produce the expected behavior. | ||
105 | |||
106 | The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH | ||
107 | than the previous vanilla scheduler: both types of workloads are isolated much | ||
108 | more aggressively. | ||
109 | |||
110 | SMP load-balancing has been reworked/sanitized: the runqueue-walking | ||
111 | assumptions are gone from the load-balancing code now, and iterators of the | ||
112 | scheduling modules are used. The balancing code got quite a bit simpler as a | ||
113 | result. | ||
114 | |||
115 | |||
116 | |||
117 | 5. SCHEDULING CLASSES | ||
118 | |||
119 | The new CFS scheduler has been designed in such a way to introduce "Scheduling | ||
120 | Classes," an extensible hierarchy of scheduler modules. These modules | ||
121 | encapsulate scheduling policy details and are handled by the scheduler core | ||
122 | without the core code assuming too much about them. | ||
123 | |||
124 | sched_fair.c implements the CFS scheduler described above. | ||
144 | 125 | ||
145 | Group scheduler tunables: | 126 | sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than |
127 | the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT | ||
128 | priority levels, instead of 140 in the previous scheduler) and it needs no | ||
129 | expired array. | ||
146 | 130 | ||
147 | When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for | 131 | Scheduling classes are implemented through the sched_class structure, which |
148 | each new user and a "cpu_share" file is added in that directory. | 132 | contains hooks to functions that must be called whenever an interesting event |
133 | occurs. | ||
134 | |||
135 | This is the (partial) list of the hooks: | ||
136 | |||
137 | - enqueue_task(...) | ||
138 | |||
139 | Called when a task enters a runnable state. | ||
140 | It puts the scheduling entity (task) into the red-black tree and | ||
141 | increments the nr_running variable. | ||
142 | |||
143 | - dequeue_tree(...) | ||
144 | |||
145 | When a task is no longer runnable, this function is called to keep the | ||
146 | corresponding scheduling entity out of the red-black tree. It decrements | ||
147 | the nr_running variable. | ||
148 | |||
149 | - yield_task(...) | ||
150 | |||
151 | This function is basically just a dequeue followed by an enqueue, unless the | ||
152 | compat_yield sysctl is turned on; in that case, it places the scheduling | ||
153 | entity at the right-most end of the red-black tree. | ||
154 | |||
155 | - check_preempt_curr(...) | ||
156 | |||
157 | This function checks if a task that entered the runnable state should | ||
158 | preempt the currently running task. | ||
159 | |||
160 | - pick_next_task(...) | ||
161 | |||
162 | This function chooses the most appropriate task eligible to run next. | ||
163 | |||
164 | - set_curr_task(...) | ||
165 | |||
166 | This function is called when a task changes its scheduling class or changes | ||
167 | its task group. | ||
168 | |||
169 | - task_tick(...) | ||
170 | |||
171 | This function is mostly called from time tick functions; it might lead to | ||
172 | process switch. This drives the running preemption. | ||
173 | |||
174 | - task_new(...) | ||
175 | |||
176 | The core scheduler gives the scheduling module an opportunity to manage new | ||
177 | task startup. The CFS scheduling module uses it for group scheduling, while | ||
178 | the scheduling module for a real-time task does not use it. | ||
179 | |||
180 | |||
181 | |||
182 | 6. GROUP SCHEDULER EXTENSIONS TO CFS | ||
183 | |||
184 | Normally, the scheduler operates on individual tasks and strives to provide | ||
185 | fair CPU time to each task. Sometimes, it may be desirable to group tasks and | ||
186 | provide fair CPU time to each such task group. For example, it may be | ||
187 | desirable to first provide fair CPU time to each user on the system and then to | ||
188 | each task belonging to a user. | ||
189 | |||
190 | CONFIG_GROUP_SCHED strives to achieve exactly that. It lets tasks to be | ||
191 | grouped and divides CPU time fairly among such groups. | ||
192 | |||
193 | CONFIG_RT_GROUP_SCHED permits to group real-time (i.e., SCHED_FIFO and | ||
194 | SCHED_RR) tasks. | ||
195 | |||
196 | CONFIG_FAIR_GROUP_SCHED permits to group CFS (i.e., SCHED_NORMAL and | ||
197 | SCHED_BATCH) tasks. | ||
198 | |||
199 | At present, there are two (mutually exclusive) mechanisms to group tasks for | ||
200 | CPU bandwidth control purposes: | ||
201 | |||
202 | - Based on user id (CONFIG_USER_SCHED) | ||
203 | |||
204 | With this option, tasks are grouped according to their user id. | ||
205 | |||
206 | - Based on "cgroup" pseudo filesystem (CONFIG_CGROUP_SCHED) | ||
207 | |||
208 | This options needs CONFIG_CGROUPS to be defined, and lets the administrator | ||
209 | create arbitrary groups of tasks, using the "cgroup" pseudo filesystem. See | ||
210 | Documentation/cgroups.txt for more information about this filesystem. | ||
211 | |||
212 | Only one of these options to group tasks can be chosen and not both. | ||
213 | |||
214 | When CONFIG_USER_SCHED is defined, a directory is created in sysfs for each new | ||
215 | user and a "cpu_share" file is added in that directory. | ||
149 | 216 | ||
150 | # cd /sys/kernel/uids | 217 | # cd /sys/kernel/uids |
151 | # cat 512/cpu_share # Display user 512's CPU share | 218 | # cat 512/cpu_share # Display user 512's CPU share |
@@ -155,16 +222,14 @@ each new user and a "cpu_share" file is added in that directory. | |||
155 | 2048 | 222 | 2048 |
156 | # | 223 | # |
157 | 224 | ||
158 | CPU bandwidth between two users are divided in the ratio of their CPU shares. | 225 | CPU bandwidth between two users is divided in the ratio of their CPU shares. |
159 | For ex: if you would like user "root" to get twice the bandwidth of user | 226 | For example: if you would like user "root" to get twice the bandwidth of user |
160 | "guest", then set the cpu_share for both the users such that "root"'s | 227 | "guest," then set the cpu_share for both the users such that "root"'s cpu_share |
161 | cpu_share is twice "guest"'s cpu_share | 228 | is twice "guest"'s cpu_share. |
162 | |||
163 | 229 | ||
164 | When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created | 230 | When CONFIG_CGROUP_SCHED is defined, a "cpu.shares" file is created for each |
165 | for each group created using the pseudo filesystem. See example steps | 231 | group created using the pseudo filesystem. See example steps below to create |
166 | below to create task groups and modify their CPU share using the "cgroups" | 232 | task groups and modify their CPU share using the "cgroups" pseudo filesystem. |
167 | pseudo filesystem | ||
168 | 233 | ||
169 | # mkdir /dev/cpuctl | 234 | # mkdir /dev/cpuctl |
170 | # mount -t cgroup -ocpu none /dev/cpuctl | 235 | # mount -t cgroup -ocpu none /dev/cpuctl |