diff options
author | Dario Faggioli <raistlin@linux.it> | 2014-01-27 06:20:15 -0500 |
---|---|---|
committer | Ingo Molnar <mingo@kernel.org> | 2014-01-28 07:08:40 -0500 |
commit | 712e5e34aef449ab680b35c0d9016f59b0a4494c (patch) | |
tree | cf578361c799de1e7b25e3b646f769d90d047c40 /Documentation | |
parent | b78f4c7c4402d1070dc7ccaca6001abb4ab4053a (diff) |
sched/deadline: Add sched_dl documentation
Add in Documentation/scheduler/ some hints about the design
choices, the usage and the future possible developments of the
sched_dl scheduling class and of the SCHED_DEADLINE policy.
Reviewed-by: Henrik Austad <henrik@austad.us>
Signed-off-by: Dario Faggioli <raistlin@linux.it>
Signed-off-by: Juri Lelli <juri.lelli@gmail.com>
[ Re-wrote sections 2 and 3. ]
Signed-off-by: Luca Abeni <luca.abeni@unitn.it>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1390821615-23247-1-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/scheduler/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/scheduler/sched-deadline.txt | 281 |
2 files changed, 283 insertions, 0 deletions
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX index d2651c47ae27..46702e4f89c9 100644 --- a/Documentation/scheduler/00-INDEX +++ b/Documentation/scheduler/00-INDEX | |||
@@ -10,5 +10,7 @@ sched-nice-design.txt | |||
10 | - How and why the scheduler's nice levels are implemented. | 10 | - How and why the scheduler's nice levels are implemented. |
11 | sched-rt-group.txt | 11 | sched-rt-group.txt |
12 | - real-time group scheduling. | 12 | - real-time group scheduling. |
13 | sched-deadline.txt | ||
14 | - deadline scheduling. | ||
13 | sched-stats.txt | 15 | sched-stats.txt |
14 | - information on schedstats (Linux Scheduler Statistics). | 16 | - information on schedstats (Linux Scheduler Statistics). |
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt new file mode 100644 index 000000000000..18adc92a6b3b --- /dev/null +++ b/Documentation/scheduler/sched-deadline.txt | |||
@@ -0,0 +1,281 @@ | |||
1 | Deadline Task Scheduling | ||
2 | ------------------------ | ||
3 | |||
4 | CONTENTS | ||
5 | ======== | ||
6 | |||
7 | 0. WARNING | ||
8 | 1. Overview | ||
9 | 2. Scheduling algorithm | ||
10 | 3. Scheduling Real-Time Tasks | ||
11 | 4. Bandwidth management | ||
12 | 4.1 System-wide settings | ||
13 | 4.2 Task interface | ||
14 | 4.3 Default behavior | ||
15 | 5. Tasks CPU affinity | ||
16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | ||
17 | 6. Future plans | ||
18 | |||
19 | |||
20 | 0. WARNING | ||
21 | ========== | ||
22 | |||
23 | Fiddling with these settings can result in an unpredictable or even unstable | ||
24 | system behavior. As for -rt (group) scheduling, it is assumed that root users | ||
25 | know what they're doing. | ||
26 | |||
27 | |||
28 | 1. Overview | ||
29 | =========== | ||
30 | |||
31 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | ||
32 | basically an implementation of the Earliest Deadline First (EDF) scheduling | ||
33 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | ||
34 | that makes it possible to isolate the behavior of tasks between each other. | ||
35 | |||
36 | |||
37 | 2. Scheduling algorithm | ||
38 | ================== | ||
39 | |||
40 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | ||
41 | "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive | ||
42 | "runtime" microseconds of execution time every "period" microseconds, and | ||
43 | these "runtime" microseconds are available within "deadline" microseconds | ||
44 | from the beginning of the period. In order to implement this behaviour, | ||
45 | every time the task wakes up, the scheduler computes a "scheduling deadline" | ||
46 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | ||
47 | scheduled using EDF[1] on these scheduling deadlines (the task with the | ||
48 | smallest scheduling deadline is selected for execution). Notice that this | ||
49 | guaranteed is respected if a proper "admission control" strategy (see Section | ||
50 | "4. Bandwidth management") is used. | ||
51 | |||
52 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | ||
53 | that each task runs for at most its runtime every period, avoiding any | ||
54 | interference between different tasks (bandwidth isolation), while the EDF[1] | ||
55 | algorithm selects the task with the smallest scheduling deadline as the one | ||
56 | to be executed first. Thanks to this feature, also tasks that do not | ||
57 | strictly comply with the "traditional" real-time task model (see Section 3) | ||
58 | can effectively use the new policy. | ||
59 | |||
60 | In more details, the CBS algorithm assigns scheduling deadlines to | ||
61 | tasks in the following way: | ||
62 | |||
63 | - Each SCHED_DEADLINE task is characterised by the "runtime", | ||
64 | "deadline", and "period" parameters; | ||
65 | |||
66 | - The state of the task is described by a "scheduling deadline", and | ||
67 | a "current runtime". These two parameters are initially set to 0; | ||
68 | |||
69 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | ||
70 | the scheduler checks if | ||
71 | |||
72 | current runtime runtime | ||
73 | ---------------------------------- > ---------------- | ||
74 | scheduling deadline - current time period | ||
75 | |||
76 | then, if the scheduling deadline is smaller than the current time, or | ||
77 | this condition is verified, the scheduling deadline and the | ||
78 | current budget are re-initialised as | ||
79 | |||
80 | scheduling deadline = current time + deadline | ||
81 | current runtime = runtime | ||
82 | |||
83 | otherwise, the scheduling deadline and the current runtime are | ||
84 | left unchanged; | ||
85 | |||
86 | - When a SCHED_DEADLINE task executes for an amount of time t, its | ||
87 | current runtime is decreased as | ||
88 | |||
89 | current runtime = current runtime - t | ||
90 | |||
91 | (technically, the runtime is decreased at every tick, or when the | ||
92 | task is descheduled / preempted); | ||
93 | |||
94 | - When the current runtime becomes less or equal than 0, the task is | ||
95 | said to be "throttled" (also known as "depleted" in real-time literature) | ||
96 | and cannot be scheduled until its scheduling deadline. The "replenishment | ||
97 | time" for this task (see next item) is set to be equal to the current | ||
98 | value of the scheduling deadline; | ||
99 | |||
100 | - When the current time is equal to the replenishment time of a | ||
101 | throttled task, the scheduling deadline and the current runtime are | ||
102 | updated as | ||
103 | |||
104 | scheduling deadline = scheduling deadline + period | ||
105 | current runtime = current runtime + runtime | ||
106 | |||
107 | |||
108 | 3. Scheduling Real-Time Tasks | ||
109 | ============================= | ||
110 | |||
111 | * BIG FAT WARNING ****************************************************** | ||
112 | * | ||
113 | * This section contains a (not-thorough) summary on classical deadline | ||
114 | * scheduling theory, and how it applies to SCHED_DEADLINE. | ||
115 | * The reader can "safely" skip to Section 4 if only interested in seeing | ||
116 | * how the scheduling policy can be used. Anyway, we strongly recommend | ||
117 | * to come back here and continue reading (once the urge for testing is | ||
118 | * satisfied :P) to be sure of fully understanding all technical details. | ||
119 | ************************************************************************ | ||
120 | |||
121 | There are no limitations on what kind of task can exploit this new | ||
122 | scheduling discipline, even if it must be said that it is particularly | ||
123 | suited for periodic or sporadic real-time tasks that need guarantees on their | ||
124 | timing behavior, e.g., multimedia, streaming, control applications, etc. | ||
125 | |||
126 | A typical real-time task is composed of a repetition of computation phases | ||
127 | (task instances, or jobs) which are activated on a periodic or sporadic | ||
128 | fashion. | ||
129 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | ||
130 | arrival time r_j (the time when the job starts), an amount of computation | ||
131 | time c_j needed to finish the job, and a job absolute deadline d_j, which | ||
132 | is the time within which the job should be finished. The maximum execution | ||
133 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | ||
134 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | ||
135 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | ||
136 | d_j = r_j + D, where D is the task's relative deadline. | ||
137 | |||
138 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | ||
139 | the jobs' deadlines of a task are respected. In order to do this, a task | ||
140 | must be scheduled by setting: | ||
141 | |||
142 | - runtime >= WCET | ||
143 | - deadline = D | ||
144 | - period <= P | ||
145 | |||
146 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | ||
147 | and the absolute deadlines (d_j) coincide, so a proper admission control | ||
148 | allows to respect the jobs' absolute deadlines for this task (this is what is | ||
149 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | ||
150 | |||
151 | References: | ||
152 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | ||
153 | ming in a hard-real-time environment. Journal of the Association for | ||
154 | Computing Machinery, 20(1), 1973. | ||
155 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | ||
156 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | ||
157 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | ||
158 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | ||
159 | Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps | ||
160 | |||
161 | 4. Bandwidth management | ||
162 | ======================= | ||
163 | |||
164 | In order for the -deadline scheduling to be effective and useful, it is | ||
165 | important to have some method to keep the allocation of the available CPU | ||
166 | bandwidth to the tasks under control. | ||
167 | This is usually called "admission control" and if it is not performed at all, | ||
168 | no guarantee can be given on the actual scheduling of the -deadline tasks. | ||
169 | |||
170 | Since when RT-throttling has been introduced each task group has a bandwidth | ||
171 | associated, calculated as a certain amount of runtime over a period. | ||
172 | Moreover, to make it possible to manipulate such bandwidth, readable/writable | ||
173 | controls have been added to both procfs (for system wide settings) and cgroupfs | ||
174 | (for per-group settings). | ||
175 | Therefore, the same interface is being used for controlling the bandwidth | ||
176 | distrubution to -deadline tasks. | ||
177 | |||
178 | However, more discussion is needed in order to figure out how we want to manage | ||
179 | SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE | ||
180 | uses (for now) a less sophisticated, but actually very sensible, mechanism to | ||
181 | ensure that a certain utilization cap is not overcome per each root_domain. | ||
182 | |||
183 | Another main difference between deadline bandwidth management and RT-throttling | ||
184 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), | ||
185 | and thus we don't need an higher level throttling mechanism to enforce the | ||
186 | desired bandwidth. | ||
187 | |||
188 | 4.1 System wide settings | ||
189 | ------------------------ | ||
190 | |||
191 | The system wide settings are configured under the /proc virtual file system. | ||
192 | |||
193 | For now the -rt knobs are used for dl admission control and the -deadline | ||
194 | runtime is accounted against the -rt runtime. We realise that this isn't | ||
195 | entirely desirable; however, it is better to have a small interface for now, | ||
196 | and be able to change it easily later. The ideal situation (see 5.) is to run | ||
197 | -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct | ||
198 | subset of dl_bw. | ||
199 | |||
200 | This means that, for a root_domain comprising M CPUs, -deadline tasks | ||
201 | can be created while the sum of their bandwidths stays below: | ||
202 | |||
203 | M * (sched_rt_runtime_us / sched_rt_period_us) | ||
204 | |||
205 | It is also possible to disable this bandwidth management logic, and | ||
206 | be thus free of oversubscribing the system up to any arbitrary level. | ||
207 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | ||
208 | |||
209 | |||
210 | 4.2 Task interface | ||
211 | ------------------ | ||
212 | |||
213 | Specifying a periodic/sporadic task that executes for a given amount of | ||
214 | runtime at each instance, and that is scheduled according to the urgency of | ||
215 | its own timing constraints needs, in general, a way of declaring: | ||
216 | - a (maximum/typical) instance execution time, | ||
217 | - a minimum interval between consecutive instances, | ||
218 | - a time constraint by which each instance must be completed. | ||
219 | |||
220 | Therefore: | ||
221 | * a new struct sched_attr, containing all the necessary fields is | ||
222 | provided; | ||
223 | * the new scheduling related syscalls that manipulate it, i.e., | ||
224 | sched_setattr() and sched_getattr() are implemented. | ||
225 | |||
226 | |||
227 | 4.3 Default behavior | ||
228 | --------------------- | ||
229 | |||
230 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | ||
231 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | ||
232 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | ||
233 | root_domain, for each root_domain. | ||
234 | |||
235 | A -deadline task cannot fork. | ||
236 | |||
237 | 5. Tasks CPU affinity | ||
238 | ===================== | ||
239 | |||
240 | -deadline tasks cannot have an affinity mask smaller that the entire | ||
241 | root_domain they are created on. However, affinities can be specified | ||
242 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | ||
243 | |||
244 | 5.1 SCHED_DEADLINE and cpusets HOWTO | ||
245 | ------------------------------------ | ||
246 | |||
247 | An example of a simple configuration (pin a -deadline task to CPU0) | ||
248 | follows (rt-app is used to create a -deadline task). | ||
249 | |||
250 | mkdir /dev/cpuset | ||
251 | mount -t cgroup -o cpuset cpuset /dev/cpuset | ||
252 | cd /dev/cpuset | ||
253 | mkdir cpu0 | ||
254 | echo 0 > cpu0/cpuset.cpus | ||
255 | echo 0 > cpu0/cpuset.mems | ||
256 | echo 1 > cpuset.cpu_exclusive | ||
257 | echo 0 > cpuset.sched_load_balance | ||
258 | echo 1 > cpu0/cpuset.cpu_exclusive | ||
259 | echo 1 > cpu0/cpuset.mem_exclusive | ||
260 | echo $$ > cpu0/tasks | ||
261 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | ||
262 | task affinity) | ||
263 | |||
264 | 6. Future plans | ||
265 | =============== | ||
266 | |||
267 | Still missing: | ||
268 | |||
269 | - refinements to deadline inheritance, especially regarding the possibility | ||
270 | of retaining bandwidth isolation among non-interacting tasks. This is | ||
271 | being studied from both theoretical and practical points of view, and | ||
272 | hopefully we should be able to produce some demonstrative code soon; | ||
273 | - (c)group based bandwidth management, and maybe scheduling; | ||
274 | - access control for non-root users (and related security concerns to | ||
275 | address), which is the best way to allow unprivileged use of the mechanisms | ||
276 | and how to prevent non-root users "cheat" the system? | ||
277 | |||
278 | As already discussed, we are planning also to merge this work with the EDF | ||
279 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | ||
280 | the preliminary phases of the merge and we really seek feedback that would | ||
281 | help us decide on the direction it should take. | ||