diff options
| -rw-r--r-- | Documentation/scheduler/00-INDEX | 2 | ||||
| -rw-r--r-- | Documentation/scheduler/sched-deadline.txt | 281 | ||||
| -rw-r--r-- | kernel/sched/core.c | 4 | ||||
| -rw-r--r-- | kernel/sched/deadline.c | 3 |
4 files changed, 288 insertions, 2 deletions
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX index d2651c47ae27..46702e4f89c9 100644 --- a/Documentation/scheduler/00-INDEX +++ b/Documentation/scheduler/00-INDEX | |||
| @@ -10,5 +10,7 @@ sched-nice-design.txt | |||
| 10 | - How and why the scheduler's nice levels are implemented. | 10 | - How and why the scheduler's nice levels are implemented. |
| 11 | sched-rt-group.txt | 11 | sched-rt-group.txt |
| 12 | - real-time group scheduling. | 12 | - real-time group scheduling. |
| 13 | sched-deadline.txt | ||
| 14 | - deadline scheduling. | ||
| 13 | sched-stats.txt | 15 | sched-stats.txt |
| 14 | - information on schedstats (Linux Scheduler Statistics). | 16 | - information on schedstats (Linux Scheduler Statistics). |
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt new file mode 100644 index 000000000000..18adc92a6b3b --- /dev/null +++ b/Documentation/scheduler/sched-deadline.txt | |||
| @@ -0,0 +1,281 @@ | |||
| 1 | Deadline Task Scheduling | ||
| 2 | ------------------------ | ||
| 3 | |||
| 4 | CONTENTS | ||
| 5 | ======== | ||
| 6 | |||
| 7 | 0. WARNING | ||
| 8 | 1. Overview | ||
| 9 | 2. Scheduling algorithm | ||
| 10 | 3. Scheduling Real-Time Tasks | ||
| 11 | 4. Bandwidth management | ||
| 12 | 4.1 System-wide settings | ||
| 13 | 4.2 Task interface | ||
| 14 | 4.3 Default behavior | ||
| 15 | 5. Tasks CPU affinity | ||
| 16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | ||
| 17 | 6. Future plans | ||
| 18 | |||
| 19 | |||
| 20 | 0. WARNING | ||
| 21 | ========== | ||
| 22 | |||
| 23 | Fiddling with these settings can result in an unpredictable or even unstable | ||
| 24 | system behavior. As for -rt (group) scheduling, it is assumed that root users | ||
| 25 | know what they're doing. | ||
| 26 | |||
| 27 | |||
| 28 | 1. Overview | ||
| 29 | =========== | ||
| 30 | |||
| 31 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | ||
| 32 | basically an implementation of the Earliest Deadline First (EDF) scheduling | ||
| 33 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | ||
| 34 | that makes it possible to isolate the behavior of tasks between each other. | ||
| 35 | |||
| 36 | |||
| 37 | 2. Scheduling algorithm | ||
| 38 | ================== | ||
| 39 | |||
| 40 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | ||
| 41 | "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive | ||
| 42 | "runtime" microseconds of execution time every "period" microseconds, and | ||
| 43 | these "runtime" microseconds are available within "deadline" microseconds | ||
| 44 | from the beginning of the period. In order to implement this behaviour, | ||
| 45 | every time the task wakes up, the scheduler computes a "scheduling deadline" | ||
| 46 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | ||
| 47 | scheduled using EDF[1] on these scheduling deadlines (the task with the | ||
| 48 | smallest scheduling deadline is selected for execution). Notice that this | ||
| 49 | guaranteed is respected if a proper "admission control" strategy (see Section | ||
| 50 | "4. Bandwidth management") is used. | ||
| 51 | |||
| 52 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | ||
| 53 | that each task runs for at most its runtime every period, avoiding any | ||
| 54 | interference between different tasks (bandwidth isolation), while the EDF[1] | ||
| 55 | algorithm selects the task with the smallest scheduling deadline as the one | ||
| 56 | to be executed first. Thanks to this feature, also tasks that do not | ||
| 57 | strictly comply with the "traditional" real-time task model (see Section 3) | ||
| 58 | can effectively use the new policy. | ||
| 59 | |||
| 60 | In more details, the CBS algorithm assigns scheduling deadlines to | ||
| 61 | tasks in the following way: | ||
| 62 | |||
| 63 | - Each SCHED_DEADLINE task is characterised by the "runtime", | ||
| 64 | "deadline", and "period" parameters; | ||
| 65 | |||
| 66 | - The state of the task is described by a "scheduling deadline", and | ||
| 67 | a "current runtime". These two parameters are initially set to 0; | ||
| 68 | |||
| 69 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | ||
| 70 | the scheduler checks if | ||
| 71 | |||
| 72 | current runtime runtime | ||
| 73 | ---------------------------------- > ---------------- | ||
| 74 | scheduling deadline - current time period | ||
| 75 | |||
| 76 | then, if the scheduling deadline is smaller than the current time, or | ||
| 77 | this condition is verified, the scheduling deadline and the | ||
| 78 | current budget are re-initialised as | ||
| 79 | |||
| 80 | scheduling deadline = current time + deadline | ||
| 81 | current runtime = runtime | ||
| 82 | |||
| 83 | otherwise, the scheduling deadline and the current runtime are | ||
| 84 | left unchanged; | ||
| 85 | |||
| 86 | - When a SCHED_DEADLINE task executes for an amount of time t, its | ||
| 87 | current runtime is decreased as | ||
| 88 | |||
| 89 | current runtime = current runtime - t | ||
| 90 | |||
| 91 | (technically, the runtime is decreased at every tick, or when the | ||
| 92 | task is descheduled / preempted); | ||
| 93 | |||
| 94 | - When the current runtime becomes less or equal than 0, the task is | ||
| 95 | said to be "throttled" (also known as "depleted" in real-time literature) | ||
| 96 | and cannot be scheduled until its scheduling deadline. The "replenishment | ||
| 97 | time" for this task (see next item) is set to be equal to the current | ||
| 98 | value of the scheduling deadline; | ||
| 99 | |||
| 100 | - When the current time is equal to the replenishment time of a | ||
| 101 | throttled task, the scheduling deadline and the current runtime are | ||
| 102 | updated as | ||
| 103 | |||
| 104 | scheduling deadline = scheduling deadline + period | ||
| 105 | current runtime = current runtime + runtime | ||
| 106 | |||
| 107 | |||
| 108 | 3. Scheduling Real-Time Tasks | ||
| 109 | ============================= | ||
| 110 | |||
| 111 | * BIG FAT WARNING ****************************************************** | ||
| 112 | * | ||
| 113 | * This section contains a (not-thorough) summary on classical deadline | ||
| 114 | * scheduling theory, and how it applies to SCHED_DEADLINE. | ||
| 115 | * The reader can "safely" skip to Section 4 if only interested in seeing | ||
| 116 | * how the scheduling policy can be used. Anyway, we strongly recommend | ||
| 117 | * to come back here and continue reading (once the urge for testing is | ||
| 118 | * satisfied :P) to be sure of fully understanding all technical details. | ||
| 119 | ************************************************************************ | ||
| 120 | |||
| 121 | There are no limitations on what kind of task can exploit this new | ||
| 122 | scheduling discipline, even if it must be said that it is particularly | ||
| 123 | suited for periodic or sporadic real-time tasks that need guarantees on their | ||
| 124 | timing behavior, e.g., multimedia, streaming, control applications, etc. | ||
| 125 | |||
| 126 | A typical real-time task is composed of a repetition of computation phases | ||
| 127 | (task instances, or jobs) which are activated on a periodic or sporadic | ||
| 128 | fashion. | ||
| 129 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | ||
| 130 | arrival time r_j (the time when the job starts), an amount of computation | ||
| 131 | time c_j needed to finish the job, and a job absolute deadline d_j, which | ||
| 132 | is the time within which the job should be finished. The maximum execution | ||
| 133 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | ||
| 134 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | ||
| 135 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | ||
| 136 | d_j = r_j + D, where D is the task's relative deadline. | ||
| 137 | |||
| 138 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | ||
| 139 | the jobs' deadlines of a task are respected. In order to do this, a task | ||
| 140 | must be scheduled by setting: | ||
| 141 | |||
| 142 | - runtime >= WCET | ||
| 143 | - deadline = D | ||
| 144 | - period <= P | ||
| 145 | |||
| 146 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | ||
| 147 | and the absolute deadlines (d_j) coincide, so a proper admission control | ||
| 148 | allows to respect the jobs' absolute deadlines for this task (this is what is | ||
| 149 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | ||
| 150 | |||
| 151 | References: | ||
| 152 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | ||
| 153 | ming in a hard-real-time environment. Journal of the Association for | ||
| 154 | Computing Machinery, 20(1), 1973. | ||
| 155 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | ||
| 156 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | ||
| 157 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | ||
| 158 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | ||
| 159 | Technical Report. http://xoomer.virgilio.it/lucabe72/pubs/tr-98-01.ps | ||
| 160 | |||
| 161 | 4. Bandwidth management | ||
| 162 | ======================= | ||
| 163 | |||
| 164 | In order for the -deadline scheduling to be effective and useful, it is | ||
| 165 | important to have some method to keep the allocation of the available CPU | ||
| 166 | bandwidth to the tasks under control. | ||
| 167 | This is usually called "admission control" and if it is not performed at all, | ||
| 168 | no guarantee can be given on the actual scheduling of the -deadline tasks. | ||
| 169 | |||
| 170 | Since when RT-throttling has been introduced each task group has a bandwidth | ||
| 171 | associated, calculated as a certain amount of runtime over a period. | ||
| 172 | Moreover, to make it possible to manipulate such bandwidth, readable/writable | ||
| 173 | controls have been added to both procfs (for system wide settings) and cgroupfs | ||
| 174 | (for per-group settings). | ||
| 175 | Therefore, the same interface is being used for controlling the bandwidth | ||
| 176 | distrubution to -deadline tasks. | ||
| 177 | |||
| 178 | However, more discussion is needed in order to figure out how we want to manage | ||
| 179 | SCHED_DEADLINE bandwidth at the task group level. Therefore, SCHED_DEADLINE | ||
| 180 | uses (for now) a less sophisticated, but actually very sensible, mechanism to | ||
| 181 | ensure that a certain utilization cap is not overcome per each root_domain. | ||
| 182 | |||
| 183 | Another main difference between deadline bandwidth management and RT-throttling | ||
| 184 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), | ||
| 185 | and thus we don't need an higher level throttling mechanism to enforce the | ||
| 186 | desired bandwidth. | ||
| 187 | |||
| 188 | 4.1 System wide settings | ||
| 189 | ------------------------ | ||
| 190 | |||
| 191 | The system wide settings are configured under the /proc virtual file system. | ||
| 192 | |||
| 193 | For now the -rt knobs are used for dl admission control and the -deadline | ||
| 194 | runtime is accounted against the -rt runtime. We realise that this isn't | ||
| 195 | entirely desirable; however, it is better to have a small interface for now, | ||
| 196 | and be able to change it easily later. The ideal situation (see 5.) is to run | ||
| 197 | -rt tasks from a -deadline server; in which case the -rt bandwidth is a direct | ||
| 198 | subset of dl_bw. | ||
| 199 | |||
| 200 | This means that, for a root_domain comprising M CPUs, -deadline tasks | ||
| 201 | can be created while the sum of their bandwidths stays below: | ||
| 202 | |||
| 203 | M * (sched_rt_runtime_us / sched_rt_period_us) | ||
| 204 | |||
| 205 | It is also possible to disable this bandwidth management logic, and | ||
| 206 | be thus free of oversubscribing the system up to any arbitrary level. | ||
| 207 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | ||
| 208 | |||
| 209 | |||
| 210 | 4.2 Task interface | ||
| 211 | ------------------ | ||
| 212 | |||
| 213 | Specifying a periodic/sporadic task that executes for a given amount of | ||
| 214 | runtime at each instance, and that is scheduled according to the urgency of | ||
| 215 | its own timing constraints needs, in general, a way of declaring: | ||
| 216 | - a (maximum/typical) instance execution time, | ||
| 217 | - a minimum interval between consecutive instances, | ||
| 218 | - a time constraint by which each instance must be completed. | ||
| 219 | |||
| 220 | Therefore: | ||
| 221 | * a new struct sched_attr, containing all the necessary fields is | ||
| 222 | provided; | ||
| 223 | * the new scheduling related syscalls that manipulate it, i.e., | ||
| 224 | sched_setattr() and sched_getattr() are implemented. | ||
| 225 | |||
| 226 | |||
| 227 | 4.3 Default behavior | ||
| 228 | --------------------- | ||
| 229 | |||
| 230 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | ||
| 231 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | ||
| 232 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | ||
| 233 | root_domain, for each root_domain. | ||
| 234 | |||
| 235 | A -deadline task cannot fork. | ||
| 236 | |||
| 237 | 5. Tasks CPU affinity | ||
| 238 | ===================== | ||
| 239 | |||
| 240 | -deadline tasks cannot have an affinity mask smaller that the entire | ||
| 241 | root_domain they are created on. However, affinities can be specified | ||
| 242 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | ||
| 243 | |||
| 244 | 5.1 SCHED_DEADLINE and cpusets HOWTO | ||
| 245 | ------------------------------------ | ||
| 246 | |||
| 247 | An example of a simple configuration (pin a -deadline task to CPU0) | ||
| 248 | follows (rt-app is used to create a -deadline task). | ||
| 249 | |||
| 250 | mkdir /dev/cpuset | ||
| 251 | mount -t cgroup -o cpuset cpuset /dev/cpuset | ||
| 252 | cd /dev/cpuset | ||
| 253 | mkdir cpu0 | ||
| 254 | echo 0 > cpu0/cpuset.cpus | ||
| 255 | echo 0 > cpu0/cpuset.mems | ||
| 256 | echo 1 > cpuset.cpu_exclusive | ||
| 257 | echo 0 > cpuset.sched_load_balance | ||
| 258 | echo 1 > cpu0/cpuset.cpu_exclusive | ||
| 259 | echo 1 > cpu0/cpuset.mem_exclusive | ||
| 260 | echo $$ > cpu0/tasks | ||
| 261 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | ||
| 262 | task affinity) | ||
| 263 | |||
| 264 | 6. Future plans | ||
| 265 | =============== | ||
| 266 | |||
| 267 | Still missing: | ||
| 268 | |||
| 269 | - refinements to deadline inheritance, especially regarding the possibility | ||
| 270 | of retaining bandwidth isolation among non-interacting tasks. This is | ||
| 271 | being studied from both theoretical and practical points of view, and | ||
| 272 | hopefully we should be able to produce some demonstrative code soon; | ||
| 273 | - (c)group based bandwidth management, and maybe scheduling; | ||
| 274 | - access control for non-root users (and related security concerns to | ||
| 275 | address), which is the best way to allow unprivileged use of the mechanisms | ||
| 276 | and how to prevent non-root users "cheat" the system? | ||
| 277 | |||
| 278 | As already discussed, we are planning also to merge this work with the EDF | ||
| 279 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | ||
| 280 | the preliminary phases of the merge and we really seek feedback that would | ||
| 281 | help us decide on the direction it should take. | ||
diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 7fea865a810d..656cd70eb577 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c | |||
| @@ -4347,7 +4347,9 @@ SYSCALL_DEFINE2(sched_rr_get_interval, pid_t, pid, | |||
| 4347 | goto out_unlock; | 4347 | goto out_unlock; |
| 4348 | 4348 | ||
| 4349 | rq = task_rq_lock(p, &flags); | 4349 | rq = task_rq_lock(p, &flags); |
| 4350 | time_slice = p->sched_class->get_rr_interval(rq, p); | 4350 | time_slice = 0; |
| 4351 | if (p->sched_class->get_rr_interval) | ||
| 4352 | time_slice = p->sched_class->get_rr_interval(rq, p); | ||
| 4351 | task_rq_unlock(rq, p, &flags); | 4353 | task_rq_unlock(rq, p, &flags); |
| 4352 | 4354 | ||
| 4353 | rcu_read_unlock(); | 4355 | rcu_read_unlock(); |
diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c index 0de248202879..0dd5e0971a07 100644 --- a/kernel/sched/deadline.c +++ b/kernel/sched/deadline.c | |||
| @@ -351,7 +351,8 @@ static void replenish_dl_entity(struct sched_dl_entity *dl_se, | |||
| 351 | * disrupting the schedulability of the system. Otherwise, we should | 351 | * disrupting the schedulability of the system. Otherwise, we should |
| 352 | * refill the runtime and set the deadline a period in the future, | 352 | * refill the runtime and set the deadline a period in the future, |
| 353 | * because keeping the current (absolute) deadline of the task would | 353 | * because keeping the current (absolute) deadline of the task would |
| 354 | * result in breaking guarantees promised to other tasks. | 354 | * result in breaking guarantees promised to other tasks (refer to |
| 355 | * Documentation/scheduler/sched-deadline.txt for more informations). | ||
| 355 | * | 356 | * |
| 356 | * This function returns true if: | 357 | * This function returns true if: |
| 357 | * | 358 | * |
