diff options
| author | Viktor Radnai <viktor.radnai@gmail.com> | 2008-04-19 13:45:01 -0400 |
|---|---|---|
| committer | Ingo Molnar <mingo@elte.hu> | 2008-04-19 13:45:01 -0400 |
| commit | b9b158fe1ca2c166ff918de30cb098eafcae487a (patch) | |
| tree | 4320cfc00f7910444061479f652d0f94787b7c31 | |
| parent | c24b7c524421f9ea9d9ebab55f80cfb1f3fb77a3 (diff) | |
sched: better rt-group documentation
Viktor was nice enough to enhance the document based on my replies to
his questions on the subject.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
| -rw-r--r-- | Documentation/scheduler/sched-rt-group.txt | 188 | ||||
| -rw-r--r-- | init/Kconfig | 7 |
2 files changed, 160 insertions, 35 deletions
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 1c6332f4543c..14f901f639ee 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt | |||
| @@ -1,59 +1,177 @@ | |||
| 1 | Real-Time group scheduling | ||
| 2 | -------------------------- | ||
| 1 | 3 | ||
| 4 | CONTENTS | ||
| 5 | ======== | ||
| 2 | 6 | ||
| 3 | Real-Time group scheduling. | 7 | 1. Overview |
| 8 | 1.1 The problem | ||
| 9 | 1.2 The solution | ||
| 10 | 2. The interface | ||
| 11 | 2.1 System-wide settings | ||
| 12 | 2.2 Default behaviour | ||
| 13 | 2.3 Basis for grouping tasks | ||
| 14 | 3. Future plans | ||
| 4 | 15 | ||
| 5 | The problem space: | ||
| 6 | 16 | ||
| 7 | In order to schedule multiple groups of realtime tasks each group must | 17 | 1. Overview |
| 8 | be assigned a fixed portion of the CPU time available. Without a minimum | 18 | =========== |
| 9 | guarantee a realtime group can obviously fall short. A fuzzy upper limit | ||
| 10 | is of no use since it cannot be relied upon. Which leaves us with just | ||
| 11 | the single fixed portion. | ||
| 12 | 19 | ||
| 13 | CPU time is divided by means of specifying how much time can be spent | ||
| 14 | running in a given period. Say a frame fixed realtime renderer must | ||
| 15 | deliver 25 frames a second, which yields a period of 0.04s. Now say | ||
| 16 | it will also have to play some music and respond to input, leaving it | ||
| 17 | with around 80% for the graphics. We can then give this group a runtime | ||
| 18 | of 0.8 * 0.04s = 0.032s. | ||
| 19 | 20 | ||
| 20 | This way the graphics group will have a 0.04s period with a 0.032s runtime | 21 | 1.1 The problem |
| 21 | limit. | 22 | --------------- |
| 22 | 23 | ||
| 23 | Now if the audio thread needs to refill the DMA buffer every 0.005s, but | 24 | Realtime scheduling is all about determinism, a group has to be able to rely on |
| 24 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s | 25 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
| 25 | = 0.00015s. | 26 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
| 27 | of the CPU time available. Without a minimum guarantee a realtime group can | ||
| 28 | obviously fall short. A fuzzy upper limit is of no use since it cannot be | ||
| 29 | relied upon. Which leaves us with just the single fixed portion. | ||
| 26 | 30 | ||
| 31 | 1.2 The solution | ||
| 32 | ---------------- | ||
| 27 | 33 | ||
| 28 | The Interface: | 34 | CPU time is divided by means of specifying how much time can be spent running |
| 35 | in a given period. We allocate this "run time" for each realtime group which | ||
| 36 | the other realtime groups will not be permitted to use. | ||
| 29 | 37 | ||
| 30 | system wide: | 38 | Any time not allocated to a realtime group will be used to run normal priority |
| 39 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by | ||
| 40 | SCHED_OTHER. | ||
| 31 | 41 | ||
| 32 | /proc/sys/kernel/sched_rt_period_ms | 42 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
| 33 | /proc/sys/kernel/sched_rt_runtime_us | 43 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
| 44 | have to play some music and respond to input, leaving it with around 80% CPU | ||
| 45 | time dedicated for the graphics. We can then give this group a run time of 0.8 | ||
| 46 | * 0.04s = 0.032s. | ||
| 34 | 47 | ||
| 35 | CONFIG_FAIR_USER_SCHED | 48 | This way the graphics group will have a 0.04s period with a 0.032s run time |
| 49 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but | ||
| 50 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = | ||
| 51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time | ||
| 52 | of 0.00015s. | ||
| 36 | 53 | ||
| 37 | /sys/kernel/uids/<uid>/cpu_rt_runtime_us | 54 | The remaining CPU time will be used for user input and other tass. Because |
| 55 | realtime tasks have explicitly allocated the CPU time they need to perform | ||
| 56 | their tasks, buffer underruns in the graphocs or audio can be eliminated. | ||
| 38 | 57 | ||
| 39 | or | 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
| 59 | lack an EDF scheduler to make non-uniform periods usable. | ||
| 40 | 60 | ||
| 41 | CONFIG_FAIR_CGROUP_SCHED | ||
| 42 | 61 | ||
| 43 | /cgroup/<cgroup>/cpu.rt_runtime_us | 62 | 2. The Interface |
| 63 | ================ | ||
| 44 | 64 | ||
| 45 | [ time is specified in us because the interface is s32; this gives an | ||
| 46 | operating range of ~35m to 1us ] | ||
| 47 | 65 | ||
| 48 | The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. | 66 | 2.1 System wide settings |
| 67 | ------------------------ | ||
| 49 | 68 | ||
| 50 | A runtime of -1 specifies runtime == period, ie. no limit. | 69 | The system wide settings are configured under the /proc virtual file system: |
| 51 | 70 | ||
| 52 | New groups get the period from /proc/sys/kernel/sched_rt_period_us and | 71 | /proc/sys/kernel/sched_rt_period_us: |
| 53 | a runtime of 0. | 72 | The scheduling period that is equivalent to 100% CPU bandwidth |
| 54 | 73 | ||
| 55 | Settings are constrained to: | 74 | /proc/sys/kernel/sched_rt_runtime_us: |
| 75 | A global limit on how much time realtime scheduling may use. Even without | ||
| 76 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime | ||
| 77 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth | ||
| 78 | available to all realtime groups. | ||
| 79 | |||
| 80 | * Time is specified in us because the interface is s32. This gives an | ||
| 81 | operating range from 1us to about 35 minutes. | ||
| 82 | * sched_rt_period_us takes values from 1 to INT_MAX. | ||
| 83 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). | ||
| 84 | * A run time of -1 specifies runtime == period, ie. no limit. | ||
| 85 | |||
| 86 | |||
| 87 | 2.2 Default behaviour | ||
| 88 | --------------------- | ||
| 89 | |||
| 90 | The default values for sched_rt_period_us (1000000 or 1s) and | ||
| 91 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by | ||
| 92 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away | ||
| 93 | realtime tasks will not lock up the machine but leave a little time to recover | ||
| 94 | it. By setting runtime to -1 you'd get the old behaviour back. | ||
| 95 | |||
| 96 | By default all bandwidth is assigned to the root group and new groups get the | ||
| 97 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you | ||
| 98 | want to assign bandwidth to another group, reduce the root group's bandwidth | ||
| 99 | and assign some or all of the difference to another group. | ||
| 100 | |||
| 101 | Realtime group scheduling means you have to assign a portion of total CPU | ||
| 102 | bandwidth to the group before it will accept realtime tasks. Therefore you will | ||
| 103 | not be able to run realtime tasks as any user other than root until you have | ||
| 104 | done that, even if the user has the rights to run processes with realtime | ||
| 105 | priority! | ||
| 106 | |||
| 107 | |||
| 108 | 2.3 Basis for grouping tasks | ||
| 109 | ---------------------------- | ||
| 110 | |||
| 111 | There are two compile-time settings for allocating CPU bandwidth. These are | ||
| 112 | configured using the "Basis for grouping tasks" multiple choice menu under | ||
| 113 | General setup > Group CPU Scheduler: | ||
| 114 | |||
| 115 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") | ||
| 116 | |||
| 117 | This lets you use the virtual files under | ||
| 118 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for | ||
| 119 | each user . | ||
| 120 | |||
| 121 | The other option is: | ||
| 122 | |||
| 123 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") | ||
| 124 | |||
| 125 | This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" | ||
| 126 | to control the CPU time reserved for each control group instead. | ||
| 127 | |||
| 128 | For more information on working with control groups, you should read | ||
| 129 | Documentation/cgroups.txt as well. | ||
| 130 | |||
| 131 | Group settings are checked against the following limits in order to keep the configuration | ||
| 132 | schedulable: | ||
| 56 | 133 | ||
| 57 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period | 134 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
| 58 | 135 | ||
| 59 | in order to keep the configuration schedulable. | 136 | For now, this can be simplified to just the following (but see Future plans): |
| 137 | |||
| 138 | \Sum_{i} runtime_{i} <= global_runtime | ||
| 139 | |||
| 140 | |||
| 141 | 3. Future plans | ||
| 142 | =============== | ||
| 143 | |||
| 144 | There is work in progress to make the scheduling period for each group | ||
| 145 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or | ||
| 146 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. | ||
| 147 | |||
| 148 | The constraint on the period is that a subgroup must have a smaller or | ||
| 149 | equal period to its parent. But realistically its not very useful _yet_ | ||
| 150 | as its prone to starvation without deadline scheduling. | ||
| 151 | |||
| 152 | Consider two sibling groups A and B; both have 50% bandwidth, but A's | ||
| 153 | period is twice the length of B's. | ||
| 154 | |||
| 155 | * group A: period=100000us, runtime=10000us | ||
| 156 | - this runs for 0.01s once every 0.1s | ||
| 157 | |||
| 158 | * group B: period= 50000us, runtime=10000us | ||
| 159 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). | ||
| 160 | |||
| 161 | This means that currently a while (1) loop in A will run for the full period of | ||
| 162 | B and can starve B's tasks (assuming they are of lower priority) for a whole | ||
| 163 | period. | ||
| 164 | |||
| 165 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring | ||
| 166 | full deadline scheduling to the linux kernel. Deadline scheduling the above | ||
| 167 | groups and treating end of the period as a deadline will ensure that they both | ||
| 168 | get their allocated time. | ||
| 169 | |||
| 170 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is | ||
| 171 | the biggest challenge as the current linux PI infrastructure is geared towards | ||
| 172 | the limited static priority levels 0-139. With deadline scheduling you need to | ||
| 173 | do deadline inheritance (since priority is inversely proportional to the | ||
| 174 | deadline delta (deadline - now). | ||
| 175 | |||
| 176 | This means the whole PI machinery will have to be reworked - and that is one of | ||
| 177 | the most complex pieces of code we have. | ||
diff --git a/init/Kconfig b/init/Kconfig index 7fccf09bb95a..ba3a389fab94 100644 --- a/init/Kconfig +++ b/init/Kconfig | |||
| @@ -328,6 +328,13 @@ config RT_GROUP_SCHED | |||
| 328 | depends on EXPERIMENTAL | 328 | depends on EXPERIMENTAL |
| 329 | depends on GROUP_SCHED | 329 | depends on GROUP_SCHED |
| 330 | default n | 330 | default n |
| 331 | help | ||
| 332 | This feature lets you explicitly allocate real CPU bandwidth | ||
| 333 | to users or control groups (depending on the "Basis for grouping tasks" | ||
| 334 | setting below. If enabled, it will also make it impossible to | ||
| 335 | schedule realtime tasks for non-root users until you allocate | ||
| 336 | realtime bandwidth for them. | ||
| 337 | See Documentation/sched-rt-group.txt for more information. | ||
| 331 | 338 | ||
| 332 | choice | 339 | choice |
| 333 | depends on GROUP_SCHED | 340 | depends on GROUP_SCHED |
