diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cpusets.txt | 72 | ||||
-rw-r--r-- | Documentation/scheduler/sched-rt-group.txt | 188 |
2 files changed, 223 insertions, 37 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ad2bb3b3acc1..aa854b9b18cd 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | |||
8 | Modified by Paul Jackson <pj@sgi.com> | 8 | Modified by Paul Jackson <pj@sgi.com> |
9 | Modified by Christoph Lameter <clameter@sgi.com> | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
10 | Modified by Paul Menage <menage@google.com> | 10 | Modified by Paul Menage <menage@google.com> |
11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | ||
11 | 12 | ||
12 | CONTENTS: | 13 | CONTENTS: |
13 | ========= | 14 | ========= |
@@ -20,7 +21,8 @@ CONTENTS: | |||
20 | 1.5 What is memory_pressure ? | 21 | 1.5 What is memory_pressure ? |
21 | 1.6 What is memory spread ? | 22 | 1.6 What is memory spread ? |
22 | 1.7 What is sched_load_balance ? | 23 | 1.7 What is sched_load_balance ? |
23 | 1.8 How do I use cpusets ? | 24 | 1.8 What is sched_relax_domain_level ? |
25 | 1.9 How do I use cpusets ? | ||
24 | 2. Usage Examples and Syntax | 26 | 2. Usage Examples and Syntax |
25 | 2.1 Basic Usage | 27 | 2.1 Basic Usage |
26 | 2.2 Adding/removing cpus | 28 | 2.2 Adding/removing cpus |
@@ -497,7 +499,73 @@ the cpuset code to update these sched domains, it compares the new | |||
497 | partition requested with the current, and updates its sched domains, | 499 | partition requested with the current, and updates its sched domains, |
498 | removing the old and adding the new, for each change. | 500 | removing the old and adding the new, for each change. |
499 | 501 | ||
500 | 1.8 How do I use cpusets ? | 502 | |
503 | 1.8 What is sched_relax_domain_level ? | ||
504 | -------------------------------------- | ||
505 | |||
506 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | ||
507 | balance on tick, and at time of some schedule events. | ||
508 | |||
509 | When a task is woken up, scheduler try to move the task on idle CPU. | ||
510 | For example, if a task A running on CPU X activates another task B | ||
511 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | ||
512 | then scheduler migrate task B to CPU Y so that task B can start on | ||
513 | CPU Y without waiting task A on CPU X. | ||
514 | |||
515 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | ||
516 | extra tasks from other busy CPUs to help them before it is going to | ||
517 | be idle. | ||
518 | |||
519 | Of course it takes some searching cost to find movable tasks and/or | ||
520 | idle CPUs, the scheduler might not search all CPUs in the domain | ||
521 | everytime. In fact, in some architectures, the searching ranges on | ||
522 | events are limited in the same socket or node where the CPU locates, | ||
523 | while the load balance on tick searchs all. | ||
524 | |||
525 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | ||
526 | is idle while CPU X and the siblings are busy, scheduler can't migrate | ||
527 | woken task B from X to Z since it is out of its searching range. | ||
528 | As the result, task B on CPU X need to wait task A or wait load balance | ||
529 | on the next tick. For some applications in special situation, waiting | ||
530 | 1 tick may be too long. | ||
531 | |||
532 | The 'sched_relax_domain_level' file allows you to request changing | ||
533 | this searching range as you like. This file takes int value which | ||
534 | indicates size of searching range in levels ideally as follows, | ||
535 | otherwise initial value -1 that indicates the cpuset has no request. | ||
536 | |||
537 | -1 : no request. use system default or follow request of others. | ||
538 | 0 : no search. | ||
539 | 1 : search siblings (hyperthreads in a core). | ||
540 | 2 : search cores in a package. | ||
541 | 3 : search cpus in a node [= system wide on non-NUMA system] | ||
542 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | ||
543 | ( 5~ : search system wide [on NUMA system]) | ||
544 | |||
545 | This file is per-cpuset and affect the sched domain where the cpuset | ||
546 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | ||
547 | is disabled, then 'sched_relax_domain_level' have no effect since | ||
548 | there is no sched domain belonging the cpuset. | ||
549 | |||
550 | If multiple cpusets are overlapping and hence they form a single sched | ||
551 | domain, the largest value among those is used. Be careful, if one | ||
552 | requests 0 and others are -1 then 0 is used. | ||
553 | |||
554 | Note that modifying this file will have both good and bad effects, | ||
555 | and whether it is acceptable or not will be depend on your situation. | ||
556 | Don't modify this file if you are not sure. | ||
557 | |||
558 | If your situation is: | ||
559 | - The migration costs between each cpu can be assumed considerably | ||
560 | small(for you) due to your special application's behavior or | ||
561 | special hardware support for CPU cache etc. | ||
562 | - The searching cost doesn't have impact(for you) or you can make | ||
563 | the searching cost enough small by managing cpuset to compact etc. | ||
564 | - The latency is required even it sacrifices cache hit rate etc. | ||
565 | then increasing 'sched_relax_domain_level' would benefit you. | ||
566 | |||
567 | |||
568 | 1.9 How do I use cpusets ? | ||
501 | -------------------------- | 569 | -------------------------- |
502 | 570 | ||
503 | In order to minimize the impact of cpusets on critical kernel | 571 | In order to minimize the impact of cpusets on critical kernel |
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 1c6332f4543c..14f901f639ee 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt | |||
@@ -1,59 +1,177 @@ | |||
1 | Real-Time group scheduling | ||
2 | -------------------------- | ||
1 | 3 | ||
4 | CONTENTS | ||
5 | ======== | ||
2 | 6 | ||
3 | Real-Time group scheduling. | 7 | 1. Overview |
8 | 1.1 The problem | ||
9 | 1.2 The solution | ||
10 | 2. The interface | ||
11 | 2.1 System-wide settings | ||
12 | 2.2 Default behaviour | ||
13 | 2.3 Basis for grouping tasks | ||
14 | 3. Future plans | ||
4 | 15 | ||
5 | The problem space: | ||
6 | 16 | ||
7 | In order to schedule multiple groups of realtime tasks each group must | 17 | 1. Overview |
8 | be assigned a fixed portion of the CPU time available. Without a minimum | 18 | =========== |
9 | guarantee a realtime group can obviously fall short. A fuzzy upper limit | ||
10 | is of no use since it cannot be relied upon. Which leaves us with just | ||
11 | the single fixed portion. | ||
12 | 19 | ||
13 | CPU time is divided by means of specifying how much time can be spent | ||
14 | running in a given period. Say a frame fixed realtime renderer must | ||
15 | deliver 25 frames a second, which yields a period of 0.04s. Now say | ||
16 | it will also have to play some music and respond to input, leaving it | ||
17 | with around 80% for the graphics. We can then give this group a runtime | ||
18 | of 0.8 * 0.04s = 0.032s. | ||
19 | 20 | ||
20 | This way the graphics group will have a 0.04s period with a 0.032s runtime | 21 | 1.1 The problem |
21 | limit. | 22 | --------------- |
22 | 23 | ||
23 | Now if the audio thread needs to refill the DMA buffer every 0.005s, but | 24 | Realtime scheduling is all about determinism, a group has to be able to rely on |
24 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s | 25 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
25 | = 0.00015s. | 26 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
27 | of the CPU time available. Without a minimum guarantee a realtime group can | ||
28 | obviously fall short. A fuzzy upper limit is of no use since it cannot be | ||
29 | relied upon. Which leaves us with just the single fixed portion. | ||
26 | 30 | ||
31 | 1.2 The solution | ||
32 | ---------------- | ||
27 | 33 | ||
28 | The Interface: | 34 | CPU time is divided by means of specifying how much time can be spent running |
35 | in a given period. We allocate this "run time" for each realtime group which | ||
36 | the other realtime groups will not be permitted to use. | ||
29 | 37 | ||
30 | system wide: | 38 | Any time not allocated to a realtime group will be used to run normal priority |
39 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by | ||
40 | SCHED_OTHER. | ||
31 | 41 | ||
32 | /proc/sys/kernel/sched_rt_period_ms | 42 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
33 | /proc/sys/kernel/sched_rt_runtime_us | 43 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
44 | have to play some music and respond to input, leaving it with around 80% CPU | ||
45 | time dedicated for the graphics. We can then give this group a run time of 0.8 | ||
46 | * 0.04s = 0.032s. | ||
34 | 47 | ||
35 | CONFIG_FAIR_USER_SCHED | 48 | This way the graphics group will have a 0.04s period with a 0.032s run time |
49 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but | ||
50 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = | ||
51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time | ||
52 | of 0.00015s. | ||
36 | 53 | ||
37 | /sys/kernel/uids/<uid>/cpu_rt_runtime_us | 54 | The remaining CPU time will be used for user input and other tass. Because |
55 | realtime tasks have explicitly allocated the CPU time they need to perform | ||
56 | their tasks, buffer underruns in the graphocs or audio can be eliminated. | ||
38 | 57 | ||
39 | or | 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
59 | lack an EDF scheduler to make non-uniform periods usable. | ||
40 | 60 | ||
41 | CONFIG_FAIR_CGROUP_SCHED | ||
42 | 61 | ||
43 | /cgroup/<cgroup>/cpu.rt_runtime_us | 62 | 2. The Interface |
63 | ================ | ||
44 | 64 | ||
45 | [ time is specified in us because the interface is s32; this gives an | ||
46 | operating range of ~35m to 1us ] | ||
47 | 65 | ||
48 | The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. | 66 | 2.1 System wide settings |
67 | ------------------------ | ||
49 | 68 | ||
50 | A runtime of -1 specifies runtime == period, ie. no limit. | 69 | The system wide settings are configured under the /proc virtual file system: |
51 | 70 | ||
52 | New groups get the period from /proc/sys/kernel/sched_rt_period_us and | 71 | /proc/sys/kernel/sched_rt_period_us: |
53 | a runtime of 0. | 72 | The scheduling period that is equivalent to 100% CPU bandwidth |
54 | 73 | ||
55 | Settings are constrained to: | 74 | /proc/sys/kernel/sched_rt_runtime_us: |
75 | A global limit on how much time realtime scheduling may use. Even without | ||
76 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime | ||
77 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth | ||
78 | available to all realtime groups. | ||
79 | |||
80 | * Time is specified in us because the interface is s32. This gives an | ||
81 | operating range from 1us to about 35 minutes. | ||
82 | * sched_rt_period_us takes values from 1 to INT_MAX. | ||
83 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). | ||
84 | * A run time of -1 specifies runtime == period, ie. no limit. | ||
85 | |||
86 | |||
87 | 2.2 Default behaviour | ||
88 | --------------------- | ||
89 | |||
90 | The default values for sched_rt_period_us (1000000 or 1s) and | ||
91 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by | ||
92 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away | ||
93 | realtime tasks will not lock up the machine but leave a little time to recover | ||
94 | it. By setting runtime to -1 you'd get the old behaviour back. | ||
95 | |||
96 | By default all bandwidth is assigned to the root group and new groups get the | ||
97 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you | ||
98 | want to assign bandwidth to another group, reduce the root group's bandwidth | ||
99 | and assign some or all of the difference to another group. | ||
100 | |||
101 | Realtime group scheduling means you have to assign a portion of total CPU | ||
102 | bandwidth to the group before it will accept realtime tasks. Therefore you will | ||
103 | not be able to run realtime tasks as any user other than root until you have | ||
104 | done that, even if the user has the rights to run processes with realtime | ||
105 | priority! | ||
106 | |||
107 | |||
108 | 2.3 Basis for grouping tasks | ||
109 | ---------------------------- | ||
110 | |||
111 | There are two compile-time settings for allocating CPU bandwidth. These are | ||
112 | configured using the "Basis for grouping tasks" multiple choice menu under | ||
113 | General setup > Group CPU Scheduler: | ||
114 | |||
115 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") | ||
116 | |||
117 | This lets you use the virtual files under | ||
118 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for | ||
119 | each user . | ||
120 | |||
121 | The other option is: | ||
122 | |||
123 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") | ||
124 | |||
125 | This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" | ||
126 | to control the CPU time reserved for each control group instead. | ||
127 | |||
128 | For more information on working with control groups, you should read | ||
129 | Documentation/cgroups.txt as well. | ||
130 | |||
131 | Group settings are checked against the following limits in order to keep the configuration | ||
132 | schedulable: | ||
56 | 133 | ||
57 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period | 134 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
58 | 135 | ||
59 | in order to keep the configuration schedulable. | 136 | For now, this can be simplified to just the following (but see Future plans): |
137 | |||
138 | \Sum_{i} runtime_{i} <= global_runtime | ||
139 | |||
140 | |||
141 | 3. Future plans | ||
142 | =============== | ||
143 | |||
144 | There is work in progress to make the scheduling period for each group | ||
145 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or | ||
146 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. | ||
147 | |||
148 | The constraint on the period is that a subgroup must have a smaller or | ||
149 | equal period to its parent. But realistically its not very useful _yet_ | ||
150 | as its prone to starvation without deadline scheduling. | ||
151 | |||
152 | Consider two sibling groups A and B; both have 50% bandwidth, but A's | ||
153 | period is twice the length of B's. | ||
154 | |||
155 | * group A: period=100000us, runtime=10000us | ||
156 | - this runs for 0.01s once every 0.1s | ||
157 | |||
158 | * group B: period= 50000us, runtime=10000us | ||
159 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). | ||
160 | |||
161 | This means that currently a while (1) loop in A will run for the full period of | ||
162 | B and can starve B's tasks (assuming they are of lower priority) for a whole | ||
163 | period. | ||
164 | |||
165 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring | ||
166 | full deadline scheduling to the linux kernel. Deadline scheduling the above | ||
167 | groups and treating end of the period as a deadline will ensure that they both | ||
168 | get their allocated time. | ||
169 | |||
170 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is | ||
171 | the biggest challenge as the current linux PI infrastructure is geared towards | ||
172 | the limited static priority levels 0-139. With deadline scheduling you need to | ||
173 | do deadline inheritance (since priority is inversely proportional to the | ||
174 | deadline delta (deadline - now). | ||
175 | |||
176 | This means the whole PI machinery will have to be reworked - and that is one of | ||
177 | the most complex pieces of code we have. | ||