aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/DocBook/kernel-locking.tmpl6
-rw-r--r--Documentation/cpusets.txt72
-rw-r--r--Documentation/feature-removal-schedule.txt15
-rw-r--r--Documentation/kernel-parameters.txt10
-rw-r--r--Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c96
-rw-r--r--Documentation/prctl/disable-tsc-on-off-stress-test.c95
-rw-r--r--Documentation/prctl/disable-tsc-test.c94
-rw-r--r--Documentation/scheduler/sched-rt-group.txt188
8 files changed, 535 insertions, 41 deletions
diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl
index 2e9d6b41f034..435413ca40dc 100644
--- a/Documentation/DocBook/kernel-locking.tmpl
+++ b/Documentation/DocBook/kernel-locking.tmpl
@@ -241,7 +241,7 @@
241 </para> 241 </para>
242 <para> 242 <para>
243 The third type is a semaphore 243 The third type is a semaphore
244 (<filename class="headerfile">include/asm/semaphore.h</filename>): it 244 (<filename class="headerfile">include/linux/semaphore.h</filename>): it
245 can have more than one holder at any time (the number decided at 245 can have more than one holder at any time (the number decided at
246 initialization time), although it is most commonly used as a 246 initialization time), although it is most commonly used as a
247 single-holder lock (a mutex). If you can't get a semaphore, your 247 single-holder lock (a mutex). If you can't get a semaphore, your
@@ -290,7 +290,7 @@
290 <para> 290 <para>
291 If you have a data structure which is only ever accessed from 291 If you have a data structure which is only ever accessed from
292 user context, then you can use a simple semaphore 292 user context, then you can use a simple semaphore
293 (<filename>linux/asm/semaphore.h</filename>) to protect it. This 293 (<filename>linux/linux/semaphore.h</filename>) to protect it. This
294 is the most trivial case: you initialize the semaphore to the number 294 is the most trivial case: you initialize the semaphore to the number
295 of resources available (usually 1), and call 295 of resources available (usually 1), and call
296 <function>down_interruptible()</function> to grab the semaphore, and 296 <function>down_interruptible()</function> to grab the semaphore, and
@@ -1656,7 +1656,7 @@ the amount of locking which needs to be done.
1656 #include &lt;linux/slab.h&gt; 1656 #include &lt;linux/slab.h&gt;
1657 #include &lt;linux/string.h&gt; 1657 #include &lt;linux/string.h&gt;
1658+#include &lt;linux/rcupdate.h&gt; 1658+#include &lt;linux/rcupdate.h&gt;
1659 #include &lt;asm/semaphore.h&gt; 1659 #include &lt;linux/semaphore.h&gt;
1660 #include &lt;asm/errno.h&gt; 1660 #include &lt;asm/errno.h&gt;
1661 1661
1662 struct object 1662 struct object
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index ad2bb3b3acc1..aa854b9b18cd 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc.
8Modified by Paul Jackson <pj@sgi.com> 8Modified by Paul Jackson <pj@sgi.com>
9Modified by Christoph Lameter <clameter@sgi.com> 9Modified by Christoph Lameter <clameter@sgi.com>
10Modified by Paul Menage <menage@google.com> 10Modified by Paul Menage <menage@google.com>
11Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
11 12
12CONTENTS: 13CONTENTS:
13========= 14=========
@@ -20,7 +21,8 @@ CONTENTS:
20 1.5 What is memory_pressure ? 21 1.5 What is memory_pressure ?
21 1.6 What is memory spread ? 22 1.6 What is memory spread ?
22 1.7 What is sched_load_balance ? 23 1.7 What is sched_load_balance ?
23 1.8 How do I use cpusets ? 24 1.8 What is sched_relax_domain_level ?
25 1.9 How do I use cpusets ?
242. Usage Examples and Syntax 262. Usage Examples and Syntax
25 2.1 Basic Usage 27 2.1 Basic Usage
26 2.2 Adding/removing cpus 28 2.2 Adding/removing cpus
@@ -497,7 +499,73 @@ the cpuset code to update these sched domains, it compares the new
497partition requested with the current, and updates its sched domains, 499partition requested with the current, and updates its sched domains,
498removing the old and adding the new, for each change. 500removing the old and adding the new, for each change.
499 501
5001.8 How do I use cpusets ? 502
5031.8 What is sched_relax_domain_level ?
504--------------------------------------
505
506In sched domain, the scheduler migrates tasks in 2 ways; periodic load
507balance on tick, and at time of some schedule events.
508
509When a task is woken up, scheduler try to move the task on idle CPU.
510For example, if a task A running on CPU X activates another task B
511on the same CPU X, and if CPU Y is X's sibling and performing idle,
512then scheduler migrate task B to CPU Y so that task B can start on
513CPU Y without waiting task A on CPU X.
514
515And if a CPU run out of tasks in its runqueue, the CPU try to pull
516extra tasks from other busy CPUs to help them before it is going to
517be idle.
518
519Of course it takes some searching cost to find movable tasks and/or
520idle CPUs, the scheduler might not search all CPUs in the domain
521everytime. In fact, in some architectures, the searching ranges on
522events are limited in the same socket or node where the CPU locates,
523while the load balance on tick searchs all.
524
525For example, assume CPU Z is relatively far from CPU X. Even if CPU Z
526is idle while CPU X and the siblings are busy, scheduler can't migrate
527woken task B from X to Z since it is out of its searching range.
528As the result, task B on CPU X need to wait task A or wait load balance
529on the next tick. For some applications in special situation, waiting
5301 tick may be too long.
531
532The 'sched_relax_domain_level' file allows you to request changing
533this searching range as you like. This file takes int value which
534indicates size of searching range in levels ideally as follows,
535otherwise initial value -1 that indicates the cpuset has no request.
536
537 -1 : no request. use system default or follow request of others.
538 0 : no search.
539 1 : search siblings (hyperthreads in a core).
540 2 : search cores in a package.
541 3 : search cpus in a node [= system wide on non-NUMA system]
542 ( 4 : search nodes in a chunk of node [on NUMA system] )
543 ( 5~ : search system wide [on NUMA system])
544
545This file is per-cpuset and affect the sched domain where the cpuset
546belongs to. Therefore if the flag 'sched_load_balance' of a cpuset
547is disabled, then 'sched_relax_domain_level' have no effect since
548there is no sched domain belonging the cpuset.
549
550If multiple cpusets are overlapping and hence they form a single sched
551domain, the largest value among those is used. Be careful, if one
552requests 0 and others are -1 then 0 is used.
553
554Note that modifying this file will have both good and bad effects,
555and whether it is acceptable or not will be depend on your situation.
556Don't modify this file if you are not sure.
557
558If your situation is:
559 - The migration costs between each cpu can be assumed considerably
560 small(for you) due to your special application's behavior or
561 special hardware support for CPU cache etc.
562 - The searching cost doesn't have impact(for you) or you can make
563 the searching cost enough small by managing cpuset to compact etc.
564 - The latency is required even it sacrifices cache hit rate etc.
565then increasing 'sched_relax_domain_level' would benefit you.
566
567
5681.9 How do I use cpusets ?
501-------------------------- 569--------------------------
502 570
503In order to minimize the impact of cpusets on critical kernel 571In order to minimize the impact of cpusets on critical kernel
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index af0e9393bf68..b45ea28abc99 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -282,6 +282,13 @@ Why: Not used in-tree. The current out-of-tree users used it to
282 out-of-tree driver. 282 out-of-tree driver.
283Who: Thomas Gleixner <tglx@linutronix.de> 283Who: Thomas Gleixner <tglx@linutronix.de>
284 284
285----------------------------
286
287What: usedac i386 kernel parameter
288When: 2.6.27
289Why: replaced by allowdac and no dac combination
290Who: Glauber Costa <gcosta@redhat.com>
291
285--------------------------- 292---------------------------
286 293
287What: /sys/o2cb symlink 294What: /sys/o2cb symlink
@@ -291,3 +298,11 @@ Why: /sys/fs/o2cb is the proper location for this information - /sys/o2cb
291 ocfs2-tools. 2 years should be sufficient time to phase in new versions 298 ocfs2-tools. 2 years should be sufficient time to phase in new versions
292 which know to look in /sys/fs/o2cb. 299 which know to look in /sys/fs/o2cb.
293Who: ocfs2-devel@oss.oracle.com 300Who: ocfs2-devel@oss.oracle.com
301
302---------------------------
303
304What: asm/semaphore.h
305When: 2.6.26
306Why: Implementation became generic; users should now include
307 linux/semaphore.h instead.
308Who: Matthew Wilcox <willy@linux.intel.com>
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 4b0f1ae31a4c..f4839606988b 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1280,8 +1280,16 @@ and is between 256 and 4096 characters. It is defined in the file
1280 noexec [IA-64] 1280 noexec [IA-64]
1281 1281
1282 noexec [X86-32,X86-64] 1282 noexec [X86-32,X86-64]
1283 On X86-32 available only on PAE configured kernels.
1283 noexec=on: enable non-executable mappings (default) 1284 noexec=on: enable non-executable mappings (default)
1284 noexec=off: disable nn-executable mappings 1285 noexec=off: disable non-executable mappings
1286
1287 noexec32 [X86-64]
1288 This affects only 32-bit executables.
1289 noexec32=on: enable non-executable mappings (default)
1290 read doesn't imply executable mappings
1291 noexec32=off: disable non-executable mappings
1292 read implies executable mappings
1285 1293
1286 nofxsr [BUGS=X86-32] Disables x86 floating point extended 1294 nofxsr [BUGS=X86-32] Disables x86 floating point extended
1287 register save and restore. The kernel will only save 1295 register save and restore. The kernel will only save
diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
new file mode 100644
index 000000000000..f8e8e95e81fd
--- /dev/null
+++ b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c
@@ -0,0 +1,96 @@
1/*
2 * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
3 *
4 * Tests if the control register is updated correctly
5 * at context switches
6 *
7 * Warning: this test will cause a very high load for a few seconds
8 *
9 */
10
11#include <stdio.h>
12#include <stdlib.h>
13#include <unistd.h>
14#include <signal.h>
15#include <inttypes.h>
16#include <wait.h>
17
18
19#include <sys/prctl.h>
20#include <linux/prctl.h>
21
22/* Get/set the process' ability to use the timestamp counter instruction */
23#ifndef PR_GET_TSC
24#define PR_GET_TSC 25
25#define PR_SET_TSC 26
26# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */
27# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */
28#endif
29
30uint64_t rdtsc() {
31uint32_t lo, hi;
32/* We cannot use "=A", since this would use %rax on x86_64 */
33__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
34return (uint64_t)hi << 32 | lo;
35}
36
37void sigsegv_expect(int sig)
38{
39 /* */
40}
41
42void segvtask(void)
43{
44 if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
45 {
46 perror("prctl");
47 exit(0);
48 }
49 signal(SIGSEGV, sigsegv_expect);
50 alarm(10);
51 rdtsc();
52 fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
53 exit(0);
54}
55
56
57void sigsegv_fail(int sig)
58{
59 fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
60 exit(0);
61}
62
63void rdtsctask(void)
64{
65 if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
66 {
67 perror("prctl");
68 exit(0);
69 }
70 signal(SIGSEGV, sigsegv_fail);
71 alarm(10);
72 for(;;) rdtsc();
73}
74
75
76int main(int argc, char **argv)
77{
78 int n_tasks = 100, i;
79
80 fprintf(stderr, "[No further output means we're allright]\n");
81
82 for (i=0; i<n_tasks; i++)
83 if (fork() == 0)
84 {
85 if (i & 1)
86 segvtask();
87 else
88 rdtsctask();
89 }
90
91 for (i=0; i<n_tasks; i++)
92 wait(NULL);
93
94 exit(0);
95}
96
diff --git a/Documentation/prctl/disable-tsc-on-off-stress-test.c b/Documentation/prctl/disable-tsc-on-off-stress-test.c
new file mode 100644
index 000000000000..1fcd91445375
--- /dev/null
+++ b/Documentation/prctl/disable-tsc-on-off-stress-test.c
@@ -0,0 +1,95 @@
1/*
2 * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
3 *
4 * Tests if the control register is updated correctly
5 * when set with prctl()
6 *
7 * Warning: this test will cause a very high load for a few seconds
8 *
9 */
10
11#include <stdio.h>
12#include <stdlib.h>
13#include <unistd.h>
14#include <signal.h>
15#include <inttypes.h>
16#include <wait.h>
17
18
19#include <sys/prctl.h>
20#include <linux/prctl.h>
21
22/* Get/set the process' ability to use the timestamp counter instruction */
23#ifndef PR_GET_TSC
24#define PR_GET_TSC 25
25#define PR_SET_TSC 26
26# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */
27# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */
28#endif
29
30/* snippet from wikipedia :-) */
31
32uint64_t rdtsc() {
33uint32_t lo, hi;
34/* We cannot use "=A", since this would use %rax on x86_64 */
35__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
36return (uint64_t)hi << 32 | lo;
37}
38
39int should_segv = 0;
40
41void sigsegv_cb(int sig)
42{
43 if (!should_segv)
44 {
45 fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n");
46 exit(0);
47 }
48 if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0)
49 {
50 perror("prctl");
51 exit(0);
52 }
53 should_segv = 0;
54
55 rdtsc();
56}
57
58void task(void)
59{
60 signal(SIGSEGV, sigsegv_cb);
61 alarm(10);
62 for(;;)
63 {
64 rdtsc();
65 if (should_segv)
66 {
67 fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n");
68 exit(0);
69 }
70 if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0)
71 {
72 perror("prctl");
73 exit(0);
74 }
75 should_segv = 1;
76 }
77}
78
79
80int main(int argc, char **argv)
81{
82 int n_tasks = 100, i;
83
84 fprintf(stderr, "[No further output means we're allright]\n");
85
86 for (i=0; i<n_tasks; i++)
87 if (fork() == 0)
88 task();
89
90 for (i=0; i<n_tasks; i++)
91 wait(NULL);
92
93 exit(0);
94}
95
diff --git a/Documentation/prctl/disable-tsc-test.c b/Documentation/prctl/disable-tsc-test.c
new file mode 100644
index 000000000000..843c81eac235
--- /dev/null
+++ b/Documentation/prctl/disable-tsc-test.c
@@ -0,0 +1,94 @@
1/*
2 * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...)
3 *
4 * Basic test to test behaviour of PR_GET_TSC and PR_SET_TSC
5 */
6
7#include <stdio.h>
8#include <stdlib.h>
9#include <unistd.h>
10#include <signal.h>
11#include <inttypes.h>
12
13
14#include <sys/prctl.h>
15#include <linux/prctl.h>
16
17/* Get/set the process' ability to use the timestamp counter instruction */
18#ifndef PR_GET_TSC
19#define PR_GET_TSC 25
20#define PR_SET_TSC 26
21# define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */
22# define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */
23#endif
24
25const char *tsc_names[] =
26{
27 [0] = "[not set]",
28 [PR_TSC_ENABLE] = "PR_TSC_ENABLE",
29 [PR_TSC_SIGSEGV] = "PR_TSC_SIGSEGV",
30};
31
32uint64_t rdtsc() {
33uint32_t lo, hi;
34/* We cannot use "=A", since this would use %rax on x86_64 */
35__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
36return (uint64_t)hi << 32 | lo;
37}
38
39void sigsegv_cb(int sig)
40{
41 int tsc_val = 0;
42
43 printf("[ SIG_SEGV ]\n");
44 printf("prctl(PR_GET_TSC, &tsc_val); ");
45 fflush(stdout);
46
47 if ( prctl(PR_GET_TSC, &tsc_val) == -1)
48 perror("prctl");
49
50 printf("tsc_val == %s\n", tsc_names[tsc_val]);
51 printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
52 fflush(stdout);
53 if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
54 perror("prctl");
55
56 printf("rdtsc() == ");
57}
58
59int main(int argc, char **argv)
60{
61 int tsc_val = 0;
62
63 signal(SIGSEGV, sigsegv_cb);
64
65 printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
66 printf("prctl(PR_GET_TSC, &tsc_val); ");
67 fflush(stdout);
68
69 if ( prctl(PR_GET_TSC, &tsc_val) == -1)
70 perror("prctl");
71
72 printf("tsc_val == %s\n", tsc_names[tsc_val]);
73 printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
74 printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n");
75 fflush(stdout);
76
77 if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1)
78 perror("prctl");
79
80 printf("rdtsc() == %llu\n", (unsigned long long)rdtsc());
81 printf("prctl(PR_SET_TSC, PR_TSC_SIGSEGV)\n");
82 fflush(stdout);
83
84 if ( prctl(PR_SET_TSC, PR_TSC_SIGSEGV) == -1)
85 perror("prctl");
86
87 printf("rdtsc() == ");
88 fflush(stdout);
89 printf("%llu\n", (unsigned long long)rdtsc());
90 fflush(stdout);
91
92 exit(EXIT_SUCCESS);
93}
94
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt
index 1c6332f4543c..14f901f639ee 100644
--- a/Documentation/scheduler/sched-rt-group.txt
+++ b/Documentation/scheduler/sched-rt-group.txt
@@ -1,59 +1,177 @@
1 Real-Time group scheduling
2 --------------------------
1 3
4CONTENTS
5========
2 6
3Real-Time group scheduling. 71. Overview
8 1.1 The problem
9 1.2 The solution
102. The interface
11 2.1 System-wide settings
12 2.2 Default behaviour
13 2.3 Basis for grouping tasks
143. Future plans
4 15
5The problem space:
6 16
7In order to schedule multiple groups of realtime tasks each group must 171. Overview
8be assigned a fixed portion of the CPU time available. Without a minimum 18===========
9guarantee a realtime group can obviously fall short. A fuzzy upper limit
10is of no use since it cannot be relied upon. Which leaves us with just
11the single fixed portion.
12 19
13CPU time is divided by means of specifying how much time can be spent
14running in a given period. Say a frame fixed realtime renderer must
15deliver 25 frames a second, which yields a period of 0.04s. Now say
16it will also have to play some music and respond to input, leaving it
17with around 80% for the graphics. We can then give this group a runtime
18of 0.8 * 0.04s = 0.032s.
19 20
20This way the graphics group will have a 0.04s period with a 0.032s runtime 211.1 The problem
21limit. 22---------------
22 23
23Now if the audio thread needs to refill the DMA buffer every 0.005s, but 24Realtime scheduling is all about determinism, a group has to be able to rely on
24needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s 25the amount of bandwidth (eg. CPU time) being constant. In order to schedule
25= 0.00015s. 26multiple groups of realtime tasks, each group must be assigned a fixed portion
27of the CPU time available. Without a minimum guarantee a realtime group can
28obviously fall short. A fuzzy upper limit is of no use since it cannot be
29relied upon. Which leaves us with just the single fixed portion.
26 30
311.2 The solution
32----------------
27 33
28The Interface: 34CPU time is divided by means of specifying how much time can be spent running
35in a given period. We allocate this "run time" for each realtime group which
36the other realtime groups will not be permitted to use.
29 37
30system wide: 38Any time not allocated to a realtime group will be used to run normal priority
39tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
40SCHED_OTHER.
31 41
32/proc/sys/kernel/sched_rt_period_ms 42Let's consider an example: a frame fixed realtime renderer must deliver 25
33/proc/sys/kernel/sched_rt_runtime_us 43frames a second, which yields a period of 0.04s per frame. Now say it will also
44have to play some music and respond to input, leaving it with around 80% CPU
45time dedicated for the graphics. We can then give this group a run time of 0.8
46* 0.04s = 0.032s.
34 47
35CONFIG_FAIR_USER_SCHED 48This way the graphics group will have a 0.04s period with a 0.032s run time
49limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
50needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
510.00015s. So this group can be scheduled with a period of 0.005s and a run time
52of 0.00015s.
36 53
37/sys/kernel/uids/<uid>/cpu_rt_runtime_us 54The remaining CPU time will be used for user input and other tass. Because
55realtime tasks have explicitly allocated the CPU time they need to perform
56their tasks, buffer underruns in the graphocs or audio can be eliminated.
38 57
39or 58NOTE: the above example is not fully implemented as of yet (2.6.25). We still
59lack an EDF scheduler to make non-uniform periods usable.
40 60
41CONFIG_FAIR_CGROUP_SCHED
42 61
43/cgroup/<cgroup>/cpu.rt_runtime_us 622. The Interface
63================
44 64
45[ time is specified in us because the interface is s32; this gives an
46 operating range of ~35m to 1us ]
47 65
48The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. 662.1 System wide settings
67------------------------
49 68
50A runtime of -1 specifies runtime == period, ie. no limit. 69The system wide settings are configured under the /proc virtual file system:
51 70
52New groups get the period from /proc/sys/kernel/sched_rt_period_us and 71/proc/sys/kernel/sched_rt_period_us:
53a runtime of 0. 72 The scheduling period that is equivalent to 100% CPU bandwidth
54 73
55Settings are constrained to: 74/proc/sys/kernel/sched_rt_runtime_us:
75 A global limit on how much time realtime scheduling may use. Even without
76 CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
77 processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
78 available to all realtime groups.
79
80 * Time is specified in us because the interface is s32. This gives an
81 operating range from 1us to about 35 minutes.
82 * sched_rt_period_us takes values from 1 to INT_MAX.
83 * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
84 * A run time of -1 specifies runtime == period, ie. no limit.
85
86
872.2 Default behaviour
88---------------------
89
90The default values for sched_rt_period_us (1000000 or 1s) and
91sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
92SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
93realtime tasks will not lock up the machine but leave a little time to recover
94it. By setting runtime to -1 you'd get the old behaviour back.
95
96By default all bandwidth is assigned to the root group and new groups get the
97period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
98want to assign bandwidth to another group, reduce the root group's bandwidth
99and assign some or all of the difference to another group.
100
101Realtime group scheduling means you have to assign a portion of total CPU
102bandwidth to the group before it will accept realtime tasks. Therefore you will
103not be able to run realtime tasks as any user other than root until you have
104done that, even if the user has the rights to run processes with realtime
105priority!
106
107
1082.3 Basis for grouping tasks
109----------------------------
110
111There are two compile-time settings for allocating CPU bandwidth. These are
112configured using the "Basis for grouping tasks" multiple choice menu under
113General setup > Group CPU Scheduler:
114
115a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id")
116
117This lets you use the virtual files under
118"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
119each user .
120
121The other option is:
122
123.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
124
125This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us"
126to control the CPU time reserved for each control group instead.
127
128For more information on working with control groups, you should read
129Documentation/cgroups.txt as well.
130
131Group settings are checked against the following limits in order to keep the configuration
132schedulable:
56 133
57 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period 134 \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
58 135
59in order to keep the configuration schedulable. 136For now, this can be simplified to just the following (but see Future plans):
137
138 \Sum_{i} runtime_{i} <= global_runtime
139
140
1413. Future plans
142===============
143
144There is work in progress to make the scheduling period for each group
145("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
146"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
147
148The constraint on the period is that a subgroup must have a smaller or
149equal period to its parent. But realistically its not very useful _yet_
150as its prone to starvation without deadline scheduling.
151
152Consider two sibling groups A and B; both have 50% bandwidth, but A's
153period is twice the length of B's.
154
155* group A: period=100000us, runtime=10000us
156 - this runs for 0.01s once every 0.1s
157
158* group B: period= 50000us, runtime=10000us
159 - this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
160
161This means that currently a while (1) loop in A will run for the full period of
162B and can starve B's tasks (assuming they are of lower priority) for a whole
163period.
164
165The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
166full deadline scheduling to the linux kernel. Deadline scheduling the above
167groups and treating end of the period as a deadline will ensure that they both
168get their allocated time.
169
170Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
171the biggest challenge as the current linux PI infrastructure is geared towards
172the limited static priority levels 0-139. With deadline scheduling you need to
173do deadline inheritance (since priority is inversely proportional to the
174deadline delta (deadline - now).
175
176This means the whole PI machinery will have to be reworked - and that is one of
177the most complex pieces of code we have.