diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/DocBook/kernel-locking.tmpl | 6 | ||||
-rw-r--r-- | Documentation/cpusets.txt | 72 | ||||
-rw-r--r-- | Documentation/feature-removal-schedule.txt | 15 | ||||
-rw-r--r-- | Documentation/kernel-parameters.txt | 10 | ||||
-rw-r--r-- | Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c | 96 | ||||
-rw-r--r-- | Documentation/prctl/disable-tsc-on-off-stress-test.c | 95 | ||||
-rw-r--r-- | Documentation/prctl/disable-tsc-test.c | 94 | ||||
-rw-r--r-- | Documentation/scheduler/sched-rt-group.txt | 188 |
8 files changed, 535 insertions, 41 deletions
diff --git a/Documentation/DocBook/kernel-locking.tmpl b/Documentation/DocBook/kernel-locking.tmpl index 2e9d6b41f034..435413ca40dc 100644 --- a/Documentation/DocBook/kernel-locking.tmpl +++ b/Documentation/DocBook/kernel-locking.tmpl | |||
@@ -241,7 +241,7 @@ | |||
241 | </para> | 241 | </para> |
242 | <para> | 242 | <para> |
243 | The third type is a semaphore | 243 | The third type is a semaphore |
244 | (<filename class="headerfile">include/asm/semaphore.h</filename>): it | 244 | (<filename class="headerfile">include/linux/semaphore.h</filename>): it |
245 | can have more than one holder at any time (the number decided at | 245 | can have more than one holder at any time (the number decided at |
246 | initialization time), although it is most commonly used as a | 246 | initialization time), although it is most commonly used as a |
247 | single-holder lock (a mutex). If you can't get a semaphore, your | 247 | single-holder lock (a mutex). If you can't get a semaphore, your |
@@ -290,7 +290,7 @@ | |||
290 | <para> | 290 | <para> |
291 | If you have a data structure which is only ever accessed from | 291 | If you have a data structure which is only ever accessed from |
292 | user context, then you can use a simple semaphore | 292 | user context, then you can use a simple semaphore |
293 | (<filename>linux/asm/semaphore.h</filename>) to protect it. This | 293 | (<filename>linux/linux/semaphore.h</filename>) to protect it. This |
294 | is the most trivial case: you initialize the semaphore to the number | 294 | is the most trivial case: you initialize the semaphore to the number |
295 | of resources available (usually 1), and call | 295 | of resources available (usually 1), and call |
296 | <function>down_interruptible()</function> to grab the semaphore, and | 296 | <function>down_interruptible()</function> to grab the semaphore, and |
@@ -1656,7 +1656,7 @@ the amount of locking which needs to be done. | |||
1656 | #include <linux/slab.h> | 1656 | #include <linux/slab.h> |
1657 | #include <linux/string.h> | 1657 | #include <linux/string.h> |
1658 | +#include <linux/rcupdate.h> | 1658 | +#include <linux/rcupdate.h> |
1659 | #include <asm/semaphore.h> | 1659 | #include <linux/semaphore.h> |
1660 | #include <asm/errno.h> | 1660 | #include <asm/errno.h> |
1661 | 1661 | ||
1662 | struct object | 1662 | struct object |
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ad2bb3b3acc1..aa854b9b18cd 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | |||
8 | Modified by Paul Jackson <pj@sgi.com> | 8 | Modified by Paul Jackson <pj@sgi.com> |
9 | Modified by Christoph Lameter <clameter@sgi.com> | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
10 | Modified by Paul Menage <menage@google.com> | 10 | Modified by Paul Menage <menage@google.com> |
11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | ||
11 | 12 | ||
12 | CONTENTS: | 13 | CONTENTS: |
13 | ========= | 14 | ========= |
@@ -20,7 +21,8 @@ CONTENTS: | |||
20 | 1.5 What is memory_pressure ? | 21 | 1.5 What is memory_pressure ? |
21 | 1.6 What is memory spread ? | 22 | 1.6 What is memory spread ? |
22 | 1.7 What is sched_load_balance ? | 23 | 1.7 What is sched_load_balance ? |
23 | 1.8 How do I use cpusets ? | 24 | 1.8 What is sched_relax_domain_level ? |
25 | 1.9 How do I use cpusets ? | ||
24 | 2. Usage Examples and Syntax | 26 | 2. Usage Examples and Syntax |
25 | 2.1 Basic Usage | 27 | 2.1 Basic Usage |
26 | 2.2 Adding/removing cpus | 28 | 2.2 Adding/removing cpus |
@@ -497,7 +499,73 @@ the cpuset code to update these sched domains, it compares the new | |||
497 | partition requested with the current, and updates its sched domains, | 499 | partition requested with the current, and updates its sched domains, |
498 | removing the old and adding the new, for each change. | 500 | removing the old and adding the new, for each change. |
499 | 501 | ||
500 | 1.8 How do I use cpusets ? | 502 | |
503 | 1.8 What is sched_relax_domain_level ? | ||
504 | -------------------------------------- | ||
505 | |||
506 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | ||
507 | balance on tick, and at time of some schedule events. | ||
508 | |||
509 | When a task is woken up, scheduler try to move the task on idle CPU. | ||
510 | For example, if a task A running on CPU X activates another task B | ||
511 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | ||
512 | then scheduler migrate task B to CPU Y so that task B can start on | ||
513 | CPU Y without waiting task A on CPU X. | ||
514 | |||
515 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | ||
516 | extra tasks from other busy CPUs to help them before it is going to | ||
517 | be idle. | ||
518 | |||
519 | Of course it takes some searching cost to find movable tasks and/or | ||
520 | idle CPUs, the scheduler might not search all CPUs in the domain | ||
521 | everytime. In fact, in some architectures, the searching ranges on | ||
522 | events are limited in the same socket or node where the CPU locates, | ||
523 | while the load balance on tick searchs all. | ||
524 | |||
525 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | ||
526 | is idle while CPU X and the siblings are busy, scheduler can't migrate | ||
527 | woken task B from X to Z since it is out of its searching range. | ||
528 | As the result, task B on CPU X need to wait task A or wait load balance | ||
529 | on the next tick. For some applications in special situation, waiting | ||
530 | 1 tick may be too long. | ||
531 | |||
532 | The 'sched_relax_domain_level' file allows you to request changing | ||
533 | this searching range as you like. This file takes int value which | ||
534 | indicates size of searching range in levels ideally as follows, | ||
535 | otherwise initial value -1 that indicates the cpuset has no request. | ||
536 | |||
537 | -1 : no request. use system default or follow request of others. | ||
538 | 0 : no search. | ||
539 | 1 : search siblings (hyperthreads in a core). | ||
540 | 2 : search cores in a package. | ||
541 | 3 : search cpus in a node [= system wide on non-NUMA system] | ||
542 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | ||
543 | ( 5~ : search system wide [on NUMA system]) | ||
544 | |||
545 | This file is per-cpuset and affect the sched domain where the cpuset | ||
546 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | ||
547 | is disabled, then 'sched_relax_domain_level' have no effect since | ||
548 | there is no sched domain belonging the cpuset. | ||
549 | |||
550 | If multiple cpusets are overlapping and hence they form a single sched | ||
551 | domain, the largest value among those is used. Be careful, if one | ||
552 | requests 0 and others are -1 then 0 is used. | ||
553 | |||
554 | Note that modifying this file will have both good and bad effects, | ||
555 | and whether it is acceptable or not will be depend on your situation. | ||
556 | Don't modify this file if you are not sure. | ||
557 | |||
558 | If your situation is: | ||
559 | - The migration costs between each cpu can be assumed considerably | ||
560 | small(for you) due to your special application's behavior or | ||
561 | special hardware support for CPU cache etc. | ||
562 | - The searching cost doesn't have impact(for you) or you can make | ||
563 | the searching cost enough small by managing cpuset to compact etc. | ||
564 | - The latency is required even it sacrifices cache hit rate etc. | ||
565 | then increasing 'sched_relax_domain_level' would benefit you. | ||
566 | |||
567 | |||
568 | 1.9 How do I use cpusets ? | ||
501 | -------------------------- | 569 | -------------------------- |
502 | 570 | ||
503 | In order to minimize the impact of cpusets on critical kernel | 571 | In order to minimize the impact of cpusets on critical kernel |
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index af0e9393bf68..b45ea28abc99 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt | |||
@@ -282,6 +282,13 @@ Why: Not used in-tree. The current out-of-tree users used it to | |||
282 | out-of-tree driver. | 282 | out-of-tree driver. |
283 | Who: Thomas Gleixner <tglx@linutronix.de> | 283 | Who: Thomas Gleixner <tglx@linutronix.de> |
284 | 284 | ||
285 | ---------------------------- | ||
286 | |||
287 | What: usedac i386 kernel parameter | ||
288 | When: 2.6.27 | ||
289 | Why: replaced by allowdac and no dac combination | ||
290 | Who: Glauber Costa <gcosta@redhat.com> | ||
291 | |||
285 | --------------------------- | 292 | --------------------------- |
286 | 293 | ||
287 | What: /sys/o2cb symlink | 294 | What: /sys/o2cb symlink |
@@ -291,3 +298,11 @@ Why: /sys/fs/o2cb is the proper location for this information - /sys/o2cb | |||
291 | ocfs2-tools. 2 years should be sufficient time to phase in new versions | 298 | ocfs2-tools. 2 years should be sufficient time to phase in new versions |
292 | which know to look in /sys/fs/o2cb. | 299 | which know to look in /sys/fs/o2cb. |
293 | Who: ocfs2-devel@oss.oracle.com | 300 | Who: ocfs2-devel@oss.oracle.com |
301 | |||
302 | --------------------------- | ||
303 | |||
304 | What: asm/semaphore.h | ||
305 | When: 2.6.26 | ||
306 | Why: Implementation became generic; users should now include | ||
307 | linux/semaphore.h instead. | ||
308 | Who: Matthew Wilcox <willy@linux.intel.com> | ||
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 4b0f1ae31a4c..f4839606988b 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt | |||
@@ -1280,8 +1280,16 @@ and is between 256 and 4096 characters. It is defined in the file | |||
1280 | noexec [IA-64] | 1280 | noexec [IA-64] |
1281 | 1281 | ||
1282 | noexec [X86-32,X86-64] | 1282 | noexec [X86-32,X86-64] |
1283 | On X86-32 available only on PAE configured kernels. | ||
1283 | noexec=on: enable non-executable mappings (default) | 1284 | noexec=on: enable non-executable mappings (default) |
1284 | noexec=off: disable nn-executable mappings | 1285 | noexec=off: disable non-executable mappings |
1286 | |||
1287 | noexec32 [X86-64] | ||
1288 | This affects only 32-bit executables. | ||
1289 | noexec32=on: enable non-executable mappings (default) | ||
1290 | read doesn't imply executable mappings | ||
1291 | noexec32=off: disable non-executable mappings | ||
1292 | read implies executable mappings | ||
1285 | 1293 | ||
1286 | nofxsr [BUGS=X86-32] Disables x86 floating point extended | 1294 | nofxsr [BUGS=X86-32] Disables x86 floating point extended |
1287 | register save and restore. The kernel will only save | 1295 | register save and restore. The kernel will only save |
diff --git a/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c new file mode 100644 index 000000000000..f8e8e95e81fd --- /dev/null +++ b/Documentation/prctl/disable-tsc-ctxt-sw-stress-test.c | |||
@@ -0,0 +1,96 @@ | |||
1 | /* | ||
2 | * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...) | ||
3 | * | ||
4 | * Tests if the control register is updated correctly | ||
5 | * at context switches | ||
6 | * | ||
7 | * Warning: this test will cause a very high load for a few seconds | ||
8 | * | ||
9 | */ | ||
10 | |||
11 | #include <stdio.h> | ||
12 | #include <stdlib.h> | ||
13 | #include <unistd.h> | ||
14 | #include <signal.h> | ||
15 | #include <inttypes.h> | ||
16 | #include <wait.h> | ||
17 | |||
18 | |||
19 | #include <sys/prctl.h> | ||
20 | #include <linux/prctl.h> | ||
21 | |||
22 | /* Get/set the process' ability to use the timestamp counter instruction */ | ||
23 | #ifndef PR_GET_TSC | ||
24 | #define PR_GET_TSC 25 | ||
25 | #define PR_SET_TSC 26 | ||
26 | # define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */ | ||
27 | # define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */ | ||
28 | #endif | ||
29 | |||
30 | uint64_t rdtsc() { | ||
31 | uint32_t lo, hi; | ||
32 | /* We cannot use "=A", since this would use %rax on x86_64 */ | ||
33 | __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); | ||
34 | return (uint64_t)hi << 32 | lo; | ||
35 | } | ||
36 | |||
37 | void sigsegv_expect(int sig) | ||
38 | { | ||
39 | /* */ | ||
40 | } | ||
41 | |||
42 | void segvtask(void) | ||
43 | { | ||
44 | if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0) | ||
45 | { | ||
46 | perror("prctl"); | ||
47 | exit(0); | ||
48 | } | ||
49 | signal(SIGSEGV, sigsegv_expect); | ||
50 | alarm(10); | ||
51 | rdtsc(); | ||
52 | fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n"); | ||
53 | exit(0); | ||
54 | } | ||
55 | |||
56 | |||
57 | void sigsegv_fail(int sig) | ||
58 | { | ||
59 | fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n"); | ||
60 | exit(0); | ||
61 | } | ||
62 | |||
63 | void rdtsctask(void) | ||
64 | { | ||
65 | if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0) | ||
66 | { | ||
67 | perror("prctl"); | ||
68 | exit(0); | ||
69 | } | ||
70 | signal(SIGSEGV, sigsegv_fail); | ||
71 | alarm(10); | ||
72 | for(;;) rdtsc(); | ||
73 | } | ||
74 | |||
75 | |||
76 | int main(int argc, char **argv) | ||
77 | { | ||
78 | int n_tasks = 100, i; | ||
79 | |||
80 | fprintf(stderr, "[No further output means we're allright]\n"); | ||
81 | |||
82 | for (i=0; i<n_tasks; i++) | ||
83 | if (fork() == 0) | ||
84 | { | ||
85 | if (i & 1) | ||
86 | segvtask(); | ||
87 | else | ||
88 | rdtsctask(); | ||
89 | } | ||
90 | |||
91 | for (i=0; i<n_tasks; i++) | ||
92 | wait(NULL); | ||
93 | |||
94 | exit(0); | ||
95 | } | ||
96 | |||
diff --git a/Documentation/prctl/disable-tsc-on-off-stress-test.c b/Documentation/prctl/disable-tsc-on-off-stress-test.c new file mode 100644 index 000000000000..1fcd91445375 --- /dev/null +++ b/Documentation/prctl/disable-tsc-on-off-stress-test.c | |||
@@ -0,0 +1,95 @@ | |||
1 | /* | ||
2 | * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...) | ||
3 | * | ||
4 | * Tests if the control register is updated correctly | ||
5 | * when set with prctl() | ||
6 | * | ||
7 | * Warning: this test will cause a very high load for a few seconds | ||
8 | * | ||
9 | */ | ||
10 | |||
11 | #include <stdio.h> | ||
12 | #include <stdlib.h> | ||
13 | #include <unistd.h> | ||
14 | #include <signal.h> | ||
15 | #include <inttypes.h> | ||
16 | #include <wait.h> | ||
17 | |||
18 | |||
19 | #include <sys/prctl.h> | ||
20 | #include <linux/prctl.h> | ||
21 | |||
22 | /* Get/set the process' ability to use the timestamp counter instruction */ | ||
23 | #ifndef PR_GET_TSC | ||
24 | #define PR_GET_TSC 25 | ||
25 | #define PR_SET_TSC 26 | ||
26 | # define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */ | ||
27 | # define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */ | ||
28 | #endif | ||
29 | |||
30 | /* snippet from wikipedia :-) */ | ||
31 | |||
32 | uint64_t rdtsc() { | ||
33 | uint32_t lo, hi; | ||
34 | /* We cannot use "=A", since this would use %rax on x86_64 */ | ||
35 | __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); | ||
36 | return (uint64_t)hi << 32 | lo; | ||
37 | } | ||
38 | |||
39 | int should_segv = 0; | ||
40 | |||
41 | void sigsegv_cb(int sig) | ||
42 | { | ||
43 | if (!should_segv) | ||
44 | { | ||
45 | fprintf(stderr, "FATAL ERROR, rdtsc() failed while enabled\n"); | ||
46 | exit(0); | ||
47 | } | ||
48 | if (prctl(PR_SET_TSC, PR_TSC_ENABLE) < 0) | ||
49 | { | ||
50 | perror("prctl"); | ||
51 | exit(0); | ||
52 | } | ||
53 | should_segv = 0; | ||
54 | |||
55 | rdtsc(); | ||
56 | } | ||
57 | |||
58 | void task(void) | ||
59 | { | ||
60 | signal(SIGSEGV, sigsegv_cb); | ||
61 | alarm(10); | ||
62 | for(;;) | ||
63 | { | ||
64 | rdtsc(); | ||
65 | if (should_segv) | ||
66 | { | ||
67 | fprintf(stderr, "FATAL ERROR, rdtsc() succeeded while disabled\n"); | ||
68 | exit(0); | ||
69 | } | ||
70 | if (prctl(PR_SET_TSC, PR_TSC_SIGSEGV) < 0) | ||
71 | { | ||
72 | perror("prctl"); | ||
73 | exit(0); | ||
74 | } | ||
75 | should_segv = 1; | ||
76 | } | ||
77 | } | ||
78 | |||
79 | |||
80 | int main(int argc, char **argv) | ||
81 | { | ||
82 | int n_tasks = 100, i; | ||
83 | |||
84 | fprintf(stderr, "[No further output means we're allright]\n"); | ||
85 | |||
86 | for (i=0; i<n_tasks; i++) | ||
87 | if (fork() == 0) | ||
88 | task(); | ||
89 | |||
90 | for (i=0; i<n_tasks; i++) | ||
91 | wait(NULL); | ||
92 | |||
93 | exit(0); | ||
94 | } | ||
95 | |||
diff --git a/Documentation/prctl/disable-tsc-test.c b/Documentation/prctl/disable-tsc-test.c new file mode 100644 index 000000000000..843c81eac235 --- /dev/null +++ b/Documentation/prctl/disable-tsc-test.c | |||
@@ -0,0 +1,94 @@ | |||
1 | /* | ||
2 | * Tests for prctl(PR_GET_TSC, ...) / prctl(PR_SET_TSC, ...) | ||
3 | * | ||
4 | * Basic test to test behaviour of PR_GET_TSC and PR_SET_TSC | ||
5 | */ | ||
6 | |||
7 | #include <stdio.h> | ||
8 | #include <stdlib.h> | ||
9 | #include <unistd.h> | ||
10 | #include <signal.h> | ||
11 | #include <inttypes.h> | ||
12 | |||
13 | |||
14 | #include <sys/prctl.h> | ||
15 | #include <linux/prctl.h> | ||
16 | |||
17 | /* Get/set the process' ability to use the timestamp counter instruction */ | ||
18 | #ifndef PR_GET_TSC | ||
19 | #define PR_GET_TSC 25 | ||
20 | #define PR_SET_TSC 26 | ||
21 | # define PR_TSC_ENABLE 1 /* allow the use of the timestamp counter */ | ||
22 | # define PR_TSC_SIGSEGV 2 /* throw a SIGSEGV instead of reading the TSC */ | ||
23 | #endif | ||
24 | |||
25 | const char *tsc_names[] = | ||
26 | { | ||
27 | [0] = "[not set]", | ||
28 | [PR_TSC_ENABLE] = "PR_TSC_ENABLE", | ||
29 | [PR_TSC_SIGSEGV] = "PR_TSC_SIGSEGV", | ||
30 | }; | ||
31 | |||
32 | uint64_t rdtsc() { | ||
33 | uint32_t lo, hi; | ||
34 | /* We cannot use "=A", since this would use %rax on x86_64 */ | ||
35 | __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); | ||
36 | return (uint64_t)hi << 32 | lo; | ||
37 | } | ||
38 | |||
39 | void sigsegv_cb(int sig) | ||
40 | { | ||
41 | int tsc_val = 0; | ||
42 | |||
43 | printf("[ SIG_SEGV ]\n"); | ||
44 | printf("prctl(PR_GET_TSC, &tsc_val); "); | ||
45 | fflush(stdout); | ||
46 | |||
47 | if ( prctl(PR_GET_TSC, &tsc_val) == -1) | ||
48 | perror("prctl"); | ||
49 | |||
50 | printf("tsc_val == %s\n", tsc_names[tsc_val]); | ||
51 | printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n"); | ||
52 | fflush(stdout); | ||
53 | if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1) | ||
54 | perror("prctl"); | ||
55 | |||
56 | printf("rdtsc() == "); | ||
57 | } | ||
58 | |||
59 | int main(int argc, char **argv) | ||
60 | { | ||
61 | int tsc_val = 0; | ||
62 | |||
63 | signal(SIGSEGV, sigsegv_cb); | ||
64 | |||
65 | printf("rdtsc() == %llu\n", (unsigned long long)rdtsc()); | ||
66 | printf("prctl(PR_GET_TSC, &tsc_val); "); | ||
67 | fflush(stdout); | ||
68 | |||
69 | if ( prctl(PR_GET_TSC, &tsc_val) == -1) | ||
70 | perror("prctl"); | ||
71 | |||
72 | printf("tsc_val == %s\n", tsc_names[tsc_val]); | ||
73 | printf("rdtsc() == %llu\n", (unsigned long long)rdtsc()); | ||
74 | printf("prctl(PR_SET_TSC, PR_TSC_ENABLE)\n"); | ||
75 | fflush(stdout); | ||
76 | |||
77 | if ( prctl(PR_SET_TSC, PR_TSC_ENABLE) == -1) | ||
78 | perror("prctl"); | ||
79 | |||
80 | printf("rdtsc() == %llu\n", (unsigned long long)rdtsc()); | ||
81 | printf("prctl(PR_SET_TSC, PR_TSC_SIGSEGV)\n"); | ||
82 | fflush(stdout); | ||
83 | |||
84 | if ( prctl(PR_SET_TSC, PR_TSC_SIGSEGV) == -1) | ||
85 | perror("prctl"); | ||
86 | |||
87 | printf("rdtsc() == "); | ||
88 | fflush(stdout); | ||
89 | printf("%llu\n", (unsigned long long)rdtsc()); | ||
90 | fflush(stdout); | ||
91 | |||
92 | exit(EXIT_SUCCESS); | ||
93 | } | ||
94 | |||
diff --git a/Documentation/scheduler/sched-rt-group.txt b/Documentation/scheduler/sched-rt-group.txt index 1c6332f4543c..14f901f639ee 100644 --- a/Documentation/scheduler/sched-rt-group.txt +++ b/Documentation/scheduler/sched-rt-group.txt | |||
@@ -1,59 +1,177 @@ | |||
1 | Real-Time group scheduling | ||
2 | -------------------------- | ||
1 | 3 | ||
4 | CONTENTS | ||
5 | ======== | ||
2 | 6 | ||
3 | Real-Time group scheduling. | 7 | 1. Overview |
8 | 1.1 The problem | ||
9 | 1.2 The solution | ||
10 | 2. The interface | ||
11 | 2.1 System-wide settings | ||
12 | 2.2 Default behaviour | ||
13 | 2.3 Basis for grouping tasks | ||
14 | 3. Future plans | ||
4 | 15 | ||
5 | The problem space: | ||
6 | 16 | ||
7 | In order to schedule multiple groups of realtime tasks each group must | 17 | 1. Overview |
8 | be assigned a fixed portion of the CPU time available. Without a minimum | 18 | =========== |
9 | guarantee a realtime group can obviously fall short. A fuzzy upper limit | ||
10 | is of no use since it cannot be relied upon. Which leaves us with just | ||
11 | the single fixed portion. | ||
12 | 19 | ||
13 | CPU time is divided by means of specifying how much time can be spent | ||
14 | running in a given period. Say a frame fixed realtime renderer must | ||
15 | deliver 25 frames a second, which yields a period of 0.04s. Now say | ||
16 | it will also have to play some music and respond to input, leaving it | ||
17 | with around 80% for the graphics. We can then give this group a runtime | ||
18 | of 0.8 * 0.04s = 0.032s. | ||
19 | 20 | ||
20 | This way the graphics group will have a 0.04s period with a 0.032s runtime | 21 | 1.1 The problem |
21 | limit. | 22 | --------------- |
22 | 23 | ||
23 | Now if the audio thread needs to refill the DMA buffer every 0.005s, but | 24 | Realtime scheduling is all about determinism, a group has to be able to rely on |
24 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s | 25 | the amount of bandwidth (eg. CPU time) being constant. In order to schedule |
25 | = 0.00015s. | 26 | multiple groups of realtime tasks, each group must be assigned a fixed portion |
27 | of the CPU time available. Without a minimum guarantee a realtime group can | ||
28 | obviously fall short. A fuzzy upper limit is of no use since it cannot be | ||
29 | relied upon. Which leaves us with just the single fixed portion. | ||
26 | 30 | ||
31 | 1.2 The solution | ||
32 | ---------------- | ||
27 | 33 | ||
28 | The Interface: | 34 | CPU time is divided by means of specifying how much time can be spent running |
35 | in a given period. We allocate this "run time" for each realtime group which | ||
36 | the other realtime groups will not be permitted to use. | ||
29 | 37 | ||
30 | system wide: | 38 | Any time not allocated to a realtime group will be used to run normal priority |
39 | tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by | ||
40 | SCHED_OTHER. | ||
31 | 41 | ||
32 | /proc/sys/kernel/sched_rt_period_ms | 42 | Let's consider an example: a frame fixed realtime renderer must deliver 25 |
33 | /proc/sys/kernel/sched_rt_runtime_us | 43 | frames a second, which yields a period of 0.04s per frame. Now say it will also |
44 | have to play some music and respond to input, leaving it with around 80% CPU | ||
45 | time dedicated for the graphics. We can then give this group a run time of 0.8 | ||
46 | * 0.04s = 0.032s. | ||
34 | 47 | ||
35 | CONFIG_FAIR_USER_SCHED | 48 | This way the graphics group will have a 0.04s period with a 0.032s run time |
49 | limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but | ||
50 | needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s = | ||
51 | 0.00015s. So this group can be scheduled with a period of 0.005s and a run time | ||
52 | of 0.00015s. | ||
36 | 53 | ||
37 | /sys/kernel/uids/<uid>/cpu_rt_runtime_us | 54 | The remaining CPU time will be used for user input and other tass. Because |
55 | realtime tasks have explicitly allocated the CPU time they need to perform | ||
56 | their tasks, buffer underruns in the graphocs or audio can be eliminated. | ||
38 | 57 | ||
39 | or | 58 | NOTE: the above example is not fully implemented as of yet (2.6.25). We still |
59 | lack an EDF scheduler to make non-uniform periods usable. | ||
40 | 60 | ||
41 | CONFIG_FAIR_CGROUP_SCHED | ||
42 | 61 | ||
43 | /cgroup/<cgroup>/cpu.rt_runtime_us | 62 | 2. The Interface |
63 | ================ | ||
44 | 64 | ||
45 | [ time is specified in us because the interface is s32; this gives an | ||
46 | operating range of ~35m to 1us ] | ||
47 | 65 | ||
48 | The period takes values in [ 1, INT_MAX ], runtime in [ -1, INT_MAX - 1 ]. | 66 | 2.1 System wide settings |
67 | ------------------------ | ||
49 | 68 | ||
50 | A runtime of -1 specifies runtime == period, ie. no limit. | 69 | The system wide settings are configured under the /proc virtual file system: |
51 | 70 | ||
52 | New groups get the period from /proc/sys/kernel/sched_rt_period_us and | 71 | /proc/sys/kernel/sched_rt_period_us: |
53 | a runtime of 0. | 72 | The scheduling period that is equivalent to 100% CPU bandwidth |
54 | 73 | ||
55 | Settings are constrained to: | 74 | /proc/sys/kernel/sched_rt_runtime_us: |
75 | A global limit on how much time realtime scheduling may use. Even without | ||
76 | CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime | ||
77 | processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth | ||
78 | available to all realtime groups. | ||
79 | |||
80 | * Time is specified in us because the interface is s32. This gives an | ||
81 | operating range from 1us to about 35 minutes. | ||
82 | * sched_rt_period_us takes values from 1 to INT_MAX. | ||
83 | * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1). | ||
84 | * A run time of -1 specifies runtime == period, ie. no limit. | ||
85 | |||
86 | |||
87 | 2.2 Default behaviour | ||
88 | --------------------- | ||
89 | |||
90 | The default values for sched_rt_period_us (1000000 or 1s) and | ||
91 | sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by | ||
92 | SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away | ||
93 | realtime tasks will not lock up the machine but leave a little time to recover | ||
94 | it. By setting runtime to -1 you'd get the old behaviour back. | ||
95 | |||
96 | By default all bandwidth is assigned to the root group and new groups get the | ||
97 | period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you | ||
98 | want to assign bandwidth to another group, reduce the root group's bandwidth | ||
99 | and assign some or all of the difference to another group. | ||
100 | |||
101 | Realtime group scheduling means you have to assign a portion of total CPU | ||
102 | bandwidth to the group before it will accept realtime tasks. Therefore you will | ||
103 | not be able to run realtime tasks as any user other than root until you have | ||
104 | done that, even if the user has the rights to run processes with realtime | ||
105 | priority! | ||
106 | |||
107 | |||
108 | 2.3 Basis for grouping tasks | ||
109 | ---------------------------- | ||
110 | |||
111 | There are two compile-time settings for allocating CPU bandwidth. These are | ||
112 | configured using the "Basis for grouping tasks" multiple choice menu under | ||
113 | General setup > Group CPU Scheduler: | ||
114 | |||
115 | a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id") | ||
116 | |||
117 | This lets you use the virtual files under | ||
118 | "/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for | ||
119 | each user . | ||
120 | |||
121 | The other option is: | ||
122 | |||
123 | .o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups") | ||
124 | |||
125 | This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us" | ||
126 | to control the CPU time reserved for each control group instead. | ||
127 | |||
128 | For more information on working with control groups, you should read | ||
129 | Documentation/cgroups.txt as well. | ||
130 | |||
131 | Group settings are checked against the following limits in order to keep the configuration | ||
132 | schedulable: | ||
56 | 133 | ||
57 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period | 134 | \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period |
58 | 135 | ||
59 | in order to keep the configuration schedulable. | 136 | For now, this can be simplified to just the following (but see Future plans): |
137 | |||
138 | \Sum_{i} runtime_{i} <= global_runtime | ||
139 | |||
140 | |||
141 | 3. Future plans | ||
142 | =============== | ||
143 | |||
144 | There is work in progress to make the scheduling period for each group | ||
145 | ("/sys/kernel/uids/<uid>/cpu_rt_period_us" or | ||
146 | "/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well. | ||
147 | |||
148 | The constraint on the period is that a subgroup must have a smaller or | ||
149 | equal period to its parent. But realistically its not very useful _yet_ | ||
150 | as its prone to starvation without deadline scheduling. | ||
151 | |||
152 | Consider two sibling groups A and B; both have 50% bandwidth, but A's | ||
153 | period is twice the length of B's. | ||
154 | |||
155 | * group A: period=100000us, runtime=10000us | ||
156 | - this runs for 0.01s once every 0.1s | ||
157 | |||
158 | * group B: period= 50000us, runtime=10000us | ||
159 | - this runs for 0.01s twice every 0.1s (or once every 0.05 sec). | ||
160 | |||
161 | This means that currently a while (1) loop in A will run for the full period of | ||
162 | B and can starve B's tasks (assuming they are of lower priority) for a whole | ||
163 | period. | ||
164 | |||
165 | The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring | ||
166 | full deadline scheduling to the linux kernel. Deadline scheduling the above | ||
167 | groups and treating end of the period as a deadline will ensure that they both | ||
168 | get their allocated time. | ||
169 | |||
170 | Implementing SCHED_EDF might take a while to complete. Priority Inheritance is | ||
171 | the biggest challenge as the current linux PI infrastructure is geared towards | ||
172 | the limited static priority levels 0-139. With deadline scheduling you need to | ||
173 | do deadline inheritance (since priority is inversely proportional to the | ||
174 | deadline delta (deadline - now). | ||
175 | |||
176 | This means the whole PI machinery will have to be reworked - and that is one of | ||
177 | the most complex pieces of code we have. | ||