diff options
author | J. Bruce Fields <bfields@citi.umich.edu> | 2008-02-07 03:13:37 -0500 |
---|---|---|
committer | Linus Torvalds <torvalds@woody.linux-foundation.org> | 2008-02-07 11:42:17 -0500 |
commit | 9b8eae7248dad42091204f83ed3448e661456af1 (patch) | |
tree | 1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/scheduler | |
parent | d3cf91d0e201962a6367191e5926f5b0920b0339 (diff) |
Documentation: create new scheduler/ subdirectory
The top-level Documentation/ directory is unmanageably large, so we
should take any obvious opportunities to move stuff into subdirectories.
These sched-*.txt files seem an obvious easy case.
Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
Cc: Ingo Molnar <mingo@elte.hu>
Acked-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r-- | Documentation/scheduler/00-INDEX | 16 | ||||
-rw-r--r-- | Documentation/scheduler/sched-arch.txt | 89 | ||||
-rw-r--r-- | Documentation/scheduler/sched-coding.txt | 126 | ||||
-rw-r--r-- | Documentation/scheduler/sched-design-CFS.txt | 186 | ||||
-rw-r--r-- | Documentation/scheduler/sched-design.txt | 165 | ||||
-rw-r--r-- | Documentation/scheduler/sched-domains.txt | 70 | ||||
-rw-r--r-- | Documentation/scheduler/sched-nice-design.txt | 108 | ||||
-rw-r--r-- | Documentation/scheduler/sched-stats.txt | 156 |
8 files changed, 916 insertions, 0 deletions
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX new file mode 100644 index 000000000000..b5f5ca069b2d --- /dev/null +++ b/Documentation/scheduler/00-INDEX | |||
@@ -0,0 +1,16 @@ | |||
1 | 00-INDEX | ||
2 | - this file. | ||
3 | sched-arch.txt | ||
4 | - CPU Scheduler implementation hints for architecture specific code. | ||
5 | sched-coding.txt | ||
6 | - reference for various scheduler-related methods in the O(1) scheduler. | ||
7 | sched-design.txt | ||
8 | - goals, design and implementation of the Linux O(1) scheduler. | ||
9 | sched-design-CFS.txt | ||
10 | - goals, design and implementation of the Complete Fair Scheduler. | ||
11 | sched-domains.txt | ||
12 | - information on scheduling domains. | ||
13 | sched-nice-design.txt | ||
14 | - How and why the scheduler's nice levels are implemented. | ||
15 | sched-stats.txt | ||
16 | - information on schedstats (Linux Scheduler Statistics). | ||
diff --git a/Documentation/scheduler/sched-arch.txt b/Documentation/scheduler/sched-arch.txt new file mode 100644 index 000000000000..941615a9769b --- /dev/null +++ b/Documentation/scheduler/sched-arch.txt | |||
@@ -0,0 +1,89 @@ | |||
1 | CPU Scheduler implementation hints for architecture specific code | ||
2 | |||
3 | Nick Piggin, 2005 | ||
4 | |||
5 | Context switch | ||
6 | ============== | ||
7 | 1. Runqueue locking | ||
8 | By default, the switch_to arch function is called with the runqueue | ||
9 | locked. This is usually not a problem unless switch_to may need to | ||
10 | take the runqueue lock. This is usually due to a wake up operation in | ||
11 | the context switch. See include/asm-ia64/system.h for an example. | ||
12 | |||
13 | To request the scheduler call switch_to with the runqueue unlocked, | ||
14 | you must `#define __ARCH_WANT_UNLOCKED_CTXSW` in a header file | ||
15 | (typically the one where switch_to is defined). | ||
16 | |||
17 | Unlocked context switches introduce only a very minor performance | ||
18 | penalty to the core scheduler implementation in the CONFIG_SMP case. | ||
19 | |||
20 | 2. Interrupt status | ||
21 | By default, the switch_to arch function is called with interrupts | ||
22 | disabled. Interrupts may be enabled over the call if it is likely to | ||
23 | introduce a significant interrupt latency by adding the line | ||
24 | `#define __ARCH_WANT_INTERRUPTS_ON_CTXSW` in the same place as for | ||
25 | unlocked context switches. This define also implies | ||
26 | `__ARCH_WANT_UNLOCKED_CTXSW`. See include/asm-arm/system.h for an | ||
27 | example. | ||
28 | |||
29 | |||
30 | CPU idle | ||
31 | ======== | ||
32 | Your cpu_idle routines need to obey the following rules: | ||
33 | |||
34 | 1. Preempt should now disabled over idle routines. Should only | ||
35 | be enabled to call schedule() then disabled again. | ||
36 | |||
37 | 2. need_resched/TIF_NEED_RESCHED is only ever set, and will never | ||
38 | be cleared until the running task has called schedule(). Idle | ||
39 | threads need only ever query need_resched, and may never set or | ||
40 | clear it. | ||
41 | |||
42 | 3. When cpu_idle finds (need_resched() == 'true'), it should call | ||
43 | schedule(). It should not call schedule() otherwise. | ||
44 | |||
45 | 4. The only time interrupts need to be disabled when checking | ||
46 | need_resched is if we are about to sleep the processor until | ||
47 | the next interrupt (this doesn't provide any protection of | ||
48 | need_resched, it prevents losing an interrupt). | ||
49 | |||
50 | 4a. Common problem with this type of sleep appears to be: | ||
51 | local_irq_disable(); | ||
52 | if (!need_resched()) { | ||
53 | local_irq_enable(); | ||
54 | *** resched interrupt arrives here *** | ||
55 | __asm__("sleep until next interrupt"); | ||
56 | } | ||
57 | |||
58 | 5. TIF_POLLING_NRFLAG can be set by idle routines that do not | ||
59 | need an interrupt to wake them up when need_resched goes high. | ||
60 | In other words, they must be periodically polling need_resched, | ||
61 | although it may be reasonable to do some background work or enter | ||
62 | a low CPU priority. | ||
63 | |||
64 | 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter | ||
65 | an interrupt sleep, it needs to be cleared then a memory | ||
66 | barrier issued (followed by a test of need_resched with | ||
67 | interrupts disabled, as explained in 3). | ||
68 | |||
69 | arch/i386/kernel/process.c has examples of both polling and | ||
70 | sleeping idle functions. | ||
71 | |||
72 | |||
73 | Possible arch/ problems | ||
74 | ======================= | ||
75 | |||
76 | Possible arch problems I found (and either tried to fix or didn't): | ||
77 | |||
78 | h8300 - Is such sleeping racy vs interrupts? (See #4a). | ||
79 | The H8/300 manual I found indicates yes, however disabling IRQs | ||
80 | over the sleep mean only NMIs can wake it up, so can't fix easily | ||
81 | without doing spin waiting. | ||
82 | |||
83 | ia64 - is safe_halt call racy vs interrupts? (does it sleep?) (See #4a) | ||
84 | |||
85 | sh64 - Is sleeping racy vs interrupts? (See #4a) | ||
86 | |||
87 | sparc - IRQs on at this point(?), change local_irq_save to _disable. | ||
88 | - TODO: needs secondary CPUs to disable preempt (See #1) | ||
89 | |||
diff --git a/Documentation/scheduler/sched-coding.txt b/Documentation/scheduler/sched-coding.txt new file mode 100644 index 000000000000..cbd8db752acf --- /dev/null +++ b/Documentation/scheduler/sched-coding.txt | |||
@@ -0,0 +1,126 @@ | |||
1 | Reference for various scheduler-related methods in the O(1) scheduler | ||
2 | Robert Love <rml@tech9.net>, MontaVista Software | ||
3 | |||
4 | |||
5 | Note most of these methods are local to kernel/sched.c - this is by design. | ||
6 | The scheduler is meant to be self-contained and abstracted away. This document | ||
7 | is primarily for understanding the scheduler, not interfacing to it. Some of | ||
8 | the discussed interfaces, however, are general process/scheduling methods. | ||
9 | They are typically defined in include/linux/sched.h. | ||
10 | |||
11 | |||
12 | Main Scheduling Methods | ||
13 | ----------------------- | ||
14 | |||
15 | void load_balance(runqueue_t *this_rq, int idle) | ||
16 | Attempts to pull tasks from one cpu to another to balance cpu usage, | ||
17 | if needed. This method is called explicitly if the runqueues are | ||
18 | imbalanced or periodically by the timer tick. Prior to calling, | ||
19 | the current runqueue must be locked and interrupts disabled. | ||
20 | |||
21 | void schedule() | ||
22 | The main scheduling function. Upon return, the highest priority | ||
23 | process will be active. | ||
24 | |||
25 | |||
26 | Locking | ||
27 | ------- | ||
28 | |||
29 | Each runqueue has its own lock, rq->lock. When multiple runqueues need | ||
30 | to be locked, lock acquires must be ordered by ascending &runqueue value. | ||
31 | |||
32 | A specific runqueue is locked via | ||
33 | |||
34 | task_rq_lock(task_t pid, unsigned long *flags) | ||
35 | |||
36 | which disables preemption, disables interrupts, and locks the runqueue pid is | ||
37 | running on. Likewise, | ||
38 | |||
39 | task_rq_unlock(task_t pid, unsigned long *flags) | ||
40 | |||
41 | unlocks the runqueue pid is running on, restores interrupts to their previous | ||
42 | state, and reenables preemption. | ||
43 | |||
44 | The routines | ||
45 | |||
46 | double_rq_lock(runqueue_t *rq1, runqueue_t *rq2) | ||
47 | |||
48 | and | ||
49 | |||
50 | double_rq_unlock(runqueue_t *rq1, runqueue_t *rq2) | ||
51 | |||
52 | safely lock and unlock, respectively, the two specified runqueues. They do | ||
53 | not, however, disable and restore interrupts. Users are required to do so | ||
54 | manually before and after calls. | ||
55 | |||
56 | |||
57 | Values | ||
58 | ------ | ||
59 | |||
60 | MAX_PRIO | ||
61 | The maximum priority of the system, stored in the task as task->prio. | ||
62 | Lower priorities are higher. Normal (non-RT) priorities range from | ||
63 | MAX_RT_PRIO to (MAX_PRIO - 1). | ||
64 | MAX_RT_PRIO | ||
65 | The maximum real-time priority of the system. Valid RT priorities | ||
66 | range from 0 to (MAX_RT_PRIO - 1). | ||
67 | MAX_USER_RT_PRIO | ||
68 | The maximum real-time priority that is exported to user-space. Should | ||
69 | always be equal to or less than MAX_RT_PRIO. Setting it less allows | ||
70 | kernel threads to have higher priorities than any user-space task. | ||
71 | MIN_TIMESLICE | ||
72 | MAX_TIMESLICE | ||
73 | Respectively, the minimum and maximum timeslices (quanta) of a process. | ||
74 | |||
75 | Data | ||
76 | ---- | ||
77 | |||
78 | struct runqueue | ||
79 | The main per-CPU runqueue data structure. | ||
80 | struct task_struct | ||
81 | The main per-process data structure. | ||
82 | |||
83 | |||
84 | General Methods | ||
85 | --------------- | ||
86 | |||
87 | cpu_rq(cpu) | ||
88 | Returns the runqueue of the specified cpu. | ||
89 | this_rq() | ||
90 | Returns the runqueue of the current cpu. | ||
91 | task_rq(pid) | ||
92 | Returns the runqueue which holds the specified pid. | ||
93 | cpu_curr(cpu) | ||
94 | Returns the task currently running on the given cpu. | ||
95 | rt_task(pid) | ||
96 | Returns true if pid is real-time, false if not. | ||
97 | |||
98 | |||
99 | Process Control Methods | ||
100 | ----------------------- | ||
101 | |||
102 | void set_user_nice(task_t *p, long nice) | ||
103 | Sets the "nice" value of task p to the given value. | ||
104 | int setscheduler(pid_t pid, int policy, struct sched_param *param) | ||
105 | Sets the scheduling policy and parameters for the given pid. | ||
106 | int set_cpus_allowed(task_t *p, unsigned long new_mask) | ||
107 | Sets a given task's CPU affinity and migrates it to a proper cpu. | ||
108 | Callers must have a valid reference to the task and assure the | ||
109 | task not exit prematurely. No locks can be held during the call. | ||
110 | set_task_state(tsk, state_value) | ||
111 | Sets the given task's state to the given value. | ||
112 | set_current_state(state_value) | ||
113 | Sets the current task's state to the given value. | ||
114 | void set_tsk_need_resched(struct task_struct *tsk) | ||
115 | Sets need_resched in the given task. | ||
116 | void clear_tsk_need_resched(struct task_struct *tsk) | ||
117 | Clears need_resched in the given task. | ||
118 | void set_need_resched() | ||
119 | Sets need_resched in the current task. | ||
120 | void clear_need_resched() | ||
121 | Clears need_resched in the current task. | ||
122 | int need_resched() | ||
123 | Returns true if need_resched is set in the current task, false | ||
124 | otherwise. | ||
125 | yield() | ||
126 | Place the current process at the end of the runqueue and call schedule. | ||
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt new file mode 100644 index 000000000000..88bcb8767335 --- /dev/null +++ b/Documentation/scheduler/sched-design-CFS.txt | |||
@@ -0,0 +1,186 @@ | |||
1 | |||
2 | This is the CFS scheduler. | ||
3 | |||
4 | 80% of CFS's design can be summed up in a single sentence: CFS basically | ||
5 | models an "ideal, precise multi-tasking CPU" on real hardware. | ||
6 | |||
7 | "Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100% | ||
8 | physical power and which can run each task at precise equal speed, in | ||
9 | parallel, each at 1/nr_running speed. For example: if there are 2 tasks | ||
10 | running then it runs each at 50% physical power - totally in parallel. | ||
11 | |||
12 | On real hardware, we can run only a single task at once, so while that | ||
13 | one task runs, the other tasks that are waiting for the CPU are at a | ||
14 | disadvantage - the current task gets an unfair amount of CPU time. In | ||
15 | CFS this fairness imbalance is expressed and tracked via the per-task | ||
16 | p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of | ||
17 | time the task should now run on the CPU for it to become completely fair | ||
18 | and balanced. | ||
19 | |||
20 | ( small detail: on 'ideal' hardware, the p->wait_runtime value would | ||
21 | always be zero - no task would ever get 'out of balance' from the | ||
22 | 'ideal' share of CPU time. ) | ||
23 | |||
24 | CFS's task picking logic is based on this p->wait_runtime value and it | ||
25 | is thus very simple: it always tries to run the task with the largest | ||
26 | p->wait_runtime value. In other words, CFS tries to run the task with | ||
27 | the 'gravest need' for more CPU time. So CFS always tries to split up | ||
28 | CPU time between runnable tasks as close to 'ideal multitasking | ||
29 | hardware' as possible. | ||
30 | |||
31 | Most of the rest of CFS's design just falls out of this really simple | ||
32 | concept, with a few add-on embellishments like nice levels, | ||
33 | multiprocessing and various algorithm variants to recognize sleepers. | ||
34 | |||
35 | In practice it works like this: the system runs a task a bit, and when | ||
36 | the task schedules (or a scheduler tick happens) the task's CPU usage is | ||
37 | 'accounted for': the (small) time it just spent using the physical CPU | ||
38 | is deducted from p->wait_runtime. [minus the 'fair share' it would have | ||
39 | gotten anyway]. Once p->wait_runtime gets low enough so that another | ||
40 | task becomes the 'leftmost task' of the time-ordered rbtree it maintains | ||
41 | (plus a small amount of 'granularity' distance relative to the leftmost | ||
42 | task so that we do not over-schedule tasks and trash the cache) then the | ||
43 | new leftmost task is picked and the current task is preempted. | ||
44 | |||
45 | The rq->fair_clock value tracks the 'CPU time a runnable task would have | ||
46 | fairly gotten, had it been runnable during that time'. So by using | ||
47 | rq->fair_clock values we can accurately timestamp and measure the | ||
48 | 'expected CPU time' a task should have gotten. All runnable tasks are | ||
49 | sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and | ||
50 | CFS picks the 'leftmost' task and sticks to it. As the system progresses | ||
51 | forwards, newly woken tasks are put into the tree more and more to the | ||
52 | right - slowly but surely giving a chance for every task to become the | ||
53 | 'leftmost task' and thus get on the CPU within a deterministic amount of | ||
54 | time. | ||
55 | |||
56 | Some implementation details: | ||
57 | |||
58 | - the introduction of Scheduling Classes: an extensible hierarchy of | ||
59 | scheduler modules. These modules encapsulate scheduling policy | ||
60 | details and are handled by the scheduler core without the core | ||
61 | code assuming about them too much. | ||
62 | |||
63 | - sched_fair.c implements the 'CFS desktop scheduler': it is a | ||
64 | replacement for the vanilla scheduler's SCHED_OTHER interactivity | ||
65 | code. | ||
66 | |||
67 | I'd like to give credit to Con Kolivas for the general approach here: | ||
68 | he has proven via RSDL/SD that 'fair scheduling' is possible and that | ||
69 | it results in better desktop scheduling. Kudos Con! | ||
70 | |||
71 | The CFS patch uses a completely different approach and implementation | ||
72 | from RSDL/SD. My goal was to make CFS's interactivity quality exceed | ||
73 | that of RSDL/SD, which is a high standard to meet :-) Testing | ||
74 | feedback is welcome to decide this one way or another. [ and, in any | ||
75 | case, all of SD's logic could be added via a kernel/sched_sd.c module | ||
76 | as well, if Con is interested in such an approach. ] | ||
77 | |||
78 | CFS's design is quite radical: it does not use runqueues, it uses a | ||
79 | time-ordered rbtree to build a 'timeline' of future task execution, | ||
80 | and thus has no 'array switch' artifacts (by which both the vanilla | ||
81 | scheduler and RSDL/SD are affected). | ||
82 | |||
83 | CFS uses nanosecond granularity accounting and does not rely on any | ||
84 | jiffies or other HZ detail. Thus the CFS scheduler has no notion of | ||
85 | 'timeslices' and has no heuristics whatsoever. There is only one | ||
86 | central tunable (you have to switch on CONFIG_SCHED_DEBUG): | ||
87 | |||
88 | /proc/sys/kernel/sched_granularity_ns | ||
89 | |||
90 | which can be used to tune the scheduler from 'desktop' (low | ||
91 | latencies) to 'server' (good batching) workloads. It defaults to a | ||
92 | setting suitable for desktop workloads. SCHED_BATCH is handled by the | ||
93 | CFS scheduler module too. | ||
94 | |||
95 | Due to its design, the CFS scheduler is not prone to any of the | ||
96 | 'attacks' that exist today against the heuristics of the stock | ||
97 | scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all | ||
98 | work fine and do not impact interactivity and produce the expected | ||
99 | behavior. | ||
100 | |||
101 | the CFS scheduler has a much stronger handling of nice levels and | ||
102 | SCHED_BATCH: both types of workloads should be isolated much more | ||
103 | agressively than under the vanilla scheduler. | ||
104 | |||
105 | ( another detail: due to nanosec accounting and timeline sorting, | ||
106 | sched_yield() support is very simple under CFS, and in fact under | ||
107 | CFS sched_yield() behaves much better than under any other | ||
108 | scheduler i have tested so far. ) | ||
109 | |||
110 | - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler | ||
111 | way than the vanilla scheduler does. It uses 100 runqueues (for all | ||
112 | 100 RT priority levels, instead of 140 in the vanilla scheduler) | ||
113 | and it needs no expired array. | ||
114 | |||
115 | - reworked/sanitized SMP load-balancing: the runqueue-walking | ||
116 | assumptions are gone from the load-balancing code now, and | ||
117 | iterators of the scheduling modules are used. The balancing code got | ||
118 | quite a bit simpler as a result. | ||
119 | |||
120 | |||
121 | Group scheduler extension to CFS | ||
122 | ================================ | ||
123 | |||
124 | Normally the scheduler operates on individual tasks and strives to provide | ||
125 | fair CPU time to each task. Sometimes, it may be desirable to group tasks | ||
126 | and provide fair CPU time to each such task group. For example, it may | ||
127 | be desirable to first provide fair CPU time to each user on the system | ||
128 | and then to each task belonging to a user. | ||
129 | |||
130 | CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets | ||
131 | SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such | ||
132 | groups. At present, there are two (mutually exclusive) mechanisms to group | ||
133 | tasks for CPU bandwidth control purpose: | ||
134 | |||
135 | - Based on user id (CONFIG_FAIR_USER_SCHED) | ||
136 | In this option, tasks are grouped according to their user id. | ||
137 | - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED) | ||
138 | This options lets the administrator create arbitrary groups | ||
139 | of tasks, using the "cgroup" pseudo filesystem. See | ||
140 | Documentation/cgroups.txt for more information about this | ||
141 | filesystem. | ||
142 | |||
143 | Only one of these options to group tasks can be chosen and not both. | ||
144 | |||
145 | Group scheduler tunables: | ||
146 | |||
147 | When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for | ||
148 | each new user and a "cpu_share" file is added in that directory. | ||
149 | |||
150 | # cd /sys/kernel/uids | ||
151 | # cat 512/cpu_share # Display user 512's CPU share | ||
152 | 1024 | ||
153 | # echo 2048 > 512/cpu_share # Modify user 512's CPU share | ||
154 | # cat 512/cpu_share # Display user 512's CPU share | ||
155 | 2048 | ||
156 | # | ||
157 | |||
158 | CPU bandwidth between two users are divided in the ratio of their CPU shares. | ||
159 | For ex: if you would like user "root" to get twice the bandwidth of user | ||
160 | "guest", then set the cpu_share for both the users such that "root"'s | ||
161 | cpu_share is twice "guest"'s cpu_share | ||
162 | |||
163 | |||
164 | When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created | ||
165 | for each group created using the pseudo filesystem. See example steps | ||
166 | below to create task groups and modify their CPU share using the "cgroups" | ||
167 | pseudo filesystem | ||
168 | |||
169 | # mkdir /dev/cpuctl | ||
170 | # mount -t cgroup -ocpu none /dev/cpuctl | ||
171 | # cd /dev/cpuctl | ||
172 | |||
173 | # mkdir multimedia # create "multimedia" group of tasks | ||
174 | # mkdir browser # create "browser" group of tasks | ||
175 | |||
176 | # #Configure the multimedia group to receive twice the CPU bandwidth | ||
177 | # #that of browser group | ||
178 | |||
179 | # echo 2048 > multimedia/cpu.shares | ||
180 | # echo 1024 > browser/cpu.shares | ||
181 | |||
182 | # firefox & # Launch firefox and move it to "browser" group | ||
183 | # echo <firefox_pid> > browser/tasks | ||
184 | |||
185 | # #Launch gmplayer (or your favourite movie player) | ||
186 | # echo <movie_player_pid> > multimedia/tasks | ||
diff --git a/Documentation/scheduler/sched-design.txt b/Documentation/scheduler/sched-design.txt new file mode 100644 index 000000000000..1605bf0cba8b --- /dev/null +++ b/Documentation/scheduler/sched-design.txt | |||
@@ -0,0 +1,165 @@ | |||
1 | Goals, Design and Implementation of the | ||
2 | new ultra-scalable O(1) scheduler | ||
3 | |||
4 | |||
5 | This is an edited version of an email Ingo Molnar sent to | ||
6 | lkml on 4 Jan 2002. It describes the goals, design, and | ||
7 | implementation of Ingo's new ultra-scalable O(1) scheduler. | ||
8 | Last Updated: 18 April 2002. | ||
9 | |||
10 | |||
11 | Goal | ||
12 | ==== | ||
13 | |||
14 | The main goal of the new scheduler is to keep all the good things we know | ||
15 | and love about the current Linux scheduler: | ||
16 | |||
17 | - good interactive performance even during high load: if the user | ||
18 | types or clicks then the system must react instantly and must execute | ||
19 | the user tasks smoothly, even during considerable background load. | ||
20 | |||
21 | - good scheduling/wakeup performance with 1-2 runnable processes. | ||
22 | |||
23 | - fairness: no process should stay without any timeslice for any | ||
24 | unreasonable amount of time. No process should get an unjustly high | ||
25 | amount of CPU time. | ||
26 | |||
27 | - priorities: less important tasks can be started with lower priority, | ||
28 | more important tasks with higher priority. | ||
29 | |||
30 | - SMP efficiency: no CPU should stay idle if there is work to do. | ||
31 | |||
32 | - SMP affinity: processes which run on one CPU should stay affine to | ||
33 | that CPU. Processes should not bounce between CPUs too frequently. | ||
34 | |||
35 | - plus additional scheduler features: RT scheduling, CPU binding. | ||
36 | |||
37 | and the goal is also to add a few new things: | ||
38 | |||
39 | - fully O(1) scheduling. Are you tired of the recalculation loop | ||
40 | blowing the L1 cache away every now and then? Do you think the goodness | ||
41 | loop is taking a bit too long to finish if there are lots of runnable | ||
42 | processes? This new scheduler takes no prisoners: wakeup(), schedule(), | ||
43 | the timer interrupt are all O(1) algorithms. There is no recalculation | ||
44 | loop. There is no goodness loop either. | ||
45 | |||
46 | - 'perfect' SMP scalability. With the new scheduler there is no 'big' | ||
47 | runqueue_lock anymore - it's all per-CPU runqueues and locks - two | ||
48 | tasks on two separate CPUs can wake up, schedule and context-switch | ||
49 | completely in parallel, without any interlocking. All | ||
50 | scheduling-relevant data is structured for maximum scalability. | ||
51 | |||
52 | - better SMP affinity. The old scheduler has a particular weakness that | ||
53 | causes the random bouncing of tasks between CPUs if/when higher | ||
54 | priority/interactive tasks, this was observed and reported by many | ||
55 | people. The reason is that the timeslice recalculation loop first needs | ||
56 | every currently running task to consume its timeslice. But when this | ||
57 | happens on eg. an 8-way system, then this property starves an | ||
58 | increasing number of CPUs from executing any process. Once the last | ||
59 | task that has a timeslice left has finished using up that timeslice, | ||
60 | the recalculation loop is triggered and other CPUs can start executing | ||
61 | tasks again - after having idled around for a number of timer ticks. | ||
62 | The more CPUs, the worse this effect. | ||
63 | |||
64 | Furthermore, this same effect causes the bouncing effect as well: | ||
65 | whenever there is such a 'timeslice squeeze' of the global runqueue, | ||
66 | idle processors start executing tasks which are not affine to that CPU. | ||
67 | (because the affine tasks have finished off their timeslices already.) | ||
68 | |||
69 | The new scheduler solves this problem by distributing timeslices on a | ||
70 | per-CPU basis, without having any global synchronization or | ||
71 | recalculation. | ||
72 | |||
73 | - batch scheduling. A significant proportion of computing-intensive tasks | ||
74 | benefit from batch-scheduling, where timeslices are long and processes | ||
75 | are roundrobin scheduled. The new scheduler does such batch-scheduling | ||
76 | of the lowest priority tasks - so nice +19 jobs will get | ||
77 | 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are | ||
78 | in essence SCHED_IDLE, from an interactiveness point of view. | ||
79 | |||
80 | - handle extreme loads more smoothly, without breakdown and scheduling | ||
81 | storms. | ||
82 | |||
83 | - O(1) RT scheduling. For those RT folks who are paranoid about the | ||
84 | O(nr_running) property of the goodness loop and the recalculation loop. | ||
85 | |||
86 | - run fork()ed children before the parent. Andrea has pointed out the | ||
87 | advantages of this a few months ago, but patches for this feature | ||
88 | do not work with the old scheduler as well as they should, | ||
89 | because idle processes often steal the new child before the fork()ing | ||
90 | CPU gets to execute it. | ||
91 | |||
92 | |||
93 | Design | ||
94 | ====== | ||
95 | |||
96 | The core of the new scheduler contains the following mechanisms: | ||
97 | |||
98 | - *two* priority-ordered 'priority arrays' per CPU. There is an 'active' | ||
99 | array and an 'expired' array. The active array contains all tasks that | ||
100 | are affine to this CPU and have timeslices left. The expired array | ||
101 | contains all tasks which have used up their timeslices - but this array | ||
102 | is kept sorted as well. The active and expired array is not accessed | ||
103 | directly, it's accessed through two pointers in the per-CPU runqueue | ||
104 | structure. If all active tasks are used up then we 'switch' the two | ||
105 | pointers and from now on the ready-to-go (former-) expired array is the | ||
106 | active array - and the empty active array serves as the new collector | ||
107 | for expired tasks. | ||
108 | |||
109 | - there is a 64-bit bitmap cache for array indices. Finding the highest | ||
110 | priority task is thus a matter of two x86 BSFL bit-search instructions. | ||
111 | |||
112 | the split-array solution enables us to have an arbitrary number of active | ||
113 | and expired tasks, and the recalculation of timeslices can be done | ||
114 | immediately when the timeslice expires. Because the arrays are always | ||
115 | access through the pointers in the runqueue, switching the two arrays can | ||
116 | be done very quickly. | ||
117 | |||
118 | this is a hybride priority-list approach coupled with roundrobin | ||
119 | scheduling and the array-switch method of distributing timeslices. | ||
120 | |||
121 | - there is a per-task 'load estimator'. | ||
122 | |||
123 | one of the toughest things to get right is good interactive feel during | ||
124 | heavy system load. While playing with various scheduler variants i found | ||
125 | that the best interactive feel is achieved not by 'boosting' interactive | ||
126 | tasks, but by 'punishing' tasks that want to use more CPU time than there | ||
127 | is available. This method is also much easier to do in an O(1) fashion. | ||
128 | |||
129 | to establish the actual 'load' the task contributes to the system, a | ||
130 | complex-looking but pretty accurate method is used: there is a 4-entry | ||
131 | 'history' ringbuffer of the task's activities during the last 4 seconds. | ||
132 | This ringbuffer is operated without much overhead. The entries tell the | ||
133 | scheduler a pretty accurate load-history of the task: has it used up more | ||
134 | CPU time or less during the past N seconds. [the size '4' and the interval | ||
135 | of 4x 1 seconds was found by lots of experimentation - this part is | ||
136 | flexible and can be changed in both directions.] | ||
137 | |||
138 | the penalty a task gets for generating more load than the CPU can handle | ||
139 | is a priority decrease - there is a maximum amount to this penalty | ||
140 | relative to their static priority, so even fully CPU-bound tasks will | ||
141 | observe each other's priorities, and will share the CPU accordingly. | ||
142 | |||
143 | the SMP load-balancer can be extended/switched with additional parallel | ||
144 | computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs | ||
145 | can be supported easily by changing the load-balancer. Right now it's | ||
146 | tuned for my SMP systems. | ||
147 | |||
148 | i skipped the prev->mm == next->mm advantage - no workload i know of shows | ||
149 | any sensitivity to this. It can be added back by sacrificing O(1) | ||
150 | schedule() [the current and one-lower priority list can be searched for a | ||
151 | that->mm == current->mm condition], but costs a fair number of cycles | ||
152 | during a number of important workloads, so i wanted to avoid this as much | ||
153 | as possible. | ||
154 | |||
155 | - the SMP idle-task startup code was still racy and the new scheduler | ||
156 | triggered this. So i streamlined the idle-setup code a bit. We do not call | ||
157 | into schedule() before all processors have started up fully and all idle | ||
158 | threads are in place. | ||
159 | |||
160 | - the patch also cleans up a number of aspects of sched.c - moves code | ||
161 | into other areas of the kernel where it's appropriate, and simplifies | ||
162 | certain code paths and data constructs. As a result, the new scheduler's | ||
163 | code is smaller than the old one. | ||
164 | |||
165 | Ingo | ||
diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt new file mode 100644 index 000000000000..a9e990ab980f --- /dev/null +++ b/Documentation/scheduler/sched-domains.txt | |||
@@ -0,0 +1,70 @@ | |||
1 | Each CPU has a "base" scheduling domain (struct sched_domain). These are | ||
2 | accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain | ||
3 | hierarchy is built from these base domains via the ->parent pointer. ->parent | ||
4 | MUST be NULL terminated, and domain structures should be per-CPU as they | ||
5 | are locklessly updated. | ||
6 | |||
7 | Each scheduling domain spans a number of CPUs (stored in the ->span field). | ||
8 | A domain's span MUST be a superset of it child's span (this restriction could | ||
9 | be relaxed if the need arises), and a base domain for CPU i MUST span at least | ||
10 | i. The top domain for each CPU will generally span all CPUs in the system | ||
11 | although strictly it doesn't have to, but this could lead to a case where some | ||
12 | CPUs will never be given tasks to run unless the CPUs allowed mask is | ||
13 | explicitly set. A sched domain's span means "balance process load among these | ||
14 | CPUs". | ||
15 | |||
16 | Each scheduling domain must have one or more CPU groups (struct sched_group) | ||
17 | which are organised as a circular one way linked list from the ->groups | ||
18 | pointer. The union of cpumasks of these groups MUST be the same as the | ||
19 | domain's span. The intersection of cpumasks from any two of these groups | ||
20 | MUST be the empty set. The group pointed to by the ->groups pointer MUST | ||
21 | contain the CPU to which the domain belongs. Groups may be shared among | ||
22 | CPUs as they contain read only data after they have been set up. | ||
23 | |||
24 | Balancing within a sched domain occurs between groups. That is, each group | ||
25 | is treated as one entity. The load of a group is defined as the sum of the | ||
26 | load of each of its member CPUs, and only when the load of a group becomes | ||
27 | out of balance are tasks moved between groups. | ||
28 | |||
29 | In kernel/sched.c, rebalance_tick is run periodically on each CPU. This | ||
30 | function takes its CPU's base sched domain and checks to see if has reached | ||
31 | its rebalance interval. If so, then it will run load_balance on that domain. | ||
32 | rebalance_tick then checks the parent sched_domain (if it exists), and the | ||
33 | parent of the parent and so forth. | ||
34 | |||
35 | *** Implementing sched domains *** | ||
36 | The "base" domain will "span" the first level of the hierarchy. In the case | ||
37 | of SMT, you'll span all siblings of the physical CPU, with each group being | ||
38 | a single virtual CPU. | ||
39 | |||
40 | In SMP, the parent of the base domain will span all physical CPUs in the | ||
41 | node. Each group being a single physical CPU. Then with NUMA, the parent | ||
42 | of the SMP domain will span the entire machine, with each group having the | ||
43 | cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, | ||
44 | might have just one domain covering its one NUMA level. | ||
45 | |||
46 | The implementor should read comments in include/linux/sched.h: | ||
47 | struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of | ||
48 | the specifics and what to tune. | ||
49 | |||
50 | For SMT, the architecture must define CONFIG_SCHED_SMT and provide a | ||
51 | cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of | ||
52 | all "i"'s siblings as well as "i" itself. | ||
53 | |||
54 | Architectures may retain the regular override the default SD_*_INIT flags | ||
55 | while using the generic domain builder in kernel/sched.c if they wish to | ||
56 | retain the traditional SMT->SMP->NUMA topology (or some subset of that). This | ||
57 | can be done by #define'ing ARCH_HASH_SCHED_TUNE. | ||
58 | |||
59 | Alternatively, the architecture may completely override the generic domain | ||
60 | builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your | ||
61 | arch_init_sched_domains function. This function will attach domains to all | ||
62 | CPUs using cpu_attach_domain. | ||
63 | |||
64 | Implementors should change the line | ||
65 | #undef SCHED_DOMAIN_DEBUG | ||
66 | to | ||
67 | #define SCHED_DOMAIN_DEBUG | ||
68 | in kernel/sched.c as this enables an error checking parse of the sched domains | ||
69 | which should catch most possible errors (described above). It also prints out | ||
70 | the domain structure in a visual format. | ||
diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt new file mode 100644 index 000000000000..e2bae5a577e3 --- /dev/null +++ b/Documentation/scheduler/sched-nice-design.txt | |||
@@ -0,0 +1,108 @@ | |||
1 | This document explains the thinking about the revamped and streamlined | ||
2 | nice-levels implementation in the new Linux scheduler. | ||
3 | |||
4 | Nice levels were always pretty weak under Linux and people continuously | ||
5 | pestered us to make nice +19 tasks use up much less CPU time. | ||
6 | |||
7 | Unfortunately that was not that easy to implement under the old | ||
8 | scheduler, (otherwise we'd have done it long ago) because nice level | ||
9 | support was historically coupled to timeslice length, and timeslice | ||
10 | units were driven by the HZ tick, so the smallest timeslice was 1/HZ. | ||
11 | |||
12 | In the O(1) scheduler (in 2003) we changed negative nice levels to be | ||
13 | much stronger than they were before in 2.4 (and people were happy about | ||
14 | that change), and we also intentionally calibrated the linear timeslice | ||
15 | rule so that nice +19 level would be _exactly_ 1 jiffy. To better | ||
16 | understand it, the timeslice graph went like this (cheesy ASCII art | ||
17 | alert!): | ||
18 | |||
19 | |||
20 | A | ||
21 | \ | [timeslice length] | ||
22 | \ | | ||
23 | \ | | ||
24 | \ | | ||
25 | \ | | ||
26 | \|___100msecs | ||
27 | |^ . _ | ||
28 | | ^ . _ | ||
29 | | ^ . _ | ||
30 | -*----------------------------------*-----> [nice level] | ||
31 | -20 | +19 | ||
32 | | | ||
33 | | | ||
34 | |||
35 | So that if someone wanted to really renice tasks, +19 would give a much | ||
36 | bigger hit than the normal linear rule would do. (The solution of | ||
37 | changing the ABI to extend priorities was discarded early on.) | ||
38 | |||
39 | This approach worked to some degree for some time, but later on with | ||
40 | HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which | ||
41 | we felt to be a bit excessive. Excessive _not_ because it's too small of | ||
42 | a CPU utilization, but because it causes too frequent (once per | ||
43 | millisec) rescheduling. (and would thus trash the cache, etc. Remember, | ||
44 | this was long ago when hardware was weaker and caches were smaller, and | ||
45 | people were running number crunching apps at nice +19.) | ||
46 | |||
47 | So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the | ||
48 | right minimal granularity - and this translates to 5% CPU utilization. | ||
49 | But the fundamental HZ-sensitive property for nice+19 still remained, | ||
50 | and we never got a single complaint about nice +19 being too _weak_ in | ||
51 | terms of CPU utilization, we only got complaints about it (still) being | ||
52 | too _strong_ :-) | ||
53 | |||
54 | To sum it up: we always wanted to make nice levels more consistent, but | ||
55 | within the constraints of HZ and jiffies and their nasty design level | ||
56 | coupling to timeslices and granularity it was not really viable. | ||
57 | |||
58 | The second (less frequent but still periodically occuring) complaint | ||
59 | about Linux's nice level support was its assymetry around the origo | ||
60 | (which you can see demonstrated in the picture above), or more | ||
61 | accurately: the fact that nice level behavior depended on the _absolute_ | ||
62 | nice level as well, while the nice API itself is fundamentally | ||
63 | "relative": | ||
64 | |||
65 | int nice(int inc); | ||
66 | |||
67 | asmlinkage long sys_nice(int increment) | ||
68 | |||
69 | (the first one is the glibc API, the second one is the syscall API.) | ||
70 | Note that the 'inc' is relative to the current nice level. Tools like | ||
71 | bash's "nice" command mirror this relative API. | ||
72 | |||
73 | With the old scheduler, if you for example started a niced task with +1 | ||
74 | and another task with +2, the CPU split between the two tasks would | ||
75 | depend on the nice level of the parent shell - if it was at nice -10 the | ||
76 | CPU split was different than if it was at +5 or +10. | ||
77 | |||
78 | A third complaint against Linux's nice level support was that negative | ||
79 | nice levels were not 'punchy enough', so lots of people had to resort to | ||
80 | run audio (and other multimedia) apps under RT priorities such as | ||
81 | SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation | ||
82 | proof, and a buggy SCHED_FIFO app can also lock up the system for good. | ||
83 | |||
84 | The new scheduler in v2.6.23 addresses all three types of complaints: | ||
85 | |||
86 | To address the first complaint (of nice levels being not "punchy" | ||
87 | enough), the scheduler was decoupled from 'time slice' and HZ concepts | ||
88 | (and granularity was made a separate concept from nice levels) and thus | ||
89 | it was possible to implement better and more consistent nice +19 | ||
90 | support: with the new scheduler nice +19 tasks get a HZ-independent | ||
91 | 1.5%, instead of the variable 3%-5%-9% range they got in the old | ||
92 | scheduler. | ||
93 | |||
94 | To address the second complaint (of nice levels not being consistent), | ||
95 | the new scheduler makes nice(1) have the same CPU utilization effect on | ||
96 | tasks, regardless of their absolute nice levels. So on the new | ||
97 | scheduler, running a nice +10 and a nice 11 task has the same CPU | ||
98 | utilization "split" between them as running a nice -5 and a nice -4 | ||
99 | task. (one will get 55% of the CPU, the other 45%.) That is why nice | ||
100 | levels were changed to be "multiplicative" (or exponential) - that way | ||
101 | it does not matter which nice level you start out from, the 'relative | ||
102 | result' will always be the same. | ||
103 | |||
104 | The third complaint (of negative nice levels not being "punchy" enough | ||
105 | and forcing audio apps to run under the more dangerous SCHED_FIFO | ||
106 | scheduling policy) is addressed by the new scheduler almost | ||
107 | automatically: stronger negative nice levels are an automatic | ||
108 | side-effect of the recalibrated dynamic range of nice levels. | ||
diff --git a/Documentation/scheduler/sched-stats.txt b/Documentation/scheduler/sched-stats.txt new file mode 100644 index 000000000000..442e14d35dea --- /dev/null +++ b/Documentation/scheduler/sched-stats.txt | |||
@@ -0,0 +1,156 @@ | |||
1 | Version 14 of schedstats includes support for sched_domains, which hit the | ||
2 | mainline kernel in 2.6.20 although it is identical to the stats from version | ||
3 | 12 which was in the kernel from 2.6.13-2.6.19 (version 13 never saw a kernel | ||
4 | release). Some counters make more sense to be per-runqueue; other to be | ||
5 | per-domain. Note that domains (and their associated information) will only | ||
6 | be pertinent and available on machines utilizing CONFIG_SMP. | ||
7 | |||
8 | In version 14 of schedstat, there is at least one level of domain | ||
9 | statistics for each cpu listed, and there may well be more than one | ||
10 | domain. Domains have no particular names in this implementation, but | ||
11 | the highest numbered one typically arbitrates balancing across all the | ||
12 | cpus on the machine, while domain0 is the most tightly focused domain, | ||
13 | sometimes balancing only between pairs of cpus. At this time, there | ||
14 | are no architectures which need more than three domain levels. The first | ||
15 | field in the domain stats is a bit map indicating which cpus are affected | ||
16 | by that domain. | ||
17 | |||
18 | These fields are counters, and only increment. Programs which make use | ||
19 | of these will need to start with a baseline observation and then calculate | ||
20 | the change in the counters at each subsequent observation. A perl script | ||
21 | which does this for many of the fields is available at | ||
22 | |||
23 | http://eaglet.rain.com/rick/linux/schedstat/ | ||
24 | |||
25 | Note that any such script will necessarily be version-specific, as the main | ||
26 | reason to change versions is changes in the output format. For those wishing | ||
27 | to write their own scripts, the fields are described here. | ||
28 | |||
29 | CPU statistics | ||
30 | -------------- | ||
31 | cpu<N> 1 2 3 4 5 6 7 8 9 10 11 12 | ||
32 | |||
33 | NOTE: In the sched_yield() statistics, the active queue is considered empty | ||
34 | if it has only one process in it, since obviously the process calling | ||
35 | sched_yield() is that process. | ||
36 | |||
37 | First four fields are sched_yield() statistics: | ||
38 | 1) # of times both the active and the expired queue were empty | ||
39 | 2) # of times just the active queue was empty | ||
40 | 3) # of times just the expired queue was empty | ||
41 | 4) # of times sched_yield() was called | ||
42 | |||
43 | Next three are schedule() statistics: | ||
44 | 5) # of times we switched to the expired queue and reused it | ||
45 | 6) # of times schedule() was called | ||
46 | 7) # of times schedule() left the processor idle | ||
47 | |||
48 | Next two are try_to_wake_up() statistics: | ||
49 | 8) # of times try_to_wake_up() was called | ||
50 | 9) # of times try_to_wake_up() was called to wake up the local cpu | ||
51 | |||
52 | Next three are statistics describing scheduling latency: | ||
53 | 10) sum of all time spent running by tasks on this processor (in jiffies) | ||
54 | 11) sum of all time spent waiting to run by tasks on this processor (in | ||
55 | jiffies) | ||
56 | 12) # of timeslices run on this cpu | ||
57 | |||
58 | |||
59 | Domain statistics | ||
60 | ----------------- | ||
61 | One of these is produced per domain for each cpu described. (Note that if | ||
62 | CONFIG_SMP is not defined, *no* domains are utilized and these lines | ||
63 | will not appear in the output.) | ||
64 | |||
65 | domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | ||
66 | |||
67 | The first field is a bit mask indicating what cpus this domain operates over. | ||
68 | |||
69 | The next 24 are a variety of load_balance() statistics in grouped into types | ||
70 | of idleness (idle, busy, and newly idle): | ||
71 | |||
72 | 1) # of times in this domain load_balance() was called when the | ||
73 | cpu was idle | ||
74 | 2) # of times in this domain load_balance() checked but found | ||
75 | the load did not require balancing when the cpu was idle | ||
76 | 3) # of times in this domain load_balance() tried to move one or | ||
77 | more tasks and failed, when the cpu was idle | ||
78 | 4) sum of imbalances discovered (if any) with each call to | ||
79 | load_balance() in this domain when the cpu was idle | ||
80 | 5) # of times in this domain pull_task() was called when the cpu | ||
81 | was idle | ||
82 | 6) # of times in this domain pull_task() was called even though | ||
83 | the target task was cache-hot when idle | ||
84 | 7) # of times in this domain load_balance() was called but did | ||
85 | not find a busier queue while the cpu was idle | ||
86 | 8) # of times in this domain a busier queue was found while the | ||
87 | cpu was idle but no busier group was found | ||
88 | |||
89 | 9) # of times in this domain load_balance() was called when the | ||
90 | cpu was busy | ||
91 | 10) # of times in this domain load_balance() checked but found the | ||
92 | load did not require balancing when busy | ||
93 | 11) # of times in this domain load_balance() tried to move one or | ||
94 | more tasks and failed, when the cpu was busy | ||
95 | 12) sum of imbalances discovered (if any) with each call to | ||
96 | load_balance() in this domain when the cpu was busy | ||
97 | 13) # of times in this domain pull_task() was called when busy | ||
98 | 14) # of times in this domain pull_task() was called even though the | ||
99 | target task was cache-hot when busy | ||
100 | 15) # of times in this domain load_balance() was called but did not | ||
101 | find a busier queue while the cpu was busy | ||
102 | 16) # of times in this domain a busier queue was found while the cpu | ||
103 | was busy but no busier group was found | ||
104 | |||
105 | 17) # of times in this domain load_balance() was called when the | ||
106 | cpu was just becoming idle | ||
107 | 18) # of times in this domain load_balance() checked but found the | ||
108 | load did not require balancing when the cpu was just becoming idle | ||
109 | 19) # of times in this domain load_balance() tried to move one or more | ||
110 | tasks and failed, when the cpu was just becoming idle | ||
111 | 20) sum of imbalances discovered (if any) with each call to | ||
112 | load_balance() in this domain when the cpu was just becoming idle | ||
113 | 21) # of times in this domain pull_task() was called when newly idle | ||
114 | 22) # of times in this domain pull_task() was called even though the | ||
115 | target task was cache-hot when just becoming idle | ||
116 | 23) # of times in this domain load_balance() was called but did not | ||
117 | find a busier queue while the cpu was just becoming idle | ||
118 | 24) # of times in this domain a busier queue was found while the cpu | ||
119 | was just becoming idle but no busier group was found | ||
120 | |||
121 | Next three are active_load_balance() statistics: | ||
122 | 25) # of times active_load_balance() was called | ||
123 | 26) # of times active_load_balance() tried to move a task and failed | ||
124 | 27) # of times active_load_balance() successfully moved a task | ||
125 | |||
126 | Next three are sched_balance_exec() statistics: | ||
127 | 28) sbe_cnt is not used | ||
128 | 29) sbe_balanced is not used | ||
129 | 30) sbe_pushed is not used | ||
130 | |||
131 | Next three are sched_balance_fork() statistics: | ||
132 | 31) sbf_cnt is not used | ||
133 | 32) sbf_balanced is not used | ||
134 | 33) sbf_pushed is not used | ||
135 | |||
136 | Next three are try_to_wake_up() statistics: | ||
137 | 34) # of times in this domain try_to_wake_up() awoke a task that | ||
138 | last ran on a different cpu in this domain | ||
139 | 35) # of times in this domain try_to_wake_up() moved a task to the | ||
140 | waking cpu because it was cache-cold on its own cpu anyway | ||
141 | 36) # of times in this domain try_to_wake_up() started passive balancing | ||
142 | |||
143 | /proc/<pid>/schedstat | ||
144 | ---------------- | ||
145 | schedstats also adds a new /proc/<pid/schedstat file to include some of | ||
146 | the same information on a per-process level. There are three fields in | ||
147 | this file correlating for that process to: | ||
148 | 1) time spent on the cpu | ||
149 | 2) time spent waiting on a runqueue | ||
150 | 3) # of timeslices run on this cpu | ||
151 | |||
152 | A program could be easily written to make use of these extra fields to | ||
153 | report on how well a particular process or set of processes is faring | ||
154 | under the scheduler's policies. A simple version of such a program is | ||
155 | available at | ||
156 | http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c | ||