aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/scheduler
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r--Documentation/scheduler/00-INDEX16
-rw-r--r--Documentation/scheduler/sched-arch.txt89
-rw-r--r--Documentation/scheduler/sched-coding.txt126
-rw-r--r--Documentation/scheduler/sched-design-CFS.txt186
-rw-r--r--Documentation/scheduler/sched-design.txt165
-rw-r--r--Documentation/scheduler/sched-domains.txt70
-rw-r--r--Documentation/scheduler/sched-nice-design.txt108
-rw-r--r--Documentation/scheduler/sched-stats.txt156
8 files changed, 916 insertions, 0 deletions
diff --git a/Documentation/scheduler/00-INDEX b/Documentation/scheduler/00-INDEX
new file mode 100644
index 000000000000..b5f5ca069b2d
--- /dev/null
+++ b/Documentation/scheduler/00-INDEX
@@ -0,0 +1,16 @@
100-INDEX
2 - this file.
3sched-arch.txt
4 - CPU Scheduler implementation hints for architecture specific code.
5sched-coding.txt
6 - reference for various scheduler-related methods in the O(1) scheduler.
7sched-design.txt
8 - goals, design and implementation of the Linux O(1) scheduler.
9sched-design-CFS.txt
10 - goals, design and implementation of the Complete Fair Scheduler.
11sched-domains.txt
12 - information on scheduling domains.
13sched-nice-design.txt
14 - How and why the scheduler's nice levels are implemented.
15sched-stats.txt
16 - information on schedstats (Linux Scheduler Statistics).
diff --git a/Documentation/scheduler/sched-arch.txt b/Documentation/scheduler/sched-arch.txt
new file mode 100644
index 000000000000..941615a9769b
--- /dev/null
+++ b/Documentation/scheduler/sched-arch.txt
@@ -0,0 +1,89 @@
1 CPU Scheduler implementation hints for architecture specific code
2
3 Nick Piggin, 2005
4
5Context switch
6==============
71. Runqueue locking
8By default, the switch_to arch function is called with the runqueue
9locked. This is usually not a problem unless switch_to may need to
10take the runqueue lock. This is usually due to a wake up operation in
11the context switch. See include/asm-ia64/system.h for an example.
12
13To request the scheduler call switch_to with the runqueue unlocked,
14you must `#define __ARCH_WANT_UNLOCKED_CTXSW` in a header file
15(typically the one where switch_to is defined).
16
17Unlocked context switches introduce only a very minor performance
18penalty to the core scheduler implementation in the CONFIG_SMP case.
19
202. Interrupt status
21By default, the switch_to arch function is called with interrupts
22disabled. Interrupts may be enabled over the call if it is likely to
23introduce a significant interrupt latency by adding the line
24`#define __ARCH_WANT_INTERRUPTS_ON_CTXSW` in the same place as for
25unlocked context switches. This define also implies
26`__ARCH_WANT_UNLOCKED_CTXSW`. See include/asm-arm/system.h for an
27example.
28
29
30CPU idle
31========
32Your cpu_idle routines need to obey the following rules:
33
341. Preempt should now disabled over idle routines. Should only
35 be enabled to call schedule() then disabled again.
36
372. need_resched/TIF_NEED_RESCHED is only ever set, and will never
38 be cleared until the running task has called schedule(). Idle
39 threads need only ever query need_resched, and may never set or
40 clear it.
41
423. When cpu_idle finds (need_resched() == 'true'), it should call
43 schedule(). It should not call schedule() otherwise.
44
454. The only time interrupts need to be disabled when checking
46 need_resched is if we are about to sleep the processor until
47 the next interrupt (this doesn't provide any protection of
48 need_resched, it prevents losing an interrupt).
49
50 4a. Common problem with this type of sleep appears to be:
51 local_irq_disable();
52 if (!need_resched()) {
53 local_irq_enable();
54 *** resched interrupt arrives here ***
55 __asm__("sleep until next interrupt");
56 }
57
585. TIF_POLLING_NRFLAG can be set by idle routines that do not
59 need an interrupt to wake them up when need_resched goes high.
60 In other words, they must be periodically polling need_resched,
61 although it may be reasonable to do some background work or enter
62 a low CPU priority.
63
64 5a. If TIF_POLLING_NRFLAG is set, and we do decide to enter
65 an interrupt sleep, it needs to be cleared then a memory
66 barrier issued (followed by a test of need_resched with
67 interrupts disabled, as explained in 3).
68
69arch/i386/kernel/process.c has examples of both polling and
70sleeping idle functions.
71
72
73Possible arch/ problems
74=======================
75
76Possible arch problems I found (and either tried to fix or didn't):
77
78h8300 - Is such sleeping racy vs interrupts? (See #4a).
79 The H8/300 manual I found indicates yes, however disabling IRQs
80 over the sleep mean only NMIs can wake it up, so can't fix easily
81 without doing spin waiting.
82
83ia64 - is safe_halt call racy vs interrupts? (does it sleep?) (See #4a)
84
85sh64 - Is sleeping racy vs interrupts? (See #4a)
86
87sparc - IRQs on at this point(?), change local_irq_save to _disable.
88 - TODO: needs secondary CPUs to disable preempt (See #1)
89
diff --git a/Documentation/scheduler/sched-coding.txt b/Documentation/scheduler/sched-coding.txt
new file mode 100644
index 000000000000..cbd8db752acf
--- /dev/null
+++ b/Documentation/scheduler/sched-coding.txt
@@ -0,0 +1,126 @@
1 Reference for various scheduler-related methods in the O(1) scheduler
2 Robert Love <rml@tech9.net>, MontaVista Software
3
4
5Note most of these methods are local to kernel/sched.c - this is by design.
6The scheduler is meant to be self-contained and abstracted away. This document
7is primarily for understanding the scheduler, not interfacing to it. Some of
8the discussed interfaces, however, are general process/scheduling methods.
9They are typically defined in include/linux/sched.h.
10
11
12Main Scheduling Methods
13-----------------------
14
15void load_balance(runqueue_t *this_rq, int idle)
16 Attempts to pull tasks from one cpu to another to balance cpu usage,
17 if needed. This method is called explicitly if the runqueues are
18 imbalanced or periodically by the timer tick. Prior to calling,
19 the current runqueue must be locked and interrupts disabled.
20
21void schedule()
22 The main scheduling function. Upon return, the highest priority
23 process will be active.
24
25
26Locking
27-------
28
29Each runqueue has its own lock, rq->lock. When multiple runqueues need
30to be locked, lock acquires must be ordered by ascending &runqueue value.
31
32A specific runqueue is locked via
33
34 task_rq_lock(task_t pid, unsigned long *flags)
35
36which disables preemption, disables interrupts, and locks the runqueue pid is
37running on. Likewise,
38
39 task_rq_unlock(task_t pid, unsigned long *flags)
40
41unlocks the runqueue pid is running on, restores interrupts to their previous
42state, and reenables preemption.
43
44The routines
45
46 double_rq_lock(runqueue_t *rq1, runqueue_t *rq2)
47
48and
49
50 double_rq_unlock(runqueue_t *rq1, runqueue_t *rq2)
51
52safely lock and unlock, respectively, the two specified runqueues. They do
53not, however, disable and restore interrupts. Users are required to do so
54manually before and after calls.
55
56
57Values
58------
59
60MAX_PRIO
61 The maximum priority of the system, stored in the task as task->prio.
62 Lower priorities are higher. Normal (non-RT) priorities range from
63 MAX_RT_PRIO to (MAX_PRIO - 1).
64MAX_RT_PRIO
65 The maximum real-time priority of the system. Valid RT priorities
66 range from 0 to (MAX_RT_PRIO - 1).
67MAX_USER_RT_PRIO
68 The maximum real-time priority that is exported to user-space. Should
69 always be equal to or less than MAX_RT_PRIO. Setting it less allows
70 kernel threads to have higher priorities than any user-space task.
71MIN_TIMESLICE
72MAX_TIMESLICE
73 Respectively, the minimum and maximum timeslices (quanta) of a process.
74
75Data
76----
77
78struct runqueue
79 The main per-CPU runqueue data structure.
80struct task_struct
81 The main per-process data structure.
82
83
84General Methods
85---------------
86
87cpu_rq(cpu)
88 Returns the runqueue of the specified cpu.
89this_rq()
90 Returns the runqueue of the current cpu.
91task_rq(pid)
92 Returns the runqueue which holds the specified pid.
93cpu_curr(cpu)
94 Returns the task currently running on the given cpu.
95rt_task(pid)
96 Returns true if pid is real-time, false if not.
97
98
99Process Control Methods
100-----------------------
101
102void set_user_nice(task_t *p, long nice)
103 Sets the "nice" value of task p to the given value.
104int setscheduler(pid_t pid, int policy, struct sched_param *param)
105 Sets the scheduling policy and parameters for the given pid.
106int set_cpus_allowed(task_t *p, unsigned long new_mask)
107 Sets a given task's CPU affinity and migrates it to a proper cpu.
108 Callers must have a valid reference to the task and assure the
109 task not exit prematurely. No locks can be held during the call.
110set_task_state(tsk, state_value)
111 Sets the given task's state to the given value.
112set_current_state(state_value)
113 Sets the current task's state to the given value.
114void set_tsk_need_resched(struct task_struct *tsk)
115 Sets need_resched in the given task.
116void clear_tsk_need_resched(struct task_struct *tsk)
117 Clears need_resched in the given task.
118void set_need_resched()
119 Sets need_resched in the current task.
120void clear_need_resched()
121 Clears need_resched in the current task.
122int need_resched()
123 Returns true if need_resched is set in the current task, false
124 otherwise.
125yield()
126 Place the current process at the end of the runqueue and call schedule.
diff --git a/Documentation/scheduler/sched-design-CFS.txt b/Documentation/scheduler/sched-design-CFS.txt
new file mode 100644
index 000000000000..88bcb8767335
--- /dev/null
+++ b/Documentation/scheduler/sched-design-CFS.txt
@@ -0,0 +1,186 @@
1
2This is the CFS scheduler.
3
480% of CFS's design can be summed up in a single sentence: CFS basically
5models an "ideal, precise multi-tasking CPU" on real hardware.
6
7"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%
8physical power and which can run each task at precise equal speed, in
9parallel, each at 1/nr_running speed. For example: if there are 2 tasks
10running then it runs each at 50% physical power - totally in parallel.
11
12On real hardware, we can run only a single task at once, so while that
13one task runs, the other tasks that are waiting for the CPU are at a
14disadvantage - the current task gets an unfair amount of CPU time. In
15CFS this fairness imbalance is expressed and tracked via the per-task
16p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
17time the task should now run on the CPU for it to become completely fair
18and balanced.
19
20( small detail: on 'ideal' hardware, the p->wait_runtime value would
21 always be zero - no task would ever get 'out of balance' from the
22 'ideal' share of CPU time. )
23
24CFS's task picking logic is based on this p->wait_runtime value and it
25is thus very simple: it always tries to run the task with the largest
26p->wait_runtime value. In other words, CFS tries to run the task with
27the 'gravest need' for more CPU time. So CFS always tries to split up
28CPU time between runnable tasks as close to 'ideal multitasking
29hardware' as possible.
30
31Most of the rest of CFS's design just falls out of this really simple
32concept, with a few add-on embellishments like nice levels,
33multiprocessing and various algorithm variants to recognize sleepers.
34
35In practice it works like this: the system runs a task a bit, and when
36the task schedules (or a scheduler tick happens) the task's CPU usage is
37'accounted for': the (small) time it just spent using the physical CPU
38is deducted from p->wait_runtime. [minus the 'fair share' it would have
39gotten anyway]. Once p->wait_runtime gets low enough so that another
40task becomes the 'leftmost task' of the time-ordered rbtree it maintains
41(plus a small amount of 'granularity' distance relative to the leftmost
42task so that we do not over-schedule tasks and trash the cache) then the
43new leftmost task is picked and the current task is preempted.
44
45The rq->fair_clock value tracks the 'CPU time a runnable task would have
46fairly gotten, had it been runnable during that time'. So by using
47rq->fair_clock values we can accurately timestamp and measure the
48'expected CPU time' a task should have gotten. All runnable tasks are
49sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
50CFS picks the 'leftmost' task and sticks to it. As the system progresses
51forwards, newly woken tasks are put into the tree more and more to the
52right - slowly but surely giving a chance for every task to become the
53'leftmost task' and thus get on the CPU within a deterministic amount of
54time.
55
56Some implementation details:
57
58 - the introduction of Scheduling Classes: an extensible hierarchy of
59 scheduler modules. These modules encapsulate scheduling policy
60 details and are handled by the scheduler core without the core
61 code assuming about them too much.
62
63 - sched_fair.c implements the 'CFS desktop scheduler': it is a
64 replacement for the vanilla scheduler's SCHED_OTHER interactivity
65 code.
66
67 I'd like to give credit to Con Kolivas for the general approach here:
68 he has proven via RSDL/SD that 'fair scheduling' is possible and that
69 it results in better desktop scheduling. Kudos Con!
70
71 The CFS patch uses a completely different approach and implementation
72 from RSDL/SD. My goal was to make CFS's interactivity quality exceed
73 that of RSDL/SD, which is a high standard to meet :-) Testing
74 feedback is welcome to decide this one way or another. [ and, in any
75 case, all of SD's logic could be added via a kernel/sched_sd.c module
76 as well, if Con is interested in such an approach. ]
77
78 CFS's design is quite radical: it does not use runqueues, it uses a
79 time-ordered rbtree to build a 'timeline' of future task execution,
80 and thus has no 'array switch' artifacts (by which both the vanilla
81 scheduler and RSDL/SD are affected).
82
83 CFS uses nanosecond granularity accounting and does not rely on any
84 jiffies or other HZ detail. Thus the CFS scheduler has no notion of
85 'timeslices' and has no heuristics whatsoever. There is only one
86 central tunable (you have to switch on CONFIG_SCHED_DEBUG):
87
88 /proc/sys/kernel/sched_granularity_ns
89
90 which can be used to tune the scheduler from 'desktop' (low
91 latencies) to 'server' (good batching) workloads. It defaults to a
92 setting suitable for desktop workloads. SCHED_BATCH is handled by the
93 CFS scheduler module too.
94
95 Due to its design, the CFS scheduler is not prone to any of the
96 'attacks' that exist today against the heuristics of the stock
97 scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
98 work fine and do not impact interactivity and produce the expected
99 behavior.
100
101 the CFS scheduler has a much stronger handling of nice levels and
102 SCHED_BATCH: both types of workloads should be isolated much more
103 agressively than under the vanilla scheduler.
104
105 ( another detail: due to nanosec accounting and timeline sorting,
106 sched_yield() support is very simple under CFS, and in fact under
107 CFS sched_yield() behaves much better than under any other
108 scheduler i have tested so far. )
109
110 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
111 way than the vanilla scheduler does. It uses 100 runqueues (for all
112 100 RT priority levels, instead of 140 in the vanilla scheduler)
113 and it needs no expired array.
114
115 - reworked/sanitized SMP load-balancing: the runqueue-walking
116 assumptions are gone from the load-balancing code now, and
117 iterators of the scheduling modules are used. The balancing code got
118 quite a bit simpler as a result.
119
120
121Group scheduler extension to CFS
122================================
123
124Normally the scheduler operates on individual tasks and strives to provide
125fair CPU time to each task. Sometimes, it may be desirable to group tasks
126and provide fair CPU time to each such task group. For example, it may
127be desirable to first provide fair CPU time to each user on the system
128and then to each task belonging to a user.
129
130CONFIG_FAIR_GROUP_SCHED strives to achieve exactly that. It lets
131SCHED_NORMAL/BATCH tasks be be grouped and divides CPU time fairly among such
132groups. At present, there are two (mutually exclusive) mechanisms to group
133tasks for CPU bandwidth control purpose:
134
135 - Based on user id (CONFIG_FAIR_USER_SCHED)
136 In this option, tasks are grouped according to their user id.
137 - Based on "cgroup" pseudo filesystem (CONFIG_FAIR_CGROUP_SCHED)
138 This options lets the administrator create arbitrary groups
139 of tasks, using the "cgroup" pseudo filesystem. See
140 Documentation/cgroups.txt for more information about this
141 filesystem.
142
143Only one of these options to group tasks can be chosen and not both.
144
145Group scheduler tunables:
146
147When CONFIG_FAIR_USER_SCHED is defined, a directory is created in sysfs for
148each new user and a "cpu_share" file is added in that directory.
149
150 # cd /sys/kernel/uids
151 # cat 512/cpu_share # Display user 512's CPU share
152 1024
153 # echo 2048 > 512/cpu_share # Modify user 512's CPU share
154 # cat 512/cpu_share # Display user 512's CPU share
155 2048
156 #
157
158CPU bandwidth between two users are divided in the ratio of their CPU shares.
159For ex: if you would like user "root" to get twice the bandwidth of user
160"guest", then set the cpu_share for both the users such that "root"'s
161cpu_share is twice "guest"'s cpu_share
162
163
164When CONFIG_FAIR_CGROUP_SCHED is defined, a "cpu.shares" file is created
165for each group created using the pseudo filesystem. See example steps
166below to create task groups and modify their CPU share using the "cgroups"
167pseudo filesystem
168
169 # mkdir /dev/cpuctl
170 # mount -t cgroup -ocpu none /dev/cpuctl
171 # cd /dev/cpuctl
172
173 # mkdir multimedia # create "multimedia" group of tasks
174 # mkdir browser # create "browser" group of tasks
175
176 # #Configure the multimedia group to receive twice the CPU bandwidth
177 # #that of browser group
178
179 # echo 2048 > multimedia/cpu.shares
180 # echo 1024 > browser/cpu.shares
181
182 # firefox & # Launch firefox and move it to "browser" group
183 # echo <firefox_pid> > browser/tasks
184
185 # #Launch gmplayer (or your favourite movie player)
186 # echo <movie_player_pid> > multimedia/tasks
diff --git a/Documentation/scheduler/sched-design.txt b/Documentation/scheduler/sched-design.txt
new file mode 100644
index 000000000000..1605bf0cba8b
--- /dev/null
+++ b/Documentation/scheduler/sched-design.txt
@@ -0,0 +1,165 @@
1 Goals, Design and Implementation of the
2 new ultra-scalable O(1) scheduler
3
4
5 This is an edited version of an email Ingo Molnar sent to
6 lkml on 4 Jan 2002. It describes the goals, design, and
7 implementation of Ingo's new ultra-scalable O(1) scheduler.
8 Last Updated: 18 April 2002.
9
10
11Goal
12====
13
14The main goal of the new scheduler is to keep all the good things we know
15and love about the current Linux scheduler:
16
17 - good interactive performance even during high load: if the user
18 types or clicks then the system must react instantly and must execute
19 the user tasks smoothly, even during considerable background load.
20
21 - good scheduling/wakeup performance with 1-2 runnable processes.
22
23 - fairness: no process should stay without any timeslice for any
24 unreasonable amount of time. No process should get an unjustly high
25 amount of CPU time.
26
27 - priorities: less important tasks can be started with lower priority,
28 more important tasks with higher priority.
29
30 - SMP efficiency: no CPU should stay idle if there is work to do.
31
32 - SMP affinity: processes which run on one CPU should stay affine to
33 that CPU. Processes should not bounce between CPUs too frequently.
34
35 - plus additional scheduler features: RT scheduling, CPU binding.
36
37and the goal is also to add a few new things:
38
39 - fully O(1) scheduling. Are you tired of the recalculation loop
40 blowing the L1 cache away every now and then? Do you think the goodness
41 loop is taking a bit too long to finish if there are lots of runnable
42 processes? This new scheduler takes no prisoners: wakeup(), schedule(),
43 the timer interrupt are all O(1) algorithms. There is no recalculation
44 loop. There is no goodness loop either.
45
46 - 'perfect' SMP scalability. With the new scheduler there is no 'big'
47 runqueue_lock anymore - it's all per-CPU runqueues and locks - two
48 tasks on two separate CPUs can wake up, schedule and context-switch
49 completely in parallel, without any interlocking. All
50 scheduling-relevant data is structured for maximum scalability.
51
52 - better SMP affinity. The old scheduler has a particular weakness that
53 causes the random bouncing of tasks between CPUs if/when higher
54 priority/interactive tasks, this was observed and reported by many
55 people. The reason is that the timeslice recalculation loop first needs
56 every currently running task to consume its timeslice. But when this
57 happens on eg. an 8-way system, then this property starves an
58 increasing number of CPUs from executing any process. Once the last
59 task that has a timeslice left has finished using up that timeslice,
60 the recalculation loop is triggered and other CPUs can start executing
61 tasks again - after having idled around for a number of timer ticks.
62 The more CPUs, the worse this effect.
63
64 Furthermore, this same effect causes the bouncing effect as well:
65 whenever there is such a 'timeslice squeeze' of the global runqueue,
66 idle processors start executing tasks which are not affine to that CPU.
67 (because the affine tasks have finished off their timeslices already.)
68
69 The new scheduler solves this problem by distributing timeslices on a
70 per-CPU basis, without having any global synchronization or
71 recalculation.
72
73 - batch scheduling. A significant proportion of computing-intensive tasks
74 benefit from batch-scheduling, where timeslices are long and processes
75 are roundrobin scheduled. The new scheduler does such batch-scheduling
76 of the lowest priority tasks - so nice +19 jobs will get
77 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
78 in essence SCHED_IDLE, from an interactiveness point of view.
79
80 - handle extreme loads more smoothly, without breakdown and scheduling
81 storms.
82
83 - O(1) RT scheduling. For those RT folks who are paranoid about the
84 O(nr_running) property of the goodness loop and the recalculation loop.
85
86 - run fork()ed children before the parent. Andrea has pointed out the
87 advantages of this a few months ago, but patches for this feature
88 do not work with the old scheduler as well as they should,
89 because idle processes often steal the new child before the fork()ing
90 CPU gets to execute it.
91
92
93Design
94======
95
96The core of the new scheduler contains the following mechanisms:
97
98 - *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
99 array and an 'expired' array. The active array contains all tasks that
100 are affine to this CPU and have timeslices left. The expired array
101 contains all tasks which have used up their timeslices - but this array
102 is kept sorted as well. The active and expired array is not accessed
103 directly, it's accessed through two pointers in the per-CPU runqueue
104 structure. If all active tasks are used up then we 'switch' the two
105 pointers and from now on the ready-to-go (former-) expired array is the
106 active array - and the empty active array serves as the new collector
107 for expired tasks.
108
109 - there is a 64-bit bitmap cache for array indices. Finding the highest
110 priority task is thus a matter of two x86 BSFL bit-search instructions.
111
112the split-array solution enables us to have an arbitrary number of active
113and expired tasks, and the recalculation of timeslices can be done
114immediately when the timeslice expires. Because the arrays are always
115access through the pointers in the runqueue, switching the two arrays can
116be done very quickly.
117
118this is a hybride priority-list approach coupled with roundrobin
119scheduling and the array-switch method of distributing timeslices.
120
121 - there is a per-task 'load estimator'.
122
123one of the toughest things to get right is good interactive feel during
124heavy system load. While playing with various scheduler variants i found
125that the best interactive feel is achieved not by 'boosting' interactive
126tasks, but by 'punishing' tasks that want to use more CPU time than there
127is available. This method is also much easier to do in an O(1) fashion.
128
129to establish the actual 'load' the task contributes to the system, a
130complex-looking but pretty accurate method is used: there is a 4-entry
131'history' ringbuffer of the task's activities during the last 4 seconds.
132This ringbuffer is operated without much overhead. The entries tell the
133scheduler a pretty accurate load-history of the task: has it used up more
134CPU time or less during the past N seconds. [the size '4' and the interval
135of 4x 1 seconds was found by lots of experimentation - this part is
136flexible and can be changed in both directions.]
137
138the penalty a task gets for generating more load than the CPU can handle
139is a priority decrease - there is a maximum amount to this penalty
140relative to their static priority, so even fully CPU-bound tasks will
141observe each other's priorities, and will share the CPU accordingly.
142
143the SMP load-balancer can be extended/switched with additional parallel
144computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
145can be supported easily by changing the load-balancer. Right now it's
146tuned for my SMP systems.
147
148i skipped the prev->mm == next->mm advantage - no workload i know of shows
149any sensitivity to this. It can be added back by sacrificing O(1)
150schedule() [the current and one-lower priority list can be searched for a
151that->mm == current->mm condition], but costs a fair number of cycles
152during a number of important workloads, so i wanted to avoid this as much
153as possible.
154
155- the SMP idle-task startup code was still racy and the new scheduler
156triggered this. So i streamlined the idle-setup code a bit. We do not call
157into schedule() before all processors have started up fully and all idle
158threads are in place.
159
160- the patch also cleans up a number of aspects of sched.c - moves code
161into other areas of the kernel where it's appropriate, and simplifies
162certain code paths and data constructs. As a result, the new scheduler's
163code is smaller than the old one.
164
165 Ingo
diff --git a/Documentation/scheduler/sched-domains.txt b/Documentation/scheduler/sched-domains.txt
new file mode 100644
index 000000000000..a9e990ab980f
--- /dev/null
+++ b/Documentation/scheduler/sched-domains.txt
@@ -0,0 +1,70 @@
1Each CPU has a "base" scheduling domain (struct sched_domain). These are
2accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain
3hierarchy is built from these base domains via the ->parent pointer. ->parent
4MUST be NULL terminated, and domain structures should be per-CPU as they
5are locklessly updated.
6
7Each scheduling domain spans a number of CPUs (stored in the ->span field).
8A domain's span MUST be a superset of it child's span (this restriction could
9be relaxed if the need arises), and a base domain for CPU i MUST span at least
10i. The top domain for each CPU will generally span all CPUs in the system
11although strictly it doesn't have to, but this could lead to a case where some
12CPUs will never be given tasks to run unless the CPUs allowed mask is
13explicitly set. A sched domain's span means "balance process load among these
14CPUs".
15
16Each scheduling domain must have one or more CPU groups (struct sched_group)
17which are organised as a circular one way linked list from the ->groups
18pointer. The union of cpumasks of these groups MUST be the same as the
19domain's span. The intersection of cpumasks from any two of these groups
20MUST be the empty set. The group pointed to by the ->groups pointer MUST
21contain the CPU to which the domain belongs. Groups may be shared among
22CPUs as they contain read only data after they have been set up.
23
24Balancing within a sched domain occurs between groups. That is, each group
25is treated as one entity. The load of a group is defined as the sum of the
26load of each of its member CPUs, and only when the load of a group becomes
27out of balance are tasks moved between groups.
28
29In kernel/sched.c, rebalance_tick is run periodically on each CPU. This
30function takes its CPU's base sched domain and checks to see if has reached
31its rebalance interval. If so, then it will run load_balance on that domain.
32rebalance_tick then checks the parent sched_domain (if it exists), and the
33parent of the parent and so forth.
34
35*** Implementing sched domains ***
36The "base" domain will "span" the first level of the hierarchy. In the case
37of SMT, you'll span all siblings of the physical CPU, with each group being
38a single virtual CPU.
39
40In SMP, the parent of the base domain will span all physical CPUs in the
41node. Each group being a single physical CPU. Then with NUMA, the parent
42of the SMP domain will span the entire machine, with each group having the
43cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example,
44might have just one domain covering its one NUMA level.
45
46The implementor should read comments in include/linux/sched.h:
47struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of
48the specifics and what to tune.
49
50For SMT, the architecture must define CONFIG_SCHED_SMT and provide a
51cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of
52all "i"'s siblings as well as "i" itself.
53
54Architectures may retain the regular override the default SD_*_INIT flags
55while using the generic domain builder in kernel/sched.c if they wish to
56retain the traditional SMT->SMP->NUMA topology (or some subset of that). This
57can be done by #define'ing ARCH_HASH_SCHED_TUNE.
58
59Alternatively, the architecture may completely override the generic domain
60builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your
61arch_init_sched_domains function. This function will attach domains to all
62CPUs using cpu_attach_domain.
63
64Implementors should change the line
65#undef SCHED_DOMAIN_DEBUG
66to
67#define SCHED_DOMAIN_DEBUG
68in kernel/sched.c as this enables an error checking parse of the sched domains
69which should catch most possible errors (described above). It also prints out
70the domain structure in a visual format.
diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt
new file mode 100644
index 000000000000..e2bae5a577e3
--- /dev/null
+++ b/Documentation/scheduler/sched-nice-design.txt
@@ -0,0 +1,108 @@
1This document explains the thinking about the revamped and streamlined
2nice-levels implementation in the new Linux scheduler.
3
4Nice levels were always pretty weak under Linux and people continuously
5pestered us to make nice +19 tasks use up much less CPU time.
6
7Unfortunately that was not that easy to implement under the old
8scheduler, (otherwise we'd have done it long ago) because nice level
9support was historically coupled to timeslice length, and timeslice
10units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
11
12In the O(1) scheduler (in 2003) we changed negative nice levels to be
13much stronger than they were before in 2.4 (and people were happy about
14that change), and we also intentionally calibrated the linear timeslice
15rule so that nice +19 level would be _exactly_ 1 jiffy. To better
16understand it, the timeslice graph went like this (cheesy ASCII art
17alert!):
18
19
20 A
21 \ | [timeslice length]
22 \ |
23 \ |
24 \ |
25 \ |
26 \|___100msecs
27 |^ . _
28 | ^ . _
29 | ^ . _
30 -*----------------------------------*-----> [nice level]
31 -20 | +19
32 |
33 |
34
35So that if someone wanted to really renice tasks, +19 would give a much
36bigger hit than the normal linear rule would do. (The solution of
37changing the ABI to extend priorities was discarded early on.)
38
39This approach worked to some degree for some time, but later on with
40HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
41we felt to be a bit excessive. Excessive _not_ because it's too small of
42a CPU utilization, but because it causes too frequent (once per
43millisec) rescheduling. (and would thus trash the cache, etc. Remember,
44this was long ago when hardware was weaker and caches were smaller, and
45people were running number crunching apps at nice +19.)
46
47So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
48right minimal granularity - and this translates to 5% CPU utilization.
49But the fundamental HZ-sensitive property for nice+19 still remained,
50and we never got a single complaint about nice +19 being too _weak_ in
51terms of CPU utilization, we only got complaints about it (still) being
52too _strong_ :-)
53
54To sum it up: we always wanted to make nice levels more consistent, but
55within the constraints of HZ and jiffies and their nasty design level
56coupling to timeslices and granularity it was not really viable.
57
58The second (less frequent but still periodically occuring) complaint
59about Linux's nice level support was its assymetry around the origo
60(which you can see demonstrated in the picture above), or more
61accurately: the fact that nice level behavior depended on the _absolute_
62nice level as well, while the nice API itself is fundamentally
63"relative":
64
65 int nice(int inc);
66
67 asmlinkage long sys_nice(int increment)
68
69(the first one is the glibc API, the second one is the syscall API.)
70Note that the 'inc' is relative to the current nice level. Tools like
71bash's "nice" command mirror this relative API.
72
73With the old scheduler, if you for example started a niced task with +1
74and another task with +2, the CPU split between the two tasks would
75depend on the nice level of the parent shell - if it was at nice -10 the
76CPU split was different than if it was at +5 or +10.
77
78A third complaint against Linux's nice level support was that negative
79nice levels were not 'punchy enough', so lots of people had to resort to
80run audio (and other multimedia) apps under RT priorities such as
81SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
82proof, and a buggy SCHED_FIFO app can also lock up the system for good.
83
84The new scheduler in v2.6.23 addresses all three types of complaints:
85
86To address the first complaint (of nice levels being not "punchy"
87enough), the scheduler was decoupled from 'time slice' and HZ concepts
88(and granularity was made a separate concept from nice levels) and thus
89it was possible to implement better and more consistent nice +19
90support: with the new scheduler nice +19 tasks get a HZ-independent
911.5%, instead of the variable 3%-5%-9% range they got in the old
92scheduler.
93
94To address the second complaint (of nice levels not being consistent),
95the new scheduler makes nice(1) have the same CPU utilization effect on
96tasks, regardless of their absolute nice levels. So on the new
97scheduler, running a nice +10 and a nice 11 task has the same CPU
98utilization "split" between them as running a nice -5 and a nice -4
99task. (one will get 55% of the CPU, the other 45%.) That is why nice
100levels were changed to be "multiplicative" (or exponential) - that way
101it does not matter which nice level you start out from, the 'relative
102result' will always be the same.
103
104The third complaint (of negative nice levels not being "punchy" enough
105and forcing audio apps to run under the more dangerous SCHED_FIFO
106scheduling policy) is addressed by the new scheduler almost
107automatically: stronger negative nice levels are an automatic
108side-effect of the recalibrated dynamic range of nice levels.
diff --git a/Documentation/scheduler/sched-stats.txt b/Documentation/scheduler/sched-stats.txt
new file mode 100644
index 000000000000..442e14d35dea
--- /dev/null
+++ b/Documentation/scheduler/sched-stats.txt
@@ -0,0 +1,156 @@
1Version 14 of schedstats includes support for sched_domains, which hit the
2mainline kernel in 2.6.20 although it is identical to the stats from version
312 which was in the kernel from 2.6.13-2.6.19 (version 13 never saw a kernel
4release). Some counters make more sense to be per-runqueue; other to be
5per-domain. Note that domains (and their associated information) will only
6be pertinent and available on machines utilizing CONFIG_SMP.
7
8In version 14 of schedstat, there is at least one level of domain
9statistics for each cpu listed, and there may well be more than one
10domain. Domains have no particular names in this implementation, but
11the highest numbered one typically arbitrates balancing across all the
12cpus on the machine, while domain0 is the most tightly focused domain,
13sometimes balancing only between pairs of cpus. At this time, there
14are no architectures which need more than three domain levels. The first
15field in the domain stats is a bit map indicating which cpus are affected
16by that domain.
17
18These fields are counters, and only increment. Programs which make use
19of these will need to start with a baseline observation and then calculate
20the change in the counters at each subsequent observation. A perl script
21which does this for many of the fields is available at
22
23 http://eaglet.rain.com/rick/linux/schedstat/
24
25Note that any such script will necessarily be version-specific, as the main
26reason to change versions is changes in the output format. For those wishing
27to write their own scripts, the fields are described here.
28
29CPU statistics
30--------------
31cpu<N> 1 2 3 4 5 6 7 8 9 10 11 12
32
33NOTE: In the sched_yield() statistics, the active queue is considered empty
34 if it has only one process in it, since obviously the process calling
35 sched_yield() is that process.
36
37First four fields are sched_yield() statistics:
38 1) # of times both the active and the expired queue were empty
39 2) # of times just the active queue was empty
40 3) # of times just the expired queue was empty
41 4) # of times sched_yield() was called
42
43Next three are schedule() statistics:
44 5) # of times we switched to the expired queue and reused it
45 6) # of times schedule() was called
46 7) # of times schedule() left the processor idle
47
48Next two are try_to_wake_up() statistics:
49 8) # of times try_to_wake_up() was called
50 9) # of times try_to_wake_up() was called to wake up the local cpu
51
52Next three are statistics describing scheduling latency:
53 10) sum of all time spent running by tasks on this processor (in jiffies)
54 11) sum of all time spent waiting to run by tasks on this processor (in
55 jiffies)
56 12) # of timeslices run on this cpu
57
58
59Domain statistics
60-----------------
61One of these is produced per domain for each cpu described. (Note that if
62CONFIG_SMP is not defined, *no* domains are utilized and these lines
63will not appear in the output.)
64
65domain<N> <cpumask> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
66
67The first field is a bit mask indicating what cpus this domain operates over.
68
69The next 24 are a variety of load_balance() statistics in grouped into types
70of idleness (idle, busy, and newly idle):
71
72 1) # of times in this domain load_balance() was called when the
73 cpu was idle
74 2) # of times in this domain load_balance() checked but found
75 the load did not require balancing when the cpu was idle
76 3) # of times in this domain load_balance() tried to move one or
77 more tasks and failed, when the cpu was idle
78 4) sum of imbalances discovered (if any) with each call to
79 load_balance() in this domain when the cpu was idle
80 5) # of times in this domain pull_task() was called when the cpu
81 was idle
82 6) # of times in this domain pull_task() was called even though
83 the target task was cache-hot when idle
84 7) # of times in this domain load_balance() was called but did
85 not find a busier queue while the cpu was idle
86 8) # of times in this domain a busier queue was found while the
87 cpu was idle but no busier group was found
88
89 9) # of times in this domain load_balance() was called when the
90 cpu was busy
91 10) # of times in this domain load_balance() checked but found the
92 load did not require balancing when busy
93 11) # of times in this domain load_balance() tried to move one or
94 more tasks and failed, when the cpu was busy
95 12) sum of imbalances discovered (if any) with each call to
96 load_balance() in this domain when the cpu was busy
97 13) # of times in this domain pull_task() was called when busy
98 14) # of times in this domain pull_task() was called even though the
99 target task was cache-hot when busy
100 15) # of times in this domain load_balance() was called but did not
101 find a busier queue while the cpu was busy
102 16) # of times in this domain a busier queue was found while the cpu
103 was busy but no busier group was found
104
105 17) # of times in this domain load_balance() was called when the
106 cpu was just becoming idle
107 18) # of times in this domain load_balance() checked but found the
108 load did not require balancing when the cpu was just becoming idle
109 19) # of times in this domain load_balance() tried to move one or more
110 tasks and failed, when the cpu was just becoming idle
111 20) sum of imbalances discovered (if any) with each call to
112 load_balance() in this domain when the cpu was just becoming idle
113 21) # of times in this domain pull_task() was called when newly idle
114 22) # of times in this domain pull_task() was called even though the
115 target task was cache-hot when just becoming idle
116 23) # of times in this domain load_balance() was called but did not
117 find a busier queue while the cpu was just becoming idle
118 24) # of times in this domain a busier queue was found while the cpu
119 was just becoming idle but no busier group was found
120
121 Next three are active_load_balance() statistics:
122 25) # of times active_load_balance() was called
123 26) # of times active_load_balance() tried to move a task and failed
124 27) # of times active_load_balance() successfully moved a task
125
126 Next three are sched_balance_exec() statistics:
127 28) sbe_cnt is not used
128 29) sbe_balanced is not used
129 30) sbe_pushed is not used
130
131 Next three are sched_balance_fork() statistics:
132 31) sbf_cnt is not used
133 32) sbf_balanced is not used
134 33) sbf_pushed is not used
135
136 Next three are try_to_wake_up() statistics:
137 34) # of times in this domain try_to_wake_up() awoke a task that
138 last ran on a different cpu in this domain
139 35) # of times in this domain try_to_wake_up() moved a task to the
140 waking cpu because it was cache-cold on its own cpu anyway
141 36) # of times in this domain try_to_wake_up() started passive balancing
142
143/proc/<pid>/schedstat
144----------------
145schedstats also adds a new /proc/<pid/schedstat file to include some of
146the same information on a per-process level. There are three fields in
147this file correlating for that process to:
148 1) time spent on the cpu
149 2) time spent waiting on a runqueue
150 3) # of timeslices run on this cpu
151
152A program could be easily written to make use of these extra fields to
153report on how well a particular process or set of processes is faring
154under the scheduler's policies. A simple version of such a program is
155available at
156 http://eaglet.rain.com/rick/linux/schedstat/v12/latency.c