Documentation: create new scheduler/ subdirectory

The top-level Documentation/ directory is unmanageably large, so we should take any obvious opportunities to move stuff into subdirectories. These sched-*.txt files seem an obvious easy case. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: J. Bruce Fields <bfields@citi.umich.edu> 2008-02-07 03:13:37 -0500
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2008-02-07 11:42:17 -0500
commit: 9b8eae7248dad42091204f83ed3448e661456af1 (patch)
tree: 1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/scheduler/sched-design.txt
parent: d3cf91d0e201962a6367191e5926f5b0920b0339 (diff)
1 files changed, 165 insertions, 0 deletions
diff --git a/Documentation/scheduler/sched-design.txt b/Documentation/scheduler/sched-design.txt
new file mode 100644
index 000000000000..1605bf0cba8b
--- /dev/null
+++ b/Documentation/scheduler/sched-design.txt
@@ -0,0 +1,165 @@
+                   Goals, Design and Implementation of the
+                      new ultra-scalable O(1) scheduler
+  This is an edited version of an email Ingo Molnar sent to
+  lkml on 4 Jan 2002.  It describes the goals, design, and
+  implementation of Ingo's new ultra-scalable O(1) scheduler.
+  Last Updated: 18 April 2002.
+Goal
+====
+The main goal of the new scheduler is to keep all the good things we know
+and love about the current Linux scheduler:
+ - good interactive performance even during high load: if the user
+   types or clicks then the system must react instantly and must execute
+   the user tasks smoothly, even during considerable background load.
+ - good scheduling/wakeup performance with 1-2 runnable processes.
+ - fairness: no process should stay without any timeslice for any
+   unreasonable amount of time. No process should get an unjustly high
+   amount of CPU time.
+ - priorities: less important tasks can be started with lower priority,
+   more important tasks with higher priority.
+ - SMP efficiency: no CPU should stay idle if there is work to do.
+ - SMP affinity: processes which run on one CPU should stay affine to
+   that CPU. Processes should not bounce between CPUs too frequently.
+ - plus additional scheduler features: RT scheduling, CPU binding.
+and the goal is also to add a few new things:
+ - fully O(1) scheduling. Are you tired of the recalculation loop
+   blowing the L1 cache away every now and then? Do you think the goodness
+   loop is taking a bit too long to finish if there are lots of runnable
+   processes? This new scheduler takes no prisoners: wakeup(), schedule(),
+   the timer interrupt are all O(1) algorithms. There is no recalculation
+   loop. There is no goodness loop either.
+ - 'perfect' SMP scalability. With the new scheduler there is no 'big'
+   runqueue_lock anymore - it's all per-CPU runqueues and locks - two
+   tasks on two separate CPUs can wake up, schedule and context-switch
+   completely in parallel, without any interlocking. All
+   scheduling-relevant data is structured for maximum scalability.
+ - better SMP affinity. The old scheduler has a particular weakness that
+   causes the random bouncing of tasks between CPUs if/when higher
+   priority/interactive tasks, this was observed and reported by many
+   people. The reason is that the timeslice recalculation loop first needs
+   every currently running task to consume its timeslice. But when this
+   happens on eg. an 8-way system, then this property starves an
+   increasing number of CPUs from executing any process. Once the last
+   task that has a timeslice left has finished using up that timeslice,
+   the recalculation loop is triggered and other CPUs can start executing
+   tasks again - after having idled around for a number of timer ticks.
+   The more CPUs, the worse this effect.
+   Furthermore, this same effect causes the bouncing effect as well:
+   whenever there is such a 'timeslice squeeze' of the global runqueue,
+   idle processors start executing tasks which are not affine to that CPU.
+   (because the affine tasks have finished off their timeslices already.)
+   The new scheduler solves this problem by distributing timeslices on a
+   per-CPU basis, without having any global synchronization or
+   recalculation.
+ - batch scheduling. A significant proportion of computing-intensive tasks
+   benefit from batch-scheduling, where timeslices are long and processes
+   are roundrobin scheduled. The new scheduler does such batch-scheduling
+   of the lowest priority tasks - so nice +19 jobs will get
+   'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
+   in essence SCHED_IDLE, from an interactiveness point of view.
+ - handle extreme loads more smoothly, without breakdown and scheduling
+   storms.
+ - O(1) RT scheduling. For those RT folks who are paranoid about the
+   O(nr_running) property of the goodness loop and the recalculation loop.
+ - run fork()ed children before the parent. Andrea has pointed out the
+   advantages of this a few months ago, but patches for this feature
+   do not work with the old scheduler as well as they should,
+   because idle processes often steal the new child before the fork()ing
+   CPU gets to execute it.
+Design
+======
+The core of the new scheduler contains the following mechanisms:
+ - *two* priority-ordered 'priority arrays' per CPU. There is an 'active'
+   array and an 'expired' array. The active array contains all tasks that
+   are affine to this CPU and have timeslices left. The expired array
+   contains all tasks which have used up their timeslices - but this array
+   is kept sorted as well. The active and expired array is not accessed
+   directly, it's accessed through two pointers in the per-CPU runqueue
+   structure. If all active tasks are used up then we 'switch' the two
+   pointers and from now on the ready-to-go (former-) expired array is the
+   active array - and the empty active array serves as the new collector
+   for expired tasks.
+ - there is a 64-bit bitmap cache for array indices. Finding the highest
+   priority task is thus a matter of two x86 BSFL bit-search instructions.
+the split-array solution enables us to have an arbitrary number of active
+and expired tasks, and the recalculation of timeslices can be done
+immediately when the timeslice expires. Because the arrays are always
+access through the pointers in the runqueue, switching the two arrays can
+be done very quickly.
+this is a hybride priority-list approach coupled with roundrobin
+scheduling and the array-switch method of distributing timeslices.
+ - there is a per-task 'load estimator'.
+one of the toughest things to get right is good interactive feel during
+heavy system load. While playing with various scheduler variants i found
+that the best interactive feel is achieved not by 'boosting' interactive
+tasks, but by 'punishing' tasks that want to use more CPU time than there
+is available. This method is also much easier to do in an O(1) fashion.
+to establish the actual 'load' the task contributes to the system, a
+complex-looking but pretty accurate method is used: there is a 4-entry
+'history' ringbuffer of the task's activities during the last 4 seconds.
+This ringbuffer is operated without much overhead. The entries tell the
+scheduler a pretty accurate load-history of the task: has it used up more
+CPU time or less during the past N seconds. [the size '4' and the interval
+of 4x 1 seconds was found by lots of experimentation - this part is
+flexible and can be changed in both directions.]
+the penalty a task gets for generating more load than the CPU can handle
+is a priority decrease - there is a maximum amount to this penalty
+relative to their static priority, so even fully CPU-bound tasks will
+observe each other's priorities, and will share the CPU accordingly.
+the SMP load-balancer can be extended/switched with additional parallel
+computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
+can be supported easily by changing the load-balancer. Right now it's
+tuned for my SMP systems.
+i skipped the prev->mm == next->mm advantage - no workload i know of shows
+any sensitivity to this. It can be added back by sacrificing O(1)
+schedule() [the current and one-lower priority list can be searched for a
+that->mm == current->mm condition], but costs a fair number of cycles
+during a number of important workloads, so i wanted to avoid this as much
+as possible.
+- the SMP idle-task startup code was still racy and the new scheduler
+triggered this. So i streamlined the idle-setup code a bit. We do not call
+into schedule() before all processors have started up fully and all idle
+threads are in place.
+- the patch also cleans up a number of aspects of sched.c - moves code
+into other areas of the kernel where it's appropriate, and simplifies
+certain code paths and data constructs. As a result, the new scheduler's
+code is smaller than the old one.
+        Ingo
author	J. Bruce Fields <bfields@citi.umich.edu>	2008-02-07 03:13:37 -0500
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2008-02-07 11:42:17 -0500
commit	9b8eae7248dad42091204f83ed3448e661456af1 (patch)
tree	1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/scheduler/sched-design.txt
parent	d3cf91d0e201962a6367191e5926f5b0920b0339 (diff)

diff --git a/Documentation/scheduler/sched-design.txt b/Documentation/scheduler/sched-design.txt new file mode 100644 index 000000000000..1605bf0cba8b --- /dev/null +++ b/Documentation/scheduler/sched-design.txt
@@ -0,0 +1,165 @@
	1	Goals, Design and Implementation of the
	2	new ultra-scalable O(1) scheduler
	3
	4
	5	This is an edited version of an email Ingo Molnar sent to
	6	lkml on 4 Jan 2002. It describes the goals, design, and
	7	implementation of Ingo's new ultra-scalable O(1) scheduler.
	8	Last Updated: 18 April 2002.
	9
	10
	11	Goal
	12	====
	13
	14	The main goal of the new scheduler is to keep all the good things we know
	15	and love about the current Linux scheduler:
	16
	17	- good interactive performance even during high load: if the user
	18	types or clicks then the system must react instantly and must execute
	19	the user tasks smoothly, even during considerable background load.
	20
	21	- good scheduling/wakeup performance with 1-2 runnable processes.
	22
	23	- fairness: no process should stay without any timeslice for any
	24	unreasonable amount of time. No process should get an unjustly high
	25	amount of CPU time.
	26
	27	- priorities: less important tasks can be started with lower priority,
	28	more important tasks with higher priority.
	29
	30	- SMP efficiency: no CPU should stay idle if there is work to do.
	31
	32	- SMP affinity: processes which run on one CPU should stay affine to
	33	that CPU. Processes should not bounce between CPUs too frequently.
	34
	35	- plus additional scheduler features: RT scheduling, CPU binding.
	36
	37	and the goal is also to add a few new things:
	38
	39	- fully O(1) scheduling. Are you tired of the recalculation loop
	40	blowing the L1 cache away every now and then? Do you think the goodness
	41	loop is taking a bit too long to finish if there are lots of runnable
	42	processes? This new scheduler takes no prisoners: wakeup(), schedule(),
	43	the timer interrupt are all O(1) algorithms. There is no recalculation
	44	loop. There is no goodness loop either.
	45
	46	- 'perfect' SMP scalability. With the new scheduler there is no 'big'
	47	runqueue_lock anymore - it's all per-CPU runqueues and locks - two
	48	tasks on two separate CPUs can wake up, schedule and context-switch
	49	completely in parallel, without any interlocking. All
	50	scheduling-relevant data is structured for maximum scalability.
	51
	52	- better SMP affinity. The old scheduler has a particular weakness that
	53	causes the random bouncing of tasks between CPUs if/when higher
	54	priority/interactive tasks, this was observed and reported by many
	55	people. The reason is that the timeslice recalculation loop first needs
	56	every currently running task to consume its timeslice. But when this
	57	happens on eg. an 8-way system, then this property starves an
	58	increasing number of CPUs from executing any process. Once the last
	59	task that has a timeslice left has finished using up that timeslice,
	60	the recalculation loop is triggered and other CPUs can start executing
	61	tasks again - after having idled around for a number of timer ticks.
	62	The more CPUs, the worse this effect.
	63
	64	Furthermore, this same effect causes the bouncing effect as well:
	65	whenever there is such a 'timeslice squeeze' of the global runqueue,
	66	idle processors start executing tasks which are not affine to that CPU.
	67	(because the affine tasks have finished off their timeslices already.)
	68
	69	The new scheduler solves this problem by distributing timeslices on a
	70	per-CPU basis, without having any global synchronization or
	71	recalculation.
	72
	73	- batch scheduling. A significant proportion of computing-intensive tasks
	74	benefit from batch-scheduling, where timeslices are long and processes
	75	are roundrobin scheduled. The new scheduler does such batch-scheduling
	76	of the lowest priority tasks - so nice +19 jobs will get
	77	'batch-scheduled' automatically. With this scheduler, nice +19 jobs are
	78	in essence SCHED_IDLE, from an interactiveness point of view.
	79
	80	- handle extreme loads more smoothly, without breakdown and scheduling
	81	storms.
	82
	83	- O(1) RT scheduling. For those RT folks who are paranoid about the
	84	O(nr_running) property of the goodness loop and the recalculation loop.
	85
	86	- run fork()ed children before the parent. Andrea has pointed out the
	87	advantages of this a few months ago, but patches for this feature
	88	do not work with the old scheduler as well as they should,
	89	because idle processes often steal the new child before the fork()ing
	90	CPU gets to execute it.
	91
	92
	93	Design
	94	======
	95
	96	The core of the new scheduler contains the following mechanisms:
	97
	98	- two priority-ordered 'priority arrays' per CPU. There is an 'active'
	99	array and an 'expired' array. The active array contains all tasks that
	100	are affine to this CPU and have timeslices left. The expired array
	101	contains all tasks which have used up their timeslices - but this array
	102	is kept sorted as well. The active and expired array is not accessed
	103	directly, it's accessed through two pointers in the per-CPU runqueue
	104	structure. If all active tasks are used up then we 'switch' the two
	105	pointers and from now on the ready-to-go (former-) expired array is the
	106	active array - and the empty active array serves as the new collector
	107	for expired tasks.
	108
	109	- there is a 64-bit bitmap cache for array indices. Finding the highest
	110	priority task is thus a matter of two x86 BSFL bit-search instructions.
	111
	112	the split-array solution enables us to have an arbitrary number of active
	113	and expired tasks, and the recalculation of timeslices can be done
	114	immediately when the timeslice expires. Because the arrays are always
	115	access through the pointers in the runqueue, switching the two arrays can
	116	be done very quickly.
	117
	118	this is a hybride priority-list approach coupled with roundrobin
	119	scheduling and the array-switch method of distributing timeslices.
	120
	121	- there is a per-task 'load estimator'.
	122
	123	one of the toughest things to get right is good interactive feel during
	124	heavy system load. While playing with various scheduler variants i found
	125	that the best interactive feel is achieved not by 'boosting' interactive
	126	tasks, but by 'punishing' tasks that want to use more CPU time than there
	127	is available. This method is also much easier to do in an O(1) fashion.
	128
	129	to establish the actual 'load' the task contributes to the system, a
	130	complex-looking but pretty accurate method is used: there is a 4-entry
	131	'history' ringbuffer of the task's activities during the last 4 seconds.
	132	This ringbuffer is operated without much overhead. The entries tell the
	133	scheduler a pretty accurate load-history of the task: has it used up more
	134	CPU time or less during the past N seconds. [the size '4' and the interval
	135	of 4x 1 seconds was found by lots of experimentation - this part is
	136	flexible and can be changed in both directions.]
	137
	138	the penalty a task gets for generating more load than the CPU can handle
	139	is a priority decrease - there is a maximum amount to this penalty
	140	relative to their static priority, so even fully CPU-bound tasks will
	141	observe each other's priorities, and will share the CPU accordingly.
	142
	143	the SMP load-balancer can be extended/switched with additional parallel
	144	computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs
	145	can be supported easily by changing the load-balancer. Right now it's
	146	tuned for my SMP systems.
	147
	148	i skipped the prev->mm == next->mm advantage - no workload i know of shows
	149	any sensitivity to this. It can be added back by sacrificing O(1)
	150	schedule() [the current and one-lower priority list can be searched for a
	151	that->mm == current->mm condition], but costs a fair number of cycles
	152	during a number of important workloads, so i wanted to avoid this as much
	153	as possible.
	154
	155	- the SMP idle-task startup code was still racy and the new scheduler
	156	triggered this. So i streamlined the idle-setup code a bit. We do not call
	157	into schedule() before all processors have started up fully and all idle
	158	threads are in place.
	159
	160	- the patch also cleans up a number of aspects of sched.c - moves code
	161	into other areas of the kernel where it's appropriate, and simplifies
	162	certain code paths and data constructs. As a result, the new scheduler's
	163	code is smaller than the old one.
	164
	165	Ingo