Documentation: create new scheduler/ subdirectory

The top-level Documentation/ directory is unmanageably large, so we should take any obvious opportunities to move stuff into subdirectories. These sched-*.txt files seem an obvious easy case. Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu> Cc: Ingo Molnar <mingo@elte.hu> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: J. Bruce Fields <bfields@citi.umich.edu> 2008-02-07 03:13:37 -0500
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2008-02-07 11:42:17 -0500
commit: 9b8eae7248dad42091204f83ed3448e661456af1 (patch)
tree: 1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/scheduler/sched-nice-design.txt
parent: d3cf91d0e201962a6367191e5926f5b0920b0339 (diff)
1 files changed, 108 insertions, 0 deletions
diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt
new file mode 100644
index 000000000000..e2bae5a577e3
--- /dev/null
+++ b/Documentation/scheduler/sched-nice-design.txt
@@ -0,0 +1,108 @@
+This document explains the thinking about the revamped and streamlined
+nice-levels implementation in the new Linux scheduler.
+Nice levels were always pretty weak under Linux and people continuously
+pestered us to make nice +19 tasks use up much less CPU time.
+Unfortunately that was not that easy to implement under the old
+scheduler, (otherwise we'd have done it long ago) because nice level
+support was historically coupled to timeslice length, and timeslice
+units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
+In the O(1) scheduler (in 2003) we changed negative nice levels to be
+much stronger than they were before in 2.4 (and people were happy about
+that change), and we also intentionally calibrated the linear timeslice
+rule so that nice +19 level would be _exactly_ 1 jiffy. To better
+understand it, the timeslice graph went like this (cheesy ASCII art
+alert!):
+                   A
+             \     | [timeslice length]
+              \    |
+               \   |
+                \  |
+                 \ |
+                  \|___100msecs
+                   |^ . _
+                   |      ^ . _
+                   |            ^ . _
+ -*----------------------------------*-----> [nice level]
+ -20               |                +19
+                   |
+                   |
+So that if someone wanted to really renice tasks, +19 would give a much
+bigger hit than the normal linear rule would do. (The solution of
+changing the ABI to extend priorities was discarded early on.)
+This approach worked to some degree for some time, but later on with
+HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
+we felt to be a bit excessive. Excessive _not_ because it's too small of
+a CPU utilization, but because it causes too frequent (once per
+millisec) rescheduling. (and would thus trash the cache, etc. Remember,
+this was long ago when hardware was weaker and caches were smaller, and
+people were running number crunching apps at nice +19.)
+So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
+right minimal granularity - and this translates to 5% CPU utilization.
+But the fundamental HZ-sensitive property for nice+19 still remained,
+and we never got a single complaint about nice +19 being too _weak_ in
+terms of CPU utilization, we only got complaints about it (still) being
+too _strong_ :-)
+To sum it up: we always wanted to make nice levels more consistent, but
+within the constraints of HZ and jiffies and their nasty design level
+coupling to timeslices and granularity it was not really viable.
+The second (less frequent but still periodically occuring) complaint
+about Linux's nice level support was its assymetry around the origo
+(which you can see demonstrated in the picture above), or more
+accurately: the fact that nice level behavior depended on the _absolute_
+nice level as well, while the nice API itself is fundamentally
+"relative":
+   int nice(int inc);
+   asmlinkage long sys_nice(int increment)
+(the first one is the glibc API, the second one is the syscall API.)
+Note that the 'inc' is relative to the current nice level. Tools like
+bash's "nice" command mirror this relative API.
+With the old scheduler, if you for example started a niced task with +1
+and another task with +2, the CPU split between the two tasks would
+depend on the nice level of the parent shell - if it was at nice -10 the
+CPU split was different than if it was at +5 or +10.
+A third complaint against Linux's nice level support was that negative
+nice levels were not 'punchy enough', so lots of people had to resort to
+run audio (and other multimedia) apps under RT priorities such as
+SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
+proof, and a buggy SCHED_FIFO app can also lock up the system for good.
+The new scheduler in v2.6.23 addresses all three types of complaints:
+To address the first complaint (of nice levels being not "punchy"
+enough), the scheduler was decoupled from 'time slice' and HZ concepts
+(and granularity was made a separate concept from nice levels) and thus
+it was possible to implement better and more consistent nice +19
+support: with the new scheduler nice +19 tasks get a HZ-independent
+1.5%, instead of the variable 3%-5%-9% range they got in the old
+scheduler.
+To address the second complaint (of nice levels not being consistent),
+the new scheduler makes nice(1) have the same CPU utilization effect on
+tasks, regardless of their absolute nice levels. So on the new
+scheduler, running a nice +10 and a nice 11 task has the same CPU
+utilization "split" between them as running a nice -5 and a nice -4
+task. (one will get 55% of the CPU, the other 45%.) That is why nice
+levels were changed to be "multiplicative" (or exponential) - that way
+it does not matter which nice level you start out from, the 'relative
+result' will always be the same.
+The third complaint (of negative nice levels not being "punchy" enough
+and forcing audio apps to run under the more dangerous SCHED_FIFO
+scheduling policy) is addressed by the new scheduler almost
+automatically: stronger negative nice levels are an automatic
+side-effect of the recalibrated dynamic range of nice levels.
author	J. Bruce Fields <bfields@citi.umich.edu>	2008-02-07 03:13:37 -0500
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2008-02-07 11:42:17 -0500
commit	9b8eae7248dad42091204f83ed3448e661456af1 (patch)
tree	1e300d41f8aaa9c258c179024ba63799a79f5a6f /Documentation/scheduler/sched-nice-design.txt
parent	d3cf91d0e201962a6367191e5926f5b0920b0339 (diff)

diff --git a/Documentation/scheduler/sched-nice-design.txt b/Documentation/scheduler/sched-nice-design.txt new file mode 100644 index 000000000000..e2bae5a577e3 --- /dev/null +++ b/Documentation/scheduler/sched-nice-design.txt
@@ -0,0 +1,108 @@
	1	This document explains the thinking about the revamped and streamlined
	2	nice-levels implementation in the new Linux scheduler.
	3
	4	Nice levels were always pretty weak under Linux and people continuously
	5	pestered us to make nice +19 tasks use up much less CPU time.
	6
	7	Unfortunately that was not that easy to implement under the old
	8	scheduler, (otherwise we'd have done it long ago) because nice level
	9	support was historically coupled to timeslice length, and timeslice
	10	units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
	11
	12	In the O(1) scheduler (in 2003) we changed negative nice levels to be
	13	much stronger than they were before in 2.4 (and people were happy about
	14	that change), and we also intentionally calibrated the linear timeslice
	15	rule so that nice +19 level would be _exactly_ 1 jiffy. To better
	16	understand it, the timeslice graph went like this (cheesy ASCII art
	17	alert!):
	18
	19
	20	A
	21	\ \| [timeslice length]
	22	\ \|
	23	\ \|
	24	\ \|
	25	\ \|
	26	\\|___100msecs
	27	\|^ . _
	28	\| ^ . _
	29	\| ^ . _
	30	----------------------------------------> [nice level]
	31	-20 \| +19
	32	\|
	33	\|
	34
	35	So that if someone wanted to really renice tasks, +19 would give a much
	36	bigger hit than the normal linear rule would do. (The solution of
	37	changing the ABI to extend priorities was discarded early on.)
	38
	39	This approach worked to some degree for some time, but later on with
	40	HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
	41	we felt to be a bit excessive. Excessive _not_ because it's too small of
	42	a CPU utilization, but because it causes too frequent (once per
	43	millisec) rescheduling. (and would thus trash the cache, etc. Remember,
	44	this was long ago when hardware was weaker and caches were smaller, and
	45	people were running number crunching apps at nice +19.)
	46
	47	So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
	48	right minimal granularity - and this translates to 5% CPU utilization.
	49	But the fundamental HZ-sensitive property for nice+19 still remained,
	50	and we never got a single complaint about nice +19 being too _weak_ in
	51	terms of CPU utilization, we only got complaints about it (still) being
	52	too _strong_ :-)
	53
	54	To sum it up: we always wanted to make nice levels more consistent, but
	55	within the constraints of HZ and jiffies and their nasty design level
	56	coupling to timeslices and granularity it was not really viable.
	57
	58	The second (less frequent but still periodically occuring) complaint
	59	about Linux's nice level support was its assymetry around the origo
	60	(which you can see demonstrated in the picture above), or more
	61	accurately: the fact that nice level behavior depended on the _absolute_
	62	nice level as well, while the nice API itself is fundamentally
	63	"relative":
	64
	65	int nice(int inc);
	66
	67	asmlinkage long sys_nice(int increment)
	68
	69	(the first one is the glibc API, the second one is the syscall API.)
	70	Note that the 'inc' is relative to the current nice level. Tools like
	71	bash's "nice" command mirror this relative API.
	72
	73	With the old scheduler, if you for example started a niced task with +1
	74	and another task with +2, the CPU split between the two tasks would
	75	depend on the nice level of the parent shell - if it was at nice -10 the
	76	CPU split was different than if it was at +5 or +10.
	77
	78	A third complaint against Linux's nice level support was that negative
	79	nice levels were not 'punchy enough', so lots of people had to resort to
	80	run audio (and other multimedia) apps under RT priorities such as
	81	SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
	82	proof, and a buggy SCHED_FIFO app can also lock up the system for good.
	83
	84	The new scheduler in v2.6.23 addresses all three types of complaints:
	85
	86	To address the first complaint (of nice levels being not "punchy"
	87	enough), the scheduler was decoupled from 'time slice' and HZ concepts
	88	(and granularity was made a separate concept from nice levels) and thus
	89	it was possible to implement better and more consistent nice +19
	90	support: with the new scheduler nice +19 tasks get a HZ-independent
	91	1.5%, instead of the variable 3%-5%-9% range they got in the old
	92	scheduler.
	93
	94	To address the second complaint (of nice levels not being consistent),
	95	the new scheduler makes nice(1) have the same CPU utilization effect on
	96	tasks, regardless of their absolute nice levels. So on the new
	97	scheduler, running a nice +10 and a nice 11 task has the same CPU
	98	utilization "split" between them as running a nice -5 and a nice -4
	99	task. (one will get 55% of the CPU, the other 45%.) That is why nice
	100	levels were changed to be "multiplicative" (or exponential) - that way
	101	it does not matter which nice level you start out from, the 'relative
	102	result' will always be the same.
	103
	104	The third complaint (of negative nice levels not being "punchy" enough
	105	and forcing audio apps to run under the more dangerous SCHED_FIFO
	106	scheduling policy) is addressed by the new scheduler almost
	107	automatically: stronger negative nice levels are an automatic
	108	side-effect of the recalibrated dynamic range of nice levels.