diff options
Diffstat (limited to 'Documentation/sched-nice-design.txt')
| -rw-r--r-- | Documentation/sched-nice-design.txt | 108 |
1 files changed, 108 insertions, 0 deletions
diff --git a/Documentation/sched-nice-design.txt b/Documentation/sched-nice-design.txt new file mode 100644 index 000000000000..e2bae5a577e3 --- /dev/null +++ b/Documentation/sched-nice-design.txt | |||
| @@ -0,0 +1,108 @@ | |||
| 1 | This document explains the thinking about the revamped and streamlined | ||
| 2 | nice-levels implementation in the new Linux scheduler. | ||
| 3 | |||
| 4 | Nice levels were always pretty weak under Linux and people continuously | ||
| 5 | pestered us to make nice +19 tasks use up much less CPU time. | ||
| 6 | |||
| 7 | Unfortunately that was not that easy to implement under the old | ||
| 8 | scheduler, (otherwise we'd have done it long ago) because nice level | ||
| 9 | support was historically coupled to timeslice length, and timeslice | ||
| 10 | units were driven by the HZ tick, so the smallest timeslice was 1/HZ. | ||
| 11 | |||
| 12 | In the O(1) scheduler (in 2003) we changed negative nice levels to be | ||
| 13 | much stronger than they were before in 2.4 (and people were happy about | ||
| 14 | that change), and we also intentionally calibrated the linear timeslice | ||
| 15 | rule so that nice +19 level would be _exactly_ 1 jiffy. To better | ||
| 16 | understand it, the timeslice graph went like this (cheesy ASCII art | ||
| 17 | alert!): | ||
| 18 | |||
| 19 | |||
| 20 | A | ||
| 21 | \ | [timeslice length] | ||
| 22 | \ | | ||
| 23 | \ | | ||
| 24 | \ | | ||
| 25 | \ | | ||
| 26 | \|___100msecs | ||
| 27 | |^ . _ | ||
| 28 | | ^ . _ | ||
| 29 | | ^ . _ | ||
| 30 | -*----------------------------------*-----> [nice level] | ||
| 31 | -20 | +19 | ||
| 32 | | | ||
| 33 | | | ||
| 34 | |||
| 35 | So that if someone wanted to really renice tasks, +19 would give a much | ||
| 36 | bigger hit than the normal linear rule would do. (The solution of | ||
| 37 | changing the ABI to extend priorities was discarded early on.) | ||
| 38 | |||
| 39 | This approach worked to some degree for some time, but later on with | ||
| 40 | HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which | ||
| 41 | we felt to be a bit excessive. Excessive _not_ because it's too small of | ||
| 42 | a CPU utilization, but because it causes too frequent (once per | ||
| 43 | millisec) rescheduling. (and would thus trash the cache, etc. Remember, | ||
| 44 | this was long ago when hardware was weaker and caches were smaller, and | ||
| 45 | people were running number crunching apps at nice +19.) | ||
| 46 | |||
| 47 | So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the | ||
| 48 | right minimal granularity - and this translates to 5% CPU utilization. | ||
| 49 | But the fundamental HZ-sensitive property for nice+19 still remained, | ||
| 50 | and we never got a single complaint about nice +19 being too _weak_ in | ||
| 51 | terms of CPU utilization, we only got complaints about it (still) being | ||
| 52 | too _strong_ :-) | ||
| 53 | |||
| 54 | To sum it up: we always wanted to make nice levels more consistent, but | ||
| 55 | within the constraints of HZ and jiffies and their nasty design level | ||
| 56 | coupling to timeslices and granularity it was not really viable. | ||
| 57 | |||
| 58 | The second (less frequent but still periodically occuring) complaint | ||
| 59 | about Linux's nice level support was its assymetry around the origo | ||
| 60 | (which you can see demonstrated in the picture above), or more | ||
| 61 | accurately: the fact that nice level behavior depended on the _absolute_ | ||
| 62 | nice level as well, while the nice API itself is fundamentally | ||
| 63 | "relative": | ||
| 64 | |||
| 65 | int nice(int inc); | ||
| 66 | |||
| 67 | asmlinkage long sys_nice(int increment) | ||
| 68 | |||
| 69 | (the first one is the glibc API, the second one is the syscall API.) | ||
| 70 | Note that the 'inc' is relative to the current nice level. Tools like | ||
| 71 | bash's "nice" command mirror this relative API. | ||
| 72 | |||
| 73 | With the old scheduler, if you for example started a niced task with +1 | ||
| 74 | and another task with +2, the CPU split between the two tasks would | ||
| 75 | depend on the nice level of the parent shell - if it was at nice -10 the | ||
| 76 | CPU split was different than if it was at +5 or +10. | ||
| 77 | |||
| 78 | A third complaint against Linux's nice level support was that negative | ||
| 79 | nice levels were not 'punchy enough', so lots of people had to resort to | ||
| 80 | run audio (and other multimedia) apps under RT priorities such as | ||
| 81 | SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation | ||
| 82 | proof, and a buggy SCHED_FIFO app can also lock up the system for good. | ||
| 83 | |||
| 84 | The new scheduler in v2.6.23 addresses all three types of complaints: | ||
| 85 | |||
| 86 | To address the first complaint (of nice levels being not "punchy" | ||
| 87 | enough), the scheduler was decoupled from 'time slice' and HZ concepts | ||
| 88 | (and granularity was made a separate concept from nice levels) and thus | ||
| 89 | it was possible to implement better and more consistent nice +19 | ||
| 90 | support: with the new scheduler nice +19 tasks get a HZ-independent | ||
| 91 | 1.5%, instead of the variable 3%-5%-9% range they got in the old | ||
| 92 | scheduler. | ||
| 93 | |||
| 94 | To address the second complaint (of nice levels not being consistent), | ||
| 95 | the new scheduler makes nice(1) have the same CPU utilization effect on | ||
| 96 | tasks, regardless of their absolute nice levels. So on the new | ||
| 97 | scheduler, running a nice +10 and a nice 11 task has the same CPU | ||
| 98 | utilization "split" between them as running a nice -5 and a nice -4 | ||
| 99 | task. (one will get 55% of the CPU, the other 45%.) That is why nice | ||
| 100 | levels were changed to be "multiplicative" (or exponential) - that way | ||
| 101 | it does not matter which nice level you start out from, the 'relative | ||
| 102 | result' will always be the same. | ||
| 103 | |||
| 104 | The third complaint (of negative nice levels not being "punchy" enough | ||
| 105 | and forcing audio apps to run under the more dangerous SCHED_FIFO | ||
| 106 | scheduling policy) is addressed by the new scheduler almost | ||
| 107 | automatically: stronger negative nice levels are an automatic | ||
| 108 | side-effect of the recalibrated dynamic range of nice levels. | ||
