diff options
Diffstat (limited to 'Documentation/scheduler/sched-design.txt')
| -rw-r--r-- | Documentation/scheduler/sched-design.txt | 165 |
1 files changed, 0 insertions, 165 deletions
diff --git a/Documentation/scheduler/sched-design.txt b/Documentation/scheduler/sched-design.txt deleted file mode 100644 index 1605bf0cba8b..000000000000 --- a/Documentation/scheduler/sched-design.txt +++ /dev/null | |||
| @@ -1,165 +0,0 @@ | |||
| 1 | Goals, Design and Implementation of the | ||
| 2 | new ultra-scalable O(1) scheduler | ||
| 3 | |||
| 4 | |||
| 5 | This is an edited version of an email Ingo Molnar sent to | ||
| 6 | lkml on 4 Jan 2002. It describes the goals, design, and | ||
| 7 | implementation of Ingo's new ultra-scalable O(1) scheduler. | ||
| 8 | Last Updated: 18 April 2002. | ||
| 9 | |||
| 10 | |||
| 11 | Goal | ||
| 12 | ==== | ||
| 13 | |||
| 14 | The main goal of the new scheduler is to keep all the good things we know | ||
| 15 | and love about the current Linux scheduler: | ||
| 16 | |||
| 17 | - good interactive performance even during high load: if the user | ||
| 18 | types or clicks then the system must react instantly and must execute | ||
| 19 | the user tasks smoothly, even during considerable background load. | ||
| 20 | |||
| 21 | - good scheduling/wakeup performance with 1-2 runnable processes. | ||
| 22 | |||
| 23 | - fairness: no process should stay without any timeslice for any | ||
| 24 | unreasonable amount of time. No process should get an unjustly high | ||
| 25 | amount of CPU time. | ||
| 26 | |||
| 27 | - priorities: less important tasks can be started with lower priority, | ||
| 28 | more important tasks with higher priority. | ||
| 29 | |||
| 30 | - SMP efficiency: no CPU should stay idle if there is work to do. | ||
| 31 | |||
| 32 | - SMP affinity: processes which run on one CPU should stay affine to | ||
| 33 | that CPU. Processes should not bounce between CPUs too frequently. | ||
| 34 | |||
| 35 | - plus additional scheduler features: RT scheduling, CPU binding. | ||
| 36 | |||
| 37 | and the goal is also to add a few new things: | ||
| 38 | |||
| 39 | - fully O(1) scheduling. Are you tired of the recalculation loop | ||
| 40 | blowing the L1 cache away every now and then? Do you think the goodness | ||
| 41 | loop is taking a bit too long to finish if there are lots of runnable | ||
| 42 | processes? This new scheduler takes no prisoners: wakeup(), schedule(), | ||
| 43 | the timer interrupt are all O(1) algorithms. There is no recalculation | ||
| 44 | loop. There is no goodness loop either. | ||
| 45 | |||
| 46 | - 'perfect' SMP scalability. With the new scheduler there is no 'big' | ||
| 47 | runqueue_lock anymore - it's all per-CPU runqueues and locks - two | ||
| 48 | tasks on two separate CPUs can wake up, schedule and context-switch | ||
| 49 | completely in parallel, without any interlocking. All | ||
| 50 | scheduling-relevant data is structured for maximum scalability. | ||
| 51 | |||
| 52 | - better SMP affinity. The old scheduler has a particular weakness that | ||
| 53 | causes the random bouncing of tasks between CPUs if/when higher | ||
| 54 | priority/interactive tasks, this was observed and reported by many | ||
| 55 | people. The reason is that the timeslice recalculation loop first needs | ||
| 56 | every currently running task to consume its timeslice. But when this | ||
| 57 | happens on eg. an 8-way system, then this property starves an | ||
| 58 | increasing number of CPUs from executing any process. Once the last | ||
| 59 | task that has a timeslice left has finished using up that timeslice, | ||
| 60 | the recalculation loop is triggered and other CPUs can start executing | ||
| 61 | tasks again - after having idled around for a number of timer ticks. | ||
| 62 | The more CPUs, the worse this effect. | ||
| 63 | |||
| 64 | Furthermore, this same effect causes the bouncing effect as well: | ||
| 65 | whenever there is such a 'timeslice squeeze' of the global runqueue, | ||
| 66 | idle processors start executing tasks which are not affine to that CPU. | ||
| 67 | (because the affine tasks have finished off their timeslices already.) | ||
| 68 | |||
| 69 | The new scheduler solves this problem by distributing timeslices on a | ||
| 70 | per-CPU basis, without having any global synchronization or | ||
| 71 | recalculation. | ||
| 72 | |||
| 73 | - batch scheduling. A significant proportion of computing-intensive tasks | ||
| 74 | benefit from batch-scheduling, where timeslices are long and processes | ||
| 75 | are roundrobin scheduled. The new scheduler does such batch-scheduling | ||
| 76 | of the lowest priority tasks - so nice +19 jobs will get | ||
| 77 | 'batch-scheduled' automatically. With this scheduler, nice +19 jobs are | ||
| 78 | in essence SCHED_IDLE, from an interactiveness point of view. | ||
| 79 | |||
| 80 | - handle extreme loads more smoothly, without breakdown and scheduling | ||
| 81 | storms. | ||
| 82 | |||
| 83 | - O(1) RT scheduling. For those RT folks who are paranoid about the | ||
| 84 | O(nr_running) property of the goodness loop and the recalculation loop. | ||
| 85 | |||
| 86 | - run fork()ed children before the parent. Andrea has pointed out the | ||
| 87 | advantages of this a few months ago, but patches for this feature | ||
| 88 | do not work with the old scheduler as well as they should, | ||
| 89 | because idle processes often steal the new child before the fork()ing | ||
| 90 | CPU gets to execute it. | ||
| 91 | |||
| 92 | |||
| 93 | Design | ||
| 94 | ====== | ||
| 95 | |||
| 96 | The core of the new scheduler contains the following mechanisms: | ||
| 97 | |||
| 98 | - *two* priority-ordered 'priority arrays' per CPU. There is an 'active' | ||
| 99 | array and an 'expired' array. The active array contains all tasks that | ||
| 100 | are affine to this CPU and have timeslices left. The expired array | ||
| 101 | contains all tasks which have used up their timeslices - but this array | ||
| 102 | is kept sorted as well. The active and expired array is not accessed | ||
| 103 | directly, it's accessed through two pointers in the per-CPU runqueue | ||
| 104 | structure. If all active tasks are used up then we 'switch' the two | ||
| 105 | pointers and from now on the ready-to-go (former-) expired array is the | ||
| 106 | active array - and the empty active array serves as the new collector | ||
| 107 | for expired tasks. | ||
| 108 | |||
| 109 | - there is a 64-bit bitmap cache for array indices. Finding the highest | ||
| 110 | priority task is thus a matter of two x86 BSFL bit-search instructions. | ||
| 111 | |||
| 112 | the split-array solution enables us to have an arbitrary number of active | ||
| 113 | and expired tasks, and the recalculation of timeslices can be done | ||
| 114 | immediately when the timeslice expires. Because the arrays are always | ||
| 115 | access through the pointers in the runqueue, switching the two arrays can | ||
| 116 | be done very quickly. | ||
| 117 | |||
| 118 | this is a hybride priority-list approach coupled with roundrobin | ||
| 119 | scheduling and the array-switch method of distributing timeslices. | ||
| 120 | |||
| 121 | - there is a per-task 'load estimator'. | ||
| 122 | |||
| 123 | one of the toughest things to get right is good interactive feel during | ||
| 124 | heavy system load. While playing with various scheduler variants i found | ||
| 125 | that the best interactive feel is achieved not by 'boosting' interactive | ||
| 126 | tasks, but by 'punishing' tasks that want to use more CPU time than there | ||
| 127 | is available. This method is also much easier to do in an O(1) fashion. | ||
| 128 | |||
| 129 | to establish the actual 'load' the task contributes to the system, a | ||
| 130 | complex-looking but pretty accurate method is used: there is a 4-entry | ||
| 131 | 'history' ringbuffer of the task's activities during the last 4 seconds. | ||
| 132 | This ringbuffer is operated without much overhead. The entries tell the | ||
| 133 | scheduler a pretty accurate load-history of the task: has it used up more | ||
| 134 | CPU time or less during the past N seconds. [the size '4' and the interval | ||
| 135 | of 4x 1 seconds was found by lots of experimentation - this part is | ||
| 136 | flexible and can be changed in both directions.] | ||
| 137 | |||
| 138 | the penalty a task gets for generating more load than the CPU can handle | ||
| 139 | is a priority decrease - there is a maximum amount to this penalty | ||
| 140 | relative to their static priority, so even fully CPU-bound tasks will | ||
| 141 | observe each other's priorities, and will share the CPU accordingly. | ||
| 142 | |||
| 143 | the SMP load-balancer can be extended/switched with additional parallel | ||
| 144 | computing and cache hierarchy concepts: NUMA scheduling, multi-core CPUs | ||
| 145 | can be supported easily by changing the load-balancer. Right now it's | ||
| 146 | tuned for my SMP systems. | ||
| 147 | |||
| 148 | i skipped the prev->mm == next->mm advantage - no workload i know of shows | ||
| 149 | any sensitivity to this. It can be added back by sacrificing O(1) | ||
| 150 | schedule() [the current and one-lower priority list can be searched for a | ||
| 151 | that->mm == current->mm condition], but costs a fair number of cycles | ||
| 152 | during a number of important workloads, so i wanted to avoid this as much | ||
| 153 | as possible. | ||
| 154 | |||
| 155 | - the SMP idle-task startup code was still racy and the new scheduler | ||
| 156 | triggered this. So i streamlined the idle-setup code a bit. We do not call | ||
| 157 | into schedule() before all processors have started up fully and all idle | ||
| 158 | threads are in place. | ||
| 159 | |||
| 160 | - the patch also cleans up a number of aspects of sched.c - moves code | ||
| 161 | into other areas of the kernel where it's appropriate, and simplifies | ||
| 162 | certain code paths and data constructs. As a result, the new scheduler's | ||
| 163 | code is smaller than the old one. | ||
| 164 | |||
| 165 | Ingo | ||
