aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/kernel-parameters.txt43
-rw-r--r--Documentation/sched-design-CFS.txt119
2 files changed, 119 insertions, 43 deletions
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index af50f9bbe68e..4d880b3d1f35 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1014,49 +1014,6 @@ and is between 256 and 4096 characters. It is defined in the file
1014 1014
1015 mga= [HW,DRM] 1015 mga= [HW,DRM]
1016 1016
1017 migration_cost=
1018 [KNL,SMP] debug: override scheduler migration costs
1019 Format: <level-1-usecs>,<level-2-usecs>,...
1020 This debugging option can be used to override the
1021 default scheduler migration cost matrix. The numbers
1022 are indexed by 'CPU domain distance'.
1023 E.g. migration_cost=1000,2000,3000 on an SMT NUMA
1024 box will set up an intra-core migration cost of
1025 1 msec, an inter-core migration cost of 2 msecs,
1026 and an inter-node migration cost of 3 msecs.
1027
1028 WARNING: using the wrong values here can break
1029 scheduler performance, so it's only for scheduler
1030 development purposes, not production environments.
1031
1032 migration_debug=
1033 [KNL,SMP] migration cost auto-detect verbosity
1034 Format=<0|1|2>
1035 If a system's migration matrix reported at bootup
1036 seems erroneous then this option can be used to
1037 increase verbosity of the detection process.
1038 We default to 0 (no extra messages), 1 will print
1039 some more information, and 2 will be really
1040 verbose (probably only useful if you also have a
1041 serial console attached to the system).
1042
1043 migration_factor=
1044 [KNL,SMP] multiply/divide migration costs by a factor
1045 Format=<percent>
1046 This debug option can be used to proportionally
1047 increase or decrease the auto-detected migration
1048 costs for all entries of the migration matrix.
1049 E.g. migration_factor=150 will increase migration
1050 costs by 50%. (and thus the scheduler will be less
1051 eager migrating cache-hot tasks)
1052 migration_factor=80 will decrease migration costs
1053 by 20%. (thus the scheduler will be more eager to
1054 migrate tasks)
1055
1056 WARNING: using the wrong values here can break
1057 scheduler performance, so it's only for scheduler
1058 development purposes, not production environments.
1059
1060 mousedev.tap_time= 1017 mousedev.tap_time=
1061 [MOUSE] Maximum time between finger touching and 1018 [MOUSE] Maximum time between finger touching and
1062 leaving touchpad surface for touch to be considered 1019 leaving touchpad surface for touch to be considered
diff --git a/Documentation/sched-design-CFS.txt b/Documentation/sched-design-CFS.txt
new file mode 100644
index 000000000000..16feebb7bdc0
--- /dev/null
+++ b/Documentation/sched-design-CFS.txt
@@ -0,0 +1,119 @@
1
2This is the CFS scheduler.
3
480% of CFS's design can be summed up in a single sentence: CFS basically
5models an "ideal, precise multi-tasking CPU" on real hardware.
6
7"Ideal multi-tasking CPU" is a (non-existent :-)) CPU that has 100%
8physical power and which can run each task at precise equal speed, in
9parallel, each at 1/nr_running speed. For example: if there are 2 tasks
10running then it runs each at 50% physical power - totally in parallel.
11
12On real hardware, we can run only a single task at once, so while that
13one task runs, the other tasks that are waiting for the CPU are at a
14disadvantage - the current task gets an unfair amount of CPU time. In
15CFS this fairness imbalance is expressed and tracked via the per-task
16p->wait_runtime (nanosec-unit) value. "wait_runtime" is the amount of
17time the task should now run on the CPU for it to become completely fair
18and balanced.
19
20( small detail: on 'ideal' hardware, the p->wait_runtime value would
21 always be zero - no task would ever get 'out of balance' from the
22 'ideal' share of CPU time. )
23
24CFS's task picking logic is based on this p->wait_runtime value and it
25is thus very simple: it always tries to run the task with the largest
26p->wait_runtime value. In other words, CFS tries to run the task with
27the 'gravest need' for more CPU time. So CFS always tries to split up
28CPU time between runnable tasks as close to 'ideal multitasking
29hardware' as possible.
30
31Most of the rest of CFS's design just falls out of this really simple
32concept, with a few add-on embellishments like nice levels,
33multiprocessing and various algorithm variants to recognize sleepers.
34
35In practice it works like this: the system runs a task a bit, and when
36the task schedules (or a scheduler tick happens) the task's CPU usage is
37'accounted for': the (small) time it just spent using the physical CPU
38is deducted from p->wait_runtime. [minus the 'fair share' it would have
39gotten anyway]. Once p->wait_runtime gets low enough so that another
40task becomes the 'leftmost task' of the time-ordered rbtree it maintains
41(plus a small amount of 'granularity' distance relative to the leftmost
42task so that we do not over-schedule tasks and trash the cache) then the
43new leftmost task is picked and the current task is preempted.
44
45The rq->fair_clock value tracks the 'CPU time a runnable task would have
46fairly gotten, had it been runnable during that time'. So by using
47rq->fair_clock values we can accurately timestamp and measure the
48'expected CPU time' a task should have gotten. All runnable tasks are
49sorted in the rbtree by the "rq->fair_clock - p->wait_runtime" key, and
50CFS picks the 'leftmost' task and sticks to it. As the system progresses
51forwards, newly woken tasks are put into the tree more and more to the
52right - slowly but surely giving a chance for every task to become the
53'leftmost task' and thus get on the CPU within a deterministic amount of
54time.
55
56Some implementation details:
57
58 - the introduction of Scheduling Classes: an extensible hierarchy of
59 scheduler modules. These modules encapsulate scheduling policy
60 details and are handled by the scheduler core without the core
61 code assuming about them too much.
62
63 - sched_fair.c implements the 'CFS desktop scheduler': it is a
64 replacement for the vanilla scheduler's SCHED_OTHER interactivity
65 code.
66
67 I'd like to give credit to Con Kolivas for the general approach here:
68 he has proven via RSDL/SD that 'fair scheduling' is possible and that
69 it results in better desktop scheduling. Kudos Con!
70
71 The CFS patch uses a completely different approach and implementation
72 from RSDL/SD. My goal was to make CFS's interactivity quality exceed
73 that of RSDL/SD, which is a high standard to meet :-) Testing
74 feedback is welcome to decide this one way or another. [ and, in any
75 case, all of SD's logic could be added via a kernel/sched_sd.c module
76 as well, if Con is interested in such an approach. ]
77
78 CFS's design is quite radical: it does not use runqueues, it uses a
79 time-ordered rbtree to build a 'timeline' of future task execution,
80 and thus has no 'array switch' artifacts (by which both the vanilla
81 scheduler and RSDL/SD are affected).
82
83 CFS uses nanosecond granularity accounting and does not rely on any
84 jiffies or other HZ detail. Thus the CFS scheduler has no notion of
85 'timeslices' and has no heuristics whatsoever. There is only one
86 central tunable:
87
88 /proc/sys/kernel/sched_granularity_ns
89
90 which can be used to tune the scheduler from 'desktop' (low
91 latencies) to 'server' (good batching) workloads. It defaults to a
92 setting suitable for desktop workloads. SCHED_BATCH is handled by the
93 CFS scheduler module too.
94
95 Due to its design, the CFS scheduler is not prone to any of the
96 'attacks' that exist today against the heuristics of the stock
97 scheduler: fiftyp.c, thud.c, chew.c, ring-test.c, massive_intr.c all
98 work fine and do not impact interactivity and produce the expected
99 behavior.
100
101 the CFS scheduler has a much stronger handling of nice levels and
102 SCHED_BATCH: both types of workloads should be isolated much more
103 agressively than under the vanilla scheduler.
104
105 ( another detail: due to nanosec accounting and timeline sorting,
106 sched_yield() support is very simple under CFS, and in fact under
107 CFS sched_yield() behaves much better than under any other
108 scheduler i have tested so far. )
109
110 - sched_rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler
111 way than the vanilla scheduler does. It uses 100 runqueues (for all
112 100 RT priority levels, instead of 140 in the vanilla scheduler)
113 and it needs no expired array.
114
115 - reworked/sanitized SMP load-balancing: the runqueue-walking
116 assumptions are gone from the load-balancing code now, and
117 iterators of the scheduling modules are used. The balancing code got
118 quite a bit simpler as a result.
119