diff options
Diffstat (limited to 'Documentation/sched-domains.txt')
-rw-r--r-- | Documentation/sched-domains.txt | 70 |
1 files changed, 70 insertions, 0 deletions
diff --git a/Documentation/sched-domains.txt b/Documentation/sched-domains.txt new file mode 100644 index 000000000000..a9e990ab980f --- /dev/null +++ b/Documentation/sched-domains.txt | |||
@@ -0,0 +1,70 @@ | |||
1 | Each CPU has a "base" scheduling domain (struct sched_domain). These are | ||
2 | accessed via cpu_sched_domain(i) and this_sched_domain() macros. The domain | ||
3 | hierarchy is built from these base domains via the ->parent pointer. ->parent | ||
4 | MUST be NULL terminated, and domain structures should be per-CPU as they | ||
5 | are locklessly updated. | ||
6 | |||
7 | Each scheduling domain spans a number of CPUs (stored in the ->span field). | ||
8 | A domain's span MUST be a superset of it child's span (this restriction could | ||
9 | be relaxed if the need arises), and a base domain for CPU i MUST span at least | ||
10 | i. The top domain for each CPU will generally span all CPUs in the system | ||
11 | although strictly it doesn't have to, but this could lead to a case where some | ||
12 | CPUs will never be given tasks to run unless the CPUs allowed mask is | ||
13 | explicitly set. A sched domain's span means "balance process load among these | ||
14 | CPUs". | ||
15 | |||
16 | Each scheduling domain must have one or more CPU groups (struct sched_group) | ||
17 | which are organised as a circular one way linked list from the ->groups | ||
18 | pointer. The union of cpumasks of these groups MUST be the same as the | ||
19 | domain's span. The intersection of cpumasks from any two of these groups | ||
20 | MUST be the empty set. The group pointed to by the ->groups pointer MUST | ||
21 | contain the CPU to which the domain belongs. Groups may be shared among | ||
22 | CPUs as they contain read only data after they have been set up. | ||
23 | |||
24 | Balancing within a sched domain occurs between groups. That is, each group | ||
25 | is treated as one entity. The load of a group is defined as the sum of the | ||
26 | load of each of its member CPUs, and only when the load of a group becomes | ||
27 | out of balance are tasks moved between groups. | ||
28 | |||
29 | In kernel/sched.c, rebalance_tick is run periodically on each CPU. This | ||
30 | function takes its CPU's base sched domain and checks to see if has reached | ||
31 | its rebalance interval. If so, then it will run load_balance on that domain. | ||
32 | rebalance_tick then checks the parent sched_domain (if it exists), and the | ||
33 | parent of the parent and so forth. | ||
34 | |||
35 | *** Implementing sched domains *** | ||
36 | The "base" domain will "span" the first level of the hierarchy. In the case | ||
37 | of SMT, you'll span all siblings of the physical CPU, with each group being | ||
38 | a single virtual CPU. | ||
39 | |||
40 | In SMP, the parent of the base domain will span all physical CPUs in the | ||
41 | node. Each group being a single physical CPU. Then with NUMA, the parent | ||
42 | of the SMP domain will span the entire machine, with each group having the | ||
43 | cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, | ||
44 | might have just one domain covering its one NUMA level. | ||
45 | |||
46 | The implementor should read comments in include/linux/sched.h: | ||
47 | struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of | ||
48 | the specifics and what to tune. | ||
49 | |||
50 | For SMT, the architecture must define CONFIG_SCHED_SMT and provide a | ||
51 | cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of | ||
52 | all "i"'s siblings as well as "i" itself. | ||
53 | |||
54 | Architectures may retain the regular override the default SD_*_INIT flags | ||
55 | while using the generic domain builder in kernel/sched.c if they wish to | ||
56 | retain the traditional SMT->SMP->NUMA topology (or some subset of that). This | ||
57 | can be done by #define'ing ARCH_HASH_SCHED_TUNE. | ||
58 | |||
59 | Alternatively, the architecture may completely override the generic domain | ||
60 | builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your | ||
61 | arch_init_sched_domains function. This function will attach domains to all | ||
62 | CPUs using cpu_attach_domain. | ||
63 | |||
64 | Implementors should change the line | ||
65 | #undef SCHED_DOMAIN_DEBUG | ||
66 | to | ||
67 | #define SCHED_DOMAIN_DEBUG | ||
68 | in kernel/sched.c as this enables an error checking parse of the sched domains | ||
69 | which should catch most possible errors (described above). It also prints out | ||
70 | the domain structure in a visual format. | ||