diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cpusets.txt | 141 |
1 files changed, 139 insertions, 2 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 85eeab5e7e32..141bef1c8599 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -19,7 +19,8 @@ CONTENTS: | |||
19 | 1.4 What are exclusive cpusets ? | 19 | 1.4 What are exclusive cpusets ? |
20 | 1.5 What is memory_pressure ? | 20 | 1.5 What is memory_pressure ? |
21 | 1.6 What is memory spread ? | 21 | 1.6 What is memory spread ? |
22 | 1.7 How do I use cpusets ? | 22 | 1.7 What is sched_load_balance ? |
23 | 1.8 How do I use cpusets ? | ||
23 | 2. Usage Examples and Syntax | 24 | 2. Usage Examples and Syntax |
24 | 2.1 Basic Usage | 25 | 2.1 Basic Usage |
25 | 2.2 Adding/removing cpus | 26 | 2.2 Adding/removing cpus |
@@ -359,8 +360,144 @@ policy, especially for jobs that might have one thread reading in the | |||
359 | data set, the memory allocation across the nodes in the jobs cpuset | 360 | data set, the memory allocation across the nodes in the jobs cpuset |
360 | can become very uneven. | 361 | can become very uneven. |
361 | 362 | ||
363 | 1.7 What is sched_load_balance ? | ||
364 | -------------------------------- | ||
362 | 365 | ||
363 | 1.7 How do I use cpusets ? | 366 | The kernel scheduler (kernel/sched.c) automatically load balances |
367 | tasks. If one CPU is underutilized, kernel code running on that | ||
368 | CPU will look for tasks on other more overloaded CPUs and move those | ||
369 | tasks to itself, within the constraints of such placement mechanisms | ||
370 | as cpusets and sched_setaffinity. | ||
371 | |||
372 | The algorithmic cost of load balancing and its impact on key shared | ||
373 | kernel data structures such as the task list increases more than | ||
374 | linearly with the number of CPUs being balanced. So the scheduler | ||
375 | has support to partition the systems CPUs into a number of sched | ||
376 | domains such that it only load balances within each sched domain. | ||
377 | Each sched domain covers some subset of the CPUs in the system; | ||
378 | no two sched domains overlap; some CPUs might not be in any sched | ||
379 | domain and hence won't be load balanced. | ||
380 | |||
381 | Put simply, it costs less to balance between two smaller sched domains | ||
382 | than one big one, but doing so means that overloads in one of the | ||
383 | two domains won't be load balanced to the other one. | ||
384 | |||
385 | By default, there is one sched domain covering all CPUs, except those | ||
386 | marked isolated using the kernel boot time "isolcpus=" argument. | ||
387 | |||
388 | This default load balancing across all CPUs is not well suited for | ||
389 | the following two situations: | ||
390 | 1) On large systems, load balancing across many CPUs is expensive. | ||
391 | If the system is managed using cpusets to place independent jobs | ||
392 | on separate sets of CPUs, full load balancing is unnecessary. | ||
393 | 2) Systems supporting realtime on some CPUs need to minimize | ||
394 | system overhead on those CPUs, including avoiding task load | ||
395 | balancing if that is not needed. | ||
396 | |||
397 | When the per-cpuset flag "sched_load_balance" is enabled (the default | ||
398 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | ||
399 | be contained in a single sched domain, ensuring that load balancing | ||
400 | can move a task (not otherwised pinned, as by sched_setaffinity) | ||
401 | from any CPU in that cpuset to any other. | ||
402 | |||
403 | When the per-cpuset flag "sched_load_balance" is disabled, then the | ||
404 | scheduler will avoid load balancing across the CPUs in that cpuset, | ||
405 | --except-- in so far as is necessary because some overlapping cpuset | ||
406 | has "sched_load_balance" enabled. | ||
407 | |||
408 | So, for example, if the top cpuset has the flag "sched_load_balance" | ||
409 | enabled, then the scheduler will have one sched domain covering all | ||
410 | CPUs, and the setting of the "sched_load_balance" flag in any other | ||
411 | cpusets won't matter, as we're already fully load balancing. | ||
412 | |||
413 | Therefore in the above two situations, the top cpuset flag | ||
414 | "sched_load_balance" should be disabled, and only some of the smaller, | ||
415 | child cpusets have this flag enabled. | ||
416 | |||
417 | When doing this, you don't usually want to leave any unpinned tasks in | ||
418 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | ||
419 | may be artificially constrained to some subset of CPUs, depending on | ||
420 | the particulars of this flag setting in descendent cpusets. Even if | ||
421 | such a task could use spare CPU cycles in some other CPUs, the kernel | ||
422 | scheduler might not consider the possibility of load balancing that | ||
423 | task to that underused CPU. | ||
424 | |||
425 | Of course, tasks pinned to a particular CPU can be left in a cpuset | ||
426 | that disables "sched_load_balance" as those tasks aren't going anywhere | ||
427 | else anyway. | ||
428 | |||
429 | There is an impedance mismatch here, between cpusets and sched domains. | ||
430 | Cpusets are hierarchical and nest. Sched domains are flat; they don't | ||
431 | overlap and each CPU is in at most one sched domain. | ||
432 | |||
433 | It is necessary for sched domains to be flat because load balancing | ||
434 | across partially overlapping sets of CPUs would risk unstable dynamics | ||
435 | that would be beyond our understanding. So if each of two partially | ||
436 | overlapping cpusets enables the flag 'sched_load_balance', then we | ||
437 | form a single sched domain that is a superset of both. We won't move | ||
438 | a task to a CPU outside it cpuset, but the scheduler load balancing | ||
439 | code might waste some compute cycles considering that possibility. | ||
440 | |||
441 | This mismatch is why there is not a simple one-to-one relation | ||
442 | between which cpusets have the flag "sched_load_balance" enabled, | ||
443 | and the sched domain configuration. If a cpuset enables the flag, it | ||
444 | will get balancing across all its CPUs, but if it disables the flag, | ||
445 | it will only be assured of no load balancing if no other overlapping | ||
446 | cpuset enables the flag. | ||
447 | |||
448 | If two cpusets have partially overlapping 'cpus' allowed, and only | ||
449 | one of them has this flag enabled, then the other may find its | ||
450 | tasks only partially load balanced, just on the overlapping CPUs. | ||
451 | This is just the general case of the top_cpuset example given a few | ||
452 | paragraphs above. In the general case, as in the top cpuset case, | ||
453 | don't leave tasks that might use non-trivial amounts of CPU in | ||
454 | such partially load balanced cpusets, as they may be artificially | ||
455 | constrained to some subset of the CPUs allowed to them, for lack of | ||
456 | load balancing to the other CPUs. | ||
457 | |||
458 | 1.7.1 sched_load_balance implementation details. | ||
459 | ------------------------------------------------ | ||
460 | |||
461 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | ||
462 | to most cpuset flags.) When enabled for a cpuset, the kernel will | ||
463 | ensure that it can load balance across all the CPUs in that cpuset | ||
464 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | ||
465 | in the same sched domain.) | ||
466 | |||
467 | If two overlapping cpusets both have 'sched_load_balance' enabled, | ||
468 | then they will be (must be) both in the same sched domain. | ||
469 | |||
470 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | ||
471 | then by the above that means there is a single sched domain covering | ||
472 | the whole system, regardless of any other cpuset settings. | ||
473 | |||
474 | The kernel commits to user space that it will avoid load balancing | ||
475 | where it can. It will pick as fine a granularity partition of sched | ||
476 | domains as it can while still providing load balancing for any set | ||
477 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | ||
478 | |||
479 | The internal kernel cpuset to scheduler interface passes from the | ||
480 | cpuset code to the scheduler code a partition of the load balanced | ||
481 | CPUs in the system. This partition is a set of subsets (represented | ||
482 | as an array of cpumask_t) of CPUs, pairwise disjoint, that cover all | ||
483 | the CPUs that must be load balanced. | ||
484 | |||
485 | Whenever the 'sched_load_balance' flag changes, or CPUs come or go | ||
486 | from a cpuset with this flag enabled, or a cpuset with this flag | ||
487 | enabled is removed, the cpuset code builds a new such partition and | ||
488 | passes it to the scheduler sched domain setup code, to have the sched | ||
489 | domains rebuilt as necessary. | ||
490 | |||
491 | This partition exactly defines what sched domains the scheduler should | ||
492 | setup - one sched domain for each element (cpumask_t) in the partition. | ||
493 | |||
494 | The scheduler remembers the currently active sched domain partitions. | ||
495 | When the scheduler routine partition_sched_domains() is invoked from | ||
496 | the cpuset code to update these sched domains, it compares the new | ||
497 | partition requested with the current, and updates its sched domains, | ||
498 | removing the old and adding the new, for each change. | ||
499 | |||
500 | 1.8 How do I use cpusets ? | ||
364 | -------------------------- | 501 | -------------------------- |
365 | 502 | ||
366 | In order to minimize the impact of cpusets on critical kernel | 503 | In order to minimize the impact of cpusets on critical kernel |