diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/cpusets.txt | 72 |
1 files changed, 70 insertions, 2 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index ad2bb3b3acc1..aa854b9b18cd 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -8,6 +8,7 @@ Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | |||
8 | Modified by Paul Jackson <pj@sgi.com> | 8 | Modified by Paul Jackson <pj@sgi.com> |
9 | Modified by Christoph Lameter <clameter@sgi.com> | 9 | Modified by Christoph Lameter <clameter@sgi.com> |
10 | Modified by Paul Menage <menage@google.com> | 10 | Modified by Paul Menage <menage@google.com> |
11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | ||
11 | 12 | ||
12 | CONTENTS: | 13 | CONTENTS: |
13 | ========= | 14 | ========= |
@@ -20,7 +21,8 @@ CONTENTS: | |||
20 | 1.5 What is memory_pressure ? | 21 | 1.5 What is memory_pressure ? |
21 | 1.6 What is memory spread ? | 22 | 1.6 What is memory spread ? |
22 | 1.7 What is sched_load_balance ? | 23 | 1.7 What is sched_load_balance ? |
23 | 1.8 How do I use cpusets ? | 24 | 1.8 What is sched_relax_domain_level ? |
25 | 1.9 How do I use cpusets ? | ||
24 | 2. Usage Examples and Syntax | 26 | 2. Usage Examples and Syntax |
25 | 2.1 Basic Usage | 27 | 2.1 Basic Usage |
26 | 2.2 Adding/removing cpus | 28 | 2.2 Adding/removing cpus |
@@ -497,7 +499,73 @@ the cpuset code to update these sched domains, it compares the new | |||
497 | partition requested with the current, and updates its sched domains, | 499 | partition requested with the current, and updates its sched domains, |
498 | removing the old and adding the new, for each change. | 500 | removing the old and adding the new, for each change. |
499 | 501 | ||
500 | 1.8 How do I use cpusets ? | 502 | |
503 | 1.8 What is sched_relax_domain_level ? | ||
504 | -------------------------------------- | ||
505 | |||
506 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | ||
507 | balance on tick, and at time of some schedule events. | ||
508 | |||
509 | When a task is woken up, scheduler try to move the task on idle CPU. | ||
510 | For example, if a task A running on CPU X activates another task B | ||
511 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | ||
512 | then scheduler migrate task B to CPU Y so that task B can start on | ||
513 | CPU Y without waiting task A on CPU X. | ||
514 | |||
515 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | ||
516 | extra tasks from other busy CPUs to help them before it is going to | ||
517 | be idle. | ||
518 | |||
519 | Of course it takes some searching cost to find movable tasks and/or | ||
520 | idle CPUs, the scheduler might not search all CPUs in the domain | ||
521 | everytime. In fact, in some architectures, the searching ranges on | ||
522 | events are limited in the same socket or node where the CPU locates, | ||
523 | while the load balance on tick searchs all. | ||
524 | |||
525 | For example, assume CPU Z is relatively far from CPU X. Even if CPU Z | ||
526 | is idle while CPU X and the siblings are busy, scheduler can't migrate | ||
527 | woken task B from X to Z since it is out of its searching range. | ||
528 | As the result, task B on CPU X need to wait task A or wait load balance | ||
529 | on the next tick. For some applications in special situation, waiting | ||
530 | 1 tick may be too long. | ||
531 | |||
532 | The 'sched_relax_domain_level' file allows you to request changing | ||
533 | this searching range as you like. This file takes int value which | ||
534 | indicates size of searching range in levels ideally as follows, | ||
535 | otherwise initial value -1 that indicates the cpuset has no request. | ||
536 | |||
537 | -1 : no request. use system default or follow request of others. | ||
538 | 0 : no search. | ||
539 | 1 : search siblings (hyperthreads in a core). | ||
540 | 2 : search cores in a package. | ||
541 | 3 : search cpus in a node [= system wide on non-NUMA system] | ||
542 | ( 4 : search nodes in a chunk of node [on NUMA system] ) | ||
543 | ( 5~ : search system wide [on NUMA system]) | ||
544 | |||
545 | This file is per-cpuset and affect the sched domain where the cpuset | ||
546 | belongs to. Therefore if the flag 'sched_load_balance' of a cpuset | ||
547 | is disabled, then 'sched_relax_domain_level' have no effect since | ||
548 | there is no sched domain belonging the cpuset. | ||
549 | |||
550 | If multiple cpusets are overlapping and hence they form a single sched | ||
551 | domain, the largest value among those is used. Be careful, if one | ||
552 | requests 0 and others are -1 then 0 is used. | ||
553 | |||
554 | Note that modifying this file will have both good and bad effects, | ||
555 | and whether it is acceptable or not will be depend on your situation. | ||
556 | Don't modify this file if you are not sure. | ||
557 | |||
558 | If your situation is: | ||
559 | - The migration costs between each cpu can be assumed considerably | ||
560 | small(for you) due to your special application's behavior or | ||
561 | special hardware support for CPU cache etc. | ||
562 | - The searching cost doesn't have impact(for you) or you can make | ||
563 | the searching cost enough small by managing cpuset to compact etc. | ||
564 | - The latency is required even it sacrifices cache hit rate etc. | ||
565 | then increasing 'sched_relax_domain_level' would benefit you. | ||
566 | |||
567 | |||
568 | 1.9 How do I use cpusets ? | ||
501 | -------------------------- | 569 | -------------------------- |
502 | 570 | ||
503 | In order to minimize the impact of cpusets on critical kernel | 571 | In order to minimize the impact of cpusets on critical kernel |