diff options
Diffstat (limited to 'Documentation/cpusets.txt')
-rw-r--r-- | Documentation/cpusets.txt | 76 |
1 files changed, 74 insertions, 2 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt index 30c41459953c..159e2a0c3e80 100644 --- a/Documentation/cpusets.txt +++ b/Documentation/cpusets.txt | |||
@@ -18,7 +18,8 @@ CONTENTS: | |||
18 | 1.4 What are exclusive cpusets ? | 18 | 1.4 What are exclusive cpusets ? |
19 | 1.5 What does notify_on_release do ? | 19 | 1.5 What does notify_on_release do ? |
20 | 1.6 What is memory_pressure ? | 20 | 1.6 What is memory_pressure ? |
21 | 1.7 How do I use cpusets ? | 21 | 1.7 What is memory spread ? |
22 | 1.8 How do I use cpusets ? | ||
22 | 2. Usage Examples and Syntax | 23 | 2. Usage Examples and Syntax |
23 | 2.1 Basic Usage | 24 | 2.1 Basic Usage |
24 | 2.2 Adding/removing cpus | 25 | 2.2 Adding/removing cpus |
@@ -317,7 +318,78 @@ the tasks in the cpuset, in units of reclaims attempted per second, | |||
317 | times 1000. | 318 | times 1000. |
318 | 319 | ||
319 | 320 | ||
320 | 1.7 How do I use cpusets ? | 321 | 1.7 What is memory spread ? |
322 | --------------------------- | ||
323 | There are two boolean flag files per cpuset that control where the | ||
324 | kernel allocates pages for the file system buffers and related in | ||
325 | kernel data structures. They are called 'memory_spread_page' and | ||
326 | 'memory_spread_slab'. | ||
327 | |||
328 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | ||
329 | the kernel will spread the file system buffers (page cache) evenly | ||
330 | over all the nodes that the faulting task is allowed to use, instead | ||
331 | of preferring to put those pages on the node where the task is running. | ||
332 | |||
333 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | ||
334 | then the kernel will spread some file system related slab caches, | ||
335 | such as for inodes and dentries evenly over all the nodes that the | ||
336 | faulting task is allowed to use, instead of preferring to put those | ||
337 | pages on the node where the task is running. | ||
338 | |||
339 | The setting of these flags does not affect anonymous data segment or | ||
340 | stack segment pages of a task. | ||
341 | |||
342 | By default, both kinds of memory spreading are off, and memory | ||
343 | pages are allocated on the node local to where the task is running, | ||
344 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | ||
345 | configuration, so long as sufficient free memory pages are available. | ||
346 | |||
347 | When new cpusets are created, they inherit the memory spread settings | ||
348 | of their parent. | ||
349 | |||
350 | Setting memory spreading causes allocations for the affected page | ||
351 | or slab caches to ignore the tasks NUMA mempolicy and be spread | ||
352 | instead. Tasks using mbind() or set_mempolicy() calls to set NUMA | ||
353 | mempolicies will not notice any change in these calls as a result of | ||
354 | their containing tasks memory spread settings. If memory spreading | ||
355 | is turned off, then the currently specified NUMA mempolicy once again | ||
356 | applies to memory page allocations. | ||
357 | |||
358 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | ||
359 | files. By default they contain "0", meaning that the feature is off | ||
360 | for that cpuset. If a "1" is written to that file, then that turns | ||
361 | the named feature on. | ||
362 | |||
363 | The implementation is simple. | ||
364 | |||
365 | Setting the flag 'memory_spread_page' turns on a per-process flag | ||
366 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | ||
367 | joins that cpuset. The page allocation calls for the page cache | ||
368 | is modified to perform an inline check for this PF_SPREAD_PAGE task | ||
369 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | ||
370 | returns the node to prefer for the allocation. | ||
371 | |||
372 | Similarly, setting 'memory_spread_cache' turns on the flag | ||
373 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | ||
374 | pages from the node returned by cpuset_mem_spread_node(). | ||
375 | |||
376 | The cpuset_mem_spread_node() routine is also simple. It uses the | ||
377 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | ||
378 | node in the current tasks mems_allowed to prefer for the allocation. | ||
379 | |||
380 | This memory placement policy is also known (in other contexts) as | ||
381 | round-robin or interleave. | ||
382 | |||
383 | This policy can provide substantial improvements for jobs that need | ||
384 | to place thread local data on the corresponding node, but that need | ||
385 | to access large file system data sets that need to be spread across | ||
386 | the several nodes in the jobs cpuset in order to fit. Without this | ||
387 | policy, especially for jobs that might have one thread reading in the | ||
388 | data set, the memory allocation across the nodes in the jobs cpuset | ||
389 | can become very uneven. | ||
390 | |||
391 | |||
392 | 1.8 How do I use cpusets ? | ||
321 | -------------------------- | 393 | -------------------------- |
322 | 394 | ||
323 | In order to minimize the impact of cpusets on critical kernel | 395 | In order to minimize the impact of cpusets on critical kernel |