aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorPaul Jackson <pj@sgi.com>2006-03-24 06:16:03 -0500
committerLinus Torvalds <torvalds@g5.osdl.org>2006-03-24 10:33:22 -0500
commit825a46af5ac171f9f41f794a0a00165588ba1589 (patch)
treeb690fe9d809d7b047f0393097fc79892e1217d98 /Documentation
parent8a39cc60bfa5a72f32d975729a354daca124f6de (diff)
[PATCH] cpuset memory spread basic implementation
This patch provides the implementation and cpuset interface for an alternative memory allocation policy that can be applied to certain kinds of memory allocations, such as the page cache (file system buffers) and some slab caches (such as inode caches). The policy is called "memory spreading." If enabled, it spreads out these kinds of memory allocations over all the nodes allowed to a task, instead of preferring to place them on the node where the task is executing. All other kinds of allocations, including anonymous pages for a tasks stack and data regions, are not affected by this policy choice, and continue to be allocated preferring the node local to execution, as modified by the NUMA mempolicy. There are two boolean flag files per cpuset that control where the kernel allocates pages for the file system buffers and related in kernel data structures. They are called 'memory_spread_page' and 'memory_spread_slab'. If the per-cpuset boolean flag file 'memory_spread_page' is set, then the kernel will spread the file system buffers (page cache) evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. If the per-cpuset boolean flag file 'memory_spread_slab' is set, then the kernel will spread some file system related slab caches, such as for inodes and dentries evenly over all the nodes that the faulting task is allowed to use, instead of preferring to put those pages on the node where the task is running. The implementation is simple. Setting the cpuset flags 'memory_spread_page' or 'memory_spread_cache' turns on the per-process flags PF_SPREAD_PAGE or PF_SPREAD_SLAB, respectively, for each task that is in the cpuset or subsequently joins that cpuset. In subsequent patches, the page allocation calls for the affected page cache and slab caches are modified to perform an inline check for these flags, and if set, a call to a new routine cpuset_mem_spread_node() returns the node to prefer for the allocation. The cpuset_mem_spread_node() routine is also simple. It uses the value of a per-task rotor cpuset_mem_spread_rotor to select the next node in the current tasks mems_allowed to prefer for the allocation. This policy can provide substantial improvements for jobs that need to place thread local data on the corresponding node, but that need to access large file system data sets that need to be spread across the several nodes in the jobs cpuset in order to fit. Without this patch, especially for jobs that might have one thread reading in the data set, the memory allocation across the nodes in the jobs cpuset can become very uneven. A couple of Copyright year ranges are updated as well. And a couple of email addresses that can be found in the MAINTAINERS file are removed. Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/cpusets.txt76
1 files changed, 74 insertions, 2 deletions
diff --git a/Documentation/cpusets.txt b/Documentation/cpusets.txt
index 30c41459953c..159e2a0c3e80 100644
--- a/Documentation/cpusets.txt
+++ b/Documentation/cpusets.txt
@@ -18,7 +18,8 @@ CONTENTS:
18 1.4 What are exclusive cpusets ? 18 1.4 What are exclusive cpusets ?
19 1.5 What does notify_on_release do ? 19 1.5 What does notify_on_release do ?
20 1.6 What is memory_pressure ? 20 1.6 What is memory_pressure ?
21 1.7 How do I use cpusets ? 21 1.7 What is memory spread ?
22 1.8 How do I use cpusets ?
222. Usage Examples and Syntax 232. Usage Examples and Syntax
23 2.1 Basic Usage 24 2.1 Basic Usage
24 2.2 Adding/removing cpus 25 2.2 Adding/removing cpus
@@ -317,7 +318,78 @@ the tasks in the cpuset, in units of reclaims attempted per second,
317times 1000. 318times 1000.
318 319
319 320
3201.7 How do I use cpusets ? 3211.7 What is memory spread ?
322---------------------------
323There are two boolean flag files per cpuset that control where the
324kernel allocates pages for the file system buffers and related in
325kernel data structures. They are called 'memory_spread_page' and
326'memory_spread_slab'.
327
328If the per-cpuset boolean flag file 'memory_spread_page' is set, then
329the kernel will spread the file system buffers (page cache) evenly
330over all the nodes that the faulting task is allowed to use, instead
331of preferring to put those pages on the node where the task is running.
332
333If the per-cpuset boolean flag file 'memory_spread_slab' is set,
334then the kernel will spread some file system related slab caches,
335such as for inodes and dentries evenly over all the nodes that the
336faulting task is allowed to use, instead of preferring to put those
337pages on the node where the task is running.
338
339The setting of these flags does not affect anonymous data segment or
340stack segment pages of a task.
341
342By default, both kinds of memory spreading are off, and memory
343pages are allocated on the node local to where the task is running,
344except perhaps as modified by the tasks NUMA mempolicy or cpuset
345configuration, so long as sufficient free memory pages are available.
346
347When new cpusets are created, they inherit the memory spread settings
348of their parent.
349
350Setting memory spreading causes allocations for the affected page
351or slab caches to ignore the tasks NUMA mempolicy and be spread
352instead. Tasks using mbind() or set_mempolicy() calls to set NUMA
353mempolicies will not notice any change in these calls as a result of
354their containing tasks memory spread settings. If memory spreading
355is turned off, then the currently specified NUMA mempolicy once again
356applies to memory page allocations.
357
358Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag
359files. By default they contain "0", meaning that the feature is off
360for that cpuset. If a "1" is written to that file, then that turns
361the named feature on.
362
363The implementation is simple.
364
365Setting the flag 'memory_spread_page' turns on a per-process flag
366PF_SPREAD_PAGE for each task that is in that cpuset or subsequently
367joins that cpuset. The page allocation calls for the page cache
368is modified to perform an inline check for this PF_SPREAD_PAGE task
369flag, and if set, a call to a new routine cpuset_mem_spread_node()
370returns the node to prefer for the allocation.
371
372Similarly, setting 'memory_spread_cache' turns on the flag
373PF_SPREAD_SLAB, and appropriately marked slab caches will allocate
374pages from the node returned by cpuset_mem_spread_node().
375
376The cpuset_mem_spread_node() routine is also simple. It uses the
377value of a per-task rotor cpuset_mem_spread_rotor to select the next
378node in the current tasks mems_allowed to prefer for the allocation.
379
380This memory placement policy is also known (in other contexts) as
381round-robin or interleave.
382
383This policy can provide substantial improvements for jobs that need
384to place thread local data on the corresponding node, but that need
385to access large file system data sets that need to be spread across
386the several nodes in the jobs cpuset in order to fit. Without this
387policy, especially for jobs that might have one thread reading in the
388data set, the memory allocation across the nodes in the jobs cpuset
389can become very uneven.
390
391
3921.8 How do I use cpusets ?
321-------------------------- 393--------------------------
322 394
323In order to minimize the impact of cpusets on critical kernel 395In order to minimize the impact of cpusets on critical kernel