aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDavid Rientjes <rientjes@google.com>2008-04-28 05:12:31 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2008-04-28 11:58:19 -0400
commit65d66fc02ed9433b957588071b60425b12628e25 (patch)
tree8737b2e5d018dc9e9d310d9b032fbeeecd588e62
parent4c50bc0116cf3cc35e7152d6a8424b4db65f52d6 (diff)
mempolicy: update NUMA memory policy documentation
Updates Documentation/vm/numa_memory_policy.txt and Documentation/filesystems/tmpfs.txt to describe optional mempolicy mode flags. Cc: Christoph Lameter <clameter@sgi.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andi Kleen <ak@suse.de> Cc: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Paul Jackson <pj@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--Documentation/filesystems/tmpfs.txt12
-rw-r--r--Documentation/vm/numa_memory_policy.txt131
2 files changed, 112 insertions, 31 deletions
diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt
index 145e44086358..222437efd75a 100644
--- a/Documentation/filesystems/tmpfs.txt
+++ b/Documentation/filesystems/tmpfs.txt
@@ -92,6 +92,18 @@ NodeList format is a comma-separated list of decimal numbers and ranges,
92a range being two hyphen-separated decimal numbers, the smallest and 92a range being two hyphen-separated decimal numbers, the smallest and
93largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 93largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15
94 94
95NUMA memory allocation policies have optional flags that can be used in
96conjunction with their modes. These optional flags can be specified
97when tmpfs is mounted by appending them to the mode before the NodeList.
98See Documentation/vm/numa_memory_policy.txt for a list of all available
99memory allocation policy mode flags.
100
101 =static is equivalent to MPOL_F_STATIC_NODES
102 =relative is equivalent to MPOL_F_RELATIVE_NODES
103
104For example, mpol=bind=static:NodeList, is the equivalent of an
105allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES.
106
95Note that trying to mount a tmpfs with an mpol option will fail if the 107Note that trying to mount a tmpfs with an mpol option will fail if the
96running kernel does not support NUMA; and will fail if its nodelist 108running kernel does not support NUMA; and will fail if its nodelist
97specifies a node which is not online. If your system relies on that 109specifies a node which is not online. If your system relies on that
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 1278e685d650..706410dfb9e5 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -135,9 +135,11 @@ most general to most specific:
135 135
136Components of Memory Policies 136Components of Memory Policies
137 137
138 A Linux memory policy is a tuple consisting of a "mode" and an optional set 138 A Linux memory policy consists of a "mode", optional mode flags, and an
139 of nodes. The mode determine the behavior of the policy, while the 139 optional set of nodes. The mode determines the behavior of the policy,
140 optional set of nodes can be viewed as the arguments to the behavior. 140 the optional mode flags determine the behavior of the mode, and the
141 optional set of nodes can be viewed as the arguments to the policy
142 behavior.
141 143
142 Internally, memory policies are implemented by a reference counted 144 Internally, memory policies are implemented by a reference counted
143 structure, struct mempolicy. Details of this structure will be discussed 145 structure, struct mempolicy. Details of this structure will be discussed
@@ -179,7 +181,8 @@ Components of Memory Policies
179 on a non-shared region of the address space. However, see 181 on a non-shared region of the address space. However, see
180 MPOL_PREFERRED below. 182 MPOL_PREFERRED below.
181 183
182 The Default mode does not use the optional set of nodes. 184 It is an error for the set of nodes specified for this policy to
185 be non-empty.
183 186
184 MPOL_BIND: This mode specifies that memory must come from the 187 MPOL_BIND: This mode specifies that memory must come from the
185 set of nodes specified by the policy. Memory will be allocated from 188 set of nodes specified by the policy. Memory will be allocated from
@@ -226,6 +229,80 @@ Components of Memory Policies
226 the temporary interleaved system default policy works in this 229 the temporary interleaved system default policy works in this
227 mode. 230 mode.
228 231
232 Linux memory policy supports the following optional mode flags:
233
234 MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
235 the user should not be remapped if the task or VMA's set of allowed
236 nodes changes after the memory policy has been defined.
237
238 Without this flag, anytime a mempolicy is rebound because of a
239 change in the set of allowed nodes, the node (Preferred) or
240 nodemask (Bind, Interleave) is remapped to the new set of
241 allowed nodes. This may result in nodes being used that were
242 previously undesired.
243
244 With this flag, if the user-specified nodes overlap with the
245 nodes allowed by the task's cpuset, then the memory policy is
246 applied to their intersection. If the two sets of nodes do not
247 overlap, the Default policy is used.
248
249 For example, consider a task that is attached to a cpuset with
250 mems 1-3 that sets an Interleave policy over the same set. If
251 the cpuset's mems change to 3-5, the Interleave will now occur
252 over nodes 3, 4, and 5. With this flag, however, since only node
253 3 is allowed from the user's nodemask, the "interleave" only
254 occurs over that node. If no nodes from the user's nodemask are
255 now allowed, the Default behavior is used.
256
257 MPOL_F_STATIC_NODES cannot be used with MPOL_F_RELATIVE_NODES.
258
259 MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
260 by the user will be mapped relative to the set of the task or VMA's
261 set of allowed nodes. The kernel stores the user-passed nodemask,
262 and if the allowed nodes changes, then that original nodemask will
263 be remapped relative to the new set of allowed nodes.
264
265 Without this flag (and without MPOL_F_STATIC_NODES), anytime a
266 mempolicy is rebound because of a change in the set of allowed
267 nodes, the node (Preferred) or nodemask (Bind, Interleave) is
268 remapped to the new set of allowed nodes. That remap may not
269 preserve the relative nature of the user's passed nodemask to its
270 set of allowed nodes upon successive rebinds: a nodemask of
271 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
272 allowed nodes is restored to its original state.
273
274 With this flag, the remap is done so that the node numbers from
275 the user's passed nodemask are relative to the set of allowed
276 nodes. In other words, if nodes 0, 2, and 4 are set in the user's
277 nodemask, the policy will be effected over the first (and in the
278 Bind or Interleave case, the third and fifth) nodes in the set of
279 allowed nodes. The nodemask passed by the user represents nodes
280 relative to task or VMA's set of allowed nodes.
281
282 If the user's nodemask includes nodes that are outside the range
283 of the new set of allowed nodes (for example, node 5 is set in
284 the user's nodemask when the set of allowed nodes is only 0-3),
285 then the remap wraps around to the beginning of the nodemask and,
286 if not already set, sets the node in the mempolicy nodemask.
287
288 For example, consider a task that is attached to a cpuset with
289 mems 2-5 that sets an Interleave policy over the same set with
290 MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
291 interleave now occurs over nodes 3,5-6. If the cpuset's mems
292 then change to 0,2-3,5, then the interleave occurs over nodes
293 0,3,5.
294
295 Thanks to the consistent remapping, applications preparing
296 nodemasks to specify memory policies using this flag should
297 disregard their current, actual cpuset imposed memory placement
298 and prepare the nodemask as if they were always located on
299 memory nodes 0 to N-1, where N is the number of memory nodes the
300 policy is intended to manage. Let the kernel then remap to the
301 set of memory nodes allowed by the task's cpuset, as that may
302 change over time.
303
304 MPOL_F_RELATIVE_NODES cannot be used with MPOL_F_STATIC_NODES.
305
229MEMORY POLICY APIs 306MEMORY POLICY APIs
230 307
231Linux supports 3 system calls for controlling memory policy. These APIS 308Linux supports 3 system calls for controlling memory policy. These APIS
@@ -246,7 +323,9 @@ Set [Task] Memory Policy:
246 Set's the calling task's "task/process memory policy" to mode 323 Set's the calling task's "task/process memory policy" to mode
247 specified by the 'mode' argument and the set of nodes defined 324 specified by the 'mode' argument and the set of nodes defined
248 by 'nmask'. 'nmask' points to a bit mask of node ids containing 325 by 'nmask'. 'nmask' points to a bit mask of node ids containing
249 at least 'maxnode' ids. 326 at least 'maxnode' ids. Optional mode flags may be passed by
327 combining the 'mode' argument with the flag (for example:
328 MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
250 329
251 See the set_mempolicy(2) man page for more details 330 See the set_mempolicy(2) man page for more details
252 331
@@ -298,29 +377,19 @@ MEMORY POLICIES AND CPUSETS
298Memory policies work within cpusets as described above. For memory policies 377Memory policies work within cpusets as described above. For memory policies
299that require a node or set of nodes, the nodes are restricted to the set of 378that require a node or set of nodes, the nodes are restricted to the set of
300nodes whose memories are allowed by the cpuset constraints. If the nodemask 379nodes whose memories are allowed by the cpuset constraints. If the nodemask
301specified for the policy contains nodes that are not allowed by the cpuset, or 380specified for the policy contains nodes that are not allowed by the cpuset and
302the intersection of the set of nodes specified for the policy and the set of 381MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
303nodes with memory is the empty set, the policy is considered invalid 382specified for the policy and the set of nodes with memory is used. If the
304and cannot be installed. 383result is the empty set, the policy is considered invalid and cannot be
305 384installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
306The interaction of memory policies and cpusets can be problematic for a 385onto and folded into the task's set of allowed nodes as previously described.
307couple of reasons: 386
308 387The interaction of memory policies and cpusets can be problematic when tasks
3091) the memory policy APIs take physical node id's as arguments. As mentioned 388in two cpusets share access to a memory region, such as shared memory segments
310 above, it is illegal to specify nodes that are not allowed in the cpuset. 389created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
311 The application must query the allowed nodes using the get_mempolicy() 390any of the tasks install shared policy on the region, only nodes whose
312 API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and 391memories are allowed in both cpusets may be used in the policies. Obtaining
313 restrict itself to those nodes. However, the resources available to a 392this information requires "stepping outside" the memory policy APIs to use the
314 cpuset can be changed by the system administrator, or a workload manager 393cpuset information and requires that one know in what cpusets other task might
315 application, at any time. So, a task may still get errors attempting to 394be attaching to the shared region. Furthermore, if the cpusets' allowed
316 specify policy nodes, and must query the allowed memories again. 395memory sets are disjoint, "local" allocation is the only valid policy.
317
3182) when tasks in two cpusets share access to a memory region, such as shared
319 memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
320 MAP_SHARED flags, and any of the tasks install shared policy on the region,
321 only nodes whose memories are allowed in both cpusets may be used in the
322 policies. Obtaining this information requires "stepping outside" the
323 memory policy APIs to use the cpuset information and requires that one
324 know in what cpusets other task might be attaching to the shared region.
325 Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
326 allocation is the only valid policy.