diff options
Diffstat (limited to 'Documentation/vm')
-rw-r--r-- | Documentation/vm/numa_memory_policy.txt | 131 |
1 files changed, 100 insertions, 31 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 1278e685d650..706410dfb9e5 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt | |||
@@ -135,9 +135,11 @@ most general to most specific: | |||
135 | 135 | ||
136 | Components of Memory Policies | 136 | Components of Memory Policies |
137 | 137 | ||
138 | A Linux memory policy is a tuple consisting of a "mode" and an optional set | 138 | A Linux memory policy consists of a "mode", optional mode flags, and an |
139 | of nodes. The mode determine the behavior of the policy, while the | 139 | optional set of nodes. The mode determines the behavior of the policy, |
140 | optional set of nodes can be viewed as the arguments to the behavior. | 140 | the optional mode flags determine the behavior of the mode, and the |
141 | optional set of nodes can be viewed as the arguments to the policy | ||
142 | behavior. | ||
141 | 143 | ||
142 | Internally, memory policies are implemented by a reference counted | 144 | Internally, memory policies are implemented by a reference counted |
143 | structure, struct mempolicy. Details of this structure will be discussed | 145 | structure, struct mempolicy. Details of this structure will be discussed |
@@ -179,7 +181,8 @@ Components of Memory Policies | |||
179 | on a non-shared region of the address space. However, see | 181 | on a non-shared region of the address space. However, see |
180 | MPOL_PREFERRED below. | 182 | MPOL_PREFERRED below. |
181 | 183 | ||
182 | The Default mode does not use the optional set of nodes. | 184 | It is an error for the set of nodes specified for this policy to |
185 | be non-empty. | ||
183 | 186 | ||
184 | MPOL_BIND: This mode specifies that memory must come from the | 187 | MPOL_BIND: This mode specifies that memory must come from the |
185 | set of nodes specified by the policy. Memory will be allocated from | 188 | set of nodes specified by the policy. Memory will be allocated from |
@@ -226,6 +229,80 @@ Components of Memory Policies | |||
226 | the temporary interleaved system default policy works in this | 229 | the temporary interleaved system default policy works in this |
227 | mode. | 230 | mode. |
228 | 231 | ||
232 | Linux memory policy supports the following optional mode flags: | ||
233 | |||
234 | MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by | ||
235 | the user should not be remapped if the task or VMA's set of allowed | ||
236 | nodes changes after the memory policy has been defined. | ||
237 | |||
238 | Without this flag, anytime a mempolicy is rebound because of a | ||
239 | change in the set of allowed nodes, the node (Preferred) or | ||
240 | nodemask (Bind, Interleave) is remapped to the new set of | ||
241 | allowed nodes. This may result in nodes being used that were | ||
242 | previously undesired. | ||
243 | |||
244 | With this flag, if the user-specified nodes overlap with the | ||
245 | nodes allowed by the task's cpuset, then the memory policy is | ||
246 | applied to their intersection. If the two sets of nodes do not | ||
247 | overlap, the Default policy is used. | ||
248 | |||
249 | For example, consider a task that is attached to a cpuset with | ||
250 | mems 1-3 that sets an Interleave policy over the same set. If | ||
251 | the cpuset's mems change to 3-5, the Interleave will now occur | ||
252 | over nodes 3, 4, and 5. With this flag, however, since only node | ||
253 | 3 is allowed from the user's nodemask, the "interleave" only | ||
254 | occurs over that node. If no nodes from the user's nodemask are | ||
255 | now allowed, the Default behavior is used. | ||
256 | |||
257 | MPOL_F_STATIC_NODES cannot be used with MPOL_F_RELATIVE_NODES. | ||
258 | |||
259 | MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed | ||
260 | by the user will be mapped relative to the set of the task or VMA's | ||
261 | set of allowed nodes. The kernel stores the user-passed nodemask, | ||
262 | and if the allowed nodes changes, then that original nodemask will | ||
263 | be remapped relative to the new set of allowed nodes. | ||
264 | |||
265 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a | ||
266 | mempolicy is rebound because of a change in the set of allowed | ||
267 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is | ||
268 | remapped to the new set of allowed nodes. That remap may not | ||
269 | preserve the relative nature of the user's passed nodemask to its | ||
270 | set of allowed nodes upon successive rebinds: a nodemask of | ||
271 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of | ||
272 | allowed nodes is restored to its original state. | ||
273 | |||
274 | With this flag, the remap is done so that the node numbers from | ||
275 | the user's passed nodemask are relative to the set of allowed | ||
276 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's | ||
277 | nodemask, the policy will be effected over the first (and in the | ||
278 | Bind or Interleave case, the third and fifth) nodes in the set of | ||
279 | allowed nodes. The nodemask passed by the user represents nodes | ||
280 | relative to task or VMA's set of allowed nodes. | ||
281 | |||
282 | If the user's nodemask includes nodes that are outside the range | ||
283 | of the new set of allowed nodes (for example, node 5 is set in | ||
284 | the user's nodemask when the set of allowed nodes is only 0-3), | ||
285 | then the remap wraps around to the beginning of the nodemask and, | ||
286 | if not already set, sets the node in the mempolicy nodemask. | ||
287 | |||
288 | For example, consider a task that is attached to a cpuset with | ||
289 | mems 2-5 that sets an Interleave policy over the same set with | ||
290 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the | ||
291 | interleave now occurs over nodes 3,5-6. If the cpuset's mems | ||
292 | then change to 0,2-3,5, then the interleave occurs over nodes | ||
293 | 0,3,5. | ||
294 | |||
295 | Thanks to the consistent remapping, applications preparing | ||
296 | nodemasks to specify memory policies using this flag should | ||
297 | disregard their current, actual cpuset imposed memory placement | ||
298 | and prepare the nodemask as if they were always located on | ||
299 | memory nodes 0 to N-1, where N is the number of memory nodes the | ||
300 | policy is intended to manage. Let the kernel then remap to the | ||
301 | set of memory nodes allowed by the task's cpuset, as that may | ||
302 | change over time. | ||
303 | |||
304 | MPOL_F_RELATIVE_NODES cannot be used with MPOL_F_STATIC_NODES. | ||
305 | |||
229 | MEMORY POLICY APIs | 306 | MEMORY POLICY APIs |
230 | 307 | ||
231 | Linux supports 3 system calls for controlling memory policy. These APIS | 308 | Linux supports 3 system calls for controlling memory policy. These APIS |
@@ -246,7 +323,9 @@ Set [Task] Memory Policy: | |||
246 | Set's the calling task's "task/process memory policy" to mode | 323 | Set's the calling task's "task/process memory policy" to mode |
247 | specified by the 'mode' argument and the set of nodes defined | 324 | specified by the 'mode' argument and the set of nodes defined |
248 | by 'nmask'. 'nmask' points to a bit mask of node ids containing | 325 | by 'nmask'. 'nmask' points to a bit mask of node ids containing |
249 | at least 'maxnode' ids. | 326 | at least 'maxnode' ids. Optional mode flags may be passed by |
327 | combining the 'mode' argument with the flag (for example: | ||
328 | MPOL_INTERLEAVE | MPOL_F_STATIC_NODES). | ||
250 | 329 | ||
251 | See the set_mempolicy(2) man page for more details | 330 | See the set_mempolicy(2) man page for more details |
252 | 331 | ||
@@ -298,29 +377,19 @@ MEMORY POLICIES AND CPUSETS | |||
298 | Memory policies work within cpusets as described above. For memory policies | 377 | Memory policies work within cpusets as described above. For memory policies |
299 | that require a node or set of nodes, the nodes are restricted to the set of | 378 | that require a node or set of nodes, the nodes are restricted to the set of |
300 | nodes whose memories are allowed by the cpuset constraints. If the nodemask | 379 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
301 | specified for the policy contains nodes that are not allowed by the cpuset, or | 380 | specified for the policy contains nodes that are not allowed by the cpuset and |
302 | the intersection of the set of nodes specified for the policy and the set of | 381 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes |
303 | nodes with memory is the empty set, the policy is considered invalid | 382 | specified for the policy and the set of nodes with memory is used. If the |
304 | and cannot be installed. | 383 | result is the empty set, the policy is considered invalid and cannot be |
305 | 384 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped | |
306 | The interaction of memory policies and cpusets can be problematic for a | 385 | onto and folded into the task's set of allowed nodes as previously described. |
307 | couple of reasons: | 386 | |
308 | 387 | The interaction of memory policies and cpusets can be problematic when tasks | |
309 | 1) the memory policy APIs take physical node id's as arguments. As mentioned | 388 | in two cpusets share access to a memory region, such as shared memory segments |
310 | above, it is illegal to specify nodes that are not allowed in the cpuset. | 389 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and |
311 | The application must query the allowed nodes using the get_mempolicy() | 390 | any of the tasks install shared policy on the region, only nodes whose |
312 | API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and | 391 | memories are allowed in both cpusets may be used in the policies. Obtaining |
313 | restrict itself to those nodes. However, the resources available to a | 392 | this information requires "stepping outside" the memory policy APIs to use the |
314 | cpuset can be changed by the system administrator, or a workload manager | 393 | cpuset information and requires that one know in what cpusets other task might |
315 | application, at any time. So, a task may still get errors attempting to | 394 | be attaching to the shared region. Furthermore, if the cpusets' allowed |
316 | specify policy nodes, and must query the allowed memories again. | 395 | memory sets are disjoint, "local" allocation is the only valid policy. |
317 | |||
318 | 2) when tasks in two cpusets share access to a memory region, such as shared | ||
319 | memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and | ||
320 | MAP_SHARED flags, and any of the tasks install shared policy on the region, | ||
321 | only nodes whose memories are allowed in both cpusets may be used in the | ||
322 | policies. Obtaining this information requires "stepping outside" the | ||
323 | memory policy APIs to use the cpuset information and requires that one | ||
324 | know in what cpusets other task might be attaching to the shared region. | ||
325 | Furthermore, if the cpusets' allowed memory sets are disjoint, "local" | ||
326 | allocation is the only valid policy. | ||