1 files changed, 201 insertions, 80 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index dd4986497996..bad16d3f6a47 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -135,77 +135,58 @@ most general to most specific:
 Components of Memory Policies
-    A Linux memory policy is a tuple consisting of a "mode" and an optional set
+    A Linux memory policy consists of a "mode", optional mode flags, and an
-    of nodes.  The mode determine the behavior of the policy, while the
+    optional set of nodes.  The mode determines the behavior of the policy,
-    optional set of nodes can be viewed as the arguments to the behavior.
+    the optional mode flags determine the behavior of the mode, and the
+    optional set of nodes can be viewed as the arguments to the policy
+    behavior.
   Internally, memory policies are implemented by a reference counted
   structure, struct mempolicy.  Details of this structure will be discussed
   in context, below, as required to explain the behavior.
-        Note:  in some functions AND in the struct mempolicy itself, the mode
-        is called "policy".  However, to avoid confusion with the policy tuple,
-        this document will continue to use the term "mode".
   Linux memory policy supports the following 4 behavioral modes:
-        Default Mode--MPOL_DEFAULT:  The behavior specified by this mode is
+        Default Mode--MPOL_DEFAULT:  This mode is only used in the memory
-        context or scope dependent.
+        policy APIs.  Internally, MPOL_DEFAULT is converted to the NULL
+        memory policy in all policy scopes.  Any existing non-default policy
-            As mentioned in the Policy Scope section above, during normal
+        will simply be removed when MPOL_DEFAULT is specified.  As a result,
-            system operation, the System Default Policy is hard coded to
+        MPOL_DEFAULT means "fall back to the next most specific policy scope."
-            contain the Default mode.
-            In this context, default mode means "local" allocation--that is
-            attempt to allocate the page from the node associated with the cpu
-            where the fault occurs.  If the "local" node has no memory, or the
-            node's memory can be exhausted [no free pages available], local
-            allocation will "fallback to"--attempt to allocate pages from--
-            "nearby" nodes, in order of increasing "distance".
-                Implementation detail -- subject to change:  "Fallback" uses
+            For example, a NULL or default task policy will fall back to the
-                a per node list of sibling nodes--called zonelists--built at
+            system default policy.  A NULL or default vma policy will fall
-                boot time, or when nodes or memory are added or removed from
+            back to the task policy.
-                the system [memory hotplug].  These per node zonelist are
-                constructed with nodes in order of increasing distance based
-                on information provided by the platform firmware.
-            When a task/process policy or a shared policy contains the Default
+            When specified in one of the memory policy APIs, the Default mode
-            mode, this also means "local allocation", as described above.
+            does not use the optional set of nodes.
-            In the context of a VMA, Default mode means "fall back to task
+            It is an error for the set of nodes specified for this policy to
-            policy"--which may or may not specify Default mode.  Thus, Default
+            be non-empty.
-            mode can not be counted on to mean local allocation when used
-            on a non-shared region of the address space.  However, see
-            MPOL_PREFERRED below.
-            The Default mode does not use the optional set of nodes.
        MPOL_BIND:  This mode specifies that memory must come from the
-        set of nodes specified by the policy.
+        set of nodes specified by the policy.  Memory will be allocated from
+        the node in the set with sufficient free memory that is closest to
-            The memory policy APIs do not specify an order in which the nodes
+        the node where the allocation takes place.
-            will be searched.  However, unlike "local allocation", the Bind
-            policy does not consider the distance between the nodes.  Rather,
-            allocations will fallback to the nodes specified by the policy in
-            order of numeric node id.  Like everything in Linux, this is subject
-            to change.
        MPOL_PREFERRED:  This mode specifies that the allocation should be
        attempted from the single node specified in the policy.  If that
-        allocation fails, the kernel will search other nodes, exactly as
+        allocation fails, the kernel will search other nodes, in order of
-        it would for a local allocation that started at the preferred node
+        increasing distance from the preferred node based on information
-        in increasing distance from the preferred node.  "Local" allocation
+        provided by the platform firmware.
-        policy can be viewed as a Preferred policy that starts at the node
        containing the cpu where the allocation takes place.
            Internally, the Preferred policy uses a single node--the
-            preferred_node member of struct mempolicy.  A "distinguished
+            preferred_node member of struct mempolicy.  When the internal
-            value of this preferred_node, currently '-1', is interpreted
+            mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
-            as "the node containing the cpu where the allocation takes
+            the policy is interpreted as local allocation.  "Local" allocation
-            place"--local allocation.  This is the way to specify
+            policy can be viewed as a Preferred policy that starts at the node
-            local allocation for a specific range of addresses--i.e. for
+            containing the cpu where the allocation takes place.
-            VMA policies.
+            It is possible for the user to specify that local allocation is
+            always preferred by passing an empty nodemask with this mode.
+            If an empty nodemask is passed, the policy cannot use the
+            MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
+            below.
        MPOL_INTERLEAVED:  This mode specifies that page allocations be
        interleaved, on a page granularity, across the nodes specified in
@@ -231,6 +212,154 @@ Components of Memory Policies
            the temporary interleaved system default policy works in this
            mode.
+   Linux memory policy supports the following optional mode flags:
+        MPOL_F_STATIC_NODES:  This flag specifies that the nodemask passed by
+        the user should not be remapped if the task or VMA's set of allowed
+        nodes changes after the memory policy has been defined.
+            Without this flag, anytime a mempolicy is rebound because of a
+            change in the set of allowed nodes, the node (Preferred) or
+            nodemask (Bind, Interleave) is remapped to the new set of
+            allowed nodes.  This may result in nodes being used that were
+            previously undesired.
+            With this flag, if the user-specified nodes overlap with the
+            nodes allowed by the task's cpuset, then the memory policy is
+            applied to their intersection.  If the two sets of nodes do not
+            overlap, the Default policy is used.
+            For example, consider a task that is attached to a cpuset with
+            mems 1-3 that sets an Interleave policy over the same set.  If
+            the cpuset's mems change to 3-5, the Interleave will now occur
+            over nodes 3, 4, and 5.  With this flag, however, since only node
+            3 is allowed from the user's nodemask, the "interleave" only
+            occurs over that node.  If no nodes from the user's nodemask are
+            now allowed, the Default behavior is used.
+            MPOL_F_STATIC_NODES cannot be combined with the
+            MPOL_F_RELATIVE_NODES flag.  It also cannot be used for
+            MPOL_PREFERRED policies that were created with an empty nodemask
+            (local allocation).
+        MPOL_F_RELATIVE_NODES:  This flag specifies that the nodemask passed
+        by the user will be mapped relative to the set of the task or VMA's
+        set of allowed nodes.  The kernel stores the user-passed nodemask,
+        and if the allowed nodes changes, then that original nodemask will
+        be remapped relative to the new set of allowed nodes.
+            Without this flag (and without MPOL_F_STATIC_NODES), anytime a
+            mempolicy is rebound because of a change in the set of allowed
+            nodes, the node (Preferred) or nodemask (Bind, Interleave) is
+            remapped to the new set of allowed nodes.  That remap may not
+            preserve the relative nature of the user's passed nodemask to its
+            set of allowed nodes upon successive rebinds: a nodemask of
+            1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
+            allowed nodes is restored to its original state.
+            With this flag, the remap is done so that the node numbers from
+            the user's passed nodemask are relative to the set of allowed
+            nodes.  In other words, if nodes 0, 2, and 4 are set in the user's
+            nodemask, the policy will be effected over the first (and in the
+            Bind or Interleave case, the third and fifth) nodes in the set of
+            allowed nodes.  The nodemask passed by the user represents nodes
+            relative to task or VMA's set of allowed nodes.
+            If the user's nodemask includes nodes that are outside the range
+            of the new set of allowed nodes (for example, node 5 is set in
+            the user's nodemask when the set of allowed nodes is only 0-3),
+            then the remap wraps around to the beginning of the nodemask and,
+            if not already set, sets the node in the mempolicy nodemask.
+            For example, consider a task that is attached to a cpuset with
+            mems 2-5 that sets an Interleave policy over the same set with
+            MPOL_F_RELATIVE_NODES.  If the cpuset's mems change to 3-7, the
+            interleave now occurs over nodes 3,5-6.  If the cpuset's mems
+            then change to 0,2-3,5, then the interleave occurs over nodes
+            0,3,5.
+            Thanks to the consistent remapping, applications preparing
+            nodemasks to specify memory policies using this flag should
+            disregard their current, actual cpuset imposed memory placement
+            and prepare the nodemask as if they were always located on
+            memory nodes 0 to N-1, where N is the number of memory nodes the
+            policy is intended to manage.  Let the kernel then remap to the
+            set of memory nodes allowed by the task's cpuset, as that may
+            change over time.
+            MPOL_F_RELATIVE_NODES cannot be combined with the
+            MPOL_F_STATIC_NODES flag.  It also cannot be used for
+            MPOL_PREFERRED policies that were created with an empty nodemask
+            (local allocation).
+MEMORY POLICY REFERENCE COUNTING
+To resolve use/free races, struct mempolicy contains an atomic reference
+count field.  Internal interfaces, mpol_get()/mpol_put() increment and
+decrement this reference count, respectively.  mpol_put() will only free
+the structure back to the mempolicy kmem cache when the reference count
+goes to zero.
+When a new memory policy is allocated, it's reference count is initialized
+to '1', representing the reference held by the task that is installing the
+new policy.  When a pointer to a memory policy structure is stored in another
+structure, another reference is added, as the task's reference will be dropped
+on completion of the policy installation.
+During run-time "usage" of the policy, we attempt to minimize atomic operations
+on the reference count, as this can lead to cache lines bouncing between cpus
+and NUMA nodes.  "Usage" here means one of the following:
+1) querying of the policy, either by the task itself [using the get_mempolicy()
+   API discussed below] or by another task using the /proc/<pid>/numa_maps
+   interface.
+2) examination of the policy to determine the policy mode and associated node
+   or node lists, if any, for page allocation.  This is considered a "hot
+   path".  Note that for MPOL_BIND, the "usage" extends across the entire
+   allocation process, which may sleep during page reclaimation, because the
+   BIND policy nodemask is used, by reference, to filter ineligible nodes.
+We can avoid taking an extra reference during the usages listed above as
+follows:
+1) we never need to get/free the system default policy as this is never
+   changed nor freed, once the system is up and running.
+2) for querying the policy, we do not need to take an extra reference on the
+   target task's task policy nor vma policies because we always acquire the
+   task's mm's mmap_sem for read during the query.  The set_mempolicy() and
+   mbind() APIs [see below] always acquire the mmap_sem for write when
+   installing or replacing task or vma policies.  Thus, there is no possibility
+   of a task or thread freeing a policy while another task or thread is
+   querying it.
+3) Page allocation usage of task or vma policy occurs in the fault path where
+   we hold them mmap_sem for read.  Again, because replacing the task or vma
+   policy requires that the mmap_sem be held for write, the policy can't be
+   freed out from under us while we're using it for page allocation.
+4) Shared policies require special consideration.  One task can replace a
+   shared memory policy while another task, with a distinct mmap_sem, is
+   querying or allocating a page based on the policy.  To resolve this
+   potential race, the shared policy infrastructure adds an extra reference
+   to the shared policy during lookup while holding a spin lock on the shared
+   policy management structure.  This requires that we drop this extra
+   reference when we're finished "using" the policy.  We must drop the
+   extra reference on shared policies in the same query/allocation paths
+   used for non-shared policies.  For this reason, shared policies are marked
+   as such, and the extra reference is dropped "conditionally"--i.e., only
+   for shared policies.
+   Because of this extra reference counting, and because we must lookup
+   shared policies in a tree structure under spinlock, shared policies are
+   more expensive to use in the page allocation path.  This is expecially
+   true for shared policies on shared memory regions shared by tasks running
+   on different NUMA nodes.  This extra overhead can be avoided by always
+   falling back to task or system default policy for shared memory regions,
+   or by prefaulting the entire shared memory region into memory and locking
+   it down.  However, this might not be appropriate for all applications.
 MEMORY POLICY APIs
 Linux supports 3 system calls for controlling memory policy.  These APIS
@@ -251,7 +380,9 @@ Set [Task] Memory Policy:
        Set's the calling task's "task/process memory policy" to mode
        specified by the 'mode' argument and the set of nodes defined
        by 'nmask'.  'nmask' points to a bit mask of node ids containing
-        at least 'maxnode' ids.
+        at least 'maxnode' ids.  Optional mode flags may be passed by
+        combining the 'mode' argument with the flag (for example:
+        MPOL_INTERLEAVE | MPOL_F_STATIC_NODES).
        See the set_mempolicy(2) man page for more details
@@ -303,29 +434,19 @@ MEMORY POLICIES AND CPUSETS
 Memory policies work within cpusets as described above.  For memory policies
 that require a node or set of nodes, the nodes are restricted to the set of
 nodes whose memories are allowed by the cpuset constraints.  If the nodemask
-specified for the policy contains nodes that are not allowed by the cpuset, or
+specified for the policy contains nodes that are not allowed by the cpuset and
-the intersection of the set of nodes specified for the policy and the set of
+MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
-nodes with memory is the empty set, the policy is considered invalid
+specified for the policy and the set of nodes with memory is used.  If the
-and cannot be installed.
+result is the empty set, the policy is considered invalid and cannot be
+installed.  If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
-The interaction of memory policies and cpusets can be problematic for a
+onto and folded into the task's set of allowed nodes as previously described.
-couple of reasons:
+The interaction of memory policies and cpusets can be problematic when tasks
-1) the memory policy APIs take physical node id's as arguments.  As mentioned
+in two cpusets share access to a memory region, such as shared memory segments
-   above, it is illegal to specify nodes that are not allowed in the cpuset.
+created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
-   The application must query the allowed nodes using the get_mempolicy()
+any of the tasks install shared policy on the region, only nodes whose
-   API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and
+memories are allowed in both cpusets may be used in the policies.  Obtaining
-   restrict itself to those nodes.  However, the resources available to a
+this information requires "stepping outside" the memory policy APIs to use the
-   cpuset can be changed by the system administrator, or a workload manager
+cpuset information and requires that one know in what cpusets other task might
-   application, at any time.  So, a task may still get errors attempting to
+be attaching to the shared region.  Furthermore, if the cpusets' allowed
-   specify policy nodes, and must query the allowed memories again.
+memory sets are disjoint, "local" allocation is the only valid policy.
-2) when tasks in two cpusets share access to a memory region, such as shared
-   memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
-   MAP_SHARED flags, and any of the tasks install shared policy on the region,
-   only nodes whose memories are allowed in both cpusets may be used in the
-   policies.  Obtaining this information requires "stepping outside" the
-   memory policy APIs to use the cpuset information and requires that one
-   know in what cpusets other task might be attaching to the shared region.
-   Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
-   allocation is the only valid policy.

diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index dd4986497996..bad16d3f6a47 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt
@@ -135,77 +135,58 @@ most general to most specific:
135		135
136	Components of Memory Policies	136	Components of Memory Policies
137		137
138	A Linux memory policy is a tuple consisting of a "mode" and an optional set	138	A Linux memory policy consists of a "mode", optional mode flags, and an
139	of nodes. The mode determine the behavior of the policy, while the	139	optional set of nodes. The mode determines the behavior of the policy,
140	optional set of nodes can be viewed as the arguments to the behavior.	140	the optional mode flags determine the behavior of the mode, and the
		141	optional set of nodes can be viewed as the arguments to the policy
		142	behavior.
141		143
142	Internally, memory policies are implemented by a reference counted	144	Internally, memory policies are implemented by a reference counted
143	structure, struct mempolicy. Details of this structure will be discussed	145	structure, struct mempolicy. Details of this structure will be discussed
144	in context, below, as required to explain the behavior.	146	in context, below, as required to explain the behavior.
145		147
146	Note: in some functions AND in the struct mempolicy itself, the mode
147	is called "policy". However, to avoid confusion with the policy tuple,
148	this document will continue to use the term "mode".
149
150	Linux memory policy supports the following 4 behavioral modes:	148	Linux memory policy supports the following 4 behavioral modes:
151		149
152	Default Mode--MPOL_DEFAULT: The behavior specified by this mode is	150	Default Mode--MPOL_DEFAULT: This mode is only used in the memory
153	context or scope dependent.	151	policy APIs. Internally, MPOL_DEFAULT is converted to the NULL
154		152	memory policy in all policy scopes. Any existing non-default policy
155	As mentioned in the Policy Scope section above, during normal	153	will simply be removed when MPOL_DEFAULT is specified. As a result,
156	system operation, the System Default Policy is hard coded to	154	MPOL_DEFAULT means "fall back to the next most specific policy scope."
157	contain the Default mode.
158
159	In this context, default mode means "local" allocation--that is
160	attempt to allocate the page from the node associated with the cpu
161	where the fault occurs. If the "local" node has no memory, or the
162	node's memory can be exhausted [no free pages available], local
163	allocation will "fallback to"--attempt to allocate pages from--
164	"nearby" nodes, in order of increasing "distance".
165		155
166	Implementation detail -- subject to change: "Fallback" uses	156	For example, a NULL or default task policy will fall back to the
167	a per node list of sibling nodes--called zonelists--built at	157	system default policy. A NULL or default vma policy will fall
168	boot time, or when nodes or memory are added or removed from	158	back to the task policy.
169	the system [memory hotplug]. These per node zonelist are
170	constructed with nodes in order of increasing distance based
171	on information provided by the platform firmware.
172		159
173	When a task/process policy or a shared policy contains the Default	160	When specified in one of the memory policy APIs, the Default mode
174	mode, this also means "local allocation", as described above.	161	does not use the optional set of nodes.
175		162
176	In the context of a VMA, Default mode means "fall back to task	163	It is an error for the set of nodes specified for this policy to
177	policy"--which may or may not specify Default mode. Thus, Default	164	be non-empty.
178	mode can not be counted on to mean local allocation when used
179	on a non-shared region of the address space. However, see
180	MPOL_PREFERRED below.
181
182	The Default mode does not use the optional set of nodes.
183		165
184	MPOL_BIND: This mode specifies that memory must come from the	166	MPOL_BIND: This mode specifies that memory must come from the
185	set of nodes specified by the policy.	167	set of nodes specified by the policy. Memory will be allocated from
186		168	the node in the set with sufficient free memory that is closest to
187	The memory policy APIs do not specify an order in which the nodes	169	the node where the allocation takes place.
188	will be searched. However, unlike "local allocation", the Bind
189	policy does not consider the distance between the nodes. Rather,
190	allocations will fallback to the nodes specified by the policy in
191	order of numeric node id. Like everything in Linux, this is subject
192	to change.
193		170
194	MPOL_PREFERRED: This mode specifies that the allocation should be	171	MPOL_PREFERRED: This mode specifies that the allocation should be
195	attempted from the single node specified in the policy. If that	172	attempted from the single node specified in the policy. If that
196	allocation fails, the kernel will search other nodes, exactly as	173	allocation fails, the kernel will search other nodes, in order of
197	it would for a local allocation that started at the preferred node	174	increasing distance from the preferred node based on information
198	in increasing distance from the preferred node. "Local" allocation	175	provided by the platform firmware.
199	policy can be viewed as a Preferred policy that starts at the node
200	containing the cpu where the allocation takes place.	176	containing the cpu where the allocation takes place.
201		177
202	Internally, the Preferred policy uses a single node--the	178	Internally, the Preferred policy uses a single node--the
203	preferred_node member of struct mempolicy. A "distinguished	179	preferred_node member of struct mempolicy. When the internal
204	value of this preferred_node, currently '-1', is interpreted	180	mode flag MPOL_F_LOCAL is set, the preferred_node is ignored and
205	as "the node containing the cpu where the allocation takes	181	the policy is interpreted as local allocation. "Local" allocation
206	place"--local allocation. This is the way to specify	182	policy can be viewed as a Preferred policy that starts at the node
207	local allocation for a specific range of addresses--i.e. for	183	containing the cpu where the allocation takes place.
208	VMA policies.	184
		185	It is possible for the user to specify that local allocation is
		186	always preferred by passing an empty nodemask with this mode.
		187	If an empty nodemask is passed, the policy cannot use the
		188	MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags described
		189	below.
209		190
210	MPOL_INTERLEAVED: This mode specifies that page allocations be	191	MPOL_INTERLEAVED: This mode specifies that page allocations be
211	interleaved, on a page granularity, across the nodes specified in	192	interleaved, on a page granularity, across the nodes specified in
@@ -231,6 +212,154 @@ Components of Memory Policies
231	the temporary interleaved system default policy works in this	212	the temporary interleaved system default policy works in this
232	mode.	213	mode.
233		214
		215	Linux memory policy supports the following optional mode flags:
		216
		217	MPOL_F_STATIC_NODES: This flag specifies that the nodemask passed by
		218	the user should not be remapped if the task or VMA's set of allowed
		219	nodes changes after the memory policy has been defined.
		220
		221	Without this flag, anytime a mempolicy is rebound because of a
		222	change in the set of allowed nodes, the node (Preferred) or
		223	nodemask (Bind, Interleave) is remapped to the new set of
		224	allowed nodes. This may result in nodes being used that were
		225	previously undesired.
		226
		227	With this flag, if the user-specified nodes overlap with the
		228	nodes allowed by the task's cpuset, then the memory policy is
		229	applied to their intersection. If the two sets of nodes do not
		230	overlap, the Default policy is used.
		231
		232	For example, consider a task that is attached to a cpuset with
		233	mems 1-3 that sets an Interleave policy over the same set. If
		234	the cpuset's mems change to 3-5, the Interleave will now occur
		235	over nodes 3, 4, and 5. With this flag, however, since only node
		236	3 is allowed from the user's nodemask, the "interleave" only
		237	occurs over that node. If no nodes from the user's nodemask are
		238	now allowed, the Default behavior is used.
		239
		240	MPOL_F_STATIC_NODES cannot be combined with the
		241	MPOL_F_RELATIVE_NODES flag. It also cannot be used for
		242	MPOL_PREFERRED policies that were created with an empty nodemask
		243	(local allocation).
		244
		245	MPOL_F_RELATIVE_NODES: This flag specifies that the nodemask passed
		246	by the user will be mapped relative to the set of the task or VMA's
		247	set of allowed nodes. The kernel stores the user-passed nodemask,
		248	and if the allowed nodes changes, then that original nodemask will
		249	be remapped relative to the new set of allowed nodes.
		250
		251	Without this flag (and without MPOL_F_STATIC_NODES), anytime a
		252	mempolicy is rebound because of a change in the set of allowed
		253	nodes, the node (Preferred) or nodemask (Bind, Interleave) is
		254	remapped to the new set of allowed nodes. That remap may not
		255	preserve the relative nature of the user's passed nodemask to its
		256	set of allowed nodes upon successive rebinds: a nodemask of
		257	1,3,5 may be remapped to 7-9 and then to 1-3 if the set of
		258	allowed nodes is restored to its original state.
		259
		260	With this flag, the remap is done so that the node numbers from
		261	the user's passed nodemask are relative to the set of allowed
		262	nodes. In other words, if nodes 0, 2, and 4 are set in the user's
		263	nodemask, the policy will be effected over the first (and in the
		264	Bind or Interleave case, the third and fifth) nodes in the set of
		265	allowed nodes. The nodemask passed by the user represents nodes
		266	relative to task or VMA's set of allowed nodes.
		267
		268	If the user's nodemask includes nodes that are outside the range
		269	of the new set of allowed nodes (for example, node 5 is set in
		270	the user's nodemask when the set of allowed nodes is only 0-3),
		271	then the remap wraps around to the beginning of the nodemask and,
		272	if not already set, sets the node in the mempolicy nodemask.
		273
		274	For example, consider a task that is attached to a cpuset with
		275	mems 2-5 that sets an Interleave policy over the same set with
		276	MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the
		277	interleave now occurs over nodes 3,5-6. If the cpuset's mems
		278	then change to 0,2-3,5, then the interleave occurs over nodes
		279	0,3,5.
		280
		281	Thanks to the consistent remapping, applications preparing
		282	nodemasks to specify memory policies using this flag should
		283	disregard their current, actual cpuset imposed memory placement
		284	and prepare the nodemask as if they were always located on
		285	memory nodes 0 to N-1, where N is the number of memory nodes the
		286	policy is intended to manage. Let the kernel then remap to the
		287	set of memory nodes allowed by the task's cpuset, as that may
		288	change over time.
		289
		290	MPOL_F_RELATIVE_NODES cannot be combined with the
		291	MPOL_F_STATIC_NODES flag. It also cannot be used for
		292	MPOL_PREFERRED policies that were created with an empty nodemask
		293	(local allocation).
		294
		295	MEMORY POLICY REFERENCE COUNTING
		296
		297	To resolve use/free races, struct mempolicy contains an atomic reference
		298	count field. Internal interfaces, mpol_get()/mpol_put() increment and
		299	decrement this reference count, respectively. mpol_put() will only free
		300	the structure back to the mempolicy kmem cache when the reference count
		301	goes to zero.
		302
		303	When a new memory policy is allocated, it's reference count is initialized
		304	to '1', representing the reference held by the task that is installing the
		305	new policy. When a pointer to a memory policy structure is stored in another
		306	structure, another reference is added, as the task's reference will be dropped
		307	on completion of the policy installation.
		308
		309	During run-time "usage" of the policy, we attempt to minimize atomic operations
		310	on the reference count, as this can lead to cache lines bouncing between cpus
		311	and NUMA nodes. "Usage" here means one of the following:
		312
		313	1) querying of the policy, either by the task itself [using the get_mempolicy()
		314	API discussed below] or by another task using the /proc/<pid>/numa_maps
		315	interface.
		316
		317	2) examination of the policy to determine the policy mode and associated node
		318	or node lists, if any, for page allocation. This is considered a "hot
		319	path". Note that for MPOL_BIND, the "usage" extends across the entire
		320	allocation process, which may sleep during page reclaimation, because the
		321	BIND policy nodemask is used, by reference, to filter ineligible nodes.
		322
		323	We can avoid taking an extra reference during the usages listed above as
		324	follows:
		325
		326	1) we never need to get/free the system default policy as this is never
		327	changed nor freed, once the system is up and running.
		328
		329	2) for querying the policy, we do not need to take an extra reference on the
		330	target task's task policy nor vma policies because we always acquire the
		331	task's mm's mmap_sem for read during the query. The set_mempolicy() and
		332	mbind() APIs [see below] always acquire the mmap_sem for write when
		333	installing or replacing task or vma policies. Thus, there is no possibility
		334	of a task or thread freeing a policy while another task or thread is
		335	querying it.
		336
		337	3) Page allocation usage of task or vma policy occurs in the fault path where
		338	we hold them mmap_sem for read. Again, because replacing the task or vma
		339	policy requires that the mmap_sem be held for write, the policy can't be
		340	freed out from under us while we're using it for page allocation.
		341
		342	4) Shared policies require special consideration. One task can replace a
		343	shared memory policy while another task, with a distinct mmap_sem, is
		344	querying or allocating a page based on the policy. To resolve this
		345	potential race, the shared policy infrastructure adds an extra reference
		346	to the shared policy during lookup while holding a spin lock on the shared
		347	policy management structure. This requires that we drop this extra
		348	reference when we're finished "using" the policy. We must drop the
		349	extra reference on shared policies in the same query/allocation paths
		350	used for non-shared policies. For this reason, shared policies are marked
		351	as such, and the extra reference is dropped "conditionally"--i.e., only
		352	for shared policies.
		353
		354	Because of this extra reference counting, and because we must lookup
		355	shared policies in a tree structure under spinlock, shared policies are
		356	more expensive to use in the page allocation path. This is expecially
		357	true for shared policies on shared memory regions shared by tasks running
		358	on different NUMA nodes. This extra overhead can be avoided by always
		359	falling back to task or system default policy for shared memory regions,
		360	or by prefaulting the entire shared memory region into memory and locking
		361	it down. However, this might not be appropriate for all applications.
		362
234	MEMORY POLICY APIs	363	MEMORY POLICY APIs
235		364
236	Linux supports 3 system calls for controlling memory policy. These APIS	365	Linux supports 3 system calls for controlling memory policy. These APIS
@@ -251,7 +380,9 @@ Set [Task] Memory Policy:
251	Set's the calling task's "task/process memory policy" to mode	380	Set's the calling task's "task/process memory policy" to mode
252	specified by the 'mode' argument and the set of nodes defined	381	specified by the 'mode' argument and the set of nodes defined
253	by 'nmask'. 'nmask' points to a bit mask of node ids containing	382	by 'nmask'. 'nmask' points to a bit mask of node ids containing
254	at least 'maxnode' ids.	383	at least 'maxnode' ids. Optional mode flags may be passed by
		384	combining the 'mode' argument with the flag (for example:
		385	MPOL_INTERLEAVE \| MPOL_F_STATIC_NODES).
255		386
256	See the set_mempolicy(2) man page for more details	387	See the set_mempolicy(2) man page for more details
257		388
@@ -303,29 +434,19 @@ MEMORY POLICIES AND CPUSETS
303	Memory policies work within cpusets as described above. For memory policies	434	Memory policies work within cpusets as described above. For memory policies
304	that require a node or set of nodes, the nodes are restricted to the set of	435	that require a node or set of nodes, the nodes are restricted to the set of
305	nodes whose memories are allowed by the cpuset constraints. If the nodemask	436	nodes whose memories are allowed by the cpuset constraints. If the nodemask
306	specified for the policy contains nodes that are not allowed by the cpuset, or	437	specified for the policy contains nodes that are not allowed by the cpuset and
307	the intersection of the set of nodes specified for the policy and the set of	438	MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes
308	nodes with memory is the empty set, the policy is considered invalid	439	specified for the policy and the set of nodes with memory is used. If the
309	and cannot be installed.	440	result is the empty set, the policy is considered invalid and cannot be
310		441	installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped
311	The interaction of memory policies and cpusets can be problematic for a	442	onto and folded into the task's set of allowed nodes as previously described.
312	couple of reasons:	443
313		444	The interaction of memory policies and cpusets can be problematic when tasks
314	1) the memory policy APIs take physical node id's as arguments. As mentioned	445	in two cpusets share access to a memory region, such as shared memory segments
315	above, it is illegal to specify nodes that are not allowed in the cpuset.	446	created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and
316	The application must query the allowed nodes using the get_mempolicy()	447	any of the tasks install shared policy on the region, only nodes whose
317	API with the MPOL_F_MEMS_ALLOWED flag to determine the allowed nodes and	448	memories are allowed in both cpusets may be used in the policies. Obtaining
318	restrict itself to those nodes. However, the resources available to a	449	this information requires "stepping outside" the memory policy APIs to use the
319	cpuset can be changed by the system administrator, or a workload manager	450	cpuset information and requires that one know in what cpusets other task might
320	application, at any time. So, a task may still get errors attempting to	451	be attaching to the shared region. Furthermore, if the cpusets' allowed
321	specify policy nodes, and must query the allowed memories again.	452	memory sets are disjoint, "local" allocation is the only valid policy.
322
323	2) when tasks in two cpusets share access to a memory region, such as shared
324	memory segments created by shmget() of mmap() with the MAP_ANONYMOUS and
325	MAP_SHARED flags, and any of the tasks install shared policy on the region,
326	only nodes whose memories are allowed in both cpusets may be used in the
327	policies. Obtaining this information requires "stepping outside" the
328	memory policy APIs to use the cpuset information and requires that one
329	know in what cpusets other task might be attaching to the shared region.
330	Furthermore, if the cpusets' allowed memory sets are disjoint, "local"
331	allocation is the only valid policy.