mempolicy: rework mempolicy Reference Counting [yet again]

After further discussion with Christoph Lameter, it has become clear that my earlier attempts to clean up the mempolicy reference counting were a bit of overkill in some areas, resulting in superflous ref/unref in what are usually fast paths. In other areas, further inspection reveals that I botched the unref for interleave policies. A separate patch, suitable for upstream/stable trees, fixes up the known errors in the previous attempt to fix reference counting. This patch reworks the memory policy referencing counting and, one hopes, simplifies the code. Maybe I'll get it right this time. See the update to the numa_memory_policy.txt document for a discussion of memory policy reference counting that motivates this patch. Summary: Lookup of mempolicy, based on (vma, address) need only add a reference for shared policy, and we need only unref the policy when finished for shared policies. So, this patch backs out all of the unneeded extra reference counting added by my previous attempt. It then unrefs only shared policies when we're finished with them, using the mpol_cond_put() [conditional put] helper function introduced by this patch. Note that shmem_swapin() calls read_swap_cache_async() with a dummy vma containing just the policy. read_swap_cache_async() can call alloc_page_vma() multiple times, so we can't let alloc_page_vma() unref the shared policy in this case. To avoid this, we make a copy of any non-null shared policy and remove the MPOL_F_SHARED flag from the copy. This copy occurs before reading a page [or multiple pages] from swap, so the overhead should not be an issue here. I introduced a new static inline function "mpol_cond_copy()" to copy the shared policy to an on-stack policy and remove the flags that would require a conditional free. The current implementation of mpol_cond_copy() assumes that the struct mempolicy contains no pointers to dynamically allocated structures that must be duplicated or reference counted during copy. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Lee Schermerhorn <lee.schermerhorn@hp.com> 2008-04-28 05:13:16 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2008-04-28 11:58:24 -0400
commit: 52cd3b074050dd664380b5e8cfc85d4a6ed8ad48 (patch)
tree: fcfcf55c0e81376ea34919fab26e29bedd7f3b88 /Documentation/vm
parent: a6020ed759404372e8be2b276e85e51735472cc9 (diff)
1 files changed, 68 insertions, 0 deletions
diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt
index 27b9507a3769..6719d642653f 100644
--- a/Documentation/vm/numa_memory_policy.txt
+++ b/Documentation/vm/numa_memory_policy.txt
@@ -311,6 +311,74 @@ Components of Memory Policies
            MPOL_PREFERRED policies that were created with an empty nodemask
            (local allocation).
+MEMORY POLICY REFERENCE COUNTING
+To resolve use/free races, struct mempolicy contains an atomic reference
+count field.  Internal interfaces, mpol_get()/mpol_put() increment and
+decrement this reference count, respectively.  mpol_put() will only free
+the structure back to the mempolicy kmem cache when the reference count
+goes to zero.
+When a new memory policy is allocated, it's reference count is initialized
+to '1', representing the reference held by the task that is installing the
+new policy.  When a pointer to a memory policy structure is stored in another
+structure, another reference is added, as the task's reference will be dropped
+on completion of the policy installation.
+During run-time "usage" of the policy, we attempt to minimize atomic operations
+on the reference count, as this can lead to cache lines bouncing between cpus
+and NUMA nodes.  "Usage" here means one of the following:
+1) querying of the policy, either by the task itself [using the get_mempolicy()
+   API discussed below] or by another task using the /proc/<pid>/numa_maps
+   interface.
+2) examination of the policy to determine the policy mode and associated node
+   or node lists, if any, for page allocation.  This is considered a "hot
+   path".  Note that for MPOL_BIND, the "usage" extends across the entire
+   allocation process, which may sleep during page reclaimation, because the
+   BIND policy nodemask is used, by reference, to filter ineligible nodes.
+We can avoid taking an extra reference during the usages listed above as
+follows:
+1) we never need to get/free the system default policy as this is never
+   changed nor freed, once the system is up and running.
+2) for querying the policy, we do not need to take an extra reference on the
+   target task's task policy nor vma policies because we always acquire the
+   task's mm's mmap_sem for read during the query.  The set_mempolicy() and
+   mbind() APIs [see below] always acquire the mmap_sem for write when
+   installing or replacing task or vma policies.  Thus, there is no possibility
+   of a task or thread freeing a policy while another task or thread is
+   querying it.
+3) Page allocation usage of task or vma policy occurs in the fault path where
+   we hold them mmap_sem for read.  Again, because replacing the task or vma
+   policy requires that the mmap_sem be held for write, the policy can't be
+   freed out from under us while we're using it for page allocation.
+4) Shared policies require special consideration.  One task can replace a
+   shared memory policy while another task, with a distinct mmap_sem, is
+   querying or allocating a page based on the policy.  To resolve this
+   potential race, the shared policy infrastructure adds an extra reference
+   to the shared policy during lookup while holding a spin lock on the shared
+   policy management structure.  This requires that we drop this extra
+   reference when we're finished "using" the policy.  We must drop the
+   extra reference on shared policies in the same query/allocation paths
+   used for non-shared policies.  For this reason, shared policies are marked
+   as such, and the extra reference is dropped "conditionally"--i.e., only
+   for shared policies.
+   Because of this extra reference counting, and because we must lookup
+   shared policies in a tree structure under spinlock, shared policies are
+   more expensive to use in the page allocation path.  This is expecially
+   true for shared policies on shared memory regions shared by tasks running
+   on different NUMA nodes.  This extra overhead can be avoided by always
+   falling back to task or system default policy for shared memory regions,
+   or by prefaulting the entire shared memory region into memory and locking
+   it down.  However, this might not be appropriate for all applications.
 MEMORY POLICY APIs
 Linux supports 3 system calls for controlling memory policy.  These APIS
author	Lee Schermerhorn <lee.schermerhorn@hp.com>	2008-04-28 05:13:16 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2008-04-28 11:58:24 -0400
commit	52cd3b074050dd664380b5e8cfc85d4a6ed8ad48 (patch)
tree	fcfcf55c0e81376ea34919fab26e29bedd7f3b88 /Documentation/vm
parent	a6020ed759404372e8be2b276e85e51735472cc9 (diff)

diff --git a/Documentation/vm/numa_memory_policy.txt b/Documentation/vm/numa_memory_policy.txt index 27b9507a3769..6719d642653f 100644 --- a/Documentation/vm/numa_memory_policy.txt +++ b/Documentation/vm/numa_memory_policy.txt
@@ -311,6 +311,74 @@ Components of Memory Policies
311	MPOL_PREFERRED policies that were created with an empty nodemask	311	MPOL_PREFERRED policies that were created with an empty nodemask
312	(local allocation).	312	(local allocation).
313		313
		314	MEMORY POLICY REFERENCE COUNTING
		315
		316	To resolve use/free races, struct mempolicy contains an atomic reference
		317	count field. Internal interfaces, mpol_get()/mpol_put() increment and
		318	decrement this reference count, respectively. mpol_put() will only free
		319	the structure back to the mempolicy kmem cache when the reference count
		320	goes to zero.
		321
		322	When a new memory policy is allocated, it's reference count is initialized
		323	to '1', representing the reference held by the task that is installing the
		324	new policy. When a pointer to a memory policy structure is stored in another
		325	structure, another reference is added, as the task's reference will be dropped
		326	on completion of the policy installation.
		327
		328	During run-time "usage" of the policy, we attempt to minimize atomic operations
		329	on the reference count, as this can lead to cache lines bouncing between cpus
		330	and NUMA nodes. "Usage" here means one of the following:
		331
		332	1) querying of the policy, either by the task itself [using the get_mempolicy()
		333	API discussed below] or by another task using the /proc/<pid>/numa_maps
		334	interface.
		335
		336	2) examination of the policy to determine the policy mode and associated node
		337	or node lists, if any, for page allocation. This is considered a "hot
		338	path". Note that for MPOL_BIND, the "usage" extends across the entire
		339	allocation process, which may sleep during page reclaimation, because the
		340	BIND policy nodemask is used, by reference, to filter ineligible nodes.
		341
		342	We can avoid taking an extra reference during the usages listed above as
		343	follows:
		344
		345	1) we never need to get/free the system default policy as this is never
		346	changed nor freed, once the system is up and running.
		347
		348	2) for querying the policy, we do not need to take an extra reference on the
		349	target task's task policy nor vma policies because we always acquire the
		350	task's mm's mmap_sem for read during the query. The set_mempolicy() and
		351	mbind() APIs [see below] always acquire the mmap_sem for write when
		352	installing or replacing task or vma policies. Thus, there is no possibility
		353	of a task or thread freeing a policy while another task or thread is
		354	querying it.
		355
		356	3) Page allocation usage of task or vma policy occurs in the fault path where
		357	we hold them mmap_sem for read. Again, because replacing the task or vma
		358	policy requires that the mmap_sem be held for write, the policy can't be
		359	freed out from under us while we're using it for page allocation.
		360
		361	4) Shared policies require special consideration. One task can replace a
		362	shared memory policy while another task, with a distinct mmap_sem, is
		363	querying or allocating a page based on the policy. To resolve this
		364	potential race, the shared policy infrastructure adds an extra reference
		365	to the shared policy during lookup while holding a spin lock on the shared
		366	policy management structure. This requires that we drop this extra
		367	reference when we're finished "using" the policy. We must drop the
		368	extra reference on shared policies in the same query/allocation paths
		369	used for non-shared policies. For this reason, shared policies are marked
		370	as such, and the extra reference is dropped "conditionally"--i.e., only
		371	for shared policies.
		372
		373	Because of this extra reference counting, and because we must lookup
		374	shared policies in a tree structure under spinlock, shared policies are
		375	more expensive to use in the page allocation path. This is expecially
		376	true for shared policies on shared memory regions shared by tasks running
		377	on different NUMA nodes. This extra overhead can be avoided by always
		378	falling back to task or system default policy for shared memory regions,
		379	or by prefaulting the entire shared memory region into memory and locking
		380	it down. However, this might not be appropriate for all applications.
		381
314	MEMORY POLICY APIs	382	MEMORY POLICY APIs
315		383
316	Linux supports 3 system calls for controlling memory policy. These APIS	384	Linux supports 3 system calls for controlling memory policy. These APIS