hugetlb: introduce nr_overcommit_hugepages sysctl

hugetlb: introduce nr_overcommit_hugepages sysctl While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I became convinced that having a boolean sysctl was insufficient: 1) To support per-node control of hugepages, I have previously submitted patches to add a sysfs attribute related to nr_hugepages. However, with a boolean global value and per-mount quota enforcement constraining the dynamic pool, adding corresponding control of the dynamic pool on a per-node basis seems inconsistent to me. 2) Administration of the hugetlb dynamic pool with multiple hugetlbfs mount points is, arguably, more arduous than it needs to be. Each quota would need to be set separately, and the sum would need to be monitored. To ease the administration, and to help make the way for per-node control of the static & dynamic hugepage pool, I added a separate sysctl, nr_overcommit_hugepages. This value serves as a high watermark for the overall hugepage pool, while nr_hugepages serves as a low watermark. The boolean sysctl can then be removed, as the condition nr_overcommit_hugepages > 0 indicates the same administrative setting as hugetlb_dynamic_pool == 1 Quotas still serve as local enforcement of the size of the pool on a per-mount basis. A few caveats: 1) There is a race whereby the global surplus huge page counter is incremented before a hugepage has allocated. Another process could then try grow the pool, and fail to convert a surplus huge page to a normal huge page and instead allocate a fresh huge page. I believe this is benign, as no memory is leaked (the actual pages are still tracked correctly) and the counters won't go out of sync. 2) Shrinking the static pool while a surplus is in effect will allow the number of surplus huge pages to exceed the overcommit value. As long as this condition holds, however, no more surplus huge pages will be allowed on the system until one of the two sysctls are increased sufficiently, or the surplus huge pages go out of use and are freed. Successfully tested on x86_64 with the current libhugetlbfs snapshot, modified to use the new sysctl. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Nishanth Aravamudan <nacc@us.ibm.com> 2007-12-17 19:20:12 -0500
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-12-17 22:28:17 -0500
commit: d1c3fb1f8f29c41b0d098d7cfb3c32939043631f (patch)
tree: b91983662da7ec4c28ac0788e835c2d51eea20e1 /mm/hugetlb.c
parent: 7a3f595cc8298df14a7c71b0d876bafd8e9e1cbf (diff)
1 files changed, 61 insertions, 6 deletions
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6f978218c2c..3a790651475 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -32,6 +32,7 @@ static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
 static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
 unsigned long hugepages_treat_as_movable;
 int hugetlb_dynamic_pool;
+unsigned long nr_overcommit_huge_pages;
 static int hugetlb_next_nid;
 /*
@@ -227,22 +228,62 @@ static struct page *alloc_buddy_huge_page(struct vm_area_struct *vma,
                                                unsigned long address)
 {
        struct page *page;
+        unsigned int nid;
        /* Check if the dynamic pool is enabled */
        if (!hugetlb_dynamic_pool)
                return NULL;
+        /*
+         * Assume we will successfully allocate the surplus page to
+         * prevent racing processes from causing the surplus to exceed
+         * overcommit
+         *
+         * This however introduces a different race, where a process B
+         * tries to grow the static hugepage pool while alloc_pages() is
+         * called by process A. B will only examine the per-node
+         * counters in determining if surplus huge pages can be
+         * converted to normal huge pages in adjust_pool_surplus(). A
+         * won't be able to increment the per-node counter, until the
+         * lock is dropped by B, but B doesn't drop hugetlb_lock until
+         * no more huge pages can be converted from surplus to normal
+         * state (and doesn't try to convert again). Thus, we have a
+         * case where a surplus huge page exists, the pool is grown, and
+         * the surplus huge page still exists after, even though it
+         * should just have been converted to a normal huge page. This
+         * does not leak memory, though, as the hugepage will be freed
+         * once it is out of use. It also does not allow the counters to
+         * go out of whack in adjust_pool_surplus() as we don't modify
+         * the node values until we've gotten the hugepage and only the
+         * per-node value is checked there.
+         */
+        spin_lock(&hugetlb_lock);
+        if (surplus_huge_pages >= nr_overcommit_huge_pages) {
+                spin_unlock(&hugetlb_lock);
+                return NULL;
+        } else {
+                nr_huge_pages++;
+                surplus_huge_pages++;
+        }
+        spin_unlock(&hugetlb_lock);
        page = alloc_pages(htlb_alloc_mask|__GFP_COMP|__GFP_NOWARN,
                                        HUGETLB_PAGE_ORDER);
+        spin_lock(&hugetlb_lock);
        if (page) {
+                nid = page_to_nid(page);
                set_compound_page_dtor(page, free_huge_page);
-                spin_lock(&hugetlb_lock);
+                /*
-                nr_huge_pages++;
+                 * We incremented the global counters already
-                nr_huge_pages_node[page_to_nid(page)]++;
+                 */
-                surplus_huge_pages++;
+                nr_huge_pages_node[nid]++;
-                surplus_huge_pages_node[page_to_nid(page)]++;
+                surplus_huge_pages_node[nid]++;
-                spin_unlock(&hugetlb_lock);
+        } else {
+                nr_huge_pages--;
+                surplus_huge_pages--;
        }
+        spin_unlock(&hugetlb_lock);
        return page;
 }
@@ -481,6 +522,12 @@ static unsigned long set_max_huge_pages(unsigned long count)
         * Increase the pool size
         * First take pages out of surplus state.  Then make up the
         * remaining difference by allocating fresh huge pages.
+         *
+         * We might race with alloc_buddy_huge_page() here and be unable
+         * to convert a surplus huge page to a normal huge page. That is
+         * not critical, though, it just means the overall size of the
+         * pool might be one hugepage larger than it needs to be, but
+         * within all the constraints specified by the sysctls.
         */
        spin_lock(&hugetlb_lock);
        while (surplus_huge_pages && count > persistent_huge_pages) {
@@ -509,6 +556,14 @@ static unsigned long set_max_huge_pages(unsigned long count)
         * to keep enough around to satisfy reservations).  Then place
         * pages into surplus state as needed so the pool will shrink
         * to the desired size as pages become free.
+         *
+         * By placing pages into the surplus state independent of the
+         * overcommit value, we are allowing the surplus pool size to
+         * exceed overcommit. There are few sane options here. Since
+         * alloc_buddy_huge_page() is checking the global counter,
+         * though, we'll note that we're not allowed to exceed surplus
+         * and won't grow the pool anywhere else. Not until one of the
+         * sysctls are changed, or the surplus pages go out of use.
         */
        min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
        min_count = max(count, min_count);
author	Nishanth Aravamudan <nacc@us.ibm.com>	2007-12-17 19:20:12 -0500
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-12-17 22:28:17 -0500
commit	d1c3fb1f8f29c41b0d098d7cfb3c32939043631f (patch)
tree	b91983662da7ec4c28ac0788e835c2d51eea20e1 /mm/hugetlb.c
parent	7a3f595cc8298df14a7c71b0d876bafd8e9e1cbf (diff)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6f978218c2c..3a790651475 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c
@@ -32,6 +32,7 @@ static unsigned int surplus_huge_pages_node[MAX_NUMNODES];
32	static gfp_t htlb_alloc_mask = GFP_HIGHUSER;	32	static gfp_t htlb_alloc_mask = GFP_HIGHUSER;
33	unsigned long hugepages_treat_as_movable;	33	unsigned long hugepages_treat_as_movable;
34	int hugetlb_dynamic_pool;	34	int hugetlb_dynamic_pool;
		35	unsigned long nr_overcommit_huge_pages;
35	static int hugetlb_next_nid;	36	static int hugetlb_next_nid;
36		37
37	/*	38	/*
@@ -227,22 +228,62 @@ static struct page alloc_buddy_huge_page(struct vm_area_struct vma,
227	unsigned long address)	228	unsigned long address)
228	{	229	{
229	struct page *page;	230	struct page *page;
		231	unsigned int nid;
230		232
231	/* Check if the dynamic pool is enabled */	233	/* Check if the dynamic pool is enabled */
232	if (!hugetlb_dynamic_pool)	234	if (!hugetlb_dynamic_pool)
233	return NULL;	235	return NULL;
234		236
		237	/*
		238	* Assume we will successfully allocate the surplus page to
		239	* prevent racing processes from causing the surplus to exceed
		240	* overcommit
		241	*
		242	* This however introduces a different race, where a process B
		243	* tries to grow the static hugepage pool while alloc_pages() is
		244	* called by process A. B will only examine the per-node
		245	* counters in determining if surplus huge pages can be
		246	* converted to normal huge pages in adjust_pool_surplus(). A
		247	* won't be able to increment the per-node counter, until the
		248	* lock is dropped by B, but B doesn't drop hugetlb_lock until
		249	* no more huge pages can be converted from surplus to normal
		250	* state (and doesn't try to convert again). Thus, we have a
		251	* case where a surplus huge page exists, the pool is grown, and
		252	* the surplus huge page still exists after, even though it
		253	* should just have been converted to a normal huge page. This
		254	* does not leak memory, though, as the hugepage will be freed
		255	* once it is out of use. It also does not allow the counters to
		256	* go out of whack in adjust_pool_surplus() as we don't modify
		257	* the node values until we've gotten the hugepage and only the
		258	* per-node value is checked there.
		259	*/
		260	spin_lock(&hugetlb_lock);
		261	if (surplus_huge_pages >= nr_overcommit_huge_pages) {
		262	spin_unlock(&hugetlb_lock);
		263	return NULL;
		264	} else {
		265	nr_huge_pages++;
		266	surplus_huge_pages++;
		267	}
		268	spin_unlock(&hugetlb_lock);
		269
235	page = alloc_pages(htlb_alloc_mask\|__GFP_COMP\|__GFP_NOWARN,	270	page = alloc_pages(htlb_alloc_mask\|__GFP_COMP\|__GFP_NOWARN,
236	HUGETLB_PAGE_ORDER);	271	HUGETLB_PAGE_ORDER);
		272
		273	spin_lock(&hugetlb_lock);
237	if (page) {	274	if (page) {
		275	nid = page_to_nid(page);
238	set_compound_page_dtor(page, free_huge_page);	276	set_compound_page_dtor(page, free_huge_page);
239	spin_lock(&hugetlb_lock);	277	/*
240	nr_huge_pages++;	278	* We incremented the global counters already
241	nr_huge_pages_node[page_to_nid(page)]++;	279	*/
242	surplus_huge_pages++;	280	nr_huge_pages_node[nid]++;
243	surplus_huge_pages_node[page_to_nid(page)]++;	281	surplus_huge_pages_node[nid]++;
244	spin_unlock(&hugetlb_lock);	282	} else {
		283	nr_huge_pages--;
		284	surplus_huge_pages--;
245	}	285	}
		286	spin_unlock(&hugetlb_lock);
246		287
247	return page;	288	return page;
248	}	289	}
@@ -481,6 +522,12 @@ static unsigned long set_max_huge_pages(unsigned long count)
481	* Increase the pool size	522	* Increase the pool size
482	* First take pages out of surplus state. Then make up the	523	* First take pages out of surplus state. Then make up the
483	* remaining difference by allocating fresh huge pages.	524	* remaining difference by allocating fresh huge pages.
		525	*
		526	* We might race with alloc_buddy_huge_page() here and be unable
		527	* to convert a surplus huge page to a normal huge page. That is
		528	* not critical, though, it just means the overall size of the
		529	* pool might be one hugepage larger than it needs to be, but
		530	* within all the constraints specified by the sysctls.
484	*/	531	*/
485	spin_lock(&hugetlb_lock);	532	spin_lock(&hugetlb_lock);
486	while (surplus_huge_pages && count > persistent_huge_pages) {	533	while (surplus_huge_pages && count > persistent_huge_pages) {
@@ -509,6 +556,14 @@ static unsigned long set_max_huge_pages(unsigned long count)
509	* to keep enough around to satisfy reservations). Then place	556	* to keep enough around to satisfy reservations). Then place
510	* pages into surplus state as needed so the pool will shrink	557	* pages into surplus state as needed so the pool will shrink
511	* to the desired size as pages become free.	558	* to the desired size as pages become free.
		559	*
		560	* By placing pages into the surplus state independent of the
		561	* overcommit value, we are allowing the surplus pool size to
		562	* exceed overcommit. There are few sane options here. Since
		563	* alloc_buddy_huge_page() is checking the global counter,
		564	* though, we'll note that we're not allowed to exceed surplus
		565	* and won't grow the pool anywhere else. Not until one of the
		566	* sysctls are changed, or the surplus pages go out of use.
512	*/	567	*/
513	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;	568	min_count = resv_huge_pages + nr_huge_pages - free_huge_pages;
514	min_count = max(count, min_count);	569	min_count = max(count, min_count);