aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorMel Gorman <mel@csn.ul.ie>2008-07-24 00:27:22 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2008-07-24 13:47:16 -0400
commitfc1b8a73dd71226902a11928dd5500326e101df9 (patch)
tree524abe44f0da15bef579cbeffad0a78d26571f72 /mm
parent9109fb7b3520de187ebc3646c209d66a233f7169 (diff)
hugetlb: move hugetlb_acct_memory()
This is a patchset to give reliable behaviour to a process that successfully calls mmap(MAP_PRIVATE) on a hugetlbfs file. Currently, it is possible for the process to be killed due to a small hugepage pool size even if it calls mlock(). MAP_SHARED mappings on hugetlbfs reserve huge pages at mmap() time. This guarantees all future faults against the mapping will succeed. This allows local allocations at first use improving NUMA locality whilst retaining reliability. MAP_PRIVATE mappings do not reserve pages. This can result in an application being SIGKILLed later if a huge page is not available at fault time. This makes huge pages usage very ill-advised in some cases as the unexpected application failure cannot be detected and handled as it is immediately fatal. Although an application may force instantiation of the pages using mlock(), this may lead to poor memory placement and the process may still be killed when performing COW. This patchset introduces a reliability guarantee for the process which creates a private mapping, i.e. the process that calls mmap() on a hugetlbfs file successfully. The first patch of the set is purely mechanical code move to make later diffs easier to read. The second patch will guarantee faults up until the process calls fork(). After patch two, as long as the child keeps the mappings, the parent is no longer guaranteed to be reliable. Patch 3 guarantees that the parent will always successfully COW by unmapping the pages from the child in the event there are insufficient pages in the hugepage pool in allocate a new page, be it via a static or dynamic pool. Existing hugepage-aware applications are unlikely to be affected by this change. For much of hugetlbfs's history, pages were pre-faulted at mmap() time or mmap() failed which acts in a reserve-like manner. If the pool is sized correctly already so that parent and child can fault reliably, the application will not even notice the reserves. It's only when the pool is too small for the application to function perfectly reliably that the reserves come into play. Credit goes to Andy Whitcroft for cleaning up a number of mistakes during review before the patches were released. This patch: A later patch in this set needs to call hugetlb_acct_memory() before it is defined. This patch moves the function without modification. This makes later diffs easier to read. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Acked-by: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/hugetlb.c82
1 files changed, 41 insertions, 41 deletions
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2c5c9ee4220d..a4dbba8965f3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -716,6 +716,47 @@ unsigned long hugetlb_total_pages(void)
716 return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE); 716 return nr_huge_pages * (HPAGE_SIZE / PAGE_SIZE);
717} 717}
718 718
719static int hugetlb_acct_memory(long delta)
720{
721 int ret = -ENOMEM;
722
723 spin_lock(&hugetlb_lock);
724 /*
725 * When cpuset is configured, it breaks the strict hugetlb page
726 * reservation as the accounting is done on a global variable. Such
727 * reservation is completely rubbish in the presence of cpuset because
728 * the reservation is not checked against page availability for the
729 * current cpuset. Application can still potentially OOM'ed by kernel
730 * with lack of free htlb page in cpuset that the task is in.
731 * Attempt to enforce strict accounting with cpuset is almost
732 * impossible (or too ugly) because cpuset is too fluid that
733 * task or memory node can be dynamically moved between cpusets.
734 *
735 * The change of semantics for shared hugetlb mapping with cpuset is
736 * undesirable. However, in order to preserve some of the semantics,
737 * we fall back to check against current free page availability as
738 * a best attempt and hopefully to minimize the impact of changing
739 * semantics that cpuset has.
740 */
741 if (delta > 0) {
742 if (gather_surplus_pages(delta) < 0)
743 goto out;
744
745 if (delta > cpuset_mems_nr(free_huge_pages_node)) {
746 return_unused_surplus_pages(delta);
747 goto out;
748 }
749 }
750
751 ret = 0;
752 if (delta < 0)
753 return_unused_surplus_pages((unsigned long) -delta);
754
755out:
756 spin_unlock(&hugetlb_lock);
757 return ret;
758}
759
719/* 760/*
720 * We cannot handle pagefaults against hugetlb pages at all. They cause 761 * We cannot handle pagefaults against hugetlb pages at all. They cause
721 * handle_mm_fault() to try to instantiate regular-sized pages in the 762 * handle_mm_fault() to try to instantiate regular-sized pages in the
@@ -1248,47 +1289,6 @@ static long region_truncate(struct list_head *head, long end)
1248 return chg; 1289 return chg;
1249} 1290}
1250 1291
1251static int hugetlb_acct_memory(long delta)
1252{
1253 int ret = -ENOMEM;
1254
1255 spin_lock(&hugetlb_lock);
1256 /*
1257 * When cpuset is configured, it breaks the strict hugetlb page
1258 * reservation as the accounting is done on a global variable. Such
1259 * reservation is completely rubbish in the presence of cpuset because
1260 * the reservation is not checked against page availability for the
1261 * current cpuset. Application can still potentially OOM'ed by kernel
1262 * with lack of free htlb page in cpuset that the task is in.
1263 * Attempt to enforce strict accounting with cpuset is almost
1264 * impossible (or too ugly) because cpuset is too fluid that
1265 * task or memory node can be dynamically moved between cpusets.
1266 *
1267 * The change of semantics for shared hugetlb mapping with cpuset is
1268 * undesirable. However, in order to preserve some of the semantics,
1269 * we fall back to check against current free page availability as
1270 * a best attempt and hopefully to minimize the impact of changing
1271 * semantics that cpuset has.
1272 */
1273 if (delta > 0) {
1274 if (gather_surplus_pages(delta) < 0)
1275 goto out;
1276
1277 if (delta > cpuset_mems_nr(free_huge_pages_node)) {
1278 return_unused_surplus_pages(delta);
1279 goto out;
1280 }
1281 }
1282
1283 ret = 0;
1284 if (delta < 0)
1285 return_unused_surplus_pages((unsigned long) -delta);
1286
1287out:
1288 spin_unlock(&hugetlb_lock);
1289 return ret;
1290}
1291
1292int hugetlb_reserve_pages(struct inode *inode, long from, long to) 1292int hugetlb_reserve_pages(struct inode *inode, long from, long to)
1293{ 1293{
1294 long ret, chg; 1294 long ret, chg;