aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMel Gorman <mel@csn.ul.ie>2009-06-16 18:33:20 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2009-06-16 22:47:45 -0400
commit90afa5de6f3fa89a733861e843377302479fcf7e (patch)
tree2870878fa3361c27551b5a18c4732073ae1432bd
parent84a892456046921a40646114deed65e2df93a1bc (diff)
vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim
A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--Documentation/sysctl/vm.txt12
-rw-r--r--mm/vmscan.c52
2 files changed, 53 insertions, 11 deletions
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 0ea5adbc5b16..c4de6359d440 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -315,10 +315,14 @@ min_unmapped_ratio:
315 315
316This is available only on NUMA kernels. 316This is available only on NUMA kernels.
317 317
318A percentage of the total pages in each zone. Zone reclaim will only 318This is a percentage of the total pages in each zone. Zone reclaim will
319occur if more than this percentage of pages are file backed and unmapped. 319only occur if more than this percentage of pages are in a state that
320This is to insure that a minimal amount of local pages is still available for 320zone_reclaim_mode allows to be reclaimed.
321file I/O even if the node is overallocated. 321
322If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
323against all file-backed unmapped pages including swapcache pages and tmpfs
324files. Otherwise, only unmapped pages backed by normal files but not tmpfs
325files and similar are considered.
322 326
323The default is 1 percent. 327The default is 1 percent.
324 328
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 057e44b97aa1..79a98d98ed33 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2356,6 +2356,48 @@ int sysctl_min_unmapped_ratio = 1;
2356 */ 2356 */
2357int sysctl_min_slab_ratio = 5; 2357int sysctl_min_slab_ratio = 5;
2358 2358
2359static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
2360{
2361 unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
2362 unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
2363 zone_page_state(zone, NR_ACTIVE_FILE);
2364
2365 /*
2366 * It's possible for there to be more file mapped pages than
2367 * accounted for by the pages on the file LRU lists because
2368 * tmpfs pages accounted for as ANON can also be FILE_MAPPED
2369 */
2370 return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
2371}
2372
2373/* Work out how many page cache pages we can reclaim in this reclaim_mode */
2374static long zone_pagecache_reclaimable(struct zone *zone)
2375{
2376 long nr_pagecache_reclaimable;
2377 long delta = 0;
2378
2379 /*
2380 * If RECLAIM_SWAP is set, then all file pages are considered
2381 * potentially reclaimable. Otherwise, we have to worry about
2382 * pages like swapcache and zone_unmapped_file_pages() provides
2383 * a better estimate
2384 */
2385 if (zone_reclaim_mode & RECLAIM_SWAP)
2386 nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
2387 else
2388 nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
2389
2390 /* If we can't clean pages, remove dirty pages from consideration */
2391 if (!(zone_reclaim_mode & RECLAIM_WRITE))
2392 delta += zone_page_state(zone, NR_FILE_DIRTY);
2393
2394 /* Watch for any possible underflows due to delta */
2395 if (unlikely(delta > nr_pagecache_reclaimable))
2396 delta = nr_pagecache_reclaimable;
2397
2398 return nr_pagecache_reclaimable - delta;
2399}
2400
2359/* 2401/*
2360 * Try to free up some pages from this zone through reclaim. 2402 * Try to free up some pages from this zone through reclaim.
2361 */ 2403 */
@@ -2390,9 +2432,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
2390 reclaim_state.reclaimed_slab = 0; 2432 reclaim_state.reclaimed_slab = 0;
2391 p->reclaim_state = &reclaim_state; 2433 p->reclaim_state = &reclaim_state;
2392 2434
2393 if (zone_page_state(zone, NR_FILE_PAGES) - 2435 if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
2394 zone_page_state(zone, NR_FILE_MAPPED) >
2395 zone->min_unmapped_pages) {
2396 /* 2436 /*
2397 * Free memory by calling shrink zone with increasing 2437 * Free memory by calling shrink zone with increasing
2398 * priorities until we have enough memory freed. 2438 * priorities until we have enough memory freed.
@@ -2450,10 +2490,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
2450 * if less than a specified percentage of the zone is used by 2490 * if less than a specified percentage of the zone is used by
2451 * unmapped file backed pages. 2491 * unmapped file backed pages.
2452 */ 2492 */
2453 if (zone_page_state(zone, NR_FILE_PAGES) - 2493 if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
2454 zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages 2494 zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
2455 && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
2456 <= zone->min_slab_pages)
2457 return 0; 2495 return 0;
2458 2496
2459 if (zone_is_all_unreclaimable(zone)) 2497 if (zone_is_all_unreclaimable(zone))