vmscan: properly account for the number of page cache pages zone_reclaim() can reclaim

A bug was brought to my attention against a distro kernel but it affects mainline and I believe problems like this have been reported in various guises on the mailing lists although I don't have specific examples at the moment. The reported problem was that malloc() stalled for a long time (minutes in some cases) if a large tmpfs mount was occupying a large percentage of memory overall. The pages did not get cleaned or reclaimed by zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists are uselessly scanned frequencly making the CPU spin at near 100%. This patchset intends to address that bug and bring the behaviour of zone_reclaim() more in line with expectations which were noticed during investigation. It is based on top of mmotm and takes advantage of Kosaki's work with respect to zone_reclaim(). Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the scan should go ahead. The broken heuristic is what was causing the malloc() stall as it uselessly scanned the LRU constantly. Currently, zone_reclaim is assuming zone_reclaim_mode is 1 and historically it could not deal with tmpfs pages at all. This fixes up the heuristic so that an unnecessary scan is more likely to be correctly avoided. Patch 2 notes that zone_reclaim() returning a failure automatically means the zone is marked full. This is not always true. It could have failed because the GFP mask or zone_reclaim_mode were unsuitable. Patch 3 introduces a counter zreclaim_failed that will increment each time the zone_reclaim scan-avoidance heuristics fail. If that counter is rapidly increasing, then zone_reclaim_mode should be set to 0 as a temporarily resolution and a bug reported because the scan-avoidance heuristic is still broken. This patch: On NUMA machines, the administrator can configure zone_reclaim_mode that is a more targetted form of direct reclaim. On machines with large NUMA distances for example, a zone_reclaim_mode defaults to 1 meaning that clean unmapped pages will be reclaimed if the zone watermarks are not being met. There is a heuristic that determines if the scan is worthwhile but the problem is that the heuristic is not being properly applied and is basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of proper detection can manfiest as high CPU usage as the LRU list is scanned uselessly. Historically, once enabled it was depending on NR_FILE_PAGES which may include swapcache pages that the reclaim_mode cannot deal with. Patch vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included pages that were not file-backed such as swapcache and made a calculation based on the inactive, active and mapped files. This is far superior when zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a reasonable starting figure. This patch alters how zone_reclaim() works out how many pages it might be able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in the reclaim_mode it will either consider NR_FILE_PAGES as potential candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set, then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is not set, then NR_FILE_MAPPED are not. [kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages] [fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate] Signed-off-by: Mel Gorman <mel@csn.ul.ie> Reviewed-by: Rik van Riel <riel@redhat.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Mel Gorman <mel@csn.ul.ie> 2009-06-16 18:33:20 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2009-06-16 22:47:45 -0400
commit: 90afa5de6f3fa89a733861e843377302479fcf7e (patch)
tree: 2870878fa3361c27551b5a18c4732073ae1432bd
parent: 84a892456046921a40646114deed65e2df93a1bc (diff)
2 files changed, 53 insertions, 11 deletions
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 0ea5adbc5b16..c4de6359d440 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -315,10 +315,14 @@ min_unmapped_ratio:
 This is available only on NUMA kernels.
-A percentage of the total pages in each zone.  Zone reclaim will only
+This is a percentage of the total pages in each zone. Zone reclaim will
-occur if more than this percentage of pages are file backed and unmapped.
+only occur if more than this percentage of pages are in a state that
-This is to insure that a minimal amount of local pages is still available for
+zone_reclaim_mode allows to be reclaimed.
-file I/O even if the node is overallocated.
+If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
+against all file-backed unmapped pages including swapcache pages and tmpfs
+files. Otherwise, only unmapped pages backed by normal files but not tmpfs
+files and similar are considered.
 The default is 1 percent.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 057e44b97aa1..79a98d98ed33 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2356,6 +2356,48 @@ int sysctl_min_unmapped_ratio = 1;
 */
 int sysctl_min_slab_ratio = 5;
+static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
+{
+        unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
+        unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
+                zone_page_state(zone, NR_ACTIVE_FILE);
+        /*
+         * It's possible for there to be more file mapped pages than
+         * accounted for by the pages on the file LRU lists because
+         * tmpfs pages accounted for as ANON can also be FILE_MAPPED
+         */
+        return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
+}
+/* Work out how many page cache pages we can reclaim in this reclaim_mode */
+static long zone_pagecache_reclaimable(struct zone *zone)
+{
+        long nr_pagecache_reclaimable;
+        long delta = 0;
+        /*
+         * If RECLAIM_SWAP is set, then all file pages are considered
+         * potentially reclaimable. Otherwise, we have to worry about
+         * pages like swapcache and zone_unmapped_file_pages() provides
+         * a better estimate
+         */
+        if (zone_reclaim_mode & RECLAIM_SWAP)
+                nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
+        else
+                nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
+        /* If we can't clean pages, remove dirty pages from consideration */
+        if (!(zone_reclaim_mode & RECLAIM_WRITE))
+                delta += zone_page_state(zone, NR_FILE_DIRTY);
+        /* Watch for any possible underflows due to delta */
+        if (unlikely(delta > nr_pagecache_reclaimable))
+                delta = nr_pagecache_reclaimable;
+        return nr_pagecache_reclaimable - delta;
+}
 /*
 * Try to free up some pages from this zone through reclaim.
 */
@@ -2390,9 +2432,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
        reclaim_state.reclaimed_slab = 0;
        p->reclaim_state = &reclaim_state;
-        if (zone_page_state(zone, NR_FILE_PAGES) -
+        if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
-                zone_page_state(zone, NR_FILE_MAPPED) >
-                zone->min_unmapped_pages) {
                /*
                 * Free memory by calling shrink zone with increasing
                 * priorities until we have enough memory freed.
@@ -2450,10 +2490,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
         * if less than a specified percentage of the zone is used by
         * unmapped file backed pages.
         */
-        if (zone_page_state(zone, NR_FILE_PAGES) -
+        if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
-            zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages
+            zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
-            && zone_page_state(zone, NR_SLAB_RECLAIMABLE)
-                        <= zone->min_slab_pages)
                return 0;
        if (zone_is_all_unreclaimable(zone))
author	Mel Gorman <mel@csn.ul.ie>	2009-06-16 18:33:20 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2009-06-16 22:47:45 -0400
commit	90afa5de6f3fa89a733861e843377302479fcf7e (patch)
tree	2870878fa3361c27551b5a18c4732073ae1432bd
parent	84a892456046921a40646114deed65e2df93a1bc (diff)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 0ea5adbc5b16..c4de6359d440 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt
@@ -315,10 +315,14 @@ min_unmapped_ratio:
315		315
316	This is available only on NUMA kernels.	316	This is available only on NUMA kernels.
317		317
318	A percentage of the total pages in each zone. Zone reclaim will only	318	This is a percentage of the total pages in each zone. Zone reclaim will
319	occur if more than this percentage of pages are file backed and unmapped.	319	only occur if more than this percentage of pages are in a state that
320	This is to insure that a minimal amount of local pages is still available for	320	zone_reclaim_mode allows to be reclaimed.
321	file I/O even if the node is overallocated.	321
		322	If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
		323	against all file-backed unmapped pages including swapcache pages and tmpfs
		324	files. Otherwise, only unmapped pages backed by normal files but not tmpfs
		325	files and similar are considered.
322		326
323	The default is 1 percent.	327	The default is 1 percent.
324		328


diff --git a/mm/vmscan.c b/mm/vmscan.c index 057e44b97aa1..79a98d98ed33 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c
@@ -2356,6 +2356,48 @@ int sysctl_min_unmapped_ratio = 1;
2356	*/	2356	*/
2357	int sysctl_min_slab_ratio = 5;	2357	int sysctl_min_slab_ratio = 5;
2358		2358
		2359	static inline unsigned long zone_unmapped_file_pages(struct zone *zone)
		2360	{
		2361	unsigned long file_mapped = zone_page_state(zone, NR_FILE_MAPPED);
		2362	unsigned long file_lru = zone_page_state(zone, NR_INACTIVE_FILE) +
		2363	zone_page_state(zone, NR_ACTIVE_FILE);
		2364
		2365	/*
		2366	* It's possible for there to be more file mapped pages than
		2367	* accounted for by the pages on the file LRU lists because
		2368	* tmpfs pages accounted for as ANON can also be FILE_MAPPED
		2369	*/
		2370	return (file_lru > file_mapped) ? (file_lru - file_mapped) : 0;
		2371	}
		2372
		2373	/* Work out how many page cache pages we can reclaim in this reclaim_mode */
		2374	static long zone_pagecache_reclaimable(struct zone *zone)
		2375	{
		2376	long nr_pagecache_reclaimable;
		2377	long delta = 0;
		2378
		2379	/*
		2380	* If RECLAIM_SWAP is set, then all file pages are considered
		2381	* potentially reclaimable. Otherwise, we have to worry about
		2382	* pages like swapcache and zone_unmapped_file_pages() provides
		2383	* a better estimate
		2384	*/
		2385	if (zone_reclaim_mode & RECLAIM_SWAP)
		2386	nr_pagecache_reclaimable = zone_page_state(zone, NR_FILE_PAGES);
		2387	else
		2388	nr_pagecache_reclaimable = zone_unmapped_file_pages(zone);
		2389
		2390	/* If we can't clean pages, remove dirty pages from consideration */
		2391	if (!(zone_reclaim_mode & RECLAIM_WRITE))
		2392	delta += zone_page_state(zone, NR_FILE_DIRTY);
		2393
		2394	/* Watch for any possible underflows due to delta */
		2395	if (unlikely(delta > nr_pagecache_reclaimable))
		2396	delta = nr_pagecache_reclaimable;
		2397
		2398	return nr_pagecache_reclaimable - delta;
		2399	}
		2400
2359	/*	2401	/*
2360	* Try to free up some pages from this zone through reclaim.	2402	* Try to free up some pages from this zone through reclaim.
2361	*/	2403	*/
@@ -2390,9 +2432,7 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
2390	reclaim_state.reclaimed_slab = 0;	2432	reclaim_state.reclaimed_slab = 0;
2391	p->reclaim_state = &reclaim_state;	2433	p->reclaim_state = &reclaim_state;
2392		2434
2393	if (zone_page_state(zone, NR_FILE_PAGES) -	2435	if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
2394	zone_page_state(zone, NR_FILE_MAPPED) >
2395	zone->min_unmapped_pages) {
2396	/*	2436	/*
2397	* Free memory by calling shrink zone with increasing	2437	* Free memory by calling shrink zone with increasing
2398	* priorities until we have enough memory freed.	2438	* priorities until we have enough memory freed.
@@ -2450,10 +2490,8 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
2450	* if less than a specified percentage of the zone is used by	2490	* if less than a specified percentage of the zone is used by
2451	* unmapped file backed pages.	2491	* unmapped file backed pages.
2452	*/	2492	*/
2453	if (zone_page_state(zone, NR_FILE_PAGES) -	2493	if (zone_pagecache_reclaimable(zone) <= zone->min_unmapped_pages &&
2454	zone_page_state(zone, NR_FILE_MAPPED) <= zone->min_unmapped_pages	2494	zone_page_state(zone, NR_SLAB_RECLAIMABLE) <= zone->min_slab_pages)
2455	&& zone_page_state(zone, NR_SLAB_RECLAIMABLE)
2456	<= zone->min_slab_pages)
2457	return 0;	2495	return 0;
2458		2496
2459	if (zone_is_all_unreclaimable(zone))	2497	if (zone_is_all_unreclaimable(zone))