vmscan: raise the bar to PAGEOUT_IO_SYNC stalls

Fix "system goes unresponsive under memory pressure and lots of dirty/writeback pages" bug. http://lkml.org/lkml/2010/4/4/86 In the above thread, Andreas Mohr described that Invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). This happens when the two conditions are both meet: - under memory pressure - writing heavily to a slow device OOM also happens in Andreas' system. The OOM trace shows that 3 processes are stuck in wait_on_page_writeback() in the direct reclaim path. One in do_fork() and the other two in unix_stream_sendmsg(). They are blocked on this condition: (sc->order && priority < DEF_PRIORITY - 2) which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim for the order-1 fork() allocation runs into a range of 512KB hard-to-reclaim LRU pages, it will be stalled. It's a severe problem in three ways. Firstly, it can easily happen in daily desktop usage. vmscan priority can easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if the system has 50% globally reclaimable pages, it still has good opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a simple dd can easily create a big range (up to 20%) of dirty pages in the LRU lists. And order-1 to order-3 allocations are more than common with SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order slab caches. For example, the order-1 radix_tree_node slab cache may stall applications at swap-in time; the order-3 inode cache on most filesystems may stall applications when trying to read some file; the order-2 proc_inode_cache may stall applications when trying to open a /proc file. Secondly, once triggered, it will stall unrelated processes (not doing IO at all) in the system. This "one slow USB device stalls the whole system" avalanching effect is very bad. Thirdly, once stalled, the stall time could be intolerable long for the users. When there are 20MB queued writeback pages and USB 1.1 is writing them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds. Not to mention it may be called multiple times. So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is 20%, it will hardly be triggered by pure dirty pages. We'd better treat PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so uncomfortably long (easily goes beyond 1s). The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations, which are easy to satisfy in 1TB memory boxes. So, although 6.25% of memory could be an awful lot of pages to scan on a system with 1TB of memory, it won't really have to busy scan that much. Andreas tested an older version of this patch and reported that it mostly fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will fix it further in the next patch. Reported-by: Andreas Mohr <andi@lisas.de> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Wu Fengguang <fengguang.wu@intel.com> 2010-08-09 20:20:01 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2010-08-09 23:45:03 -0400
commit: e31f3698cd3499e676f6b0ea12e3528f569c4fa3 (patch)
tree: 0133cc0e11384c7293bdf0812ee04996a02c8826 /mm/vmscan.c
parent: 51980ac9e72fb5f22c81b7798d65b691125d70ee (diff)
1 files changed, 43 insertions, 8 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 154b37a33731..ec5ddccbf82e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1234,6 +1234,47 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone,
 }
 /*
+ * Returns true if the caller should wait to clean dirty/writeback pages.
+ *
+ * If we are direct reclaiming for contiguous pages and we do not reclaim
+ * everything in the list, try again and wait for writeback IO to complete.
+ * This will stall high-order allocations noticeably. Only do that when really
+ * need to free the pages under high memory pressure.
+ */
+static inline bool should_reclaim_stall(unsigned long nr_taken,
+                                        unsigned long nr_freed,
+                                        int priority,
+                                        struct scan_control *sc)
+{
+        int lumpy_stall_priority;
+        /* kswapd should not stall on sync IO */
+        if (current_is_kswapd())
+                return false;
+        /* Only stall on lumpy reclaim */
+        if (!sc->lumpy_reclaim_mode)
+                return false;
+        /* If we have relaimed everything on the isolated list, no stall */
+        if (nr_freed == nr_taken)
+                return false;
+        /*
+         * For high-order allocations, there are two stall thresholds.
+         * High-cost allocations stall immediately where as lower
+         * order allocations such as stacks require the scanning
+         * priority to be much higher before stalling.
+         */
+        if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+                lumpy_stall_priority = DEF_PRIORITY;
+        else
+                lumpy_stall_priority = DEF_PRIORITY / 3;
+        return priority <= lumpy_stall_priority;
+}
+/*
 * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
 * of reclaimed pages
 */
@@ -1298,14 +1339,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
        nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
-        /*
+        /* Check if we should syncronously wait for writeback */
-         * If we are direct reclaiming for contiguous pages and we do
+        if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
-         * not reclaim everything in the list, try again and wait
-         * for IO to complete. This will stall high-order allocations
-         * but that should be acceptable to the caller
-         */
-        if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
-                        sc->lumpy_reclaim_mode) {
                congestion_wait(BLK_RW_ASYNC, HZ/10);
                /*
author	Wu Fengguang <fengguang.wu@intel.com>	2010-08-09 20:20:01 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2010-08-09 23:45:03 -0400
commit	e31f3698cd3499e676f6b0ea12e3528f569c4fa3 (patch)
tree	0133cc0e11384c7293bdf0812ee04996a02c8826 /mm/vmscan.c
parent	51980ac9e72fb5f22c81b7798d65b691125d70ee (diff)

diff --git a/mm/vmscan.c b/mm/vmscan.c index 154b37a33731..ec5ddccbf82e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c
@@ -1234,6 +1234,47 @@ static noinline_for_stack void update_isolated_counts(struct zone *zone,
1234	}	1234	}
1235		1235
1236	/*	1236	/*
		1237	* Returns true if the caller should wait to clean dirty/writeback pages.
		1238	*
		1239	* If we are direct reclaiming for contiguous pages and we do not reclaim
		1240	* everything in the list, try again and wait for writeback IO to complete.
		1241	* This will stall high-order allocations noticeably. Only do that when really
		1242	* need to free the pages under high memory pressure.
		1243	*/
		1244	static inline bool should_reclaim_stall(unsigned long nr_taken,
		1245	unsigned long nr_freed,
		1246	int priority,
		1247	struct scan_control *sc)
		1248	{
		1249	int lumpy_stall_priority;
		1250
		1251	/* kswapd should not stall on sync IO */
		1252	if (current_is_kswapd())
		1253	return false;
		1254
		1255	/* Only stall on lumpy reclaim */
		1256	if (!sc->lumpy_reclaim_mode)
		1257	return false;
		1258
		1259	/* If we have relaimed everything on the isolated list, no stall */
		1260	if (nr_freed == nr_taken)
		1261	return false;
		1262
		1263	/*
		1264	* For high-order allocations, there are two stall thresholds.
		1265	* High-cost allocations stall immediately where as lower
		1266	* order allocations such as stacks require the scanning
		1267	* priority to be much higher before stalling.
		1268	*/
		1269	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
		1270	lumpy_stall_priority = DEF_PRIORITY;
		1271	else
		1272	lumpy_stall_priority = DEF_PRIORITY / 3;
		1273
		1274	return priority <= lumpy_stall_priority;
		1275	}
		1276
		1277	/*
1237	* shrink_inactive_list() is a helper for shrink_zone(). It returns the number	1278	* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
1238	* of reclaimed pages	1279	* of reclaimed pages
1239	*/	1280	*/
@@ -1298,14 +1339,8 @@ shrink_inactive_list(unsigned long nr_to_scan, struct zone *zone,
1298		1339
1299	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);	1340	nr_reclaimed = shrink_page_list(&page_list, sc, PAGEOUT_IO_ASYNC);
1300		1341
1301	/*	1342	/* Check if we should syncronously wait for writeback */
1302	* If we are direct reclaiming for contiguous pages and we do	1343	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
1303	* not reclaim everything in the list, try again and wait
1304	* for IO to complete. This will stall high-order allocations
1305	* but that should be acceptable to the caller
1306	*/
1307	if (nr_reclaimed < nr_taken && !current_is_kswapd() &&
1308	sc->lumpy_reclaim_mode) {
1309	congestion_wait(BLK_RW_ASYNC, HZ/10);	1344	congestion_wait(BLK_RW_ASYNC, HZ/10);
1310		1345
1311	/*	1346	/*