aboutsummaryrefslogtreecommitdiffstats
path: root/mm
diff options
context:
space:
mode:
authorHugh Dickins <hughd@google.com>2012-07-31 19:45:59 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2012-07-31 21:42:49 -0400
commitc3b94f44fcb0725471ecebb701c077a0ed67bd07 (patch)
tree526dd574ec3d35bc39a4a05759a8f4c33f91abb3 /mm
parente62e384e9da8d9a0c599795464a7e76fd490931c (diff)
memcg: further prevent OOM with too many dirty pages
The may_enter_fs test turns out to be too restrictive: though I saw no problem with it when testing on 3.5-rc6, it very soon OOMed when I tested on 3.5-rc6-mm1. I don't know what the difference there is, perhaps I just slightly changed the way I started off the testing: dd if=/dev/zero of=/mnt/temp bs=1M count=1024; rm -f /mnt/temp; sync repeatedly, in 20M memory.limit_in_bytes cgroup to ext4 on USB stick. ext4 (and gfs2 and xfs) turn out to allocate new pages for writing with AOP_FLAG_NOFS: that seems a little worrying, and it's unclear to me why the transaction needs to be started even before allocating pagecache memory. But it may not be worth worrying about these days: if direct reclaim avoids FS writeback, does __GFP_FS now mean anything? Anyway, we insisted on the may_enter_fs test to avoid hangs with the loop device; but since that also masks off __GFP_IO, we can test for __GFP_IO directly, ignoring may_enter_fs and __GFP_FS. But even so, the test still OOMs sometimes: when originally testing on 3.5-rc6, it OOMed about one time in five or ten; when testing just now on 3.5-rc6-mm1, it OOMed on the first iteration. This residual problem comes from an accumulation of pages under ordinary writeback, not marked PageReclaim, so rightly not causing the memcg check to wait on their writeback: these too can prevent shrink_page_list() from freeing any pages, so many times that memcg reclaim fails and OOMs. Deal with these in the same way as direct reclaim now deals with dirty FS pages: mark them PageReclaim. It is appropriate to rotate these to tail of list when writepage completes, but more importantly, the PageReclaim flag makes memcg reclaim wait on them if encountered again. Increment NR_VMSCAN_IMMEDIATE? That's arguable: I chose not. Setting PageReclaim here may occasionally race with end_page_writeback() clearing it: lru_deactivate_fn() already faced the same race, and correctly concluded that the window is small and the issue non-critical. With these changes, the test runs indefinitely without OOMing on ext4, ext3 and ext2: I'll move on to test with other filesystems later. Trivia: invert conditions for a clearer block without an else, and goto keep_locked to do the unlock_page. Signed-off-by: Hugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Fengguang Wu <fengguang.wu@intel.com> Acked-by: Michal Hocko <mhocko@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm')
-rw-r--r--mm/vmscan.c33
1 files changed, 24 insertions, 9 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ca43aa00ea0e..e37e68725090 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -723,23 +723,38 @@ static unsigned long shrink_page_list(struct list_head *page_list,
723 /* 723 /*
724 * memcg doesn't have any dirty pages throttling so we 724 * memcg doesn't have any dirty pages throttling so we
725 * could easily OOM just because too many pages are in 725 * could easily OOM just because too many pages are in
726 * writeback from reclaim and there is nothing else to 726 * writeback and there is nothing else to reclaim.
727 * reclaim.
728 * 727 *
729 * Check may_enter_fs, certainly because a loop driver 728 * Check __GFP_IO, certainly because a loop driver
730 * thread might enter reclaim, and deadlock if it waits 729 * thread might enter reclaim, and deadlock if it waits
731 * on a page for which it is needed to do the write 730 * on a page for which it is needed to do the write
732 * (loop masks off __GFP_IO|__GFP_FS for this reason); 731 * (loop masks off __GFP_IO|__GFP_FS for this reason);
733 * but more thought would probably show more reasons. 732 * but more thought would probably show more reasons.
733 *
734 * Don't require __GFP_FS, since we're not going into
735 * the FS, just waiting on its writeback completion.
736 * Worryingly, ext4 gfs2 and xfs allocate pages with
737 * grab_cache_page_write_begin(,,AOP_FLAG_NOFS), so
738 * testing may_enter_fs here is liable to OOM on them.
734 */ 739 */
735 if (!global_reclaim(sc) && PageReclaim(page) && 740 if (global_reclaim(sc) ||
736 may_enter_fs) 741 !PageReclaim(page) || !(sc->gfp_mask & __GFP_IO)) {
737 wait_on_page_writeback(page); 742 /*
738 else { 743 * This is slightly racy - end_page_writeback()
744 * might have just cleared PageReclaim, then
745 * setting PageReclaim here end up interpreted
746 * as PageReadahead - but that does not matter
747 * enough to care. What we do want is for this
748 * page to have PageReclaim set next time memcg
749 * reclaim reaches the tests above, so it will
750 * then wait_on_page_writeback() to avoid OOM;
751 * and it's also appropriate in global reclaim.
752 */
753 SetPageReclaim(page);
739 nr_writeback++; 754 nr_writeback++;
740 unlock_page(page); 755 goto keep_locked;
741 goto keep;
742 } 756 }
757 wait_on_page_writeback(page);
743 } 758 }
744 759
745 references = page_check_references(page, sc); 760 references = page_check_references(page, sc);