mm: try to distribute dirty pages fairly across zones

The maximum number of dirty pages that exist in the system at any time is determined by a number of pages considered dirtyable and a user-configured percentage of those, or an absolute number in bytes. This number of dirtyable pages is the sum of memory provided by all the zones in the system minus their lowmem reserves and high watermarks, so that the system can retain a healthy number of free pages without having to reclaim dirty pages. But there is a flaw in that we have a zoned page allocator which does not care about the global state but rather the state of individual memory zones. And right now there is nothing that prevents one zone from filling up with dirty pages while other zones are spared, which frequently leads to situations where kswapd, in order to restore the watermark of free pages, does indeed have to write pages from that zone's LRU list. This can interfere so badly with IO from the flusher threads that major filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim already, taking away the VM's only possibility to keep such a zone balanced, aside from hoping the flushers will soon clean pages from that zone. Enter per-zone dirty limits. They are to a zone's dirtyable memory what the global limit is to the global amount of dirtyable memory, and try to make sure that no single zone receives more than its fair share of the globally allowed dirty pages in the first place. As the number of pages considered dirtyable excludes the zones' lowmem reserves and high watermarks, the maximum number of dirty pages in a zone is such that the zone can always be balanced without requiring page cleaning. As this is a placement decision in the page allocator and pages are dirtied only after the allocation, this patch allows allocators to pass __GFP_WRITE when they know in advance that the page will be written to and become dirty soon. The page allocator will then attempt to allocate from the first zone of the zonelist - which on NUMA is determined by the task's NUMA memory policy - that has not exceeded its dirty limit. At first glance, it would appear that the diversion to lower zones can increase pressure on them, but this is not the case. With a full high zone, allocations will be diverted to lower zones eventually, so it is more of a shift in timing of the lower zone allocations. Workloads that previously could fit their dirty pages completely in the higher zone may be forced to allocate from lower zones, but the amount of pages that "spill over" are limited themselves by the lower zones' dirty constraints, and thus unlikely to become a problem. For now, the problem of unfair dirty page distribution remains for NUMA configurations where the zones allowed for allocation are in sum not big enough to trigger the global dirty limits, wake up the flusher threads and remedy the situation. Because of this, an allocation that could not succeed on any of the considered zones is allowed to ignore the dirty limits before going into direct reclaim or even failing the allocation, until a future patch changes the global dirty throttling and flusher thread activation so that they take individual zone states into account. Test results 15M DMA + 3246M DMA32 + 504 Normal = 3765M memory 40% dirty ratio 16G USB thumb drive 10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15)) seconds nr_vmscan_write (stddev) min| median| max xfs vanilla: 549.747( 3.492) 0.000| 0.000| 0.000 patched: 550.996( 3.802) 0.000| 0.000| 0.000 fuse-ntfs vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000 patched: 558.049(17.914) 0.000| 0.000| 43.000 btrfs vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000 patched: 563.365(11.368) 0.000| 0.000| 1362.000 ext4 vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000 patched: 568.806(17.496) 0.000| 0.000| 0.000 Signed-off-by: Johannes Weiner <jweiner@redhat.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Michal Hocko <mhocko@suse.cz> Tested-by: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Jan Kara <jack@suse.cz> Cc: Shaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Johannes Weiner <jweiner@redhat.com> 2012-01-10 18:07:49 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2012-01-10 19:30:43 -0500
commit: a756cf5908530e8b40bdf569eb48b40139e8d7fd (patch)
tree: ba9df151d5468098c7eae563ce09faea6a539fc0 /mm
parent: ccafa2879fb8d13b8031337a8743eac4189e5d6e (diff)
2 files changed, 111 insertions, 0 deletions
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 433fa990fe8..5cdd4f2b0c9 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -147,6 +147,24 @@ static struct prop_descriptor vm_completions;
 * clamping level.
 */
+/*
+ * In a memory zone, there is a certain amount of pages we consider
+ * available for the page cache, which is essentially the number of
+ * free and reclaimable pages, minus some zone reserves to protect
+ * lowmem and the ability to uphold the zone's watermarks without
+ * requiring writeback.
+ *
+ * This number of dirtyable pages is the base value of which the
+ * user-configurable dirty ratio is the effictive number of pages that
+ * are allowed to be actually dirtied.  Per individual zone, or
+ * globally by using the sum of dirtyable pages over all zones.
+ *
+ * Because the user is allowed to specify the dirty limit globally as
+ * absolute number of bytes, calculating the per-zone dirty limit can
+ * require translating the configured limit into a percentage of
+ * global dirtyable memory first.
+ */
 static unsigned long highmem_dirtyable_memory(unsigned long total)
 {
 #ifdef CONFIG_HIGHMEM
@@ -232,6 +250,70 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
        trace_global_dirty_state(background, dirty);
 }
+/**
+ * zone_dirtyable_memory - number of dirtyable pages in a zone
+ * @zone: the zone
+ *
+ * Returns the zone's number of pages potentially available for dirty
+ * page cache.  This is the base value for the per-zone dirty limits.
+ */
+static unsigned long zone_dirtyable_memory(struct zone *zone)
+{
+        /*
+         * The effective global number of dirtyable pages may exclude
+         * highmem as a big-picture measure to keep the ratio between
+         * dirty memory and lowmem reasonable.
+         *
+         * But this function is purely about the individual zone and a
+         * highmem zone can hold its share of dirty pages, so we don't
+         * care about vm_highmem_is_dirtyable here.
+         */
+        return zone_page_state(zone, NR_FREE_PAGES) +
+               zone_reclaimable_pages(zone) -
+               zone->dirty_balance_reserve;
+}
+/**
+ * zone_dirty_limit - maximum number of dirty pages allowed in a zone
+ * @zone: the zone
+ *
+ * Returns the maximum number of dirty pages allowed in a zone, based
+ * on the zone's dirtyable memory.
+ */
+static unsigned long zone_dirty_limit(struct zone *zone)
+{
+        unsigned long zone_memory = zone_dirtyable_memory(zone);
+        struct task_struct *tsk = current;
+        unsigned long dirty;
+        if (vm_dirty_bytes)
+                dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
+                        zone_memory / global_dirtyable_memory();
+        else
+                dirty = vm_dirty_ratio * zone_memory / 100;
+        if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk))
+                dirty += dirty / 4;
+        return dirty;
+}
+/**
+ * zone_dirty_ok - tells whether a zone is within its dirty limits
+ * @zone: the zone to check
+ *
+ * Returns %true when the dirty pages in @zone are within the zone's
+ * dirty limit, %false if the limit is exceeded.
+ */
+bool zone_dirty_ok(struct zone *zone)
+{
+        unsigned long limit = zone_dirty_limit(zone);
+        return zone_page_state(zone, NR_FILE_DIRTY) +
+               zone_page_state(zone, NR_UNSTABLE_NFS) +
+               zone_page_state(zone, NR_WRITEBACK) <= limit;
+}
 /*
 * couple the period to the dirty_ratio:
 *
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cb9eb71e28..4f95bcf0f2b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1735,6 +1735,35 @@ zonelist_scan:
                if ((alloc_flags & ALLOC_CPUSET) &&
                        !cpuset_zone_allowed_softwall(zone, gfp_mask))
                                continue;
+                /*
+                 * When allocating a page cache page for writing, we
+                 * want to get it from a zone that is within its dirty
+                 * limit, such that no single zone holds more than its
+                 * proportional share of globally allowed dirty pages.
+                 * The dirty limits take into account the zone's
+                 * lowmem reserves and high watermark so that kswapd
+                 * should be able to balance it without having to
+                 * write pages from its LRU list.
+                 *
+                 * This may look like it could increase pressure on
+                 * lower zones by failing allocations in higher zones
+                 * before they are full.  But the pages that do spill
+                 * over are limited as the lower zones are protected
+                 * by this very same mechanism.  It should not become
+                 * a practical burden to them.
+                 *
+                 * XXX: For now, allow allocations to potentially
+                 * exceed the per-zone dirty limit in the slowpath
+                 * (ALLOC_WMARK_LOW unset) before going into reclaim,
+                 * which is important when on a NUMA setup the allowed
+                 * zones are together not big enough to reach the
+                 * global limit.  The proper fix for these situations
+                 * will require awareness of zones in the
+                 * dirty-throttling and the flusher threads.
+                 */
+                if ((alloc_flags & ALLOC_WMARK_LOW) &&
+                    (gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
+                        goto this_zone_full;
                BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
                if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
author	Johannes Weiner <jweiner@redhat.com>	2012-01-10 18:07:49 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2012-01-10 19:30:43 -0500
commit	a756cf5908530e8b40bdf569eb48b40139e8d7fd (patch)
tree	ba9df151d5468098c7eae563ce09faea6a539fc0 /mm
parent	ccafa2879fb8d13b8031337a8743eac4189e5d6e (diff)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 433fa990fe8..5cdd4f2b0c9 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c
@@ -147,6 +147,24 @@ static struct prop_descriptor vm_completions;
147	* clamping level.	147	* clamping level.
148	*/	148	*/
149		149
		150	/*
		151	* In a memory zone, there is a certain amount of pages we consider
		152	* available for the page cache, which is essentially the number of
		153	* free and reclaimable pages, minus some zone reserves to protect
		154	* lowmem and the ability to uphold the zone's watermarks without
		155	* requiring writeback.
		156	*
		157	* This number of dirtyable pages is the base value of which the
		158	* user-configurable dirty ratio is the effictive number of pages that
		159	* are allowed to be actually dirtied. Per individual zone, or
		160	* globally by using the sum of dirtyable pages over all zones.
		161	*
		162	* Because the user is allowed to specify the dirty limit globally as
		163	* absolute number of bytes, calculating the per-zone dirty limit can
		164	* require translating the configured limit into a percentage of
		165	* global dirtyable memory first.
		166	*/
		167
150	static unsigned long highmem_dirtyable_memory(unsigned long total)	168	static unsigned long highmem_dirtyable_memory(unsigned long total)
151	{	169	{
152	#ifdef CONFIG_HIGHMEM	170	#ifdef CONFIG_HIGHMEM
@@ -232,6 +250,70 @@ void global_dirty_limits(unsigned long pbackground, unsigned long pdirty)
232	trace_global_dirty_state(background, dirty);	250	trace_global_dirty_state(background, dirty);
233	}	251	}
234		252
		253	/**
		254	* zone_dirtyable_memory - number of dirtyable pages in a zone
		255	* @zone: the zone
		256	*
		257	* Returns the zone's number of pages potentially available for dirty
		258	* page cache. This is the base value for the per-zone dirty limits.
		259	*/
		260	static unsigned long zone_dirtyable_memory(struct zone *zone)
		261	{
		262	/*
		263	* The effective global number of dirtyable pages may exclude
		264	* highmem as a big-picture measure to keep the ratio between
		265	* dirty memory and lowmem reasonable.
		266	*
		267	* But this function is purely about the individual zone and a
		268	* highmem zone can hold its share of dirty pages, so we don't
		269	* care about vm_highmem_is_dirtyable here.
		270	*/
		271	return zone_page_state(zone, NR_FREE_PAGES) +
		272	zone_reclaimable_pages(zone) -
		273	zone->dirty_balance_reserve;
		274	}
		275
		276	/**
		277	* zone_dirty_limit - maximum number of dirty pages allowed in a zone
		278	* @zone: the zone
		279	*
		280	* Returns the maximum number of dirty pages allowed in a zone, based
		281	* on the zone's dirtyable memory.
		282	*/
		283	static unsigned long zone_dirty_limit(struct zone *zone)
		284	{
		285	unsigned long zone_memory = zone_dirtyable_memory(zone);
		286	struct task_struct *tsk = current;
		287	unsigned long dirty;
		288
		289	if (vm_dirty_bytes)
		290	dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE) *
		291	zone_memory / global_dirtyable_memory();
		292	else
		293	dirty = vm_dirty_ratio * zone_memory / 100;
		294
		295	if (tsk->flags & PF_LESS_THROTTLE \|\| rt_task(tsk))
		296	dirty += dirty / 4;
		297
		298	return dirty;
		299	}
		300
		301	/**
		302	* zone_dirty_ok - tells whether a zone is within its dirty limits
		303	* @zone: the zone to check
		304	*
		305	* Returns %true when the dirty pages in @zone are within the zone's
		306	* dirty limit, %false if the limit is exceeded.
		307	*/
		308	bool zone_dirty_ok(struct zone *zone)
		309	{
		310	unsigned long limit = zone_dirty_limit(zone);
		311
		312	return zone_page_state(zone, NR_FILE_DIRTY) +
		313	zone_page_state(zone, NR_UNSTABLE_NFS) +
		314	zone_page_state(zone, NR_WRITEBACK) <= limit;
		315	}
		316
235	/*	317	/*
236	* couple the period to the dirty_ratio:	318	* couple the period to the dirty_ratio:
237	*	319	*


diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2cb9eb71e28..4f95bcf0f2b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c
@@ -1735,6 +1735,35 @@ zonelist_scan:
1735	if ((alloc_flags & ALLOC_CPUSET) &&	1735	if ((alloc_flags & ALLOC_CPUSET) &&
1736	!cpuset_zone_allowed_softwall(zone, gfp_mask))	1736	!cpuset_zone_allowed_softwall(zone, gfp_mask))
1737	continue;	1737	continue;
		1738	/*
		1739	* When allocating a page cache page for writing, we
		1740	* want to get it from a zone that is within its dirty
		1741	* limit, such that no single zone holds more than its
		1742	* proportional share of globally allowed dirty pages.
		1743	* The dirty limits take into account the zone's
		1744	* lowmem reserves and high watermark so that kswapd
		1745	* should be able to balance it without having to
		1746	* write pages from its LRU list.
		1747	*
		1748	* This may look like it could increase pressure on
		1749	* lower zones by failing allocations in higher zones
		1750	* before they are full. But the pages that do spill
		1751	* over are limited as the lower zones are protected
		1752	* by this very same mechanism. It should not become
		1753	* a practical burden to them.
		1754	*
		1755	* XXX: For now, allow allocations to potentially
		1756	* exceed the per-zone dirty limit in the slowpath
		1757	* (ALLOC_WMARK_LOW unset) before going into reclaim,
		1758	* which is important when on a NUMA setup the allowed
		1759	* zones are together not big enough to reach the
		1760	* global limit. The proper fix for these situations
		1761	* will require awareness of zones in the
		1762	* dirty-throttling and the flusher threads.
		1763	*/
		1764	if ((alloc_flags & ALLOC_WMARK_LOW) &&
		1765	(gfp_mask & __GFP_WRITE) && !zone_dirty_ok(zone))
		1766	goto this_zone_full;
1738		1767
1739	BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);	1768	BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
1740	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {	1769	if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {