mm: reclaim small amounts of memory when an external fragmentation event occurs

An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Mel Gorman <mgorman@techsingularity.net> 2018-12-28 03:35:52 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2018-12-28 15:11:48 -0500
commit: 1c30844d2dfe272d58c8fc000960b835d13aa2ac (patch)
tree: 148a724047f2c9a98a6cb55d105a031f4e79efd7
parent: 0a79cdad5eb213b3a629e624565b1b3bf9192b7c (diff)
6 files changed, 202 insertions, 15 deletions
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..187ce4f599a2 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
 - swappiness
 - user_reserve_kbytes
 - vfs_cache_pressure
+- watermark_boost_factor
 - watermark_scale_factor
 - zone_reclaim_mode
@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
 =============================================================
+watermark_boost_factor:
+This factor controls the level of reclaim when memory is being fragmented.
+It defines the percentage of the high watermark of a zone that will be
+reclaimed if pages of different mobility are being mixed within pageblocks.
+The intent is that compaction has less work to do in the future and to
+increase the success rate of future high-order allocations such as SLUB
+allocations, THP and hugetlbfs pages.
+To make it sensible with respect to the watermark_scale_factor parameter,
+the unit is in fractions of 10,000. The default value of 15,000 means
+that up to 150% of the high watermark will be reclaimed in the event of
+a pageblock being mixed due to fragmentation. The level of reclaim is
+determined by the number of fragmentation events that occurred in the
+recent past. If this value is smaller than a pageblock then a pageblocks
+worth of pages will be reclaimed (e.g.  2MB on 64-bit x86). A boost factor
+of 0 will disable the feature.
+=============================================================
 watermark_scale_factor:
 This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1d2be4c2d34a..031b2ce983f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone);
 /* page_alloc.c */
 extern int min_free_kbytes;
+extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 /* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dcf1b66a96ab..5b4bfb90fb94 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@ enum zone_watermarks {
        NR_WMARK
 };
-#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])
+#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
-#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])
+#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
-#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])
+#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
-#define wmark_pages(z, i) (z->_watermark[i])
+#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
 struct per_cpu_pages {
        int count;              /* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
        /* zone watermarks, access with *_wmark_pages(zone) macros */
        unsigned long _watermark[NR_WMARK];
+        unsigned long watermark_boost;
        unsigned long nr_reserved_highatomic;
@@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone)
 struct ctl_table;
 int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
                                        void __user *, size_t *, loff_t *);
+int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
+                                        void __user *, size_t *, loff_t *);
 int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
                                        void __user *, size_t *, loff_t *);
 extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..1825f712e73b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1463,6 +1463,14 @@ static struct ctl_table vm_table[] = {
                .extra1         = &zero,
        },
        {
+                .procname       = "watermark_boost_factor",
+                .data           = &watermark_boost_factor,
+                .maxlen         = sizeof(watermark_boost_factor),
+                .mode           = 0644,
+                .proc_handler   = watermark_boost_factor_sysctl_handler,
+                .extra1         = &zero,
+        },
+        {
                .procname       = "watermark_scale_factor",
                .data           = &watermark_scale_factor,
                .maxlen         = sizeof(watermark_scale_factor),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b3e121a388..80373eca453d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = {
 int min_free_kbytes = 1024;
 int user_min_free_kbytes = -1;
+int watermark_boost_factor __read_mostly = 15000;
 int watermark_scale_factor = 10;
 static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
        return false;
 }
+static inline void boost_watermark(struct zone *zone)
+{
+        unsigned long max_boost;
+        if (!watermark_boost_factor)
+                return;
+        max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
+                        watermark_boost_factor, 10000);
+        max_boost = max(pageblock_nr_pages, max_boost);
+        zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+                max_boost);
+}
 /*
 * This function implements actual steal behaviour. If order is large enough,
 * we can steal whole pageblock. If not, we first move freepages in this
@@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
 * itself, so pages freed in the future will be put on the correct free list.
 */
 static void steal_suitable_fallback(struct zone *zone, struct page *page,
-                                        int start_type, bool whole_block)
+                unsigned int alloc_flags, int start_type, bool whole_block)
 {
        unsigned int current_order = page_order(page);
        struct free_area *area;
@@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
                goto single_page;
        }
+        /*
+         * Boost watermarks to increase reclaim pressure to reduce the
+         * likelihood of future fallbacks. Wake kswapd now as the node
+         * may be balanced overall and kswapd will not wake naturally.
+         */
+        boost_watermark(zone);
+        if (alloc_flags & ALLOC_KSWAPD)
+                wakeup_kswapd(zone, 0, 0, zone_idx(zone));
        /* We are not allowed to try stealing from the whole block */
        if (!whole_block)
                goto single_page;
@@ -2443,7 +2468,8 @@ do_steal:
        page = list_first_entry(&area->free_list[fallback_mt],
                                                        struct page, lru);
-        steal_suitable_fallback(zone, page, start_migratetype, can_steal);
+        steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
+                                                                can_steal);
        trace_mm_page_alloc_extfrag(page, order, current_order,
                start_migratetype, fallback_mt);
@@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void)
                zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
                zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
+                zone->watermark_boost = 0;
                spin_unlock_irqrestore(&zone->lock, flags);
        }
@@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
        return 0;
 }
+int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
+        void __user *buffer, size_t *length, loff_t *ppos)
+{
+        int rc;
+        rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+        if (rc)
+                return rc;
+        return 0;
+}
 int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
        void __user *buffer, size_t *length, loff_t *ppos)
 {
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 24ab1f7394ab..bd8971a29204 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -88,6 +88,9 @@ struct scan_control {
        /* Can pages be swapped as part of reclaim? */
        unsigned int may_swap:1;
+        /* e.g. boosted watermark reclaim leaves slabs alone */
+        unsigned int may_shrinkslab:1;
        /*
         * Cgroups are not reclaimed below their configured memory.low,
         * unless we threaten to OOM. If any cgroups are skipped due to
@@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
                        shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
                        node_lru_pages += lru_pages;
-                        shrink_slab(sc->gfp_mask, pgdat->node_id,
+                        if (sc->may_shrinkslab) {
+                                shrink_slab(sc->gfp_mask, pgdat->node_id,
                                    memcg, sc->priority);
+                        }
                        /* Record the group's reclaim efficiency */
                        vmpressure(sc->gfp_mask, memcg, false,
@@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
                .may_writepage = !laptop_mode,
                .may_unmap = 1,
                .may_swap = 1,
+                .may_shrinkslab = 1,
        };
        /*
@@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
                .may_unmap = 1,
                .reclaim_idx = MAX_NR_ZONES - 1,
                .may_swap = !noswap,
+                .may_shrinkslab = 1,
        };
        unsigned long lru_pages;
@@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
                .may_writepage = !laptop_mode,
                .may_unmap = 1,
                .may_swap = may_swap,
+                .may_shrinkslab = 1,
        };
        /*
@@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat,
        } while (memcg);
 }
+static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
+{
+        int i;
+        struct zone *zone;
+        /*
+         * Check for watermark boosts top-down as the higher zones
+         * are more likely to be boosted. Both watermarks and boosts
+         * should not be checked at the time time as reclaim would
+         * start prematurely when there is no boosting and a lower
+         * zone is balanced.
+         */
+        for (i = classzone_idx; i >= 0; i--) {
+                zone = pgdat->node_zones + i;
+                if (!managed_zone(zone))
+                        continue;
+                if (zone->watermark_boost)
+                        return true;
+        }
+        return false;
+}
 /*
 * Returns true if there is an eligible zone balanced for the request order
 * and classzone_idx
@@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
        unsigned long mark = -1;
        struct zone *zone;
+        /*
+         * Check watermarks bottom-up as lower zones are more likely to
+         * meet watermarks.
+         */
        for (i = 0; i <= classzone_idx; i++) {
                zone = pgdat->node_zones + i;
@@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
        unsigned long nr_soft_reclaimed;
        unsigned long nr_soft_scanned;
        unsigned long pflags;
+        unsigned long nr_boost_reclaim;
+        unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
+        bool boosted;
        struct zone *zone;
        struct scan_control sc = {
                .gfp_mask = GFP_KERNEL,
                .order = order,
-                .priority = DEF_PRIORITY,
-                .may_writepage = !laptop_mode,
                .may_unmap = 1,
-                .may_swap = 1,
        };
        psi_memstall_enter(&pflags);
@@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
        count_vm_event(PAGEOUTRUN);
+        /*
+         * Account for the reclaim boost. Note that the zone boost is left in
+         * place so that parallel allocations that are near the watermark will
+         * stall or direct reclaim until kswapd is finished.
+         */
+        nr_boost_reclaim = 0;
+        for (i = 0; i <= classzone_idx; i++) {
+                zone = pgdat->node_zones + i;
+                if (!managed_zone(zone))
+                        continue;
+                nr_boost_reclaim += zone->watermark_boost;
+                zone_boosts[i] = zone->watermark_boost;
+        }
+        boosted = nr_boost_reclaim;
+restart:
+        sc.priority = DEF_PRIORITY;
        do {
                unsigned long nr_reclaimed = sc.nr_reclaimed;
                bool raise_priority = true;
+                bool balanced;
                bool ret;
                sc.reclaim_idx = classzone_idx;
@@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
                }
                /*
-                 * Only reclaim if there are no eligible zones. Note that
+                 * If the pgdat is imbalanced then ignore boosting and preserve
-                 * sc.reclaim_idx is not used as buffer_heads_over_limit may
+                 * the watermarks for a later time and restart. Note that the
-                 * have adjusted it.
+                 * zone watermarks will be still reset at the end of balancing
+                 * on the grounds that the normal reclaim should be enough to
+                 * re-evaluate if boosting is required when kswapd next wakes.
+                 */
+                balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
+                if (!balanced && nr_boost_reclaim) {
+                        nr_boost_reclaim = 0;
+                        goto restart;
+                }
+                /*
+                 * If boosting is not active then only reclaim if there are no
+                 * eligible zones. Note that sc.reclaim_idx is not used as
+                 * buffer_heads_over_limit may have adjusted it.
                 */
-                if (pgdat_balanced(pgdat, sc.order, classzone_idx))
+                if (!nr_boost_reclaim && balanced)
                        goto out;
+                /* Limit the priority of boosting to avoid reclaim writeback */
+                if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
+                        raise_priority = false;
+                /*
+                 * Do not writeback or swap pages for boosted reclaim. The
+                 * intent is to relieve pressure not issue sub-optimal IO
+                 * from reclaim context. If no pages are reclaimed, the
+                 * reclaim will be aborted.
+                 */
+                sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
+                sc.may_swap = !nr_boost_reclaim;
+                sc.may_shrinkslab = !nr_boost_reclaim;
                /*
                 * Do some background aging of the anon list, to give
                 * pages a chance to be referenced before reclaiming. All
@@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
                 * progress in reclaiming pages
                 */
                nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
+                nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
+                /*
+                 * If reclaim made no progress for a boost, stop reclaim as
+                 * IO cannot be queued and it could be an infinite loop in
+                 * extreme circumstances.
+                 */
+                if (nr_boost_reclaim && !nr_reclaimed)
+                        break;
                if (raise_priority || !nr_reclaimed)
                        sc.priority--;
        } while (sc.priority >= 1);
@@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
                pgdat->kswapd_failures++;
 out:
+        /* If reclaim was boosted, account for the reclaim done in this pass */
+        if (boosted) {
+                unsigned long flags;
+                for (i = 0; i <= classzone_idx; i++) {
+                        if (!zone_boosts[i])
+                                continue;
+                        /* Increments are under the zone lock */
+                        zone = pgdat->node_zones + i;
+                        spin_lock_irqsave(&zone->lock, flags);
+                        zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
+                        spin_unlock_irqrestore(&zone->lock, flags);
+                }
+                /*
+                 * As there is now likely space, wakeup kcompact to defragment
+                 * pageblocks.
+                 */
+                wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
+        }
        snapshot_refaults(NULL, pgdat);
        __fs_reclaim_release();
        psi_memstall_leave(&pflags);
@@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
        /* Hopeless node, leave it to direct reclaim if possible */
        if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
-            pgdat_balanced(pgdat, order, classzone_idx)) {
+            (pgdat_balanced(pgdat, order, classzone_idx) &&
+             !pgdat_watermark_boosted(pgdat, classzone_idx))) {
                /*
                 * There may be plenty of free memory available, but it's too
                 * fragmented for high-order allocations.  Wake up kcompactd
author	Mel Gorman <mgorman@techsingularity.net>	2018-12-28 03:35:52 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2018-12-28 15:11:48 -0500
commit	1c30844d2dfe272d58c8fc000960b835d13aa2ac (patch)
tree	148a724047f2c9a98a6cb55d105a031f4e79efd7
parent	0a79cdad5eb213b3a629e624565b1b3bf9192b7c (diff)

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 7d73882e2c27..187ce4f599a2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
63	- swappiness	63	- swappiness
64	- user_reserve_kbytes	64	- user_reserve_kbytes
65	- vfs_cache_pressure	65	- vfs_cache_pressure
		66	- watermark_boost_factor
66	- watermark_scale_factor	67	- watermark_scale_factor
67	- zone_reclaim_mode	68	- zone_reclaim_mode
68		69
@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
856		857
857	=============================================================	858	=============================================================
858		859
		860	watermark_boost_factor:
		861
		862	This factor controls the level of reclaim when memory is being fragmented.
		863	It defines the percentage of the high watermark of a zone that will be
		864	reclaimed if pages of different mobility are being mixed within pageblocks.
		865	The intent is that compaction has less work to do in the future and to
		866	increase the success rate of future high-order allocations such as SLUB
		867	allocations, THP and hugetlbfs pages.
		868
		869	To make it sensible with respect to the watermark_scale_factor parameter,
		870	the unit is in fractions of 10,000. The default value of 15,000 means
		871	that up to 150% of the high watermark will be reclaimed in the event of
		872	a pageblock being mixed due to fragmentation. The level of reclaim is
		873	determined by the number of fragmentation events that occurred in the
		874	recent past. If this value is smaller than a pageblock then a pageblocks
		875	worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
		876	of 0 will disable the feature.
		877
		878	=============================================================
		879
859	watermark_scale_factor:	880	watermark_scale_factor:
860		881
861	This factor controls the aggressiveness of kswapd. It defines the	882	This factor controls the aggressiveness of kswapd. It defines the


diff --git a/include/linux/mm.h b/include/linux/mm.h index 1d2be4c2d34a..031b2ce983f9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h
@@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone);
2256		2256
2257	/* page_alloc.c */	2257	/* page_alloc.c */
2258	extern int min_free_kbytes;	2258	extern int min_free_kbytes;
		2259	extern int watermark_boost_factor;
2259	extern int watermark_scale_factor;	2260	extern int watermark_scale_factor;
2260		2261
2261	/* nommu.c */	2262	/* nommu.c */


diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index dcf1b66a96ab..5b4bfb90fb94 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@ enum zone_watermarks {
269	NR_WMARK	269	NR_WMARK
270	};	270	};
271		271
272	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN])	272	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
273	#define low_wmark_pages(z) (z->_watermark[WMARK_LOW])	273	#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
274	#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH])	274	#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
275	#define wmark_pages(z, i) (z->_watermark[i])	275	#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
276		276
277	struct per_cpu_pages {	277	struct per_cpu_pages {
278	int count; /* number of pages in the list */	278	int count; /* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
364		364
365	/* zone watermarks, access with _wmark_pages(zone) macros /	365	/* zone watermarks, access with _wmark_pages(zone) macros /
366	unsigned long _watermark[NR_WMARK];	366	unsigned long _watermark[NR_WMARK];
		367	unsigned long watermark_boost;
367		368
368	unsigned long nr_reserved_highatomic;	369	unsigned long nr_reserved_highatomic;
369		370
@@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone)
890	struct ctl_table;	891	struct ctl_table;
891	int min_free_kbytes_sysctl_handler(struct ctl_table *, int,	892	int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
892	void __user , size_t , loff_t *);	893	void __user , size_t , loff_t *);
		894	int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
		895	void __user , size_t , loff_t *);
893	int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,	896	int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
894	void __user , size_t , loff_t *);	897	void __user , size_t , loff_t *);
895	extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];	898	extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];


diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 5fc724e4e454..1825f712e73b 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c
@@ -1463,6 +1463,14 @@ static struct ctl_table vm_table[] = {
1463	.extra1 = &zero,	1463	.extra1 = &zero,
1464	},	1464	},
1465	{	1465	{
		1466	.procname = "watermark_boost_factor",
		1467	.data = &watermark_boost_factor,
		1468	.maxlen = sizeof(watermark_boost_factor),
		1469	.mode = 0644,
		1470	.proc_handler = watermark_boost_factor_sysctl_handler,
		1471	.extra1 = &zero,
		1472	},
		1473	{
1466	.procname = "watermark_scale_factor",	1474	.procname = "watermark_scale_factor",
1467	.data = &watermark_scale_factor,	1475	.data = &watermark_scale_factor,
1468	.maxlen = sizeof(watermark_scale_factor),	1476	.maxlen = sizeof(watermark_scale_factor),


diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 32b3e121a388..80373eca453d 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c
@@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = {
262		262
263	int min_free_kbytes = 1024;	263	int min_free_kbytes = 1024;
264	int user_min_free_kbytes = -1;	264	int user_min_free_kbytes = -1;
		265	int watermark_boost_factor __read_mostly = 15000;
265	int watermark_scale_factor = 10;	266	int watermark_scale_factor = 10;
266		267
267	static unsigned long nr_kernel_pages __meminitdata;	268	static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
2129	return false;	2130	return false;
2130	}	2131	}
2131		2132
		2133	static inline void boost_watermark(struct zone *zone)
		2134	{
		2135	unsigned long max_boost;
		2136
		2137	if (!watermark_boost_factor)
		2138	return;
		2139
		2140	max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
		2141	watermark_boost_factor, 10000);
		2142	max_boost = max(pageblock_nr_pages, max_boost);
		2143
		2144	zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
		2145	max_boost);
		2146	}
		2147
2132	/*	2148	/*
2133	* This function implements actual steal behaviour. If order is large enough,	2149	* This function implements actual steal behaviour. If order is large enough,
2134	* we can steal whole pageblock. If not, we first move freepages in this	2150	* we can steal whole pageblock. If not, we first move freepages in this
@@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
2138	* itself, so pages freed in the future will be put on the correct free list.	2154	* itself, so pages freed in the future will be put on the correct free list.
2139	*/	2155	*/
2140	static void steal_suitable_fallback(struct zone zone, struct page page,	2156	static void steal_suitable_fallback(struct zone zone, struct page page,
2141	int start_type, bool whole_block)	2157	unsigned int alloc_flags, int start_type, bool whole_block)
2142	{	2158	{
2143	unsigned int current_order = page_order(page);	2159	unsigned int current_order = page_order(page);
2144	struct free_area *area;	2160	struct free_area *area;
@@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone zone, struct page page,
2160	goto single_page;	2176	goto single_page;
2161	}	2177	}
2162		2178
		2179	/*
		2180	* Boost watermarks to increase reclaim pressure to reduce the
		2181	* likelihood of future fallbacks. Wake kswapd now as the node
		2182	* may be balanced overall and kswapd will not wake naturally.
		2183	*/
		2184	boost_watermark(zone);
		2185	if (alloc_flags & ALLOC_KSWAPD)
		2186	wakeup_kswapd(zone, 0, 0, zone_idx(zone));
		2187
2163	/* We are not allowed to try stealing from the whole block */	2188	/* We are not allowed to try stealing from the whole block */
2164	if (!whole_block)	2189	if (!whole_block)
2165	goto single_page;	2190	goto single_page;
@@ -2443,7 +2468,8 @@ do_steal:
2443	page = list_first_entry(&area->free_list[fallback_mt],	2468	page = list_first_entry(&area->free_list[fallback_mt],
2444	struct page, lru);	2469	struct page, lru);
2445		2470
2446	steal_suitable_fallback(zone, page, start_migratetype, can_steal);	2471	steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
		2472	can_steal);
2447		2473
2448	trace_mm_page_alloc_extfrag(page, order, current_order,	2474	trace_mm_page_alloc_extfrag(page, order, current_order,
2449	start_migratetype, fallback_mt);	2475	start_migratetype, fallback_mt);
@@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void)
7454		7480
7455	zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;	7481	zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
7456	zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;	7482	zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
		7483	zone->watermark_boost = 0;
7457		7484
7458	spin_unlock_irqrestore(&zone->lock, flags);	7485	spin_unlock_irqrestore(&zone->lock, flags);
7459	}	7486	}
@@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
7554	return 0;	7581	return 0;
7555	}	7582	}
7556		7583
		7584	int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
		7585	void __user buffer, size_t length, loff_t *ppos)
		7586	{
		7587	int rc;
		7588
		7589	rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
		7590	if (rc)
		7591	return rc;
		7592
		7593	return 0;
		7594	}
		7595
7557	int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,	7596	int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
7558	void __user buffer, size_t length, loff_t *ppos)	7597	void __user buffer, size_t length, loff_t *ppos)
7559	{	7598	{


diff --git a/mm/vmscan.c b/mm/vmscan.c index 24ab1f7394ab..bd8971a29204 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c
@@ -88,6 +88,9 @@ struct scan_control {
88	/* Can pages be swapped as part of reclaim? */	88	/* Can pages be swapped as part of reclaim? */
89	unsigned int may_swap:1;	89	unsigned int may_swap:1;
90		90
		91	/* e.g. boosted watermark reclaim leaves slabs alone */
		92	unsigned int may_shrinkslab:1;
		93
91	/*	94	/*
92	* Cgroups are not reclaimed below their configured memory.low,	95	* Cgroups are not reclaimed below their configured memory.low,
93	* unless we threaten to OOM. If any cgroups are skipped due to	96	* unless we threaten to OOM. If any cgroups are skipped due to
@@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t pgdat, struct scan_control sc)
2756	shrink_node_memcg(pgdat, memcg, sc, &lru_pages);	2759	shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
2757	node_lru_pages += lru_pages;	2760	node_lru_pages += lru_pages;
2758		2761
2759	shrink_slab(sc->gfp_mask, pgdat->node_id,	2762	if (sc->may_shrinkslab) {
		2763	shrink_slab(sc->gfp_mask, pgdat->node_id,
2760	memcg, sc->priority);	2764	memcg, sc->priority);
		2765	}
2761		2766
2762	/* Record the group's reclaim efficiency */	2767	/* Record the group's reclaim efficiency */
2763	vmpressure(sc->gfp_mask, memcg, false,	2768	vmpressure(sc->gfp_mask, memcg, false,
@@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
3239	.may_writepage = !laptop_mode,	3244	.may_writepage = !laptop_mode,
3240	.may_unmap = 1,	3245	.may_unmap = 1,
3241	.may_swap = 1,	3246	.may_swap = 1,
		3247	.may_shrinkslab = 1,
3242	};	3248	};
3243		3249
3244	/*	3250	/*
@@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
3283	.may_unmap = 1,	3289	.may_unmap = 1,
3284	.reclaim_idx = MAX_NR_ZONES - 1,	3290	.reclaim_idx = MAX_NR_ZONES - 1,
3285	.may_swap = !noswap,	3291	.may_swap = !noswap,
		3292	.may_shrinkslab = 1,
3286	};	3293	};
3287	unsigned long lru_pages;	3294	unsigned long lru_pages;
3288		3295
@@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
3329	.may_writepage = !laptop_mode,	3336	.may_writepage = !laptop_mode,
3330	.may_unmap = 1,	3337	.may_unmap = 1,
3331	.may_swap = may_swap,	3338	.may_swap = may_swap,
		3339	.may_shrinkslab = 1,
3332	};	3340	};
3333		3341
3334	/*	3342	/*
@@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat,
3379	} while (memcg);	3387	} while (memcg);
3380	}	3388	}
3381		3389
		3390	static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
		3391	{
		3392	int i;
		3393	struct zone *zone;
		3394
		3395	/*
		3396	* Check for watermark boosts top-down as the higher zones
		3397	* are more likely to be boosted. Both watermarks and boosts
		3398	* should not be checked at the time time as reclaim would
		3399	* start prematurely when there is no boosting and a lower
		3400	* zone is balanced.
		3401	*/
		3402	for (i = classzone_idx; i >= 0; i--) {
		3403	zone = pgdat->node_zones + i;
		3404	if (!managed_zone(zone))
		3405	continue;
		3406
		3407	if (zone->watermark_boost)
		3408	return true;
		3409	}
		3410
		3411	return false;
		3412	}
		3413
3382	/*	3414	/*
3383	* Returns true if there is an eligible zone balanced for the request order	3415	* Returns true if there is an eligible zone balanced for the request order
3384	* and classzone_idx	3416	* and classzone_idx
@@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
3389	unsigned long mark = -1;	3421	unsigned long mark = -1;
3390	struct zone *zone;	3422	struct zone *zone;
3391		3423
		3424	/*
		3425	* Check watermarks bottom-up as lower zones are more likely to
		3426	* meet watermarks.
		3427	*/
3392	for (i = 0; i <= classzone_idx; i++) {	3428	for (i = 0; i <= classzone_idx; i++) {
3393	zone = pgdat->node_zones + i;	3429	zone = pgdat->node_zones + i;
3394		3430
@@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3517	unsigned long nr_soft_reclaimed;	3553	unsigned long nr_soft_reclaimed;
3518	unsigned long nr_soft_scanned;	3554	unsigned long nr_soft_scanned;
3519	unsigned long pflags;	3555	unsigned long pflags;
		3556	unsigned long nr_boost_reclaim;
		3557	unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
		3558	bool boosted;
3520	struct zone *zone;	3559	struct zone *zone;
3521	struct scan_control sc = {	3560	struct scan_control sc = {
3522	.gfp_mask = GFP_KERNEL,	3561	.gfp_mask = GFP_KERNEL,
3523	.order = order,	3562	.order = order,
3524	.priority = DEF_PRIORITY,
3525	.may_writepage = !laptop_mode,
3526	.may_unmap = 1,	3563	.may_unmap = 1,
3527	.may_swap = 1,
3528	};	3564	};
3529		3565
3530	psi_memstall_enter(&pflags);	3566	psi_memstall_enter(&pflags);
@@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3532		3568
3533	count_vm_event(PAGEOUTRUN);	3569	count_vm_event(PAGEOUTRUN);
3534		3570
		3571	/*
		3572	* Account for the reclaim boost. Note that the zone boost is left in
		3573	* place so that parallel allocations that are near the watermark will
		3574	* stall or direct reclaim until kswapd is finished.
		3575	*/
		3576	nr_boost_reclaim = 0;
		3577	for (i = 0; i <= classzone_idx; i++) {
		3578	zone = pgdat->node_zones + i;
		3579	if (!managed_zone(zone))
		3580	continue;
		3581
		3582	nr_boost_reclaim += zone->watermark_boost;
		3583	zone_boosts[i] = zone->watermark_boost;
		3584	}
		3585	boosted = nr_boost_reclaim;
		3586
		3587	restart:
		3588	sc.priority = DEF_PRIORITY;
3535	do {	3589	do {
3536	unsigned long nr_reclaimed = sc.nr_reclaimed;	3590	unsigned long nr_reclaimed = sc.nr_reclaimed;
3537	bool raise_priority = true;	3591	bool raise_priority = true;
		3592	bool balanced;
3538	bool ret;	3593	bool ret;
3539		3594
3540	sc.reclaim_idx = classzone_idx;	3595	sc.reclaim_idx = classzone_idx;
@@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3561	}	3616	}
3562		3617
3563	/*	3618	/*
3564	* Only reclaim if there are no eligible zones. Note that	3619	* If the pgdat is imbalanced then ignore boosting and preserve
3565	* sc.reclaim_idx is not used as buffer_heads_over_limit may	3620	* the watermarks for a later time and restart. Note that the
3566	* have adjusted it.	3621	* zone watermarks will be still reset at the end of balancing
		3622	* on the grounds that the normal reclaim should be enough to
		3623	* re-evaluate if boosting is required when kswapd next wakes.
		3624	*/
		3625	balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
		3626	if (!balanced && nr_boost_reclaim) {
		3627	nr_boost_reclaim = 0;
		3628	goto restart;
		3629	}
		3630
		3631	/*
		3632	* If boosting is not active then only reclaim if there are no
		3633	* eligible zones. Note that sc.reclaim_idx is not used as
		3634	* buffer_heads_over_limit may have adjusted it.
3567	*/	3635	*/
3568	if (pgdat_balanced(pgdat, sc.order, classzone_idx))	3636	if (!nr_boost_reclaim && balanced)
3569	goto out;	3637	goto out;
3570		3638
		3639	/* Limit the priority of boosting to avoid reclaim writeback */
		3640	if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
		3641	raise_priority = false;
		3642
		3643	/*
		3644	* Do not writeback or swap pages for boosted reclaim. The
		3645	* intent is to relieve pressure not issue sub-optimal IO
		3646	* from reclaim context. If no pages are reclaimed, the
		3647	* reclaim will be aborted.
		3648	*/
		3649	sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
		3650	sc.may_swap = !nr_boost_reclaim;
		3651	sc.may_shrinkslab = !nr_boost_reclaim;
		3652
3571	/*	3653	/*
3572	* Do some background aging of the anon list, to give	3654	* Do some background aging of the anon list, to give
3573	* pages a chance to be referenced before reclaiming. All	3655	* pages a chance to be referenced before reclaiming. All
@@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3619	* progress in reclaiming pages	3701	* progress in reclaiming pages
3620	*/	3702	*/
3621	nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;	3703	nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
		3704	nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
		3705
		3706	/*
		3707	* If reclaim made no progress for a boost, stop reclaim as
		3708	* IO cannot be queued and it could be an infinite loop in
		3709	* extreme circumstances.
		3710	*/
		3711	if (nr_boost_reclaim && !nr_reclaimed)
		3712	break;
		3713
3622	if (raise_priority \|\| !nr_reclaimed)	3714	if (raise_priority \|\| !nr_reclaimed)
3623	sc.priority--;	3715	sc.priority--;
3624	} while (sc.priority >= 1);	3716	} while (sc.priority >= 1);
@@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3627	pgdat->kswapd_failures++;	3719	pgdat->kswapd_failures++;
3628		3720
3629	out:	3721	out:
		3722	/* If reclaim was boosted, account for the reclaim done in this pass */
		3723	if (boosted) {
		3724	unsigned long flags;
		3725
		3726	for (i = 0; i <= classzone_idx; i++) {
		3727	if (!zone_boosts[i])
		3728	continue;
		3729
		3730	/* Increments are under the zone lock */
		3731	zone = pgdat->node_zones + i;
		3732	spin_lock_irqsave(&zone->lock, flags);
		3733	zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
		3734	spin_unlock_irqrestore(&zone->lock, flags);
		3735	}
		3736
		3737	/*
		3738	* As there is now likely space, wakeup kcompact to defragment
		3739	* pageblocks.
		3740	*/
		3741	wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
		3742	}
		3743
3630	snapshot_refaults(NULL, pgdat);	3744	snapshot_refaults(NULL, pgdat);
3631	__fs_reclaim_release();	3745	__fs_reclaim_release();
3632	psi_memstall_leave(&pflags);	3746	psi_memstall_leave(&pflags);
@@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
3855		3969
3856	/* Hopeless node, leave it to direct reclaim if possible */	3970	/* Hopeless node, leave it to direct reclaim if possible */
3857	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES \|\|	3971	if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES \|\|
3858	pgdat_balanced(pgdat, order, classzone_idx)) {	3972	(pgdat_balanced(pgdat, order, classzone_idx) &&
		3973	!pgdat_watermark_boosted(pgdat, classzone_idx))) {
3859	/*	3974	/*
3860	* There may be plenty of free memory available, but it's too	3975	* There may be plenty of free memory available, but it's too
3861	* fragmented for high-order allocations. Wake up kcompactd	3976	* fragmented for high-order allocations. Wake up kcompactd