summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMel Gorman <mgorman@techsingularity.net>2018-12-28 03:35:52 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2018-12-28 15:11:48 -0500
commit1c30844d2dfe272d58c8fc000960b835d13aa2ac (patch)
tree148a724047f2c9a98a6cb55d105a031f4e79efd7
parent0a79cdad5eb213b3a629e624565b1b3bf9192b7c (diff)
mm: reclaim small amounts of memory when an external fragmentation event occurs
An external fragmentation event was previously described as When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered an event that will cause external fragmentation issues in the future. The kernel reduces the probability of such events by increasing the watermark sizes by calling set_recommended_min_free_kbytes early in the lifetime of the system. This works reasonably well in general but if there are enough sparsely populated pageblocks then the problem can still occur as enough memory is free overall and kswapd stays asleep. This patch introduces a watermark_boost_factor sysctl that allows a zone watermark to be temporarily boosted when an external fragmentation causing events occurs. The boosting will stall allocations that would decrease free memory below the boosted low watermark and kswapd is woken if the calling context allows to reclaim an amount of memory relative to the size of the high watermark and the watermark_boost_factor until the boost is cleared. When kswapd finishes, it wakes kcompactd at the pageblock order to clean some of the pageblocks that may have been affected by the fragmentation event. kswapd avoids any writeback, slab shrinkage and swap from reclaim context during this operation to avoid excessive system disruption in the name of fragmentation avoidance. Care is taken so that kswapd will do normal reclaim work if the system is really low on memory. This was evaluated using the same workloads as "mm, page_alloc: Spread allocations across zones before introducing fragmentation". 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) 4.20-rc3+patch1-4: 18421 (98% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%) Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%) Note that external fragmentation causing events are massively reduced by this path whether in comparison to the previous kernel or the vanilla kernel. The fault latency for huge pages appears to be increased but that is only because THP allocations were successful with the patch applied. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) 4.20-rc3+patch1-4: 13464 (95% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%) Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%) Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%) Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%* 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%) As before, massive reduction in external fragmentation events, some jitter on latencies and an increase in THP allocation success rates. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) 4.20-rc3+patch1-4: 14263 (93% reduction) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%) Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%) There is a 93% reduction in fragmentation causing events, there is a big reduction in the huge page fault latency and allocation success rate is higher. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) 4.20-rc3+patch1-4: 11095 (93% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%* Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%) 4.20.0-rc3 4.20.0-rc3 lowzone-v5r8 boost-v5r8 Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%) There is a large reduction in fragmentation events with some jitter around the latencies and success rates. As before, the high THP allocation success rate does mean the system is under a lot of pressure. However, as the fragmentation events are reduced, it would be expected that the long-term allocation success rate would be higher. Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--Documentation/sysctl/vm.txt21
-rw-r--r--include/linux/mm.h1
-rw-r--r--include/linux/mmzone.h11
-rw-r--r--kernel/sysctl.c8
-rw-r--r--mm/page_alloc.c43
-rw-r--r--mm/vmscan.c133
6 files changed, 202 insertions, 15 deletions
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 7d73882e2c27..187ce4f599a2 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
63- swappiness 63- swappiness
64- user_reserve_kbytes 64- user_reserve_kbytes
65- vfs_cache_pressure 65- vfs_cache_pressure
66- watermark_boost_factor
66- watermark_scale_factor 67- watermark_scale_factor
67- zone_reclaim_mode 68- zone_reclaim_mode
68 69
@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
856 857
857============================================================= 858=============================================================
858 859
860watermark_boost_factor:
861
862This factor controls the level of reclaim when memory is being fragmented.
863It defines the percentage of the high watermark of a zone that will be
864reclaimed if pages of different mobility are being mixed within pageblocks.
865The intent is that compaction has less work to do in the future and to
866increase the success rate of future high-order allocations such as SLUB
867allocations, THP and hugetlbfs pages.
868
869To make it sensible with respect to the watermark_scale_factor parameter,
870the unit is in fractions of 10,000. The default value of 15,000 means
871that up to 150% of the high watermark will be reclaimed in the event of
872a pageblock being mixed due to fragmentation. The level of reclaim is
873determined by the number of fragmentation events that occurred in the
874recent past. If this value is smaller than a pageblock then a pageblocks
875worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
876of 0 will disable the feature.
877
878=============================================================
879
859watermark_scale_factor: 880watermark_scale_factor:
860 881
861This factor controls the aggressiveness of kswapd. It defines the 882This factor controls the aggressiveness of kswapd. It defines the
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1d2be4c2d34a..031b2ce983f9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2256,6 +2256,7 @@ extern void zone_pcp_reset(struct zone *zone);
2256 2256
2257/* page_alloc.c */ 2257/* page_alloc.c */
2258extern int min_free_kbytes; 2258extern int min_free_kbytes;
2259extern int watermark_boost_factor;
2259extern int watermark_scale_factor; 2260extern int watermark_scale_factor;
2260 2261
2261/* nommu.c */ 2262/* nommu.c */
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index dcf1b66a96ab..5b4bfb90fb94 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -269,10 +269,10 @@ enum zone_watermarks {
269 NR_WMARK 269 NR_WMARK
270}; 270};
271 271
272#define min_wmark_pages(z) (z->_watermark[WMARK_MIN]) 272#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost)
273#define low_wmark_pages(z) (z->_watermark[WMARK_LOW]) 273#define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost)
274#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH]) 274#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
275#define wmark_pages(z, i) (z->_watermark[i]) 275#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
276 276
277struct per_cpu_pages { 277struct per_cpu_pages {
278 int count; /* number of pages in the list */ 278 int count; /* number of pages in the list */
@@ -364,6 +364,7 @@ struct zone {
364 364
365 /* zone watermarks, access with *_wmark_pages(zone) macros */ 365 /* zone watermarks, access with *_wmark_pages(zone) macros */
366 unsigned long _watermark[NR_WMARK]; 366 unsigned long _watermark[NR_WMARK];
367 unsigned long watermark_boost;
367 368
368 unsigned long nr_reserved_highatomic; 369 unsigned long nr_reserved_highatomic;
369 370
@@ -890,6 +891,8 @@ static inline int is_highmem(struct zone *zone)
890struct ctl_table; 891struct ctl_table;
891int min_free_kbytes_sysctl_handler(struct ctl_table *, int, 892int min_free_kbytes_sysctl_handler(struct ctl_table *, int,
892 void __user *, size_t *, loff_t *); 893 void __user *, size_t *, loff_t *);
894int watermark_boost_factor_sysctl_handler(struct ctl_table *, int,
895 void __user *, size_t *, loff_t *);
893int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, 896int watermark_scale_factor_sysctl_handler(struct ctl_table *, int,
894 void __user *, size_t *, loff_t *); 897 void __user *, size_t *, loff_t *);
895extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; 898extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 5fc724e4e454..1825f712e73b 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1463,6 +1463,14 @@ static struct ctl_table vm_table[] = {
1463 .extra1 = &zero, 1463 .extra1 = &zero,
1464 }, 1464 },
1465 { 1465 {
1466 .procname = "watermark_boost_factor",
1467 .data = &watermark_boost_factor,
1468 .maxlen = sizeof(watermark_boost_factor),
1469 .mode = 0644,
1470 .proc_handler = watermark_boost_factor_sysctl_handler,
1471 .extra1 = &zero,
1472 },
1473 {
1466 .procname = "watermark_scale_factor", 1474 .procname = "watermark_scale_factor",
1467 .data = &watermark_scale_factor, 1475 .data = &watermark_scale_factor,
1468 .maxlen = sizeof(watermark_scale_factor), 1476 .maxlen = sizeof(watermark_scale_factor),
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 32b3e121a388..80373eca453d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -262,6 +262,7 @@ compound_page_dtor * const compound_page_dtors[] = {
262 262
263int min_free_kbytes = 1024; 263int min_free_kbytes = 1024;
264int user_min_free_kbytes = -1; 264int user_min_free_kbytes = -1;
265int watermark_boost_factor __read_mostly = 15000;
265int watermark_scale_factor = 10; 266int watermark_scale_factor = 10;
266 267
267static unsigned long nr_kernel_pages __meminitdata; 268static unsigned long nr_kernel_pages __meminitdata;
@@ -2129,6 +2130,21 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
2129 return false; 2130 return false;
2130} 2131}
2131 2132
2133static inline void boost_watermark(struct zone *zone)
2134{
2135 unsigned long max_boost;
2136
2137 if (!watermark_boost_factor)
2138 return;
2139
2140 max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
2141 watermark_boost_factor, 10000);
2142 max_boost = max(pageblock_nr_pages, max_boost);
2143
2144 zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
2145 max_boost);
2146}
2147
2132/* 2148/*
2133 * This function implements actual steal behaviour. If order is large enough, 2149 * This function implements actual steal behaviour. If order is large enough,
2134 * we can steal whole pageblock. If not, we first move freepages in this 2150 * we can steal whole pageblock. If not, we first move freepages in this
@@ -2138,7 +2154,7 @@ static bool can_steal_fallback(unsigned int order, int start_mt)
2138 * itself, so pages freed in the future will be put on the correct free list. 2154 * itself, so pages freed in the future will be put on the correct free list.
2139 */ 2155 */
2140static void steal_suitable_fallback(struct zone *zone, struct page *page, 2156static void steal_suitable_fallback(struct zone *zone, struct page *page,
2141 int start_type, bool whole_block) 2157 unsigned int alloc_flags, int start_type, bool whole_block)
2142{ 2158{
2143 unsigned int current_order = page_order(page); 2159 unsigned int current_order = page_order(page);
2144 struct free_area *area; 2160 struct free_area *area;
@@ -2160,6 +2176,15 @@ static void steal_suitable_fallback(struct zone *zone, struct page *page,
2160 goto single_page; 2176 goto single_page;
2161 } 2177 }
2162 2178
2179 /*
2180 * Boost watermarks to increase reclaim pressure to reduce the
2181 * likelihood of future fallbacks. Wake kswapd now as the node
2182 * may be balanced overall and kswapd will not wake naturally.
2183 */
2184 boost_watermark(zone);
2185 if (alloc_flags & ALLOC_KSWAPD)
2186 wakeup_kswapd(zone, 0, 0, zone_idx(zone));
2187
2163 /* We are not allowed to try stealing from the whole block */ 2188 /* We are not allowed to try stealing from the whole block */
2164 if (!whole_block) 2189 if (!whole_block)
2165 goto single_page; 2190 goto single_page;
@@ -2443,7 +2468,8 @@ do_steal:
2443 page = list_first_entry(&area->free_list[fallback_mt], 2468 page = list_first_entry(&area->free_list[fallback_mt],
2444 struct page, lru); 2469 struct page, lru);
2445 2470
2446 steal_suitable_fallback(zone, page, start_migratetype, can_steal); 2471 steal_suitable_fallback(zone, page, alloc_flags, start_migratetype,
2472 can_steal);
2447 2473
2448 trace_mm_page_alloc_extfrag(page, order, current_order, 2474 trace_mm_page_alloc_extfrag(page, order, current_order,
2449 start_migratetype, fallback_mt); 2475 start_migratetype, fallback_mt);
@@ -7454,6 +7480,7 @@ static void __setup_per_zone_wmarks(void)
7454 7480
7455 zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; 7481 zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
7456 zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; 7482 zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
7483 zone->watermark_boost = 0;
7457 7484
7458 spin_unlock_irqrestore(&zone->lock, flags); 7485 spin_unlock_irqrestore(&zone->lock, flags);
7459 } 7486 }
@@ -7554,6 +7581,18 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *table, int write,
7554 return 0; 7581 return 0;
7555} 7582}
7556 7583
7584int watermark_boost_factor_sysctl_handler(struct ctl_table *table, int write,
7585 void __user *buffer, size_t *length, loff_t *ppos)
7586{
7587 int rc;
7588
7589 rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
7590 if (rc)
7591 return rc;
7592
7593 return 0;
7594}
7595
7557int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write, 7596int watermark_scale_factor_sysctl_handler(struct ctl_table *table, int write,
7558 void __user *buffer, size_t *length, loff_t *ppos) 7597 void __user *buffer, size_t *length, loff_t *ppos)
7559{ 7598{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 24ab1f7394ab..bd8971a29204 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -88,6 +88,9 @@ struct scan_control {
88 /* Can pages be swapped as part of reclaim? */ 88 /* Can pages be swapped as part of reclaim? */
89 unsigned int may_swap:1; 89 unsigned int may_swap:1;
90 90
91 /* e.g. boosted watermark reclaim leaves slabs alone */
92 unsigned int may_shrinkslab:1;
93
91 /* 94 /*
92 * Cgroups are not reclaimed below their configured memory.low, 95 * Cgroups are not reclaimed below their configured memory.low,
93 * unless we threaten to OOM. If any cgroups are skipped due to 96 * unless we threaten to OOM. If any cgroups are skipped due to
@@ -2756,8 +2759,10 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
2756 shrink_node_memcg(pgdat, memcg, sc, &lru_pages); 2759 shrink_node_memcg(pgdat, memcg, sc, &lru_pages);
2757 node_lru_pages += lru_pages; 2760 node_lru_pages += lru_pages;
2758 2761
2759 shrink_slab(sc->gfp_mask, pgdat->node_id, 2762 if (sc->may_shrinkslab) {
2763 shrink_slab(sc->gfp_mask, pgdat->node_id,
2760 memcg, sc->priority); 2764 memcg, sc->priority);
2765 }
2761 2766
2762 /* Record the group's reclaim efficiency */ 2767 /* Record the group's reclaim efficiency */
2763 vmpressure(sc->gfp_mask, memcg, false, 2768 vmpressure(sc->gfp_mask, memcg, false,
@@ -3239,6 +3244,7 @@ unsigned long try_to_free_pages(struct zonelist *zonelist, int order,
3239 .may_writepage = !laptop_mode, 3244 .may_writepage = !laptop_mode,
3240 .may_unmap = 1, 3245 .may_unmap = 1,
3241 .may_swap = 1, 3246 .may_swap = 1,
3247 .may_shrinkslab = 1,
3242 }; 3248 };
3243 3249
3244 /* 3250 /*
@@ -3283,6 +3289,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup *memcg,
3283 .may_unmap = 1, 3289 .may_unmap = 1,
3284 .reclaim_idx = MAX_NR_ZONES - 1, 3290 .reclaim_idx = MAX_NR_ZONES - 1,
3285 .may_swap = !noswap, 3291 .may_swap = !noswap,
3292 .may_shrinkslab = 1,
3286 }; 3293 };
3287 unsigned long lru_pages; 3294 unsigned long lru_pages;
3288 3295
@@ -3329,6 +3336,7 @@ unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg,
3329 .may_writepage = !laptop_mode, 3336 .may_writepage = !laptop_mode,
3330 .may_unmap = 1, 3337 .may_unmap = 1,
3331 .may_swap = may_swap, 3338 .may_swap = may_swap,
3339 .may_shrinkslab = 1,
3332 }; 3340 };
3333 3341
3334 /* 3342 /*
@@ -3379,6 +3387,30 @@ static void age_active_anon(struct pglist_data *pgdat,
3379 } while (memcg); 3387 } while (memcg);
3380} 3388}
3381 3389
3390static bool pgdat_watermark_boosted(pg_data_t *pgdat, int classzone_idx)
3391{
3392 int i;
3393 struct zone *zone;
3394
3395 /*
3396 * Check for watermark boosts top-down as the higher zones
3397 * are more likely to be boosted. Both watermarks and boosts
3398 * should not be checked at the time time as reclaim would
3399 * start prematurely when there is no boosting and a lower
3400 * zone is balanced.
3401 */
3402 for (i = classzone_idx; i >= 0; i--) {
3403 zone = pgdat->node_zones + i;
3404 if (!managed_zone(zone))
3405 continue;
3406
3407 if (zone->watermark_boost)
3408 return true;
3409 }
3410
3411 return false;
3412}
3413
3382/* 3414/*
3383 * Returns true if there is an eligible zone balanced for the request order 3415 * Returns true if there is an eligible zone balanced for the request order
3384 * and classzone_idx 3416 * and classzone_idx
@@ -3389,6 +3421,10 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
3389 unsigned long mark = -1; 3421 unsigned long mark = -1;
3390 struct zone *zone; 3422 struct zone *zone;
3391 3423
3424 /*
3425 * Check watermarks bottom-up as lower zones are more likely to
3426 * meet watermarks.
3427 */
3392 for (i = 0; i <= classzone_idx; i++) { 3428 for (i = 0; i <= classzone_idx; i++) {
3393 zone = pgdat->node_zones + i; 3429 zone = pgdat->node_zones + i;
3394 3430
@@ -3517,14 +3553,14 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3517 unsigned long nr_soft_reclaimed; 3553 unsigned long nr_soft_reclaimed;
3518 unsigned long nr_soft_scanned; 3554 unsigned long nr_soft_scanned;
3519 unsigned long pflags; 3555 unsigned long pflags;
3556 unsigned long nr_boost_reclaim;
3557 unsigned long zone_boosts[MAX_NR_ZONES] = { 0, };
3558 bool boosted;
3520 struct zone *zone; 3559 struct zone *zone;
3521 struct scan_control sc = { 3560 struct scan_control sc = {
3522 .gfp_mask = GFP_KERNEL, 3561 .gfp_mask = GFP_KERNEL,
3523 .order = order, 3562 .order = order,
3524 .priority = DEF_PRIORITY,
3525 .may_writepage = !laptop_mode,
3526 .may_unmap = 1, 3563 .may_unmap = 1,
3527 .may_swap = 1,
3528 }; 3564 };
3529 3565
3530 psi_memstall_enter(&pflags); 3566 psi_memstall_enter(&pflags);
@@ -3532,9 +3568,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3532 3568
3533 count_vm_event(PAGEOUTRUN); 3569 count_vm_event(PAGEOUTRUN);
3534 3570
3571 /*
3572 * Account for the reclaim boost. Note that the zone boost is left in
3573 * place so that parallel allocations that are near the watermark will
3574 * stall or direct reclaim until kswapd is finished.
3575 */
3576 nr_boost_reclaim = 0;
3577 for (i = 0; i <= classzone_idx; i++) {
3578 zone = pgdat->node_zones + i;
3579 if (!managed_zone(zone))
3580 continue;
3581
3582 nr_boost_reclaim += zone->watermark_boost;
3583 zone_boosts[i] = zone->watermark_boost;
3584 }
3585 boosted = nr_boost_reclaim;
3586
3587restart:
3588 sc.priority = DEF_PRIORITY;
3535 do { 3589 do {
3536 unsigned long nr_reclaimed = sc.nr_reclaimed; 3590 unsigned long nr_reclaimed = sc.nr_reclaimed;
3537 bool raise_priority = true; 3591 bool raise_priority = true;
3592 bool balanced;
3538 bool ret; 3593 bool ret;
3539 3594
3540 sc.reclaim_idx = classzone_idx; 3595 sc.reclaim_idx = classzone_idx;
@@ -3561,13 +3616,40 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3561 } 3616 }
3562 3617
3563 /* 3618 /*
3564 * Only reclaim if there are no eligible zones. Note that 3619 * If the pgdat is imbalanced then ignore boosting and preserve
3565 * sc.reclaim_idx is not used as buffer_heads_over_limit may 3620 * the watermarks for a later time and restart. Note that the
3566 * have adjusted it. 3621 * zone watermarks will be still reset at the end of balancing
3622 * on the grounds that the normal reclaim should be enough to
3623 * re-evaluate if boosting is required when kswapd next wakes.
3624 */
3625 balanced = pgdat_balanced(pgdat, sc.order, classzone_idx);
3626 if (!balanced && nr_boost_reclaim) {
3627 nr_boost_reclaim = 0;
3628 goto restart;
3629 }
3630
3631 /*
3632 * If boosting is not active then only reclaim if there are no
3633 * eligible zones. Note that sc.reclaim_idx is not used as
3634 * buffer_heads_over_limit may have adjusted it.
3567 */ 3635 */
3568 if (pgdat_balanced(pgdat, sc.order, classzone_idx)) 3636 if (!nr_boost_reclaim && balanced)
3569 goto out; 3637 goto out;
3570 3638
3639 /* Limit the priority of boosting to avoid reclaim writeback */
3640 if (nr_boost_reclaim && sc.priority == DEF_PRIORITY - 2)
3641 raise_priority = false;
3642
3643 /*
3644 * Do not writeback or swap pages for boosted reclaim. The
3645 * intent is to relieve pressure not issue sub-optimal IO
3646 * from reclaim context. If no pages are reclaimed, the
3647 * reclaim will be aborted.
3648 */
3649 sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
3650 sc.may_swap = !nr_boost_reclaim;
3651 sc.may_shrinkslab = !nr_boost_reclaim;
3652
3571 /* 3653 /*
3572 * Do some background aging of the anon list, to give 3654 * Do some background aging of the anon list, to give
3573 * pages a chance to be referenced before reclaiming. All 3655 * pages a chance to be referenced before reclaiming. All
@@ -3619,6 +3701,16 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3619 * progress in reclaiming pages 3701 * progress in reclaiming pages
3620 */ 3702 */
3621 nr_reclaimed = sc.nr_reclaimed - nr_reclaimed; 3703 nr_reclaimed = sc.nr_reclaimed - nr_reclaimed;
3704 nr_boost_reclaim -= min(nr_boost_reclaim, nr_reclaimed);
3705
3706 /*
3707 * If reclaim made no progress for a boost, stop reclaim as
3708 * IO cannot be queued and it could be an infinite loop in
3709 * extreme circumstances.
3710 */
3711 if (nr_boost_reclaim && !nr_reclaimed)
3712 break;
3713
3622 if (raise_priority || !nr_reclaimed) 3714 if (raise_priority || !nr_reclaimed)
3623 sc.priority--; 3715 sc.priority--;
3624 } while (sc.priority >= 1); 3716 } while (sc.priority >= 1);
@@ -3627,6 +3719,28 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3627 pgdat->kswapd_failures++; 3719 pgdat->kswapd_failures++;
3628 3720
3629out: 3721out:
3722 /* If reclaim was boosted, account for the reclaim done in this pass */
3723 if (boosted) {
3724 unsigned long flags;
3725
3726 for (i = 0; i <= classzone_idx; i++) {
3727 if (!zone_boosts[i])
3728 continue;
3729
3730 /* Increments are under the zone lock */
3731 zone = pgdat->node_zones + i;
3732 spin_lock_irqsave(&zone->lock, flags);
3733 zone->watermark_boost -= min(zone->watermark_boost, zone_boosts[i]);
3734 spin_unlock_irqrestore(&zone->lock, flags);
3735 }
3736
3737 /*
3738 * As there is now likely space, wakeup kcompact to defragment
3739 * pageblocks.
3740 */
3741 wakeup_kcompactd(pgdat, pageblock_order, classzone_idx);
3742 }
3743
3630 snapshot_refaults(NULL, pgdat); 3744 snapshot_refaults(NULL, pgdat);
3631 __fs_reclaim_release(); 3745 __fs_reclaim_release();
3632 psi_memstall_leave(&pflags); 3746 psi_memstall_leave(&pflags);
@@ -3855,7 +3969,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order,
3855 3969
3856 /* Hopeless node, leave it to direct reclaim if possible */ 3970 /* Hopeless node, leave it to direct reclaim if possible */
3857 if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || 3971 if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES ||
3858 pgdat_balanced(pgdat, order, classzone_idx)) { 3972 (pgdat_balanced(pgdat, order, classzone_idx) &&
3973 !pgdat_watermark_boosted(pgdat, classzone_idx))) {
3859 /* 3974 /*
3860 * There may be plenty of free memory available, but it's too 3975 * There may be plenty of free memory available, but it's too
3861 * fragmented for high-order allocations. Wake up kcompactd 3976 * fragmented for high-order allocations. Wake up kcompactd