aboutsummaryrefslogtreecommitdiffstats
path: root/mm/vmscan.c
diff options
context:
space:
mode:
authorVlastimil Babka <vbabka@suse.cz>2016-03-17 17:18:15 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2016-03-17 18:09:34 -0400
commitaccf62422b3a67fce8ce086aa81c8300ddbf42be (patch)
tree65a007c6a29f92726e0483de334df05ee24a1a63 /mm/vmscan.c
parente888ca3545dc6823603b976e40b62af2c68b6fcc (diff)
mm, kswapd: replace kswapd compaction with waking up kcompactd
Similarly to direct reclaim/compaction, kswapd attempts to combine reclaim and compaction to attempt making memory allocation of given order available. The details differ from direct reclaim e.g. in having high watermark as a goal. The code involved in kswapd's reclaim/compaction decisions has evolved to be quite complex. Testing reveals that it doesn't actually work in at least one scenario, and closer inspection suggests that it could be greatly simplified without compromising on the goal (make high-order page available) or efficiency (don't reclaim too much). The simplification relieas of doing all compaction in kcompactd, which is simply woken up when high watermarks are reached by kswapd's reclaim. The scenario where kswapd compaction doesn't work was found with mmtests test stress-highalloc configured to attempt order-9 allocations without direct reclaim, just waking up kswapd. There was no compaction attempt from kswapd during the whole test. Some added instrumentation shows what happens: - balance_pgdat() sets end_zone to Normal, as it's not balanced - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but it cannot reclaim anything, so sc.nr_reclaimed is 0 - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so it merely checks if high watermarks were reached for base pages. This is true, so no reclaim is attempted. For DMA, testorder=0 wasn't used, as compaction_suitable() returned COMPACT_SKIPPED - even though the pgdat_needs_compaction flag wasn't set to false, no compaction happens due to the condition sc.nr_reclaimed > nr_attempted being false (as 0 < 99) - priority-- due to nr_reclaimed being 0, repeat until priority reaches 0 pgdat_balanced() is false as only the small zone DMA appears balanced (curiously in that check, watermark appears OK and compaction_suitable() returns COMPACT_PARTIAL, because a lower classzone_idx is used there) Now, even if it was decided that reclaim shouldn't be attempted on the DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 > nr_attempted=0) is also false. The condition really should use >= as the comment suggests. Then there is a mismatch in the check for setting pgdat_needs_compaction to false using low watermark, while the rest uses high watermark, and who knows what other subtlety. Hopefully this demonstrates that this is unsustainable. Luckily we can simplify this a lot. The reclaim/compaction decisions make sense for direct reclaim scenario, but in kswapd, our primary goal is to reach high watermark in order-0 pages. Afterwards we can attempt compaction just once. Unlike direct reclaim, we don't reclaim extra pages (over the high watermark), the current code already disallows it for good reasons. After this patch, we simply wake up kcompactd to process the pgdat, after we have either succeeded or failed to reach the high watermarks in kswapd, which goes to sleep. We pass kswapd's order and classzone_idx, so kcompactd can apply the same criteria to determine which zones are worth compacting. Note that we use the classzone_idx from wakeup_kswapd(), not balanced_classzone_idx which can include higher zones that kswapd tried to balance too, but didn't consider them in pgdat_balanced(). Since kswapd now cannot create high-order pages itself, we need to adjust how it determines the zones to be balanced. The key element here is adding a "highorder" parameter to zone_balanced, which, when set to false, makes it consider only order-0 watermark instead of the desired higher order (this was done previously by kswapd_shrink_zone(), but not elsewhere). This false is passed for example in pgdat_balanced(). Importantly, wakeup_kswapd() uses true to make sure kswapd and thus kcompactd are woken up for a high-order allocation failure. The last thing is to decide what to do with pageblock_skip bitmap handling. Compaction maintains a pageblock_skip bitmap to record pageblocks where isolation recently failed. This bitmap can be reset by three ways: 1) direct compaction is restarting after going through the full deferred cycle 2) kswapd goes to sleep, and some other direct compaction has previously finished scanning the whole zone and set zone->compact_blockskip_flush. Note that a successful direct compaction clears this flag. 3) compaction was invoked manually via trigger in /proc The case 2) is somewhat fuzzy to begin with, but after introducing kcompactd we should update it. The check for direct compaction in 1), and to set the flush flag in 2) use current_is_kswapd(), which doesn't work for kcompactd. Thus, this patch adds bool direct_compaction to compact_control to use in 2). For the case 1) we remove the check completely - unlike the former kswapd compaction, kcompactd does use the deferred compaction functionality, so flushing tied to restarting from deferred compaction makes sense here. Note that when kswapd goes to sleep, kcompactd is woken up, so it will see the flushed pageblock_skip bits. This is different from when the former kswapd compaction observed the bits and I believe it makes more sense. Kcompactd can afford to be more thorough than a direct compaction trying to limit allocation latency, or kswapd whose primary goal is to reclaim. For testing, I used stress-highalloc configured to do order-9 allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in phases 1 and 2 work as usual): stress-highalloc 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%) Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%) Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%) Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%) Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%) Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%) Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%) Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%) User 3166.67 3181.09 System 1153.37 1158.25 Elapsed 1768.53 1799.37 4.5-rc1+before 4.5-rc1+after -nodirect -nodirect Direct pages scanned 32938 32797 Kswapd pages scanned 2183166 2202613 Kswapd pages reclaimed 2152359 2143524 Direct pages reclaimed 32735 32545 Percentage direct scans 1% 1% THP fault alloc 579 612 THP collapse alloc 304 316 THP splits 0 0 THP fault fallback 793 778 THP collapse fail 11 16 Compaction stalls 1013 1007 Compaction success 92 67 Compaction failures 920 939 Page migrate success 238457 721374 Page migrate failure 23021 23469 Compaction pages isolated 504695 1479924 Compaction migrate scanned 661390 8812554 Compaction free scanned 13476658 84327916 Compaction cost 262 838 After this patch we see improvements in allocation success rate (especially for phase 3) along with increased compaction activity. The compaction stalls (direct compaction) in the interfering kernel builds (probably THP's) also decreased somewhat thanks to kcompactd activity, yet THP alloc successes improved a bit. Note that elapsed and user time isn't so useful for this benchmark, because of the background interference being unpredictable. It's just to quickly spot some major unexpected differences. System time is somewhat more useful and that didn't increase. Also (after adjusting mmtests' ftrace monitor): Time kswapd awake 2547781 2269241 Time kcompactd awake 0 119253 Time direct compacting 939937 557649 Time kswapd compacting 0 0 Time kcompactd compacting 0 119099 The decrease of overal time spent compacting appears to not match the increased compaction stats. I suspect the tasks get rescheduled and since the ftrace monitor doesn't see that, the reported time is wall time, not CPU time. But arguably direct compactors care about overall latency anyway, whether busy compacting or waiting for CPU doesn't matter. And that latency seems to almost halved. It's also interesting how much time kswapd spent awake just going through all the priorities and failing to even try compacting, over and over. We can also configure stress-highalloc to perform both direct reclaim/compaction and wakeup kswapd/kcompactd, by using GFP_KERNEL|__GFP_HIGH|__GFP_COMP: stress-highalloc 4.5-rc1+before 4.5-rc1+after -direct -direct Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%) Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%) Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%) Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%) Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%) Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%) Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%) Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%) User 3344.73 3246.04 System 1194.24 1172.29 Elapsed 1838.04 1836.76 4.5-rc1+before 4.5-rc1+after -direct -direct Direct pages scanned 125146 120966 Kswapd pages scanned 2119757 2135012 Kswapd pages reclaimed 2073183 2108388 Direct pages reclaimed 124909 120577 Percentage direct scans 5% 5% THP fault alloc 599 652 THP collapse alloc 323 354 THP splits 0 0 THP fault fallback 806 793 THP collapse fail 17 16 Compaction stalls 2457 2025 Compaction success 906 518 Compaction failures 1551 1507 Page migrate success 2031423 2360608 Page migrate failure 32845 40852 Compaction pages isolated 4129761 4802025 Compaction migrate scanned 11996712 21750613 Compaction free scanned 214970969 344372001 Compaction cost 2271 2694 In this scenario, this patch doesn't change the overall success rate as direct compaction already tries all it can. There's however significant reduction in direct compaction stalls (that is, the number of allocations that went into direct compaction). The number of successes (i.e. direct compaction stalls that ended up with successful allocation) is reduced by the same number. This means the offload to kcompactd is working as expected, and direct compaction is reduced either due to detecting contention, or compaction deferred by kcompactd. In the previous version of this patchset there was some apparent reduction of success rate, but the changes in this version (such as using sync compaction only), new baseline kernel, and/or averaging results from 5 executions (my bet), made this go away. Ftrace-based stats seem to roughly agree: Time kswapd awake 2532984 2326824 Time kcompactd awake 0 257916 Time direct compacting 864839 735130 Time kswapd compacting 0 0 Time kcompactd compacting 0 257585 Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/vmscan.c')
-rw-r--r--mm/vmscan.c147
1 files changed, 48 insertions, 99 deletions
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5dcc71140108..f87cfaa955a8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2968,18 +2968,23 @@ static void age_active_anon(struct zone *zone, struct scan_control *sc)
2968 } while (memcg); 2968 } while (memcg);
2969} 2969}
2970 2970
2971static bool zone_balanced(struct zone *zone, int order, 2971static bool zone_balanced(struct zone *zone, int order, bool highorder,
2972 unsigned long balance_gap, int classzone_idx) 2972 unsigned long balance_gap, int classzone_idx)
2973{ 2973{
2974 if (!zone_watermark_ok_safe(zone, order, high_wmark_pages(zone) + 2974 unsigned long mark = high_wmark_pages(zone) + balance_gap;
2975 balance_gap, classzone_idx))
2976 return false;
2977 2975
2978 if (IS_ENABLED(CONFIG_COMPACTION) && order && compaction_suitable(zone, 2976 /*
2979 order, 0, classzone_idx) == COMPACT_SKIPPED) 2977 * When checking from pgdat_balanced(), kswapd should stop and sleep
2980 return false; 2978 * when it reaches the high order-0 watermark and let kcompactd take
2979 * over. Other callers such as wakeup_kswapd() want to determine the
2980 * true high-order watermark.
2981 */
2982 if (IS_ENABLED(CONFIG_COMPACTION) && !highorder) {
2983 mark += (1UL << order);
2984 order = 0;
2985 }
2981 2986
2982 return true; 2987 return zone_watermark_ok_safe(zone, order, mark, classzone_idx);
2983} 2988}
2984 2989
2985/* 2990/*
@@ -3029,7 +3034,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int classzone_idx)
3029 continue; 3034 continue;
3030 } 3035 }
3031 3036
3032 if (zone_balanced(zone, order, 0, i)) 3037 if (zone_balanced(zone, order, false, 0, i))
3033 balanced_pages += zone->managed_pages; 3038 balanced_pages += zone->managed_pages;
3034 else if (!order) 3039 else if (!order)
3035 return false; 3040 return false;
@@ -3083,10 +3088,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, long remaining,
3083 */ 3088 */
3084static bool kswapd_shrink_zone(struct zone *zone, 3089static bool kswapd_shrink_zone(struct zone *zone,
3085 int classzone_idx, 3090 int classzone_idx,
3086 struct scan_control *sc, 3091 struct scan_control *sc)
3087 unsigned long *nr_attempted)
3088{ 3092{
3089 int testorder = sc->order;
3090 unsigned long balance_gap; 3093 unsigned long balance_gap;
3091 bool lowmem_pressure; 3094 bool lowmem_pressure;
3092 3095
@@ -3094,17 +3097,6 @@ static bool kswapd_shrink_zone(struct zone *zone,
3094 sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone)); 3097 sc->nr_to_reclaim = max(SWAP_CLUSTER_MAX, high_wmark_pages(zone));
3095 3098
3096 /* 3099 /*
3097 * Kswapd reclaims only single pages with compaction enabled. Trying
3098 * too hard to reclaim until contiguous free pages have become
3099 * available can hurt performance by evicting too much useful data
3100 * from memory. Do not reclaim more than needed for compaction.
3101 */
3102 if (IS_ENABLED(CONFIG_COMPACTION) && sc->order &&
3103 compaction_suitable(zone, sc->order, 0, classzone_idx)
3104 != COMPACT_SKIPPED)
3105 testorder = 0;
3106
3107 /*
3108 * We put equal pressure on every zone, unless one zone has way too 3100 * We put equal pressure on every zone, unless one zone has way too
3109 * many pages free already. The "too many pages" is defined as the 3101 * many pages free already. The "too many pages" is defined as the
3110 * high wmark plus a "gap" where the gap is either the low 3102 * high wmark plus a "gap" where the gap is either the low
@@ -3118,15 +3110,12 @@ static bool kswapd_shrink_zone(struct zone *zone,
3118 * reclaim is necessary 3110 * reclaim is necessary
3119 */ 3111 */
3120 lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone)); 3112 lowmem_pressure = (buffer_heads_over_limit && is_highmem(zone));
3121 if (!lowmem_pressure && zone_balanced(zone, testorder, 3113 if (!lowmem_pressure && zone_balanced(zone, sc->order, false,
3122 balance_gap, classzone_idx)) 3114 balance_gap, classzone_idx))
3123 return true; 3115 return true;
3124 3116
3125 shrink_zone(zone, sc, zone_idx(zone) == classzone_idx); 3117 shrink_zone(zone, sc, zone_idx(zone) == classzone_idx);
3126 3118
3127 /* Account for the number of pages attempted to reclaim */
3128 *nr_attempted += sc->nr_to_reclaim;
3129
3130 clear_bit(ZONE_WRITEBACK, &zone->flags); 3119 clear_bit(ZONE_WRITEBACK, &zone->flags);
3131 3120
3132 /* 3121 /*
@@ -3136,7 +3125,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
3136 * waits. 3125 * waits.
3137 */ 3126 */
3138 if (zone_reclaimable(zone) && 3127 if (zone_reclaimable(zone) &&
3139 zone_balanced(zone, testorder, 0, classzone_idx)) { 3128 zone_balanced(zone, sc->order, false, 0, classzone_idx)) {
3140 clear_bit(ZONE_CONGESTED, &zone->flags); 3129 clear_bit(ZONE_CONGESTED, &zone->flags);
3141 clear_bit(ZONE_DIRTY, &zone->flags); 3130 clear_bit(ZONE_DIRTY, &zone->flags);
3142 } 3131 }
@@ -3148,7 +3137,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
3148 * For kswapd, balance_pgdat() will work across all this node's zones until 3137 * For kswapd, balance_pgdat() will work across all this node's zones until
3149 * they are all at high_wmark_pages(zone). 3138 * they are all at high_wmark_pages(zone).
3150 * 3139 *
3151 * Returns the final order kswapd was reclaiming at 3140 * Returns the highest zone idx kswapd was reclaiming at
3152 * 3141 *
3153 * There is special handling here for zones which are full of pinned pages. 3142 * There is special handling here for zones which are full of pinned pages.
3154 * This can happen if the pages are all mlocked, or if they are all used by 3143 * This can happen if the pages are all mlocked, or if they are all used by
@@ -3165,8 +3154,7 @@ static bool kswapd_shrink_zone(struct zone *zone,
3165 * interoperates with the page allocator fallback scheme to ensure that aging 3154 * interoperates with the page allocator fallback scheme to ensure that aging
3166 * of pages is balanced across the zones. 3155 * of pages is balanced across the zones.
3167 */ 3156 */
3168static unsigned long balance_pgdat(pg_data_t *pgdat, int order, 3157static int balance_pgdat(pg_data_t *pgdat, int order, int classzone_idx)
3169 int *classzone_idx)
3170{ 3158{
3171 int i; 3159 int i;
3172 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */ 3160 int end_zone = 0; /* Inclusive. 0 = ZONE_DMA */
@@ -3183,9 +3171,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
3183 count_vm_event(PAGEOUTRUN); 3171 count_vm_event(PAGEOUTRUN);
3184 3172
3185 do { 3173 do {
3186 unsigned long nr_attempted = 0;
3187 bool raise_priority = true; 3174 bool raise_priority = true;
3188 bool pgdat_needs_compaction = (order > 0);
3189 3175
3190 sc.nr_reclaimed = 0; 3176 sc.nr_reclaimed = 0;
3191 3177
@@ -3220,7 +3206,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
3220 break; 3206 break;
3221 } 3207 }
3222 3208
3223 if (!zone_balanced(zone, order, 0, 0)) { 3209 if (!zone_balanced(zone, order, false, 0, 0)) {
3224 end_zone = i; 3210 end_zone = i;
3225 break; 3211 break;
3226 } else { 3212 } else {
@@ -3236,24 +3222,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
3236 if (i < 0) 3222 if (i < 0)
3237 goto out; 3223 goto out;
3238 3224
3239 for (i = 0; i <= end_zone; i++) {
3240 struct zone *zone = pgdat->node_zones + i;
3241
3242 if (!populated_zone(zone))
3243 continue;
3244
3245 /*
3246 * If any zone is currently balanced then kswapd will
3247 * not call compaction as it is expected that the
3248 * necessary pages are already available.
3249 */
3250 if (pgdat_needs_compaction &&
3251 zone_watermark_ok(zone, order,
3252 low_wmark_pages(zone),
3253 *classzone_idx, 0))
3254 pgdat_needs_compaction = false;
3255 }
3256
3257 /* 3225 /*
3258 * If we're getting trouble reclaiming, start doing writepage 3226 * If we're getting trouble reclaiming, start doing writepage
3259 * even in laptop mode. 3227 * even in laptop mode.
@@ -3297,8 +3265,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
3297 * that that high watermark would be met at 100% 3265 * that that high watermark would be met at 100%
3298 * efficiency. 3266 * efficiency.
3299 */ 3267 */
3300 if (kswapd_shrink_zone(zone, end_zone, 3268 if (kswapd_shrink_zone(zone, end_zone, &sc))
3301 &sc, &nr_attempted))
3302 raise_priority = false; 3269 raise_priority = false;
3303 } 3270 }
3304 3271
@@ -3311,49 +3278,29 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order,
3311 pfmemalloc_watermark_ok(pgdat)) 3278 pfmemalloc_watermark_ok(pgdat))
3312 wake_up_all(&pgdat->pfmemalloc_wait); 3279 wake_up_all(&pgdat->pfmemalloc_wait);
3313 3280
3314 /*
3315 * Fragmentation may mean that the system cannot be rebalanced
3316 * for high-order allocations in all zones. If twice the
3317 * allocation size has been reclaimed and the zones are still
3318 * not balanced then recheck the watermarks at order-0 to
3319 * prevent kswapd reclaiming excessively. Assume that a
3320 * process requested a high-order can direct reclaim/compact.
3321 */
3322 if (order && sc.nr_reclaimed >= 2UL << order)
3323 order = sc.order = 0;
3324
3325 /* Check if kswapd should be suspending */ 3281 /* Check if kswapd should be suspending */
3326 if (try_to_freeze() || kthread_should_stop()) 3282 if (try_to_freeze() || kthread_should_stop())
3327 break; 3283 break;
3328 3284
3329 /* 3285 /*
3330 * Compact if necessary and kswapd is reclaiming at least the
3331 * high watermark number of pages as requsted
3332 */
3333 if (pgdat_needs_compaction && sc.nr_reclaimed > nr_attempted)
3334 compact_pgdat(pgdat, order);
3335
3336 /*
3337 * Raise priority if scanning rate is too low or there was no 3286 * Raise priority if scanning rate is too low or there was no
3338 * progress in reclaiming pages 3287 * progress in reclaiming pages
3339 */ 3288 */
3340 if (raise_priority || !sc.nr_reclaimed) 3289 if (raise_priority || !sc.nr_reclaimed)
3341 sc.priority--; 3290 sc.priority--;
3342 } while (sc.priority >= 1 && 3291 } while (sc.priority >= 1 &&
3343 !pgdat_balanced(pgdat, order, *classzone_idx)); 3292 !pgdat_balanced(pgdat, order, classzone_idx));
3344 3293
3345out: 3294out:
3346 /* 3295 /*
3347 * Return the order we were reclaiming at so prepare_kswapd_sleep() 3296 * Return the highest zone idx we were reclaiming at so
3348 * makes a decision on the order we were last reclaiming at. However, 3297 * prepare_kswapd_sleep() makes the same decisions as here.
3349 * if another caller entered the allocator slow path while kswapd
3350 * was awake, order will remain at the higher level
3351 */ 3298 */
3352 *classzone_idx = end_zone; 3299 return end_zone;
3353 return order;
3354} 3300}
3355 3301
3356static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx) 3302static void kswapd_try_to_sleep(pg_data_t *pgdat, int order,
3303 int classzone_idx, int balanced_classzone_idx)
3357{ 3304{
3358 long remaining = 0; 3305 long remaining = 0;
3359 DEFINE_WAIT(wait); 3306 DEFINE_WAIT(wait);
@@ -3364,7 +3311,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
3364 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); 3311 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
3365 3312
3366 /* Try to sleep for a short interval */ 3313 /* Try to sleep for a short interval */
3367 if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { 3314 if (prepare_kswapd_sleep(pgdat, order, remaining,
3315 balanced_classzone_idx)) {
3368 remaining = schedule_timeout(HZ/10); 3316 remaining = schedule_timeout(HZ/10);
3369 finish_wait(&pgdat->kswapd_wait, &wait); 3317 finish_wait(&pgdat->kswapd_wait, &wait);
3370 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE); 3318 prepare_to_wait(&pgdat->kswapd_wait, &wait, TASK_INTERRUPTIBLE);
@@ -3374,7 +3322,8 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
3374 * After a short sleep, check if it was a premature sleep. If not, then 3322 * After a short sleep, check if it was a premature sleep. If not, then
3375 * go fully to sleep until explicitly woken up. 3323 * go fully to sleep until explicitly woken up.
3376 */ 3324 */
3377 if (prepare_kswapd_sleep(pgdat, order, remaining, classzone_idx)) { 3325 if (prepare_kswapd_sleep(pgdat, order, remaining,
3326 balanced_classzone_idx)) {
3378 trace_mm_vmscan_kswapd_sleep(pgdat->node_id); 3327 trace_mm_vmscan_kswapd_sleep(pgdat->node_id);
3379 3328
3380 /* 3329 /*
@@ -3395,6 +3344,12 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
3395 */ 3344 */
3396 reset_isolation_suitable(pgdat); 3345 reset_isolation_suitable(pgdat);
3397 3346
3347 /*
3348 * We have freed the memory, now we should compact it to make
3349 * allocation of the requested order possible.
3350 */
3351 wakeup_kcompactd(pgdat, order, classzone_idx);
3352
3398 if (!kthread_should_stop()) 3353 if (!kthread_should_stop())
3399 schedule(); 3354 schedule();
3400 3355
@@ -3424,7 +3379,6 @@ static void kswapd_try_to_sleep(pg_data_t *pgdat, int order, int classzone_idx)
3424static int kswapd(void *p) 3379static int kswapd(void *p)
3425{ 3380{
3426 unsigned long order, new_order; 3381 unsigned long order, new_order;
3427 unsigned balanced_order;
3428 int classzone_idx, new_classzone_idx; 3382 int classzone_idx, new_classzone_idx;
3429 int balanced_classzone_idx; 3383 int balanced_classzone_idx;
3430 pg_data_t *pgdat = (pg_data_t*)p; 3384 pg_data_t *pgdat = (pg_data_t*)p;
@@ -3457,23 +3411,19 @@ static int kswapd(void *p)
3457 set_freezable(); 3411 set_freezable();
3458 3412
3459 order = new_order = 0; 3413 order = new_order = 0;
3460 balanced_order = 0;
3461 classzone_idx = new_classzone_idx = pgdat->nr_zones - 1; 3414 classzone_idx = new_classzone_idx = pgdat->nr_zones - 1;
3462 balanced_classzone_idx = classzone_idx; 3415 balanced_classzone_idx = classzone_idx;
3463 for ( ; ; ) { 3416 for ( ; ; ) {
3464 bool ret; 3417 bool ret;
3465 3418
3466 /* 3419 /*
3467 * If the last balance_pgdat was unsuccessful it's unlikely a 3420 * While we were reclaiming, there might have been another
3468 * new request of a similar or harder type will succeed soon 3421 * wakeup, so check the values.
3469 * so consider going to sleep on the basis we reclaimed at
3470 */ 3422 */
3471 if (balanced_order == new_order) { 3423 new_order = pgdat->kswapd_max_order;
3472 new_order = pgdat->kswapd_max_order; 3424 new_classzone_idx = pgdat->classzone_idx;
3473 new_classzone_idx = pgdat->classzone_idx; 3425 pgdat->kswapd_max_order = 0;
3474 pgdat->kswapd_max_order = 0; 3426 pgdat->classzone_idx = pgdat->nr_zones - 1;
3475 pgdat->classzone_idx = pgdat->nr_zones - 1;
3476 }
3477 3427
3478 if (order < new_order || classzone_idx > new_classzone_idx) { 3428 if (order < new_order || classzone_idx > new_classzone_idx) {
3479 /* 3429 /*
@@ -3483,7 +3433,7 @@ static int kswapd(void *p)
3483 order = new_order; 3433 order = new_order;
3484 classzone_idx = new_classzone_idx; 3434 classzone_idx = new_classzone_idx;
3485 } else { 3435 } else {
3486 kswapd_try_to_sleep(pgdat, balanced_order, 3436 kswapd_try_to_sleep(pgdat, order, classzone_idx,
3487 balanced_classzone_idx); 3437 balanced_classzone_idx);
3488 order = pgdat->kswapd_max_order; 3438 order = pgdat->kswapd_max_order;
3489 classzone_idx = pgdat->classzone_idx; 3439 classzone_idx = pgdat->classzone_idx;
@@ -3503,9 +3453,8 @@ static int kswapd(void *p)
3503 */ 3453 */
3504 if (!ret) { 3454 if (!ret) {
3505 trace_mm_vmscan_kswapd_wake(pgdat->node_id, order); 3455 trace_mm_vmscan_kswapd_wake(pgdat->node_id, order);
3506 balanced_classzone_idx = classzone_idx; 3456 balanced_classzone_idx = balance_pgdat(pgdat, order,
3507 balanced_order = balance_pgdat(pgdat, order, 3457 classzone_idx);
3508 &balanced_classzone_idx);
3509 } 3458 }
3510 } 3459 }
3511 3460
@@ -3535,7 +3484,7 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
3535 } 3484 }
3536 if (!waitqueue_active(&pgdat->kswapd_wait)) 3485 if (!waitqueue_active(&pgdat->kswapd_wait))
3537 return; 3486 return;
3538 if (zone_balanced(zone, order, 0, 0)) 3487 if (zone_balanced(zone, order, true, 0, 0))
3539 return; 3488 return;
3540 3489
3541 trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order); 3490 trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);