aboutsummaryrefslogtreecommitdiffstats
path: root/mm/page_alloc.c
diff options
context:
space:
mode:
authorMel Gorman <mgorman@techsingularity.net>2018-12-28 03:35:41 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2018-12-28 15:11:48 -0500
commit6bb154504f8b496780ec53ec81aba957a12981fa (patch)
treedbed6f65d0ff903edb43788301f2969fa37d76df /mm/page_alloc.c
parentf29d8e9c0191a2a02500945db505e5c89159c3f4 (diff)
mm, page_alloc: spread allocations across zones before introducing fragmentation
Patch series "Fragmentation avoidance improvements", v5. It has been noted before that fragmentation avoidance (aka anti-fragmentation) is not perfect. Given sufficient time or an adverse workload, memory gets fragmented and the long-term success of high-order allocations degrades. This series defines an adverse workload, a definition of external fragmentation events (including serious) ones and a series that reduces the level of those fragmentation events. The details of the workload and the consequences are described in more detail in the changelogs. However, from patch 1, this is a high-level summary of the adverse workload. The exact details are found in the mmtests implementation. The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch) 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameterr create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed 3. Warm up a number of fio read-only threads accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll fault back in old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup Overall the series reduces external fragmentation causing events by over 94% on 1 and 2 socket machines, which in turn impacts high-order allocation success rates over the long term. There are differences in latencies and high-order allocation success rates. Latencies are a mixed bag as they are vulnerable to exact system state and whether allocations succeeded so they are treated as a secondary metric. Patch 1 uses lower zones if they are populated and have free memory instead of fragmenting a higher zone. It's special cased to handle a Normal->DMA32 fallback with the reasons explained in the changelog. Patch 2-4 boosts watermarks temporarily when an external fragmentation event occurs. kswapd wakes to reclaim a small amount of old memory and then wakes kcompactd on completion to recover the system slightly. This introduces some overhead in the slowpath. The level of boosting can be tuned or disabled depending on the tolerance for fragmentation vs allocation latency. Patch 5 stalls some movable allocation requests to let kswapd from patch 4 make some progress. The duration of the stalls is very low but it is possible to tune the system to avoid fragmentation events if larger stalls can be tolerated. The bulk of the improvement in fragmentation avoidance is from patches 1-4 but patch 5 can deal with a rare corner case and provides the option of tuning a system for THP allocation success rates in exchange for some stalls to control fragmentation. This patch (of 5): The page allocator zone lists are iterated based on the watermarks of each zone which does not take anti-fragmentation into account. On x86, node 0 may have multiple zones while other nodes have one zone. A consequence is that tasks running on node 0 may fragment ZONE_NORMAL even though ZONE_DMA32 has plenty of free memory. This patch special cases the allocator fast path such that it'll try an allocation from a lower local zone before fragmenting a higher zone. In this case, stealing of pageblocks or orders larger than a pageblock are still allowed in the fast path as they are uninteresting from a fragmentation point of view. This was evaluated using a benchmark designed to fragment memory before attempting THP allocations. It's implemented in mmtests as the following configurations configs/config-global-dhp__workload_thpfioscale configs/config-global-dhp__workload_thpfioscale-defrag configs/config-global-dhp__workload_thpfioscale-madvhugepage e.g. from mmtests ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1 The broad details of the workload are as follows; 1. Create an XFS filesystem (not specified in the configuration but done as part of the testing for this patch). 2. Start 4 fio threads that write a number of 64K files inefficiently. Inefficiently means that files are created on first access and not created in advance (fio parameter create_on_open=1) and fallocate is not used (fallocate=none). With multiple IO issuers this creates a mix of slab and page cache allocations over time. The total size of the files is 150% physical memory so that the slabs and page cache pages get mixed. 3. Warm up a number of fio read-only processes accessing the same files created in step 2. This part runs for the same length of time it took to create the files. It'll refault old data and further interleave slab and page cache allocations. As it's now low on memory due to step 2, fragmentation occurs as pageblocks get stolen. 4. While step 3 is still running, start a process that tries to allocate 75% of memory as huge pages with a number of threads. The number of threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP threads contending with fio, any other threads or forcing cross-NUMA scheduling. Note that the test has not been used on a machine with less than 8 cores. The benchmark records whether huge pages were allocated and what the fault latency was in microseconds. 5. Measure the number of events potentially causing external fragmentation, the fault latency and the huge page allocation success rate. 6. Cleanup the test files. Note that due to the use of IO and page cache that this benchmark is not suitable for running on large machines where the time to fragment memory may be excessive. Also note that while this is one mix that generates fragmentation that it's not the only mix that generates fragmentation. Differences in workload that are more slab-intensive or whether SLUB is used with high-order pages may yield different results. When the page allocator fragments memory, it records the event using the mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than a pageblock order (order-9 on 64-bit x86) then it's considered to be an "external fragmentation event" that may cause issues in the future. Hence, the primary metric here is the number of external fragmentation events that occur with order < 9. The secondary metric is allocation latency and huge page allocation success rates but note that differences in latencies and what the success rate also can affect the number of external fragmentation event which is why it's a secondary metric. 1-socket Skylake machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 1 THP allocating thread -------------------------------------- 4.20-rc3 extfrag events < order 9: 804694 4.20-rc3+patch: 408912 (49% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%* Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%) Fault latencies are slightly reduced while allocation success rates remain at zero as this configuration does not make any special effort to allocate THP and fio is heavily active at the time and either filling memory or keeping pages resident. However, a 49% reduction of serious fragmentation events reduces the changes of external fragmentation being a problem in the future. Vlastimil asked during review for a breakdown of the allocation types that are falling back. vanilla 3816 MIGRATE_UNMOVABLE 800845 MIGRATE_MOVABLE 33 MIGRATE_UNRECLAIMABLE patch 735 MIGRATE_UNMOVABLE 408135 MIGRATE_MOVABLE 42 MIGRATE_UNRECLAIMABLE The majority of the fallbacks are due to movable allocations and this is consistent for the workload throughout the series so will not be presented again as the primary source of fallbacks are movable allocations. Movable fallbacks are sometimes considered "ok" to fallback because they can be migrated. The problem is that they can fill an unmovable/reclaimable pageblock causing those allocations to fallback later and polluting pageblocks with pages that cannot move. If there is a movable fallback, it is pretty much guaranteed to affect an unmovable/reclaimable pageblock and while it might not be enough to actually cause a unmovable/reclaimable fallback in the future, we cannot know that in advance so the patch takes the only option available to it. Hence, it's important to control them. This point is also consistent throughout the series and will not be repeated. 1-socket Skylake machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 291392 4.20-rc3+patch: 191187 (34% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%) Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%) thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%) Fragmentation events were reduced quite a bit although this is known to be a little variable. The latencies and allocation success rates are similar but they were already quite high. 2-socket Haswell machine config-global-dhp__workload_thpfioscale XFS (no special madvise) 4 fio threads, 5 THP allocating threads ---------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 215698 4.20-rc3+patch: 200210 (7% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%) Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%) 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%) The reduction of external fragmentation events is slight and this is partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP allocations can now spill over to remote nodes instead of fragmenting local memory. 2-socket Haswell machine global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE) ----------------------------------------------------------------- 4.20-rc3 extfrag events < order 9: 166352 4.20-rc3+patch: 147463 (11% reduction) thpfioscale Fault Latencies 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%) Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%* thpfioscale Percentage Faults Huge 4.20.0-rc3 4.20.0-rc3 vanilla lowzone-v5r8 Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%) There was a slight reduction in external fragmentation events although the latencies were higher. The allocation success rate is high enough that the system is struggling and there is quite a lot of parallel reclaim and compaction activity. There is also a certain degree of luck on whether processes start on node 0 or not for this patch but the relevance is reduced later in the series. Overall, the patch reduces the number of external fragmentation causing events so the success of THP over long periods of time would be improved for this adverse workload. Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Diffstat (limited to 'mm/page_alloc.c')
-rw-r--r--mm/page_alloc.c108
1 files changed, 96 insertions, 12 deletions
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 298b449a03c7..251b8a0c9c5d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,20 +2375,30 @@ static bool unreserve_highatomic_pageblock(const struct alloc_context *ac,
2375 * condition simpler. 2375 * condition simpler.
2376 */ 2376 */
2377static __always_inline bool 2377static __always_inline bool
2378__rmqueue_fallback(struct zone *zone, int order, int start_migratetype) 2378__rmqueue_fallback(struct zone *zone, int order, int start_migratetype,
2379 unsigned int alloc_flags)
2379{ 2380{
2380 struct free_area *area; 2381 struct free_area *area;
2381 int current_order; 2382 int current_order;
2383 int min_order = order;
2382 struct page *page; 2384 struct page *page;
2383 int fallback_mt; 2385 int fallback_mt;
2384 bool can_steal; 2386 bool can_steal;
2385 2387
2386 /* 2388 /*
2389 * Do not steal pages from freelists belonging to other pageblocks
2390 * i.e. orders < pageblock_order. If there are no local zones free,
2391 * the zonelists will be reiterated without ALLOC_NOFRAGMENT.
2392 */
2393 if (alloc_flags & ALLOC_NOFRAGMENT)
2394 min_order = pageblock_order;
2395
2396 /*
2387 * Find the largest available free page in the other list. This roughly 2397 * Find the largest available free page in the other list. This roughly
2388 * approximates finding the pageblock with the most free pages, which 2398 * approximates finding the pageblock with the most free pages, which
2389 * would be too costly to do exactly. 2399 * would be too costly to do exactly.
2390 */ 2400 */
2391 for (current_order = MAX_ORDER - 1; current_order >= order; 2401 for (current_order = MAX_ORDER - 1; current_order >= min_order;
2392 --current_order) { 2402 --current_order) {
2393 area = &(zone->free_area[current_order]); 2403 area = &(zone->free_area[current_order]);
2394 fallback_mt = find_suitable_fallback(area, current_order, 2404 fallback_mt = find_suitable_fallback(area, current_order,
@@ -2447,7 +2457,8 @@ do_steal:
2447 * Call me with the zone->lock already held. 2457 * Call me with the zone->lock already held.
2448 */ 2458 */
2449static __always_inline struct page * 2459static __always_inline struct page *
2450__rmqueue(struct zone *zone, unsigned int order, int migratetype) 2460__rmqueue(struct zone *zone, unsigned int order, int migratetype,
2461 unsigned int alloc_flags)
2451{ 2462{
2452 struct page *page; 2463 struct page *page;
2453 2464
@@ -2457,7 +2468,8 @@ retry:
2457 if (migratetype == MIGRATE_MOVABLE) 2468 if (migratetype == MIGRATE_MOVABLE)
2458 page = __rmqueue_cma_fallback(zone, order); 2469 page = __rmqueue_cma_fallback(zone, order);
2459 2470
2460 if (!page && __rmqueue_fallback(zone, order, migratetype)) 2471 if (!page && __rmqueue_fallback(zone, order, migratetype,
2472 alloc_flags))
2461 goto retry; 2473 goto retry;
2462 } 2474 }
2463 2475
@@ -2472,13 +2484,14 @@ retry:
2472 */ 2484 */
2473static int rmqueue_bulk(struct zone *zone, unsigned int order, 2485static int rmqueue_bulk(struct zone *zone, unsigned int order,
2474 unsigned long count, struct list_head *list, 2486 unsigned long count, struct list_head *list,
2475 int migratetype) 2487 int migratetype, unsigned int alloc_flags)
2476{ 2488{
2477 int i, alloced = 0; 2489 int i, alloced = 0;
2478 2490
2479 spin_lock(&zone->lock); 2491 spin_lock(&zone->lock);
2480 for (i = 0; i < count; ++i) { 2492 for (i = 0; i < count; ++i) {
2481 struct page *page = __rmqueue(zone, order, migratetype); 2493 struct page *page = __rmqueue(zone, order, migratetype,
2494 alloc_flags);
2482 if (unlikely(page == NULL)) 2495 if (unlikely(page == NULL))
2483 break; 2496 break;
2484 2497
@@ -2934,6 +2947,7 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
2934 2947
2935/* Remove page from the per-cpu list, caller must protect the list */ 2948/* Remove page from the per-cpu list, caller must protect the list */
2936static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype, 2949static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
2950 unsigned int alloc_flags,
2937 struct per_cpu_pages *pcp, 2951 struct per_cpu_pages *pcp,
2938 struct list_head *list) 2952 struct list_head *list)
2939{ 2953{
@@ -2943,7 +2957,7 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
2943 if (list_empty(list)) { 2957 if (list_empty(list)) {
2944 pcp->count += rmqueue_bulk(zone, 0, 2958 pcp->count += rmqueue_bulk(zone, 0,
2945 pcp->batch, list, 2959 pcp->batch, list,
2946 migratetype); 2960 migratetype, alloc_flags);
2947 if (unlikely(list_empty(list))) 2961 if (unlikely(list_empty(list)))
2948 return NULL; 2962 return NULL;
2949 } 2963 }
@@ -2959,7 +2973,8 @@ static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
2959/* Lock and remove page from the per-cpu list */ 2973/* Lock and remove page from the per-cpu list */
2960static struct page *rmqueue_pcplist(struct zone *preferred_zone, 2974static struct page *rmqueue_pcplist(struct zone *preferred_zone,
2961 struct zone *zone, unsigned int order, 2975 struct zone *zone, unsigned int order,
2962 gfp_t gfp_flags, int migratetype) 2976 gfp_t gfp_flags, int migratetype,
2977 unsigned int alloc_flags)
2963{ 2978{
2964 struct per_cpu_pages *pcp; 2979 struct per_cpu_pages *pcp;
2965 struct list_head *list; 2980 struct list_head *list;
@@ -2969,7 +2984,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
2969 local_irq_save(flags); 2984 local_irq_save(flags);
2970 pcp = &this_cpu_ptr(zone->pageset)->pcp; 2985 pcp = &this_cpu_ptr(zone->pageset)->pcp;
2971 list = &pcp->lists[migratetype]; 2986 list = &pcp->lists[migratetype];
2972 page = __rmqueue_pcplist(zone, migratetype, pcp, list); 2987 page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
2973 if (page) { 2988 if (page) {
2974 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order); 2989 __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
2975 zone_statistics(preferred_zone, zone); 2990 zone_statistics(preferred_zone, zone);
@@ -2992,7 +3007,7 @@ struct page *rmqueue(struct zone *preferred_zone,
2992 3007
2993 if (likely(order == 0)) { 3008 if (likely(order == 0)) {
2994 page = rmqueue_pcplist(preferred_zone, zone, order, 3009 page = rmqueue_pcplist(preferred_zone, zone, order,
2995 gfp_flags, migratetype); 3010 gfp_flags, migratetype, alloc_flags);
2996 goto out; 3011 goto out;
2997 } 3012 }
2998 3013
@@ -3011,7 +3026,7 @@ struct page *rmqueue(struct zone *preferred_zone,
3011 trace_mm_page_alloc_zone_locked(page, order, migratetype); 3026 trace_mm_page_alloc_zone_locked(page, order, migratetype);
3012 } 3027 }
3013 if (!page) 3028 if (!page)
3014 page = __rmqueue(zone, order, migratetype); 3029 page = __rmqueue(zone, order, migratetype, alloc_flags);
3015 } while (page && check_new_pages(page, order)); 3030 } while (page && check_new_pages(page, order));
3016 spin_unlock(&zone->lock); 3031 spin_unlock(&zone->lock);
3017 if (!page) 3032 if (!page)
@@ -3253,6 +3268,40 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone)
3253} 3268}
3254#endif /* CONFIG_NUMA */ 3269#endif /* CONFIG_NUMA */
3255 3270
3271#ifdef CONFIG_ZONE_DMA32
3272/*
3273 * The restriction on ZONE_DMA32 as being a suitable zone to use to avoid
3274 * fragmentation is subtle. If the preferred zone was HIGHMEM then
3275 * premature use of a lower zone may cause lowmem pressure problems that
3276 * are worse than fragmentation. If the next zone is ZONE_DMA then it is
3277 * probably too small. It only makes sense to spread allocations to avoid
3278 * fragmentation between the Normal and DMA32 zones.
3279 */
3280static inline unsigned int
3281alloc_flags_nofragment(struct zone *zone)
3282{
3283 if (zone_idx(zone) != ZONE_NORMAL)
3284 return 0;
3285
3286 /*
3287 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
3288 * the pointer is within zone->zone_pgdat->node_zones[]. Also assume
3289 * on UMA that if Normal is populated then so is DMA32.
3290 */
3291 BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);
3292 if (nr_online_nodes > 1 && !populated_zone(--zone))
3293 return 0;
3294
3295 return ALLOC_NOFRAGMENT;
3296}
3297#else
3298static inline unsigned int
3299alloc_flags_nofragment(struct zone *zone)
3300{
3301 return 0;
3302}
3303#endif
3304
3256/* 3305/*
3257 * get_page_from_freelist goes through the zonelist trying to allocate 3306 * get_page_from_freelist goes through the zonelist trying to allocate
3258 * a page. 3307 * a page.
@@ -3261,14 +3310,18 @@ static struct page *
3261get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, 3310get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
3262 const struct alloc_context *ac) 3311 const struct alloc_context *ac)
3263{ 3312{
3264 struct zoneref *z = ac->preferred_zoneref; 3313 struct zoneref *z;
3265 struct zone *zone; 3314 struct zone *zone;
3266 struct pglist_data *last_pgdat_dirty_limit = NULL; 3315 struct pglist_data *last_pgdat_dirty_limit = NULL;
3316 bool no_fallback;
3267 3317
3318retry:
3268 /* 3319 /*
3269 * Scan zonelist, looking for a zone with enough free. 3320 * Scan zonelist, looking for a zone with enough free.
3270 * See also __cpuset_node_allowed() comment in kernel/cpuset.c. 3321 * See also __cpuset_node_allowed() comment in kernel/cpuset.c.
3271 */ 3322 */
3323 no_fallback = alloc_flags & ALLOC_NOFRAGMENT;
3324 z = ac->preferred_zoneref;
3272 for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, 3325 for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
3273 ac->nodemask) { 3326 ac->nodemask) {
3274 struct page *page; 3327 struct page *page;
@@ -3307,6 +3360,22 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
3307 } 3360 }
3308 } 3361 }
3309 3362
3363 if (no_fallback && nr_online_nodes > 1 &&
3364 zone != ac->preferred_zoneref->zone) {
3365 int local_nid;
3366
3367 /*
3368 * If moving to a remote node, retry but allow
3369 * fragmenting fallbacks. Locality is more important
3370 * than fragmentation avoidance.
3371 */
3372 local_nid = zone_to_nid(ac->preferred_zoneref->zone);
3373 if (zone_to_nid(zone) != local_nid) {
3374 alloc_flags &= ~ALLOC_NOFRAGMENT;
3375 goto retry;
3376 }
3377 }
3378
3310 mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK]; 3379 mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
3311 if (!zone_watermark_fast(zone, order, mark, 3380 if (!zone_watermark_fast(zone, order, mark,
3312 ac_classzone_idx(ac), alloc_flags)) { 3381 ac_classzone_idx(ac), alloc_flags)) {
@@ -3374,6 +3443,15 @@ try_this_zone:
3374 } 3443 }
3375 } 3444 }
3376 3445
3446 /*
3447 * It's possible on a UMA machine to get through all zones that are
3448 * fragmented. If avoiding fragmentation, reset and try again.
3449 */
3450 if (no_fallback) {
3451 alloc_flags &= ~ALLOC_NOFRAGMENT;
3452 goto retry;
3453 }
3454
3377 return NULL; 3455 return NULL;
3378} 3456}
3379 3457
@@ -4369,6 +4447,12 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
4369 4447
4370 finalise_ac(gfp_mask, &ac); 4448 finalise_ac(gfp_mask, &ac);
4371 4449
4450 /*
4451 * Forbid the first pass from falling back to types that fragment
4452 * memory until all local zones are considered.
4453 */
4454 alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone);
4455
4372 /* First allocation attempt */ 4456 /* First allocation attempt */
4373 page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac); 4457 page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
4374 if (likely(page)) 4458 if (likely(page))