summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorAndrea Arcangeli <aarcange@redhat.com>2019-08-13 18:37:53 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2019-08-13 19:06:52 -0400
commita8282608c88e08b1782141026eab61204c1e533f (patch)
treed1705bf3b5d7429bb8c5027eaeae49e49bee8465
parent92717d429b38e4f9f934eed7e605cc42858f1839 (diff)
Revert "mm, thp: restore node-local hugepage allocations"
This reverts commit 2f0799a0ffc033b ("mm, thp: restore node-local hugepage allocations"). commit 2f0799a0ffc033b was rightfully applied to avoid the risk of a severe regression that was reported by the kernel test robot at the end of the merge window. Now we understood the regression was a false positive and was caused by a significant increase in fairness during a swap trashing benchmark. So it's safe to re-apply the fix and continue improving the code from there. The benchmark that reported the regression is very useful, but it provides a meaningful result only when there is no significant alteration in fairness during the workload. The removal of __GFP_THISNODE increased fairness. __GFP_THISNODE cannot be used in the generic page faults path for new memory allocations under the MPOL_DEFAULT mempolicy, or the allocation behavior significantly deviates from what the MPOL_DEFAULT semantics are supposed to be for THP and 4k allocations alike. Setting THP defrag to "always" or using MADV_HUGEPAGE (with THP defrag set to "madvise") has never meant to provide an implicit MPOL_BIND on the "current" node the task is running on, causing swap storms and providing a much more aggressive behavior than even zone_reclaim_node = 3. Any workload who could have benefited from __GFP_THISNODE has now to enable zone_reclaim_mode=1||2||3. __GFP_THISNODE implicitly provided the zone_reclaim_mode behavior, but it only did so if THP was enabled: if THP was disabled, there would have been no chance to get any 4k page from the current node if the current node was full of pagecache, which further shows how this __GFP_THISNODE was misplaced in MADV_HUGEPAGE. MADV_HUGEPAGE has never been intended to provide any zone_reclaim_mode semantics, in fact the two are orthogonal, zone_reclaim_mode = 1|2|3 must work exactly the same with MADV_HUGEPAGE set or not. The performance characteristic of memory depends on the hardware details. The numbers below are obtained on Naples/EPYC architecture and the N/A projection extends them to show what we should aim for in the future as a good THP NUMA locality default. The benchmark used exercises random memory seeks (note: the cost of the page faults is not part of the measurement). D0 THP | D0 4k | D1 THP | D1 4k | D2 THP | D2 4k | D3 THP | D3 4k | ... 0% | +43% | +45% | +106% | +131% | +224% | N/A | N/A D0 means distance zero (i.e. local memory), D1 means distance one (i.e. intra socket memory), D2 means distance two (i.e. inter socket memory), etc... For the guest physical memory allocated by qemu and for guest mode kernel the performance characteristic of RAM is more complex and an ideal default could be: D0 THP | D1 THP | D0 4k | D2 THP | D1 4k | D3 THP | D2 4k | D3 4k | ... 0% | +58% | +101% | N/A | +222% | N/A | N/A | N/A NOTE: the N/A are projections and haven't been measured yet, the measurement in this case is done on a 1950x with only two NUMA nodes. The THP case here means THP was used both in the host and in the guest. After applying this commit the THP NUMA locality order that we'll get out of MADV_HUGEPAGE is this: D0 THP | D1 THP | D2 THP | D3 THP | ... | D0 4k | D1 4k | D2 4k | D3 4k | ... Before this commit it was: D0 THP | D0 4k | D1 4k | D2 4k | D3 4k | ... Even if we ignore the breakage of large workloads that can't fit in a single node that the __GFP_THISNODE implicit "current node" mbind caused, the THP NUMA locality order provided by __GFP_THISNODE was still not the one we shall aim for in the long term (i.e. the first one at the top). After this commit is applied, we can introduce a new allocator multi order API and to replace those two alloc_pages_vmas calls in the page fault path, with a single multi order call: unsigned int order = (1 << HPAGE_PMD_ORDER) | (1 << 0); page = alloc_pages_multi_order(..., &order); if (!page) goto out; if (!(order & (1 << 0))) { VM_WARN_ON(order != 1 << HPAGE_PMD_ORDER); /* THP fault */ } else { VM_WARN_ON(order != 1 << 0); /* 4k fallback */ } The page allocator logic has to be altered so that when it fails on any zone with order 9, it has to try again with a order 0 before falling back to the next zone in the zonelist. After that we need to do more measurements and evaluate if adding an opt-in feature for guest mode is worth it, to swap "DN 4k | DN+1 THP" with "DN+1 THP | DN 4k" at every NUMA distance crossing. Link: http://lkml.kernel.org/r/20190503223146.2312-3-aarcange@redhat.com Signed-off-by: Andrea Arcangeli <aarcange@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: David Rientjes <rientjes@google.com> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-rw-r--r--include/linux/mempolicy.h2
-rw-r--r--mm/huge_memory.c42
-rw-r--r--mm/mempolicy.c2
3 files changed, 29 insertions, 17 deletions
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 5228c62af416..bac395f1d00a 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -139,6 +139,8 @@ struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
139struct mempolicy *get_task_policy(struct task_struct *p); 139struct mempolicy *get_task_policy(struct task_struct *p);
140struct mempolicy *__get_vma_policy(struct vm_area_struct *vma, 140struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
141 unsigned long addr); 141 unsigned long addr);
142struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
143 unsigned long addr);
142bool vma_policy_mof(struct vm_area_struct *vma); 144bool vma_policy_mof(struct vm_area_struct *vma);
143 145
144extern void numa_default_policy(void); 146extern void numa_default_policy(void);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f7e388b8662d..738065f765ab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -647,27 +647,37 @@ release:
647static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr) 647static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma, unsigned long addr)
648{ 648{
649 const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE); 649 const bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
650 const gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT | __GFP_THISNODE; 650 gfp_t this_node = 0;
651 651
652 /* Always do synchronous compaction */ 652#ifdef CONFIG_NUMA
653 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) 653 struct mempolicy *pol;
654 return GFP_TRANSHUGE | __GFP_THISNODE | 654 /*
655 (vma_madvised ? 0 : __GFP_NORETRY); 655 * __GFP_THISNODE is used only when __GFP_DIRECT_RECLAIM is not
656 * specified, to express a general desire to stay on the current
657 * node for optimistic allocation attempts. If the defrag mode
658 * and/or madvise hint requires the direct reclaim then we prefer
659 * to fallback to other node rather than node reclaim because that
660 * can lead to excessive reclaim even though there is free memory
661 * on other nodes. We expect that NUMA preferences are specified
662 * by memory policies.
663 */
664 pol = get_vma_policy(vma, addr);
665 if (pol->mode != MPOL_BIND)
666 this_node = __GFP_THISNODE;
667 mpol_cond_put(pol);
668#endif
656 669
657 /* Kick kcompactd and fail quickly */ 670 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
671 return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
658 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags)) 672 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
659 return gfp_mask | __GFP_KSWAPD_RECLAIM; 673 return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM | this_node;
660
661 /* Synchronous compaction if madvised, otherwise kick kcompactd */
662 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags)) 674 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_OR_MADV_FLAG, &transparent_hugepage_flags))
663 return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 675 return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
664 __GFP_KSWAPD_RECLAIM); 676 __GFP_KSWAPD_RECLAIM | this_node);
665
666 /* Only do synchronous compaction if madvised */
667 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags)) 677 if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags))
668 return gfp_mask | (vma_madvised ? __GFP_DIRECT_RECLAIM : 0); 678 return GFP_TRANSHUGE_LIGHT | (vma_madvised ? __GFP_DIRECT_RECLAIM :
669 679 this_node);
670 return gfp_mask; 680 return GFP_TRANSHUGE_LIGHT | this_node;
671} 681}
672 682
673/* Caller must hold page table lock. */ 683/* Caller must hold page table lock. */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 9c9877a43d58..65e0874fce17 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1734,7 +1734,7 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
1734 * freeing by another task. It is the caller's responsibility to free the 1734 * freeing by another task. It is the caller's responsibility to free the
1735 * extra reference for shared policies. 1735 * extra reference for shared policies.
1736 */ 1736 */
1737static struct mempolicy *get_vma_policy(struct vm_area_struct *vma, 1737struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
1738 unsigned long addr) 1738 unsigned long addr)
1739{ 1739{
1740 struct mempolicy *pol = __get_vma_policy(vma, addr); 1740 struct mempolicy *pol = __get_vma_policy(vma, addr);