mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked to see if it's possible to do a merge. This requires accessing buddy's page structure and that access could take a long time if it's cache cold. This patch adds a prefetch to the to-be-freed page's buddy outside of zone->lock in hope of accessing buddy's page structure later under zone->lock will be faster. Since we *always* do buddy merging and check an order-0 page's buddy to try to merge it when it goes into the main allocator, the cacheline will always come in, i.e. the prefetched data will never be unused. Normally, the number of prefetch will be pcp->batch(default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of pcp's pages get all drained, it will be pcp->count which has an upper limit of pcp->high. pcp->high, although has a default value of 186 (pcp->batch=31 * 6), can be changed by user through /proc/sys/vm/percpu_pagelist_fraction and there is no software upper limit so could be large, like several thousand. For this reason, only the first pcp->batch number of page's buddy structure is prefetched to avoid excessive prefetching. In the meantime, there are two concerns: 1. the prefetch could potentially evict existing cachelines, especially for L1D cache since it is not huge 2. there is some additional instruction overhead, namely calculating buddy pfn twice For 1, it's hard to say, this microbenchmark though shows good result but the actual benefit of this patch will be workload/CPU dependant; For 2, since the calculation is a XOR on two local variables, it's expected in many cases that cycles spent will be offset by reduced memory latency later. This is especially true for NUMA machines where multiple CPUs are contending on zone->lock and the most time consuming part under zone->lock is the wait of 'struct page' cacheline of the to-be-freed pages and their buddies. Test with will-it-scale/page_fault1 full load: kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S) v4.16-rc2+ 9034215 7971818 13667135 15677465 patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4% this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9% Note: this patch's performance improvement percent is against patch2/3. (Changelog stolen from Dave Hansen and Mel Gorman's comments at http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com) [aaron.lu@intel.com: use helper function, avoid disordering pages] Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com [aaron.lu@intel.com: v4] Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Ying Huang <ying.huang@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@suse.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Aaron Lu <aaron.lu@intel.com> 2018-04-05 19:24:14 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2018-04-06 00:36:26 -0400
commit: 97334162e4d79f866edd7308aac0ab3ab7a103f7 (patch)
tree: d46537be9687742b45a27690adbee2b45446da62 /mm/page_alloc.c
parent: 0a5f4e5b45625e75db85b4968fc4c232d8091143 (diff)
1 files changed, 22 insertions, 0 deletions
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e29a6ba050c8..f6005b7c3446 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1063,6 +1063,15 @@ static bool bulkfree_pcp_prepare(struct page *page)
 }
 #endif /* CONFIG_DEBUG_VM */
+static inline void prefetch_buddy(struct page *page)
+{
+        unsigned long pfn = page_to_pfn(page);
+        unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0);
+        struct page *buddy = page + (buddy_pfn - pfn);
+        prefetch(buddy);
+}
 /*
 * Frees a number of pages from the PCP lists
 * Assumes all pages on list are in same zone, and of same order.
@@ -1079,6 +1088,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
 {
        int migratetype = 0;
        int batch_free = 0;
+        int prefetch_nr = 0;
        bool isolated_pageblocks;
        struct page *page, *tmp;
        LIST_HEAD(head);
@@ -1114,6 +1124,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
                                continue;
                        list_add_tail(&page->lru, &head);
+                        /*
+                         * We are going to put the page back to the global
+                         * pool, prefetch its buddy to speed up later access
+                         * under zone->lock. It is believed the overhead of
+                         * an additional test and calculating buddy_pfn here
+                         * can be offset by reduced memory latency later. To
+                         * avoid excessive prefetching due to large count, only
+                         * prefetch buddy for the first pcp->batch nr of pages.
+                         */
+                        if (prefetch_nr++ < pcp->batch)
+                                prefetch_buddy(page);
                } while (--count && --batch_free && !list_empty(list));
        }
author	Aaron Lu <aaron.lu@intel.com>	2018-04-05 19:24:14 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2018-04-06 00:36:26 -0400
commit	97334162e4d79f866edd7308aac0ab3ab7a103f7 (patch)
tree	d46537be9687742b45a27690adbee2b45446da62 /mm/page_alloc.c
parent	0a5f4e5b45625e75db85b4968fc4c232d8091143 (diff)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e29a6ba050c8..f6005b7c3446 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c
@@ -1063,6 +1063,15 @@ static bool bulkfree_pcp_prepare(struct page *page)
1063	}	1063	}
1064	#endif /* CONFIG_DEBUG_VM */	1064	#endif /* CONFIG_DEBUG_VM */
1065		1065
		1066	static inline void prefetch_buddy(struct page *page)
		1067	{
		1068	unsigned long pfn = page_to_pfn(page);
		1069	unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0);
		1070	struct page *buddy = page + (buddy_pfn - pfn);
		1071
		1072	prefetch(buddy);
		1073	}
		1074
1066	/*	1075	/*
1067	* Frees a number of pages from the PCP lists	1076	* Frees a number of pages from the PCP lists
1068	* Assumes all pages on list are in same zone, and of same order.	1077	* Assumes all pages on list are in same zone, and of same order.
@@ -1079,6 +1088,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
1079	{	1088	{
1080	int migratetype = 0;	1089	int migratetype = 0;
1081	int batch_free = 0;	1090	int batch_free = 0;
		1091	int prefetch_nr = 0;
1082	bool isolated_pageblocks;	1092	bool isolated_pageblocks;
1083	struct page page, tmp;	1093	struct page page, tmp;
1084	LIST_HEAD(head);	1094	LIST_HEAD(head);
@@ -1114,6 +1124,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
1114	continue;	1124	continue;
1115		1125
1116	list_add_tail(&page->lru, &head);	1126	list_add_tail(&page->lru, &head);
		1127
		1128	/*
		1129	* We are going to put the page back to the global
		1130	* pool, prefetch its buddy to speed up later access
		1131	* under zone->lock. It is believed the overhead of
		1132	* an additional test and calculating buddy_pfn here
		1133	* can be offset by reduced memory latency later. To
		1134	* avoid excessive prefetching due to large count, only
		1135	* prefetch buddy for the first pcp->batch nr of pages.
		1136	*/
		1137	if (prefetch_nr++ < pcp->batch)
		1138	prefetch_buddy(page);
1117	} while (--count && --batch_free && !list_empty(list));	1139	} while (--count && --batch_free && !list_empty(list));
1118	}	1140	}
1119		1141