mm: send one IPI per CPU to TLB flush all entries after unmapping pages

An IPI is sent to flush remote TLBs when a page is unmapped that was potentially accesssed by other CPUs. There are many circumstances where this happens but the obvious one is kswapd reclaiming pages belonging to a running process as kswapd and the task are likely running on separate CPUs. On small machines, this is not a significant problem but as machine gets larger with more cores and more memory, the cost of these IPIs can be high. This patch uses a simple structure that tracks CPUs that potentially have TLB entries for pages being unmapped. When the unmapping is complete, the full TLB is flushed on the assumption that a refill cost is lower than flushing individual entries. Architectures wishing to do this must give the following guarantee. If a clean page is unmapped and not immediately flushed, the architecture must guarantee that a write to that linear address from a CPU with a cached TLB entry will trap a page fault. This is essentially what the kernel already depends on but the window is much larger with this patch applied and is worth highlighting. The architecture should consider whether the cost of the full TLB flush is higher than sending an IPI to flush each individual entry. An additional architecture helper called flush_tlb_local is required. It's a trivial wrapper with some accounting in the x86 case. The impact of this patch depends on the workload as measuring any benefit requires both mapped pages co-located on the LRU and memory pressure. The case with the biggest impact is multiple processes reading mapped pages taken from the vm-scalability test suite. The test case uses NR_CPU readers of mapped files that consume 10*RAM. Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 159.62 ( 0.00%) 120.68 ( 24.40%) Ops lru-file-mmap-read-time_range 30.59 ( 0.00%) 2.80 ( 90.85%) Ops lru-file-mmap-read-time_stddv 6.70 ( 0.00%) 0.64 ( 90.38%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 581.00 611.43 System 5804.93 4111.76 Elapsed 161.03 122.12 This is showing that the readers completed 24.40% faster with 29% less system CPU time. From vmstats, it is known that the vanilla kernel was interrupted roughly 900K times per second during the steady phase of the test and the patched kernel was interrupts 180K times per second. The impact is lower on a single socket machine. 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 Ops lru-file-mmap-read-elapsed 25.33 ( 0.00%) 20.38 ( 19.54%) Ops lru-file-mmap-read-time_range 0.91 ( 0.00%) 1.44 (-58.24%) Ops lru-file-mmap-read-time_stddv 0.28 ( 0.00%) 0.47 (-65.34%) 4.2.0-rc1 4.2.0-rc1 vanilla flushfull-v7 User 58.09 57.64 System 111.82 76.56 Elapsed 27.29 22.55 It's still a noticeable improvement with vmstat showing interrupts went from roughly 500K per second to 45K per second. The patch will have no impact on workloads with no memory pressure or have relatively few mapped pages. It will have an unpredictable impact on the workload running on the CPU being flushed as it'll depend on how many TLB entries need to be refilled and how long that takes. Worst case, the TLB will be completely cleared of active entries when the target PFNs were not resident at all. [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush] Signed-off-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Dave Hansen <dave.hansen@intel.com> Acked-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Mel Gorman <mgorman@suse.de> 2015-09-04 18:47:32 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2015-09-04 19:54:41 -0400
commit: 72b252aed506b8f1a03f7abd29caef4cdf6a043b (patch)
tree: a0825c463af7ebca1b172ceb26ab8ea3eb6ff602 /include/linux/sched.h
parent: 5b74283ab251b9db55cbbe31d19ca72482103290 (diff)
1 files changed, 16 insertions, 0 deletions
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 119823decc46..3c602c20c717 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1344,6 +1344,18 @@ enum perf_event_task_context {
        perf_nr_task_contexts,
 };
+/* Track pages that require TLB flushes */
+struct tlbflush_unmap_batch {
+        /*
+         * Each bit set is a CPU that potentially has a TLB entry for one of
+         * the PFNs being flushed. See set_tlb_ubc_flush_pending().
+         */
+        struct cpumask cpumask;
+        /* True if any bit in cpumask is set */
+        bool flush_required;
+};
 struct task_struct {
        volatile long state;    /* -1 unrunnable, 0 runnable, >0 stopped */
        void *stack;
@@ -1700,6 +1712,10 @@ struct task_struct {
        unsigned long numa_pages_migrated;
 #endif /* CONFIG_NUMA_BALANCING */
+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+        struct tlbflush_unmap_batch tlb_ubc;
+#endif
        struct rcu_head rcu;
        /*
author	Mel Gorman <mgorman@suse.de>	2015-09-04 18:47:32 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2015-09-04 19:54:41 -0400
commit	72b252aed506b8f1a03f7abd29caef4cdf6a043b (patch)
tree	a0825c463af7ebca1b172ceb26ab8ea3eb6ff602 /include/linux/sched.h
parent	5b74283ab251b9db55cbbe31d19ca72482103290 (diff)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 119823decc46..3c602c20c717 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h
@@ -1344,6 +1344,18 @@ enum perf_event_task_context {
1344	perf_nr_task_contexts,	1344	perf_nr_task_contexts,
1345	};	1345	};
1346		1346
		1347	/* Track pages that require TLB flushes */
		1348	struct tlbflush_unmap_batch {
		1349	/*
		1350	* Each bit set is a CPU that potentially has a TLB entry for one of
		1351	* the PFNs being flushed. See set_tlb_ubc_flush_pending().
		1352	*/
		1353	struct cpumask cpumask;
		1354
		1355	/* True if any bit in cpumask is set */
		1356	bool flush_required;
		1357	};
		1358
1347	struct task_struct {	1359	struct task_struct {
1348	volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */	1360	volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
1349	void *stack;	1361	void *stack;
@@ -1700,6 +1712,10 @@ struct task_struct {
1700	unsigned long numa_pages_migrated;	1712	unsigned long numa_pages_migrated;
1701	#endif /* CONFIG_NUMA_BALANCING */	1713	#endif /* CONFIG_NUMA_BALANCING */
1702		1714
		1715	#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
		1716	struct tlbflush_unmap_batch tlb_ubc;
		1717	#endif
		1718
1703	struct rcu_head rcu;	1719	struct rcu_head rcu;
1704		1720
1705	/*	1721	/*