sched/numa, mm: Use active_nodes nodemask to limit numa migrations

Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
author: Rik van Riel <riel@redhat.com> 2014-01-27 17:03:44 -0500
committer: Ingo Molnar <mingo@kernel.org> 2014-01-28 07:17:07 -0500
commit: 10f39042711ba21773763f267b4943a2c66c8bef (patch)
tree: 394f8399c6f9b980f5673e1034b125b91844f662 /mm
parent: 20e07dea286a90f096a779706861472d296397c6 (diff)
1 files changed, 1 insertions, 28 deletions
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 68d5c7f7164e..784c11ef7719 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2377,37 +2377,10 @@ int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long
        /* Migrate the page towards the node whose CPU is referencing it */
        if (pol->flags & MPOL_F_MORON) {
-                int last_cpupid;
-                int this_cpupid;
                polnid = thisnid;
-                this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
-                /*
+                if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
-                 * Multi-stage node selection is used in conjunction
-                 * with a periodic migration fault to build a temporal
-                 * task<->page relation. By using a two-stage filter we
-                 * remove short/unlikely relations.
-                 *
-                 * Using P(p) ~ n_p / n_t as per frequentist
-                 * probability, we can equate a task's usage of a
-                 * particular page (n_p) per total usage of this
-                 * page (n_t) (in a given time-span) to a probability.
-                 *
-                 * Our periodic faults will sample this probability and
-                 * getting the same result twice in a row, given these
-                 * samples are fully independent, is then given by
-                 * P(n)^2, provided our sample period is sufficiently
-                 * short compared to the usage pattern.
-                 *
-                 * This quadric squishes small probabilities, making
-                 * it less likely we act on an unlikely task<->page
-                 * relation.
-                 */
-                last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
-                if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
                        goto out;
-                }
        }
        if (curnid != polnid)
author	Rik van Riel <riel@redhat.com>	2014-01-27 17:03:44 -0500
committer	Ingo Molnar <mingo@kernel.org>	2014-01-28 07:17:07 -0500
commit	10f39042711ba21773763f267b4943a2c66c8bef (patch)
tree	394f8399c6f9b980f5673e1034b125b91844f662 /mm
parent	20e07dea286a90f096a779706861472d296397c6 (diff)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 68d5c7f7164e..784c11ef7719 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c
@@ -2377,37 +2377,10 @@ int mpol_misplaced(struct page page, struct vm_area_struct vma, unsigned long
2377		2377
2378	/* Migrate the page towards the node whose CPU is referencing it */	2378	/* Migrate the page towards the node whose CPU is referencing it */
2379	if (pol->flags & MPOL_F_MORON) {	2379	if (pol->flags & MPOL_F_MORON) {
2380	int last_cpupid;
2381	int this_cpupid;
2382
2383	polnid = thisnid;	2380	polnid = thisnid;
2384	this_cpupid = cpu_pid_to_cpupid(thiscpu, current->pid);
2385		2381
2386	/*	2382	if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
2387	* Multi-stage node selection is used in conjunction
2388	* with a periodic migration fault to build a temporal
2389	* task<->page relation. By using a two-stage filter we
2390	* remove short/unlikely relations.
2391	*
2392	* Using P(p) ~ n_p / n_t as per frequentist
2393	* probability, we can equate a task's usage of a
2394	* particular page (n_p) per total usage of this
2395	* page (n_t) (in a given time-span) to a probability.
2396	*
2397	* Our periodic faults will sample this probability and
2398	* getting the same result twice in a row, given these
2399	* samples are fully independent, is then given by
2400	* P(n)^2, provided our sample period is sufficiently
2401	* short compared to the usage pattern.
2402	*
2403	* This quadric squishes small probabilities, making
2404	* it less likely we act on an unlikely task<->page
2405	* relation.
2406	*/
2407	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
2408	if (!cpupid_pid_unset(last_cpupid) && cpupid_to_nid(last_cpupid) != thisnid) {
2409	goto out;	2383	goto out;
2410	}
2411	}	2384	}
2412		2385
2413	if (curnid != polnid)	2386	if (curnid != polnid)