Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 17:33:25 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 18:18:08 -0500
commit: 3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree: b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /mm/mempolicy.c
parent: 11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent: 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)
1 files changed, 259 insertions, 24 deletions
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index aaf54566cb6b..d1b315e98627 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
 #include <linux/syscalls.h>
 #include <linux/ctype.h>
 #include <linux/mm_inline.h>
+#include <linux/mmu_notifier.h>
 #include <asm/tlbflush.h>
 #include <asm/uaccess.h>
@@ -117,6 +118,26 @@ static struct mempolicy default_policy = {
        .flags = MPOL_F_LOCAL,
 };
+static struct mempolicy preferred_node_policy[MAX_NUMNODES];
+static struct mempolicy *get_task_policy(struct task_struct *p)
+{
+        struct mempolicy *pol = p->mempolicy;
+        int node;
+        if (!pol) {
+                node = numa_node_id();
+                if (node != -1)
+                        pol = &preferred_node_policy[node];
+                /* preferred_node_policy is not initialised early in boot */
+                if (!pol->mode)
+                        pol = NULL;
+        }
+        return pol;
+}
 static const struct mempolicy_operations {
        int (*create)(struct mempolicy *pol, const nodemask_t *nodes);
        /*
@@ -254,7 +275,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
        if (mode == MPOL_DEFAULT) {
                if (nodes && !nodes_empty(*nodes))
                        return ERR_PTR(-EINVAL);
-                return NULL;    /* simply delete any existing policy */
+                return NULL;
        }
        VM_BUG_ON(!nodes);
@@ -269,6 +290,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
                             (flags & MPOL_F_RELATIVE_NODES)))
                                return ERR_PTR(-EINVAL);
                }
+        } else if (mode == MPOL_LOCAL) {
+                if (!nodes_empty(*nodes))
+                        return ERR_PTR(-EINVAL);
+                mode = MPOL_PREFERRED;
        } else if (nodes_empty(*nodes))
                return ERR_PTR(-EINVAL);
        policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -561,6 +586,36 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
        return 0;
 }
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * This is used to mark a range of virtual addresses to be inaccessible.
+ * These are later cleared by a NUMA hinting fault. Depending on these
+ * faults, pages may be migrated for better NUMA placement.
+ *
+ * This is assuming that NUMA faults are handled using PROT_NONE. If
+ * an architecture makes a different choice, it will need further
+ * changes to the core.
+ */
+unsigned long change_prot_numa(struct vm_area_struct *vma,
+                        unsigned long addr, unsigned long end)
+{
+        int nr_updated;
+        BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
+        nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
+        if (nr_updated)
+                count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
+        return nr_updated;
+}
+#else
+static unsigned long change_prot_numa(struct vm_area_struct *vma,
+                        unsigned long addr, unsigned long end)
+{
+        return 0;
+}
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
 /*
 * Check if all pages in a range are on a set of nodes.
 * If pagelist != NULL then isolate pages from the LRU and
@@ -579,22 +634,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
                return ERR_PTR(-EFAULT);
        prev = NULL;
        for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
+                unsigned long endvma = vma->vm_end;
+                if (endvma > end)
+                        endvma = end;
+                if (vma->vm_start > start)
+                        start = vma->vm_start;
                if (!(flags & MPOL_MF_DISCONTIG_OK)) {
                        if (!vma->vm_next && vma->vm_end < end)
                                return ERR_PTR(-EFAULT);
                        if (prev && prev->vm_end < vma->vm_start)
                                return ERR_PTR(-EFAULT);
                }
-                if (!is_vm_hugetlb_page(vma) &&
-                    ((flags & MPOL_MF_STRICT) ||
+                if (is_vm_hugetlb_page(vma))
+                        goto next;
+                if (flags & MPOL_MF_LAZY) {
+                        change_prot_numa(vma, start, endvma);
+                        goto next;
+                }
+                if ((flags & MPOL_MF_STRICT) ||
                     ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
-                                vma_migratable(vma)))) {
+                      vma_migratable(vma))) {
-                        unsigned long endvma = vma->vm_end;
-                        if (endvma > end)
-                                endvma = end;
-                        if (vma->vm_start > start)
-                                start = vma->vm_start;
                        err = check_pgd_range(vma, start, endvma, nodes,
                                                flags, private);
                        if (err) {
@@ -602,6 +667,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
                                break;
                        }
                }
+next:
                prev = vma;
        }
        return first;
@@ -961,7 +1027,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
        if (!list_empty(&pagelist)) {
                err = migrate_pages(&pagelist, new_node_page, dest,
-                                                        false, MIGRATE_SYNC);
+                                                        false, MIGRATE_SYNC,
+                                                        MR_SYSCALL);
                if (err)
                        putback_lru_pages(&pagelist);
        }
@@ -1133,8 +1200,7 @@ static long do_mbind(unsigned long start, unsigned long len,
        int err;
        LIST_HEAD(pagelist);
-        if (flags & ~(unsigned long)(MPOL_MF_STRICT |
+        if (flags & ~(unsigned long)MPOL_MF_VALID)
-                                     MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
                return -EINVAL;
        if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
                return -EPERM;
@@ -1157,6 +1223,9 @@ static long do_mbind(unsigned long start, unsigned long len,
        if (IS_ERR(new))
                return PTR_ERR(new);
+        if (flags & MPOL_MF_LAZY)
+                new->flags |= MPOL_F_MOF;
        /*
         * If we are using the default policy then operation
         * on discontinuous address spaces is okay after all
@@ -1193,21 +1262,24 @@ static long do_mbind(unsigned long start, unsigned long len,
        vma = check_range(mm, start, end, nmask,
                          flags | MPOL_MF_INVERT, &pagelist);
-        err = PTR_ERR(vma);
+        err = PTR_ERR(vma);     /* maybe ... */
-        if (!IS_ERR(vma)) {
+        if (!IS_ERR(vma))
-                int nr_failed = 0;
                err = mbind_range(mm, start, end, new);
+        if (!err) {
+                int nr_failed = 0;
                if (!list_empty(&pagelist)) {
+                        WARN_ON_ONCE(flags & MPOL_MF_LAZY);
                        nr_failed = migrate_pages(&pagelist, new_vma_page,
                                                (unsigned long)vma,
-                                                false, MIGRATE_SYNC);
+                                                false, MIGRATE_SYNC,
+                                                MR_MEMPOLICY_MBIND);
                        if (nr_failed)
                                putback_lru_pages(&pagelist);
                }
-                if (!err && nr_failed && (flags & MPOL_MF_STRICT))
+                if (nr_failed && (flags & MPOL_MF_STRICT))
                        err = -EIO;
        } else
                putback_lru_pages(&pagelist);
@@ -1546,7 +1618,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
 struct mempolicy *get_vma_policy(struct task_struct *task,
                struct vm_area_struct *vma, unsigned long addr)
 {
-        struct mempolicy *pol = task->mempolicy;
+        struct mempolicy *pol = get_task_policy(task);
        if (vma) {
                if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1956,7 +2028,7 @@ retry_cpuset:
 */
 struct page *alloc_pages_current(gfp_t gfp, unsigned order)
 {
-        struct mempolicy *pol = current->mempolicy;
+        struct mempolicy *pol = get_task_policy(current);
        struct page *page;
        unsigned int cpuset_mems_cookie;
@@ -2140,6 +2212,115 @@ static void sp_free(struct sp_node *n)
        kmem_cache_free(sn_cache, n);
 }
+/**
+ * mpol_misplaced - check whether current page node is valid in policy
+ *
+ * @page   - page to be checked
+ * @vma    - vm area where page mapped
+ * @addr   - virtual address where page mapped
+ *
+ * Lookup current policy node id for vma,addr and "compare to" page's
+ * node id.
+ *
+ * Returns:
+ *      -1      - not misplaced, page is in the right node
+ *      node    - node id where the page should be
+ *
+ * Policy determination "mimics" alloc_page_vma().
+ * Called from fault path where we know the vma and faulting address.
+ */
+int mpol_misplaced(struct page *page, struct vm_area_struct *vma, unsigned long addr)
+{
+        struct mempolicy *pol;
+        struct zone *zone;
+        int curnid = page_to_nid(page);
+        unsigned long pgoff;
+        int polnid = -1;
+        int ret = -1;
+        BUG_ON(!vma);
+        pol = get_vma_policy(current, vma, addr);
+        if (!(pol->flags & MPOL_F_MOF))
+                goto out;
+        switch (pol->mode) {
+        case MPOL_INTERLEAVE:
+                BUG_ON(addr >= vma->vm_end);
+                BUG_ON(addr < vma->vm_start);
+                pgoff = vma->vm_pgoff;
+                pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
+                polnid = offset_il_node(pol, vma, pgoff);
+                break;
+        case MPOL_PREFERRED:
+                if (pol->flags & MPOL_F_LOCAL)
+                        polnid = numa_node_id();
+                else
+                        polnid = pol->v.preferred_node;
+                break;
+        case MPOL_BIND:
+                /*
+                 * allows binding to multiple nodes.
+                 * use current page if in policy nodemask,
+                 * else select nearest allowed node, if any.
+                 * If no allowed nodes, use current [!misplaced].
+                 */
+                if (node_isset(curnid, pol->v.nodes))
+                        goto out;
+                (void)first_zones_zonelist(
+                                node_zonelist(numa_node_id(), GFP_HIGHUSER),
+                                gfp_zone(GFP_HIGHUSER),
+                                &pol->v.nodes, &zone);
+                polnid = zone->node;
+                break;
+        default:
+                BUG();
+        }
+        /* Migrate the page towards the node whose CPU is referencing it */
+        if (pol->flags & MPOL_F_MORON) {
+                int last_nid;
+                polnid = numa_node_id();
+                /*
+                 * Multi-stage node selection is used in conjunction
+                 * with a periodic migration fault to build a temporal
+                 * task<->page relation. By using a two-stage filter we
+                 * remove short/unlikely relations.
+                 *
+                 * Using P(p) ~ n_p / n_t as per frequentist
+                 * probability, we can equate a task's usage of a
+                 * particular page (n_p) per total usage of this
+                 * page (n_t) (in a given time-span) to a probability.
+                 *
+                 * Our periodic faults will sample this probability and
+                 * getting the same result twice in a row, given these
+                 * samples are fully independent, is then given by
+                 * P(n)^2, provided our sample period is sufficiently
+                 * short compared to the usage pattern.
+                 *
+                 * This quadric squishes small probabilities, making
+                 * it less likely we act on an unlikely task<->page
+                 * relation.
+                 */
+                last_nid = page_xchg_last_nid(page, polnid);
+                if (last_nid != polnid)
+                        goto out;
+        }
+        if (curnid != polnid)
+                ret = polnid;
+out:
+        mpol_cond_put(pol);
+        return ret;
+}
 static void sp_delete(struct shared_policy *sp, struct sp_node *n)
 {
        pr_debug("deleting %lx-l%lx\n", n->start, n->end);
@@ -2305,6 +2486,50 @@ void mpol_free_shared_policy(struct shared_policy *p)
        mutex_unlock(&p->mutex);
 }
+#ifdef CONFIG_NUMA_BALANCING
+static bool __initdata numabalancing_override;
+static void __init check_numabalancing_enable(void)
+{
+        bool numabalancing_default = false;
+        if (IS_ENABLED(CONFIG_NUMA_BALANCING_DEFAULT_ENABLED))
+                numabalancing_default = true;
+        if (nr_node_ids > 1 && !numabalancing_override) {
+                printk(KERN_INFO "Enabling automatic NUMA balancing. "
+                        "Configure with numa_balancing= or sysctl");
+                set_numabalancing_state(numabalancing_default);
+        }
+}
+static int __init setup_numabalancing(char *str)
+{
+        int ret = 0;
+        if (!str)
+                goto out;
+        numabalancing_override = true;
+        if (!strcmp(str, "enable")) {
+                set_numabalancing_state(true);
+                ret = 1;
+        } else if (!strcmp(str, "disable")) {
+                set_numabalancing_state(false);
+                ret = 1;
+        }
+out:
+        if (!ret)
+                printk(KERN_WARNING "Unable to parse numa_balancing=\n");
+        return ret;
+}
+__setup("numa_balancing=", setup_numabalancing);
+#else
+static inline void __init check_numabalancing_enable(void)
+{
+}
+#endif /* CONFIG_NUMA_BALANCING */
 /* assumes fs == KERNEL_DS */
 void __init numa_policy_init(void)
 {
@@ -2320,6 +2545,15 @@ void __init numa_policy_init(void)
                                     sizeof(struct sp_node),
                                     0, SLAB_PANIC, NULL);
+        for_each_node(nid) {
+                preferred_node_policy[nid] = (struct mempolicy) {
+                        .refcnt = ATOMIC_INIT(1),
+                        .mode = MPOL_PREFERRED,
+                        .flags = MPOL_F_MOF | MPOL_F_MORON,
+                        .v = { .preferred_node = nid, },
+                };
+        }
        /*
         * Set interleaving policy for system init. Interleaving is only
         * enabled across suitably sized nodes (default is >= 16MB), or
@@ -2346,6 +2580,8 @@ void __init numa_policy_init(void)
        if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
                printk("numa_policy_init: interleaving failed\n");
+        check_numabalancing_enable();
 }
 /* Reset policy of current process to default */
@@ -2362,14 +2598,13 @@ void numa_default_policy(void)
 * "local" is pseudo-policy:  MPOL_PREFERRED with MPOL_F_LOCAL flag
 * Used only for mpol_parse_str() and mpol_to_str()
 */
-#define MPOL_LOCAL MPOL_MAX
 static const char * const policy_modes[] =
 {
        [MPOL_DEFAULT]    = "default",
        [MPOL_PREFERRED]  = "prefer",
        [MPOL_BIND]       = "bind",
        [MPOL_INTERLEAVE] = "interleave",
-        [MPOL_LOCAL]      = "local"
+        [MPOL_LOCAL]      = "local",
 };
@@ -2415,12 +2650,12 @@ int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context)
        if (flags)
                *flags++ = '\0';        /* terminate mode string */
-        for (mode = 0; mode <= MPOL_LOCAL; mode++) {
+        for (mode = 0; mode < MPOL_MAX; mode++) {
                if (!strcmp(str, policy_modes[mode])) {
                        break;
                }
        }
-        if (mode > MPOL_LOCAL)
+        if (mode >= MPOL_MAX)
                goto out;
        switch (mode) {
author	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 17:33:25 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 18:18:08 -0500
commit	3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree	b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /mm/mempolicy.c
parent	11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent	4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c index aaf54566cb6b..d1b315e98627 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c
@@ -90,6 +90,7 @@
90	#include <linux/syscalls.h>	90	#include <linux/syscalls.h>
91	#include <linux/ctype.h>	91	#include <linux/ctype.h>
92	#include <linux/mm_inline.h>	92	#include <linux/mm_inline.h>
		93	#include <linux/mmu_notifier.h>
93		94
94	#include <asm/tlbflush.h>	95	#include <asm/tlbflush.h>
95	#include <asm/uaccess.h>	96	#include <asm/uaccess.h>
@@ -117,6 +118,26 @@ static struct mempolicy default_policy = {
117	.flags = MPOL_F_LOCAL,	118	.flags = MPOL_F_LOCAL,
118	};	119	};
119		120
		121	static struct mempolicy preferred_node_policy[MAX_NUMNODES];
		122
		123	static struct mempolicy get_task_policy(struct task_struct p)
		124	{
		125	struct mempolicy *pol = p->mempolicy;
		126	int node;
		127
		128	if (!pol) {
		129	node = numa_node_id();
		130	if (node != -1)
		131	pol = &preferred_node_policy[node];
		132
		133	/* preferred_node_policy is not initialised early in boot */
		134	if (!pol->mode)
		135	pol = NULL;
		136	}
		137
		138	return pol;
		139	}
		140
120	static const struct mempolicy_operations {	141	static const struct mempolicy_operations {
121	int (create)(struct mempolicy pol, const nodemask_t *nodes);	142	int (create)(struct mempolicy pol, const nodemask_t *nodes);
122	/*	143	/*
@@ -254,7 +275,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
254	if (mode == MPOL_DEFAULT) {	275	if (mode == MPOL_DEFAULT) {
255	if (nodes && !nodes_empty(*nodes))	276	if (nodes && !nodes_empty(*nodes))
256	return ERR_PTR(-EINVAL);	277	return ERR_PTR(-EINVAL);
257	return NULL; /* simply delete any existing policy */	278	return NULL;
258	}	279	}
259	VM_BUG_ON(!nodes);	280	VM_BUG_ON(!nodes);
260		281
@@ -269,6 +290,10 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
269	(flags & MPOL_F_RELATIVE_NODES)))	290	(flags & MPOL_F_RELATIVE_NODES)))
270	return ERR_PTR(-EINVAL);	291	return ERR_PTR(-EINVAL);
271	}	292	}
		293	} else if (mode == MPOL_LOCAL) {
		294	if (!nodes_empty(*nodes))
		295	return ERR_PTR(-EINVAL);
		296	mode = MPOL_PREFERRED;
272	} else if (nodes_empty(*nodes))	297	} else if (nodes_empty(*nodes))
273	return ERR_PTR(-EINVAL);	298	return ERR_PTR(-EINVAL);
274	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);	299	policy = kmem_cache_alloc(policy_cache, GFP_KERNEL);
@@ -561,6 +586,36 @@ static inline int check_pgd_range(struct vm_area_struct *vma,
561	return 0;	586	return 0;
562	}	587	}
563		588
		589	#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
		590	/*
		591	* This is used to mark a range of virtual addresses to be inaccessible.
		592	* These are later cleared by a NUMA hinting fault. Depending on these
		593	* faults, pages may be migrated for better NUMA placement.
		594	*
		595	* This is assuming that NUMA faults are handled using PROT_NONE. If
		596	* an architecture makes a different choice, it will need further
		597	* changes to the core.
		598	*/
		599	unsigned long change_prot_numa(struct vm_area_struct *vma,
		600	unsigned long addr, unsigned long end)
		601	{
		602	int nr_updated;
		603	BUILD_BUG_ON(_PAGE_NUMA != _PAGE_PROTNONE);
		604
		605	nr_updated = change_protection(vma, addr, end, vma->vm_page_prot, 0, 1);
		606	if (nr_updated)
		607	count_vm_numa_events(NUMA_PTE_UPDATES, nr_updated);
		608
		609	return nr_updated;
		610	}
		611	#else
		612	static unsigned long change_prot_numa(struct vm_area_struct *vma,
		613	unsigned long addr, unsigned long end)
		614	{
		615	return 0;
		616	}
		617	#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
		618
564	/*	619	/*
565	* Check if all pages in a range are on a set of nodes.	620	* Check if all pages in a range are on a set of nodes.
566	* If pagelist != NULL then isolate pages from the LRU and	621	* If pagelist != NULL then isolate pages from the LRU and
@@ -579,22 +634,32 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
579	return ERR_PTR(-EFAULT);	634	return ERR_PTR(-EFAULT);
580	prev = NULL;	635	prev = NULL;
581	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {	636	for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
		637	unsigned long endvma = vma->vm_end;
		638
		639	if (endvma > end)
		640	endvma = end;
		641	if (vma->vm_start > start)
		642	start = vma->vm_start;
		643
582	if (!(flags & MPOL_MF_DISCONTIG_OK)) {	644	if (!(flags & MPOL_MF_DISCONTIG_OK)) {
583	if (!vma->vm_next && vma->vm_end < end)	645	if (!vma->vm_next && vma->vm_end < end)
584	return ERR_PTR(-EFAULT);	646	return ERR_PTR(-EFAULT);
585	if (prev && prev->vm_end < vma->vm_start)	647	if (prev && prev->vm_end < vma->vm_start)
586	return ERR_PTR(-EFAULT);	648	return ERR_PTR(-EFAULT);
587	}	649	}
588	if (!is_vm_hugetlb_page(vma) &&	650
589	((flags & MPOL_MF_STRICT) \|\|	651	if (is_vm_hugetlb_page(vma))
		652	goto next;
		653
		654	if (flags & MPOL_MF_LAZY) {
		655	change_prot_numa(vma, start, endvma);
		656	goto next;
		657	}
		658
		659	if ((flags & MPOL_MF_STRICT) \|\|
590	((flags & (MPOL_MF_MOVE \| MPOL_MF_MOVE_ALL)) &&	660	((flags & (MPOL_MF_MOVE \| MPOL_MF_MOVE_ALL)) &&
591	vma_migratable(vma)))) {	661	vma_migratable(vma))) {
592	unsigned long endvma = vma->vm_end;
593		662
594	if (endvma > end)
595	endvma = end;
596	if (vma->vm_start > start)
597	start = vma->vm_start;
598	err = check_pgd_range(vma, start, endvma, nodes,	663	err = check_pgd_range(vma, start, endvma, nodes,
599	flags, private);	664	flags, private);
600	if (err) {	665	if (err) {
@@ -602,6 +667,7 @@ check_range(struct mm_struct *mm, unsigned long start, unsigned long end,
602	break;	667	break;
603	}	668	}
604	}	669	}
		670	next:
605	prev = vma;	671	prev = vma;
606	}	672	}
607	return first;	673	return first;
@@ -961,7 +1027,8 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
961		1027
962	if (!list_empty(&pagelist)) {	1028	if (!list_empty(&pagelist)) {
963	err = migrate_pages(&pagelist, new_node_page, dest,	1029	err = migrate_pages(&pagelist, new_node_page, dest,
964	false, MIGRATE_SYNC);	1030	false, MIGRATE_SYNC,
		1031	MR_SYSCALL);
965	if (err)	1032	if (err)
966	putback_lru_pages(&pagelist);	1033	putback_lru_pages(&pagelist);
967	}	1034	}
@@ -1133,8 +1200,7 @@ static long do_mbind(unsigned long start, unsigned long len,
1133	int err;	1200	int err;
1134	LIST_HEAD(pagelist);	1201	LIST_HEAD(pagelist);
1135		1202
1136	if (flags & ~(unsigned long)(MPOL_MF_STRICT \|	1203	if (flags & ~(unsigned long)MPOL_MF_VALID)
1137	MPOL_MF_MOVE \| MPOL_MF_MOVE_ALL))
1138	return -EINVAL;	1204	return -EINVAL;
1139	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))	1205	if ((flags & MPOL_MF_MOVE_ALL) && !capable(CAP_SYS_NICE))
1140	return -EPERM;	1206	return -EPERM;
@@ -1157,6 +1223,9 @@ static long do_mbind(unsigned long start, unsigned long len,
1157	if (IS_ERR(new))	1223	if (IS_ERR(new))
1158	return PTR_ERR(new);	1224	return PTR_ERR(new);
1159		1225
		1226	if (flags & MPOL_MF_LAZY)
		1227	new->flags \|= MPOL_F_MOF;
		1228
1160	/*	1229	/*
1161	* If we are using the default policy then operation	1230	* If we are using the default policy then operation
1162	* on discontinuous address spaces is okay after all	1231	* on discontinuous address spaces is okay after all
@@ -1193,21 +1262,24 @@ static long do_mbind(unsigned long start, unsigned long len,
1193	vma = check_range(mm, start, end, nmask,	1262	vma = check_range(mm, start, end, nmask,
1194	flags \| MPOL_MF_INVERT, &pagelist);	1263	flags \| MPOL_MF_INVERT, &pagelist);
1195		1264
1196	err = PTR_ERR(vma);	1265	err = PTR_ERR(vma); /* maybe ... */
1197	if (!IS_ERR(vma)) {	1266	if (!IS_ERR(vma))
1198	int nr_failed = 0;
1199
1200	err = mbind_range(mm, start, end, new);	1267	err = mbind_range(mm, start, end, new);
1201		1268
		1269	if (!err) {
		1270	int nr_failed = 0;
		1271
1202	if (!list_empty(&pagelist)) {	1272	if (!list_empty(&pagelist)) {
		1273	WARN_ON_ONCE(flags & MPOL_MF_LAZY);
1203	nr_failed = migrate_pages(&pagelist, new_vma_page,	1274	nr_failed = migrate_pages(&pagelist, new_vma_page,
1204	(unsigned long)vma,	1275	(unsigned long)vma,
1205	false, MIGRATE_SYNC);	1276	false, MIGRATE_SYNC,
		1277	MR_MEMPOLICY_MBIND);
1206	if (nr_failed)	1278	if (nr_failed)
1207	putback_lru_pages(&pagelist);	1279	putback_lru_pages(&pagelist);
1208	}	1280	}
1209		1281
1210	if (!err && nr_failed && (flags & MPOL_MF_STRICT))	1282	if (nr_failed && (flags & MPOL_MF_STRICT))
1211	err = -EIO;	1283	err = -EIO;
1212	} else	1284	} else
1213	putback_lru_pages(&pagelist);	1285	putback_lru_pages(&pagelist);
@@ -1546,7 +1618,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
1546	struct mempolicy get_vma_policy(struct task_struct task,	1618	struct mempolicy get_vma_policy(struct task_struct task,
1547	struct vm_area_struct *vma, unsigned long addr)	1619	struct vm_area_struct *vma, unsigned long addr)
1548	{	1620	{
1549	struct mempolicy *pol = task->mempolicy;	1621	struct mempolicy *pol = get_task_policy(task);
1550		1622
1551	if (vma) {	1623	if (vma) {
1552	if (vma->vm_ops && vma->vm_ops->get_policy) {	1624	if (vma->vm_ops && vma->vm_ops->get_policy) {
@@ -1956,7 +2028,7 @@ retry_cpuset:
1956	*/	2028	*/
1957	struct page *alloc_pages_current(gfp_t gfp, unsigned order)	2029	struct page *alloc_pages_current(gfp_t gfp, unsigned order)
1958	{	2030	{
1959	struct mempolicy *pol = current->mempolicy;	2031	struct mempolicy *pol = get_task_policy(current);
1960	struct page *page;	2032	struct page *page;
1961	unsigned int cpuset_mems_cookie;	2033	unsigned int cpuset_mems_cookie;
1962		2034
@@ -2140,6 +2212,115 @@ static void sp_free(struct sp_node *n)
2140	kmem_cache_free(sn_cache, n);	2212	kmem_cache_free(sn_cache, n);
2141	}	2213	}
2142		2214
		2215	/**
		2216	* mpol_misplaced - check whether current page node is valid in policy
		2217	*
		2218	* @page - page to be checked
		2219	* @vma - vm area where page mapped
		2220	* @addr - virtual address where page mapped
		2221	*
		2222	* Lookup current policy node id for vma,addr and "compare to" page's
		2223	* node id.
		2224	*
		2225	* Returns:
		2226	* -1 - not misplaced, page is in the right node
		2227	* node - node id where the page should be
		2228	*
		2229	* Policy determination "mimics" alloc_page_vma().
		2230	* Called from fault path where we know the vma and faulting address.
		2231	*/
		2232	int mpol_misplaced(struct page page, struct vm_area_struct vma, unsigned long addr)
		2233	{
		2234	struct mempolicy *pol;
		2235	struct zone *zone;
		2236	int curnid = page_to_nid(page);
		2237	unsigned long pgoff;
		2238	int polnid = -1;
		2239	int ret = -1;
		2240
		2241	BUG_ON(!vma);
		2242
		2243	pol = get_vma_policy(current, vma, addr);
		2244	if (!(pol->flags & MPOL_F_MOF))
		2245	goto out;
		2246
		2247	switch (pol->mode) {
		2248	case MPOL_INTERLEAVE:
		2249	BUG_ON(addr >= vma->vm_end);
		2250	BUG_ON(addr < vma->vm_start);
		2251
		2252	pgoff = vma->vm_pgoff;
		2253	pgoff += (addr - vma->vm_start) >> PAGE_SHIFT;
		2254	polnid = offset_il_node(pol, vma, pgoff);
		2255	break;
		2256
		2257	case MPOL_PREFERRED:
		2258	if (pol->flags & MPOL_F_LOCAL)
		2259	polnid = numa_node_id();
		2260	else
		2261	polnid = pol->v.preferred_node;
		2262	break;
		2263
		2264	case MPOL_BIND:
		2265	/*
		2266	* allows binding to multiple nodes.
		2267	* use current page if in policy nodemask,
		2268	* else select nearest allowed node, if any.
		2269	* If no allowed nodes, use current [!misplaced].
		2270	*/
		2271	if (node_isset(curnid, pol->v.nodes))
		2272	goto out;
		2273	(void)first_zones_zonelist(
		2274	node_zonelist(numa_node_id(), GFP_HIGHUSER),
		2275	gfp_zone(GFP_HIGHUSER),
		2276	&pol->v.nodes, &zone);
		2277	polnid = zone->node;
		2278	break;
		2279
		2280	default:
		2281	BUG();
		2282	}
		2283
		2284	/* Migrate the page towards the node whose CPU is referencing it */
		2285	if (pol->flags & MPOL_F_MORON) {
		2286	int last_nid;
		2287
		2288	polnid = numa_node_id();
		2289
		2290	/*
		2291	* Multi-stage node selection is used in conjunction
		2292	* with a periodic migration fault to build a temporal
		2293	* task<->page relation. By using a two-stage filter we
		2294	* remove short/unlikely relations.
		2295	*
		2296	* Using P(p) ~ n_p / n_t as per frequentist
		2297	* probability, we can equate a task's usage of a
		2298	* particular page (n_p) per total usage of this
		2299	* page (n_t) (in a given time-span) to a probability.
		2300	*
		2301	* Our periodic faults will sample this probability and
		2302	* getting the same result twice in a row, given these
		2303	* samples are fully independent, is then given by
		2304	* P(n)^2, provided our sample period is sufficiently
		2305	* short compared to the usage pattern.
		2306	*
		2307	* This quadric squishes small probabilities, making
		2308	* it less likely we act on an unlikely task<->page
		2309	* relation.
		2310	*/
		2311	last_nid = page_xchg_last_nid(page, polnid);
		2312	if (last_nid != polnid)
		2313	goto out;
		2314	}
		2315
		2316	if (curnid != polnid)
		2317	ret = polnid;
		2318	out:
		2319	mpol_cond_put(pol);
		2320
		2321	return ret;
		2322	}
		2323
2143	static void sp_delete(struct shared_policy sp, struct sp_node n)	2324	static void sp_delete(struct shared_policy sp, struct sp_node n)
2144	{	2325	{
2145	pr_debug("deleting %lx-l%lx\n", n->start, n->end);	2326	pr_debug("deleting %lx-l%lx\n", n->start, n->end);
@@ -2305,6 +2486,50 @@ void mpol_free_shared_policy(struct shared_policy *p)
2305	mutex_unlock(&p->mutex);	2486	mutex_unlock(&p->mutex);
2306	}	2487	}
2307		2488
		2489	#ifdef CONFIG_NUMA_BALANCING
		2490	static bool __initdata numabalancing_override;
		2491
		2492	static void __init check_numabalancing_enable(void)
		2493	{
		2494	bool numabalancing_default = false;
		2495
		2496	if (IS_ENABLED(CONFIG_NUMA_BALANCING_DEFAULT_ENABLED))
		2497	numabalancing_default = true;
		2498
		2499	if (nr_node_ids > 1 && !numabalancing_override) {
		2500	printk(KERN_INFO "Enabling automatic NUMA balancing. "
		2501	"Configure with numa_balancing= or sysctl");
		2502	set_numabalancing_state(numabalancing_default);
		2503	}
		2504	}
		2505
		2506	static int __init setup_numabalancing(char *str)
		2507	{
		2508	int ret = 0;
		2509	if (!str)
		2510	goto out;
		2511	numabalancing_override = true;
		2512
		2513	if (!strcmp(str, "enable")) {
		2514	set_numabalancing_state(true);
		2515	ret = 1;
		2516	} else if (!strcmp(str, "disable")) {
		2517	set_numabalancing_state(false);
		2518	ret = 1;
		2519	}
		2520	out:
		2521	if (!ret)
		2522	printk(KERN_WARNING "Unable to parse numa_balancing=\n");
		2523
		2524	return ret;
		2525	}
		2526	__setup("numa_balancing=", setup_numabalancing);
		2527	#else
		2528	static inline void __init check_numabalancing_enable(void)
		2529	{
		2530	}
		2531	#endif /* CONFIG_NUMA_BALANCING */
		2532
2308	/* assumes fs == KERNEL_DS */	2533	/* assumes fs == KERNEL_DS */
2309	void __init numa_policy_init(void)	2534	void __init numa_policy_init(void)
2310	{	2535	{
@@ -2320,6 +2545,15 @@ void __init numa_policy_init(void)
2320	sizeof(struct sp_node),	2545	sizeof(struct sp_node),
2321	0, SLAB_PANIC, NULL);	2546	0, SLAB_PANIC, NULL);
2322		2547
		2548	for_each_node(nid) {
		2549	preferred_node_policy[nid] = (struct mempolicy) {
		2550	.refcnt = ATOMIC_INIT(1),
		2551	.mode = MPOL_PREFERRED,
		2552	.flags = MPOL_F_MOF \| MPOL_F_MORON,
		2553	.v = { .preferred_node = nid, },
		2554	};
		2555	}
		2556
2323	/*	2557	/*
2324	* Set interleaving policy for system init. Interleaving is only	2558	* Set interleaving policy for system init. Interleaving is only
2325	* enabled across suitably sized nodes (default is >= 16MB), or	2559	* enabled across suitably sized nodes (default is >= 16MB), or
@@ -2346,6 +2580,8 @@ void __init numa_policy_init(void)
2346		2580
2347	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))	2581	if (do_set_mempolicy(MPOL_INTERLEAVE, 0, &interleave_nodes))
2348	printk("numa_policy_init: interleaving failed\n");	2582	printk("numa_policy_init: interleaving failed\n");
		2583
		2584	check_numabalancing_enable();
2349	}	2585	}
2350		2586
2351	/* Reset policy of current process to default */	2587	/* Reset policy of current process to default */
@@ -2362,14 +2598,13 @@ void numa_default_policy(void)
2362	* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag	2598	* "local" is pseudo-policy: MPOL_PREFERRED with MPOL_F_LOCAL flag
2363	* Used only for mpol_parse_str() and mpol_to_str()	2599	* Used only for mpol_parse_str() and mpol_to_str()
2364	*/	2600	*/
2365	#define MPOL_LOCAL MPOL_MAX
2366	static const char * const policy_modes[] =	2601	static const char * const policy_modes[] =
2367	{	2602	{
2368	[MPOL_DEFAULT] = "default",	2603	[MPOL_DEFAULT] = "default",
2369	[MPOL_PREFERRED] = "prefer",	2604	[MPOL_PREFERRED] = "prefer",
2370	[MPOL_BIND] = "bind",	2605	[MPOL_BIND] = "bind",
2371	[MPOL_INTERLEAVE] = "interleave",	2606	[MPOL_INTERLEAVE] = "interleave",
2372	[MPOL_LOCAL] = "local"	2607	[MPOL_LOCAL] = "local",
2373	};	2608	};
2374		2609
2375		2610
@@ -2415,12 +2650,12 @@ int mpol_parse_str(char str, struct mempolicy *mpol, int no_context)
2415	if (flags)	2650	if (flags)
2416	flags++ = '\0'; / terminate mode string */	2651	flags++ = '\0'; / terminate mode string */
2417		2652
2418	for (mode = 0; mode <= MPOL_LOCAL; mode++) {	2653	for (mode = 0; mode < MPOL_MAX; mode++) {
2419	if (!strcmp(str, policy_modes[mode])) {	2654	if (!strcmp(str, policy_modes[mode])) {
2420	break;	2655	break;
2421	}	2656	}
2422	}	2657	}
2423	if (mode > MPOL_LOCAL)	2658	if (mode >= MPOL_MAX)
2424	goto out;	2659	goto out;
2425		2660
2426	switch (mode) {	2661	switch (mode) {