Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 17:33:25 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 18:18:08 -0500
commit: 3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree: b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /include/asm-generic
parent: 11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent: 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)
1 files changed, 110 insertions, 0 deletions
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 284e80831d2c..701beab27aab 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 #define move_pte(pte, prot, old_addr, new_addr) (pte)
 #endif
+#ifndef pte_accessible
+# define pte_accessible(pte)            ((void)(pte),1)
+#endif
 #ifndef flush_tlb_fix_spurious_fault
 #define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
 #endif
@@ -580,6 +584,112 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
 #endif
 }
+#ifdef CONFIG_NUMA_BALANCING
+#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
+/*
+ * _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
+ * same bit too). It's set only when _PAGE_PRESET is not set and it's
+ * never set if _PAGE_PRESENT is set.
+ *
+ * pte/pmd_present() returns true if pte/pmd_numa returns true. Page
+ * fault triggers on those regions if pte/pmd_numa returns true
+ * (because _PAGE_PRESENT is not set).
+ */
+#ifndef pte_numa
+static inline int pte_numa(pte_t pte)
+{
+        return (pte_flags(pte) &
+                (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+#ifndef pmd_numa
+static inline int pmd_numa(pmd_t pmd)
+{
+        return (pmd_flags(pmd) &
+                (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA;
+}
+#endif
+/*
+ * pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
+ * because they're called by the NUMA hinting minor page fault. If we
+ * wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
+ * would be forced to set it later while filling the TLB after we
+ * return to userland. That would trigger a second write to memory
+ * that we optimize away by setting _PAGE_ACCESSED here.
+ */
+#ifndef pte_mknonnuma
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+        pte = pte_clear_flags(pte, _PAGE_NUMA);
+        return pte_set_flags(pte, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+#ifndef pmd_mknonnuma
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+        pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
+        return pmd_set_flags(pmd, _PAGE_PRESENT|_PAGE_ACCESSED);
+}
+#endif
+#ifndef pte_mknuma
+static inline pte_t pte_mknuma(pte_t pte)
+{
+        pte = pte_set_flags(pte, _PAGE_NUMA);
+        return pte_clear_flags(pte, _PAGE_PRESENT);
+}
+#endif
+#ifndef pmd_mknuma
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+        pmd = pmd_set_flags(pmd, _PAGE_NUMA);
+        return pmd_clear_flags(pmd, _PAGE_PRESENT);
+}
+#endif
+#else
+extern int pte_numa(pte_t pte);
+extern int pmd_numa(pmd_t pmd);
+extern pte_t pte_mknonnuma(pte_t pte);
+extern pmd_t pmd_mknonnuma(pmd_t pmd);
+extern pte_t pte_mknuma(pte_t pte);
+extern pmd_t pmd_mknuma(pmd_t pmd);
+#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
+#else
+static inline int pmd_numa(pmd_t pmd)
+{
+        return 0;
+}
+static inline int pte_numa(pte_t pte)
+{
+        return 0;
+}
+static inline pte_t pte_mknonnuma(pte_t pte)
+{
+        return pte;
+}
+static inline pmd_t pmd_mknonnuma(pmd_t pmd)
+{
+        return pmd;
+}
+static inline pte_t pte_mknuma(pte_t pte)
+{
+        return pte;
+}
+static inline pmd_t pmd_mknuma(pmd_t pmd)
+{
+        return pmd;
+}
+#endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_MMU */
 #endif /* !__ASSEMBLY__ */
author	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 17:33:25 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 18:18:08 -0500
commit	3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree	b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /include/asm-generic
parent	11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent	4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h index 284e80831d2c..701beab27aab 100644 --- a/include/asm-generic/pgtable.h +++ b/include/asm-generic/pgtable.h
@@ -219,6 +219,10 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
219	#define move_pte(pte, prot, old_addr, new_addr) (pte)	219	#define move_pte(pte, prot, old_addr, new_addr) (pte)
220	#endif	220	#endif
221		221
		222	#ifndef pte_accessible
		223	# define pte_accessible(pte) ((void)(pte),1)
		224	#endif
		225
222	#ifndef flush_tlb_fix_spurious_fault	226	#ifndef flush_tlb_fix_spurious_fault
223	#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)	227	#define flush_tlb_fix_spurious_fault(vma, address) flush_tlb_page(vma, address)
224	#endif	228	#endif
@@ -580,6 +584,112 @@ static inline int pmd_trans_unstable(pmd_t *pmd)
580	#endif	584	#endif
581	}	585	}
582		586
		587	#ifdef CONFIG_NUMA_BALANCING
		588	#ifdef CONFIG_ARCH_USES_NUMA_PROT_NONE
		589	/*
		590	* _PAGE_NUMA works identical to _PAGE_PROTNONE (it's actually the
		591	* same bit too). It's set only when _PAGE_PRESET is not set and it's
		592	* never set if _PAGE_PRESENT is set.
		593	*
		594	* pte/pmd_present() returns true if pte/pmd_numa returns true. Page
		595	* fault triggers on those regions if pte/pmd_numa returns true
		596	* (because _PAGE_PRESENT is not set).
		597	*/
		598	#ifndef pte_numa
		599	static inline int pte_numa(pte_t pte)
		600	{
		601	return (pte_flags(pte) &
		602	(_PAGE_NUMA\|_PAGE_PRESENT)) == _PAGE_NUMA;
		603	}
		604	#endif
		605
		606	#ifndef pmd_numa
		607	static inline int pmd_numa(pmd_t pmd)
		608	{
		609	return (pmd_flags(pmd) &
		610	(_PAGE_NUMA\|_PAGE_PRESENT)) == _PAGE_NUMA;
		611	}
		612	#endif
		613
		614	/*
		615	* pte/pmd_mknuma sets the _PAGE_ACCESSED bitflag automatically
		616	* because they're called by the NUMA hinting minor page fault. If we
		617	* wouldn't set the _PAGE_ACCESSED bitflag here, the TLB miss handler
		618	* would be forced to set it later while filling the TLB after we
		619	* return to userland. That would trigger a second write to memory
		620	* that we optimize away by setting _PAGE_ACCESSED here.
		621	*/
		622	#ifndef pte_mknonnuma
		623	static inline pte_t pte_mknonnuma(pte_t pte)
		624	{
		625	pte = pte_clear_flags(pte, _PAGE_NUMA);
		626	return pte_set_flags(pte, _PAGE_PRESENT\|_PAGE_ACCESSED);
		627	}
		628	#endif
		629
		630	#ifndef pmd_mknonnuma
		631	static inline pmd_t pmd_mknonnuma(pmd_t pmd)
		632	{
		633	pmd = pmd_clear_flags(pmd, _PAGE_NUMA);
		634	return pmd_set_flags(pmd, _PAGE_PRESENT\|_PAGE_ACCESSED);
		635	}
		636	#endif
		637
		638	#ifndef pte_mknuma
		639	static inline pte_t pte_mknuma(pte_t pte)
		640	{
		641	pte = pte_set_flags(pte, _PAGE_NUMA);
		642	return pte_clear_flags(pte, _PAGE_PRESENT);
		643	}
		644	#endif
		645
		646	#ifndef pmd_mknuma
		647	static inline pmd_t pmd_mknuma(pmd_t pmd)
		648	{
		649	pmd = pmd_set_flags(pmd, _PAGE_NUMA);
		650	return pmd_clear_flags(pmd, _PAGE_PRESENT);
		651	}
		652	#endif
		653	#else
		654	extern int pte_numa(pte_t pte);
		655	extern int pmd_numa(pmd_t pmd);
		656	extern pte_t pte_mknonnuma(pte_t pte);
		657	extern pmd_t pmd_mknonnuma(pmd_t pmd);
		658	extern pte_t pte_mknuma(pte_t pte);
		659	extern pmd_t pmd_mknuma(pmd_t pmd);
		660	#endif /* CONFIG_ARCH_USES_NUMA_PROT_NONE */
		661	#else
		662	static inline int pmd_numa(pmd_t pmd)
		663	{
		664	return 0;
		665	}
		666
		667	static inline int pte_numa(pte_t pte)
		668	{
		669	return 0;
		670	}
		671
		672	static inline pte_t pte_mknonnuma(pte_t pte)
		673	{
		674	return pte;
		675	}
		676
		677	static inline pmd_t pmd_mknonnuma(pmd_t pmd)
		678	{
		679	return pmd;
		680	}
		681
		682	static inline pte_t pte_mknuma(pte_t pte)
		683	{
		684	return pte;
		685	}
		686
		687	static inline pmd_t pmd_mknuma(pmd_t pmd)
		688	{
		689	return pmd;
		690	}
		691	#endif /* CONFIG_NUMA_BALANCING */
		692
583	#endif /* CONFIG_MMU */	693	#endif /* CONFIG_MMU */
584		694
585	#endif /* !__ASSEMBLY__ */	695	#endif /* !__ASSEMBLY__ */