Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma

Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 17:33:25 -0500
committer: Linus Torvalds <torvalds@linux-foundation.org> 2012-12-16 18:18:08 -0500
commit: 3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree: b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /include/linux/sched.h
parent: 11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent: 4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)
1 files changed, 27 insertions, 0 deletions
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2c2f3072beef..b089c92c609b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1527,6 +1527,14 @@ struct task_struct {
        short il_next;
        short pref_node_fork;
 #endif
+#ifdef CONFIG_NUMA_BALANCING
+        int numa_scan_seq;
+        int numa_migrate_seq;
+        unsigned int numa_scan_period;
+        u64 node_stamp;                 /* migration stamp  */
+        struct callback_head numa_work;
+#endif /* CONFIG_NUMA_BALANCING */
        struct rcu_head rcu;
        /*
@@ -1601,6 +1609,18 @@ struct task_struct {
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
 #define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
+#ifdef CONFIG_NUMA_BALANCING
+extern void task_numa_fault(int node, int pages, bool migrated);
+extern void set_numabalancing_state(bool enabled);
+#else
+static inline void task_numa_fault(int node, int pages, bool migrated)
+{
+}
+static inline void set_numabalancing_state(bool enabled)
+{
+}
+#endif
 /*
 * Priority of a process goes from 0..MAX_PRIO-1, valid RT
 * priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2030,6 +2050,13 @@ enum sched_tunable_scaling {
 };
 extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
+extern unsigned int sysctl_numa_balancing_scan_delay;
+extern unsigned int sysctl_numa_balancing_scan_period_min;
+extern unsigned int sysctl_numa_balancing_scan_period_max;
+extern unsigned int sysctl_numa_balancing_scan_period_reset;
+extern unsigned int sysctl_numa_balancing_scan_size;
+extern unsigned int sysctl_numa_balancing_settle_count;
 #ifdef CONFIG_SCHED_DEBUG
 extern unsigned int sysctl_sched_migration_cost;
 extern unsigned int sysctl_sched_nr_migrate;
author	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 17:33:25 -0500
committer	Linus Torvalds <torvalds@linux-foundation.org>	2012-12-16 18:18:08 -0500
commit	3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
tree	b4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /include/linux/sched.h
parent	11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent	4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)

diff --git a/include/linux/sched.h b/include/linux/sched.h index 2c2f3072beef..b089c92c609b 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h
@@ -1527,6 +1527,14 @@ struct task_struct {
1527	short il_next;	1527	short il_next;
1528	short pref_node_fork;	1528	short pref_node_fork;
1529	#endif	1529	#endif
		1530	#ifdef CONFIG_NUMA_BALANCING
		1531	int numa_scan_seq;
		1532	int numa_migrate_seq;
		1533	unsigned int numa_scan_period;
		1534	u64 node_stamp; /* migration stamp */
		1535	struct callback_head numa_work;
		1536	#endif /* CONFIG_NUMA_BALANCING */
		1537
1530	struct rcu_head rcu;	1538	struct rcu_head rcu;
1531		1539
1532	/*	1540	/*
@@ -1601,6 +1609,18 @@ struct task_struct {
1601	/* Future-safe accessor for struct task_struct's cpus_allowed. */	1609	/* Future-safe accessor for struct task_struct's cpus_allowed. */
1602	#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)	1610	#define tsk_cpus_allowed(tsk) (&(tsk)->cpus_allowed)
1603		1611
		1612	#ifdef CONFIG_NUMA_BALANCING
		1613	extern void task_numa_fault(int node, int pages, bool migrated);
		1614	extern void set_numabalancing_state(bool enabled);
		1615	#else
		1616	static inline void task_numa_fault(int node, int pages, bool migrated)
		1617	{
		1618	}
		1619	static inline void set_numabalancing_state(bool enabled)
		1620	{
		1621	}
		1622	#endif
		1623
1604	/*	1624	/*
1605	* Priority of a process goes from 0..MAX_PRIO-1, valid RT	1625	* Priority of a process goes from 0..MAX_PRIO-1, valid RT
1606	* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH	1626	* priority is 0..MAX_RT_PRIO-1, and SCHED_NORMAL/SCHED_BATCH
@@ -2030,6 +2050,13 @@ enum sched_tunable_scaling {
2030	};	2050	};
2031	extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;	2051	extern enum sched_tunable_scaling sysctl_sched_tunable_scaling;
2032		2052
		2053	extern unsigned int sysctl_numa_balancing_scan_delay;
		2054	extern unsigned int sysctl_numa_balancing_scan_period_min;
		2055	extern unsigned int sysctl_numa_balancing_scan_period_max;
		2056	extern unsigned int sysctl_numa_balancing_scan_period_reset;
		2057	extern unsigned int sysctl_numa_balancing_scan_size;
		2058	extern unsigned int sysctl_numa_balancing_settle_count;
		2059
2033	#ifdef CONFIG_SCHED_DEBUG	2060	#ifdef CONFIG_SCHED_DEBUG
2034	extern unsigned int sysctl_sched_migration_cost;	2061	extern unsigned int sysctl_sched_migration_cost;
2035	extern unsigned int sysctl_sched_nr_migrate;	2062	extern unsigned int sysctl_sched_nr_migrate;