aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2012-12-16 17:33:25 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2012-12-16 18:18:08 -0500
commit3d59eebc5e137bd89c6351e4c70e90ba1d0dc234 (patch)
treeb4ddfd0b057454a7437a3b4e3074a3b8b4b03817 /arch/x86
parent11520e5e7c1855fc3bf202bb3be35a39d9efa034 (diff)
parent4fc3f1d66b1ef0d7b8dc11f4ff1cc510f78b37d6 (diff)
Merge tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma
Pull Automatic NUMA Balancing bare-bones from Mel Gorman: "There are three implementations for NUMA balancing, this tree (balancenuma), numacore which has been developed in tip/master and autonuma which is in aa.git. In almost all respects balancenuma is the dumbest of the three because its main impact is on the VM side with no attempt to be smart about scheduling. In the interest of getting the ball rolling, it would be desirable to see this much merged for 3.8 with the view to building scheduler smarts on top and adapting the VM where required for 3.9. The most recent set of comparisons available from different people are mel: https://lkml.org/lkml/2012/12/9/108 mingo: https://lkml.org/lkml/2012/12/7/331 tglx: https://lkml.org/lkml/2012/12/10/437 srikar: https://lkml.org/lkml/2012/12/10/397 The results are a mixed bag. In my own tests, balancenuma does reasonably well. It's dumb as rocks and does not regress against mainline. On the other hand, Ingo's tests shows that balancenuma is incapable of converging for this workloads driven by perf which is bad but is potentially explained by the lack of scheduler smarts. Thomas' results show balancenuma improves on mainline but falls far short of numacore or autonuma. Srikar's results indicate we all suffer on a large machine with imbalanced node sizes. My own testing showed that recent numacore results have improved dramatically, particularly in the last week but not universally. We've butted heads heavily on system CPU usage and high levels of migration even when it shows that overall performance is better. There are also cases where it regresses. Of interest is that for specjbb in some configurations it will regress for lower numbers of warehouses and show gains for higher numbers which is not reported by the tool by default and sometimes missed in treports. Recently I reported for numacore that the JVM was crashing with NullPointerExceptions but currently it's unclear what the source of this problem is. Initially I thought it was in how numacore batch handles PTEs but I'm no longer think this is the case. It's possible numacore is just able to trigger it due to higher rates of migration. These reports were quite late in the cycle so I/we would like to start with this tree as it contains much of the code we can agree on and has not changed significantly over the last 2-3 weeks." * tag 'balancenuma-v11' of git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux-balancenuma: (50 commits) mm/rmap, migration: Make rmap_walk_anon() and try_to_unmap_anon() more scalable mm/rmap: Convert the struct anon_vma::mutex to an rwsem mm: migrate: Account a transhuge page properly when rate limiting mm: numa: Account for failed allocations and isolations as migration failures mm: numa: Add THP migration for the NUMA working set scanning fault case build fix mm: numa: Add THP migration for the NUMA working set scanning fault case. mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG mm: sched: numa: Control enabling and disabling of NUMA balancing mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships mm: numa: migrate: Set last_nid on newly allocated page mm: numa: split_huge_page: Transfer last_nid on tail page mm: numa: Introduce last_nid to the page frame sched: numa: Slowly increase the scanning period as NUMA faults are handled mm: numa: Rate limit setting of pte_numa if node is saturated mm: numa: Rate limit the amount of memory that is migrated between nodes mm: numa: Structures for Migrate On Fault per NUMA migration rate limiting mm: numa: Migrate pages handled during a pmd_numa hinting fault mm: numa: Migrate on reference policy ...
Diffstat (limited to 'arch/x86')
-rw-r--r--arch/x86/Kconfig2
-rw-r--r--arch/x86/include/asm/pgtable.h17
-rw-r--r--arch/x86/include/asm/pgtable_types.h20
-rw-r--r--arch/x86/mm/pgtable.c8
4 files changed, 44 insertions, 3 deletions
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 65a872bf72f9..97f8c5ad8c2d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -22,6 +22,8 @@ config X86
22 def_bool y 22 def_bool y
23 select HAVE_AOUT if X86_32 23 select HAVE_AOUT if X86_32
24 select HAVE_UNSTABLE_SCHED_CLOCK 24 select HAVE_UNSTABLE_SCHED_CLOCK
25 select ARCH_SUPPORTS_NUMA_BALANCING
26 select ARCH_WANTS_PROT_NUMA_PROT_NONE
25 select HAVE_IDE 27 select HAVE_IDE
26 select HAVE_OPROFILE 28 select HAVE_OPROFILE
27 select HAVE_PCSPKR_PLATFORM 29 select HAVE_PCSPKR_PLATFORM
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index a1f780d45f76..5199db2923d3 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -404,7 +404,14 @@ static inline int pte_same(pte_t a, pte_t b)
404 404
405static inline int pte_present(pte_t a) 405static inline int pte_present(pte_t a)
406{ 406{
407 return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE); 407 return pte_flags(a) & (_PAGE_PRESENT | _PAGE_PROTNONE |
408 _PAGE_NUMA);
409}
410
411#define pte_accessible pte_accessible
412static inline int pte_accessible(pte_t a)
413{
414 return pte_flags(a) & _PAGE_PRESENT;
408} 415}
409 416
410static inline int pte_hidden(pte_t pte) 417static inline int pte_hidden(pte_t pte)
@@ -420,7 +427,8 @@ static inline int pmd_present(pmd_t pmd)
420 * the _PAGE_PSE flag will remain set at all times while the 427 * the _PAGE_PSE flag will remain set at all times while the
421 * _PAGE_PRESENT bit is clear). 428 * _PAGE_PRESENT bit is clear).
422 */ 429 */
423 return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE); 430 return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE |
431 _PAGE_NUMA);
424} 432}
425 433
426static inline int pmd_none(pmd_t pmd) 434static inline int pmd_none(pmd_t pmd)
@@ -479,6 +487,11 @@ static inline pte_t *pte_offset_kernel(pmd_t *pmd, unsigned long address)
479 487
480static inline int pmd_bad(pmd_t pmd) 488static inline int pmd_bad(pmd_t pmd)
481{ 489{
490#ifdef CONFIG_NUMA_BALANCING
491 /* pmd_numa check */
492 if ((pmd_flags(pmd) & (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)
493 return 0;
494#endif
482 return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE; 495 return (pmd_flags(pmd) & ~_PAGE_USER) != _KERNPG_TABLE;
483} 496}
484 497
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc9505d..3c32db8c539d 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
64#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE) 64#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
65#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE) 65#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
66 66
67/*
68 * _PAGE_NUMA indicates that this page will trigger a numa hinting
69 * minor page fault to gather numa placement statistics (see
70 * pte_numa()). The bit picked (8) is within the range between
71 * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
72 * require changes to the swp entry format because that bit is always
73 * zero when the pte is not present.
74 *
75 * The bit picked must be always zero when the pmd is present and not
76 * present, so that we don't lose information when we set it while
77 * atomically clearing the present bit.
78 *
79 * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
80 * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
81 * couldn't reach, like handle_mm_fault() (see access_error in
82 * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
83 * handle_mm_fault() to be invoked).
84 */
85#define _PAGE_NUMA _PAGE_PROTNONE
86
67#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \ 87#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
68 _PAGE_ACCESSED | _PAGE_DIRTY) 88 _PAGE_ACCESSED | _PAGE_DIRTY)
69#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \ 89#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 217eb705fac0..e27fbf887f3b 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -301,6 +301,13 @@ void pgd_free(struct mm_struct *mm, pgd_t *pgd)
301 free_page((unsigned long)pgd); 301 free_page((unsigned long)pgd);
302} 302}
303 303
304/*
305 * Used to set accessed or dirty bits in the page table entries
306 * on other architectures. On x86, the accessed and dirty bits
307 * are tracked by hardware. However, do_wp_page calls this function
308 * to also make the pte writeable at the same time the dirty bit is
309 * set. In that case we do actually need to write the PTE.
310 */
304int ptep_set_access_flags(struct vm_area_struct *vma, 311int ptep_set_access_flags(struct vm_area_struct *vma,
305 unsigned long address, pte_t *ptep, 312 unsigned long address, pte_t *ptep,
306 pte_t entry, int dirty) 313 pte_t entry, int dirty)
@@ -310,7 +317,6 @@ int ptep_set_access_flags(struct vm_area_struct *vma,
310 if (changed && dirty) { 317 if (changed && dirty) {
311 *ptep = entry; 318 *ptep = entry;
312 pte_update_defer(vma->vm_mm, address, ptep); 319 pte_update_defer(vma->vm_mm, address, ptep);
313 flush_tlb_page(vma, address);
314 } 320 }
315 321
316 return changed; 322 return changed;