diff options
author | Shaohua Li <shaohua.li@intel.com> | 2011-01-16 21:52:07 -0500 |
---|---|---|
committer | Ingo Molnar <mingo@elte.hu> | 2011-02-14 07:03:08 -0500 |
commit | 70e4a369733a21e3d16b059a6ccdad22a344bf57 (patch) | |
tree | bb103a7ea3199320dc0b7e5fdf69fe594b863e05 /arch | |
parent | 3a09fb4570a1cce11472b8e5da3f6ee409f529d5 (diff) |
x86: Scale up the number of TLB invalidate vectors with NR_CPUs, up to 32
Make the maxium TLB invalidate vectors depend on NR_CPUS linearly,
with a maximum of 32 vectors.
We currently only have 8 vectors for TLB invalidation and that is clearly
inadequate. If we have a lot of CPUs, the CPUs need share the 8 vectors and
tlbstate_lock is used to protect them. flush_tlb_page() is
heavily used in page reclaim, which will cause a lot of lock
contention for tlbstate_lock.
Andi Kleen suggested increasing the vectors number to 32, which should be
good for current typical systems to reduce the tlbstate_lock contention.
My test system has 4 sockets and 64G memory, and 64 CPUs. My
workload creates 64 processes. Each process mmap reads a big
empty sparse file. The total size of the files are 2*total_mem,
so this will cause a lot of page reclaim.
Below is the result I get from perf call-graph profiling:
without the patch:
------------------
24.25% usemem [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--42.15%-- native_flush_tlb_others
with the patch:
------------------
14.96% usemem [kernel] [k] _raw_spin_lock
|
--- _raw_spin_lock
|--13.89%-- native_flush_tlb_others
So this heavily reduces the tlbstate_lock contention.
Suggested-by: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <1295232727.1949.709.camel@sli10-conroe>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Diffstat (limited to 'arch')
-rw-r--r-- | arch/x86/include/asm/irq_vectors.h | 13 |
1 files changed, 9 insertions, 4 deletions
diff --git a/arch/x86/include/asm/irq_vectors.h b/arch/x86/include/asm/irq_vectors.h index 42f0d4a30f1b..4980f48bbbb7 100644 --- a/arch/x86/include/asm/irq_vectors.h +++ b/arch/x86/include/asm/irq_vectors.h | |||
@@ -17,8 +17,8 @@ | |||
17 | * Vectors 0 ... 31 : system traps and exceptions - hardcoded events | 17 | * Vectors 0 ... 31 : system traps and exceptions - hardcoded events |
18 | * Vectors 32 ... 127 : device interrupts | 18 | * Vectors 32 ... 127 : device interrupts |
19 | * Vector 128 : legacy int80 syscall interface | 19 | * Vector 128 : legacy int80 syscall interface |
20 | * Vectors 129 ... 229 : device interrupts | 20 | * Vectors 129 ... INVALIDATE_TLB_VECTOR_START-1 : device interrupts |
21 | * Vectors 230 ... 255 : special interrupts | 21 | * Vectors INVALIDATE_TLB_VECTOR_START ... 255 : special interrupts |
22 | * | 22 | * |
23 | * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table. | 23 | * 64-bit x86 has per CPU IDT tables, 32-bit has one shared IDT table. |
24 | * | 24 | * |
@@ -124,8 +124,13 @@ | |||
124 | */ | 124 | */ |
125 | #define LOCAL_TIMER_VECTOR 0xef | 125 | #define LOCAL_TIMER_VECTOR 0xef |
126 | 126 | ||
127 | /* f0-f7 used for spreading out TLB flushes: */ | 127 | /* up to 32 vectors used for spreading out TLB flushes: */ |
128 | #define NUM_INVALIDATE_TLB_VECTORS 8 | 128 | #if NR_CPUS <= 32 |
129 | # define NUM_INVALIDATE_TLB_VECTORS NR_CPUS | ||
130 | #else | ||
131 | # define NUM_INVALIDATE_TLB_VECTORS 32 | ||
132 | #endif | ||
133 | |||
129 | #define INVALIDATE_TLB_VECTOR_END 0xee | 134 | #define INVALIDATE_TLB_VECTOR_END 0xee |
130 | #define INVALIDATE_TLB_VECTOR_START \ | 135 | #define INVALIDATE_TLB_VECTOR_START \ |
131 | (INVALIDATE_TLB_VECTOR_END - NUM_INVALIDATE_TLB_VECTORS + 1) | 136 | (INVALIDATE_TLB_VECTOR_END - NUM_INVALIDATE_TLB_VECTORS + 1) |