diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2014-08-04 20:15:45 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2014-08-04 20:15:45 -0400 |
commit | ce4747963252a30613ebf1c1df3d83b9526a342e (patch) | |
tree | 6c61d1b1045a72965006324ae3805280be296e53 /Documentation | |
parent | 76f09aa464a1913efd596dd0edbf88f932fde08c (diff) | |
parent | a5102476a24bce364b74f1110005542a2c964103 (diff) |
Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 mm changes from Ingo Molnar:
"The main change in this cycle is the rework of the TLB range flushing
code, to simplify, fix and consolidate the code. By Dave Hansen"
* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Set TLB flush tunable to sane value (33)
x86/mm: New tunable for single vs full TLB flush
x86/mm: Add tracepoints for TLB flushes
x86/mm: Unify remote INVLPG code
x86/mm: Fix missed global TLB flush stat
x86/mm: Rip out complicated, out-of-date, buggy TLB flushing
x86/mm: Clean up the TLB flushing code
x86/smep: Be more informative when signalling an SMEP fault
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/x86/tlb.txt | 75 |
1 files changed, 75 insertions, 0 deletions
diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt new file mode 100644 index 000000000000..2b3a82e69151 --- /dev/null +++ b/Documentation/x86/tlb.txt | |||
@@ -0,0 +1,75 @@ | |||
1 | When the kernel unmaps or modified the attributes of a range of | ||
2 | memory, it has two choices: | ||
3 | 1. Flush the entire TLB with a two-instruction sequence. This is | ||
4 | a quick operation, but it causes collateral damage: TLB entries | ||
5 | from areas other than the one we are trying to flush will be | ||
6 | destroyed and must be refilled later, at some cost. | ||
7 | 2. Use the invlpg instruction to invalidate a single page at a | ||
8 | time. This could potentialy cost many more instructions, but | ||
9 | it is a much more precise operation, causing no collateral | ||
10 | damage to other TLB entries. | ||
11 | |||
12 | Which method to do depends on a few things: | ||
13 | 1. The size of the flush being performed. A flush of the entire | ||
14 | address space is obviously better performed by flushing the | ||
15 | entire TLB than doing 2^48/PAGE_SIZE individual flushes. | ||
16 | 2. The contents of the TLB. If the TLB is empty, then there will | ||
17 | be no collateral damage caused by doing the global flush, and | ||
18 | all of the individual flush will have ended up being wasted | ||
19 | work. | ||
20 | 3. The size of the TLB. The larger the TLB, the more collateral | ||
21 | damage we do with a full flush. So, the larger the TLB, the | ||
22 | more attrative an individual flush looks. Data and | ||
23 | instructions have separate TLBs, as do different page sizes. | ||
24 | 4. The microarchitecture. The TLB has become a multi-level | ||
25 | cache on modern CPUs, and the global flushes have become more | ||
26 | expensive relative to single-page flushes. | ||
27 | |||
28 | There is obviously no way the kernel can know all these things, | ||
29 | especially the contents of the TLB during a given flush. The | ||
30 | sizes of the flush will vary greatly depending on the workload as | ||
31 | well. There is essentially no "right" point to choose. | ||
32 | |||
33 | You may be doing too many individual invalidations if you see the | ||
34 | invlpg instruction (or instructions _near_ it) show up high in | ||
35 | profiles. If you believe that individual invalidations being | ||
36 | called too often, you can lower the tunable: | ||
37 | |||
38 | /sys/debug/kernel/x86/tlb_single_page_flush_ceiling | ||
39 | |||
40 | This will cause us to do the global flush for more cases. | ||
41 | Lowering it to 0 will disable the use of the individual flushes. | ||
42 | Setting it to 1 is a very conservative setting and it should | ||
43 | never need to be 0 under normal circumstances. | ||
44 | |||
45 | Despite the fact that a single individual flush on x86 is | ||
46 | guaranteed to flush a full 2MB [1], hugetlbfs always uses the full | ||
47 | flushes. THP is treated exactly the same as normal memory. | ||
48 | |||
49 | You might see invlpg inside of flush_tlb_mm_range() show up in | ||
50 | profiles, or you can use the trace_tlb_flush() tracepoints. to | ||
51 | determine how long the flush operations are taking. | ||
52 | |||
53 | Essentially, you are balancing the cycles you spend doing invlpg | ||
54 | with the cycles that you spend refilling the TLB later. | ||
55 | |||
56 | You can measure how expensive TLB refills are by using | ||
57 | performance counters and 'perf stat', like this: | ||
58 | |||
59 | perf stat -e | ||
60 | cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/, | ||
61 | cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/, | ||
62 | cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/, | ||
63 | cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/, | ||
64 | cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/, | ||
65 | cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/ | ||
66 | |||
67 | That works on an IvyBridge-era CPU (i5-3320M). Different CPUs | ||
68 | may have differently-named counters, but they should at least | ||
69 | be there in some form. You can use pmu-tools 'ocperf list' | ||
70 | (https://github.com/andikleen/pmu-tools) to find the right | ||
71 | counters for a given CPU. | ||
72 | |||
73 | 1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation" | ||
74 | says: "One execution of INVLPG is sufficient even for a page | ||
75 | with size greater than 4 KBytes." | ||