Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 mm changes from Ingo Molnar: "The main change in this cycle is the rework of the TLB range flushing code, to simplify, fix and consolidate the code. By Dave Hansen" * 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Set TLB flush tunable to sane value (33) x86/mm: New tunable for single vs full TLB flush x86/mm: Add tracepoints for TLB flushes x86/mm: Unify remote INVLPG code x86/mm: Fix missed global TLB flush stat x86/mm: Rip out complicated, out-of-date, buggy TLB flushing x86/mm: Clean up the TLB flushing code x86/smep: Be more informative when signalling an SMEP fault
author: Linus Torvalds <torvalds@linux-foundation.org> 2014-08-04 20:15:45 -0400
committer: Linus Torvalds <torvalds@linux-foundation.org> 2014-08-04 20:15:45 -0400
commit: ce4747963252a30613ebf1c1df3d83b9526a342e (patch)
tree: 6c61d1b1045a72965006324ae3805280be296e53 /Documentation
parent: 76f09aa464a1913efd596dd0edbf88f932fde08c (diff)
parent: a5102476a24bce364b74f1110005542a2c964103 (diff)
1 files changed, 75 insertions, 0 deletions
diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt
new file mode 100644
index 000000000000..2b3a82e69151
--- /dev/null
+++ b/Documentation/x86/tlb.txt
@@ -0,0 +1,75 @@
+When the kernel unmaps or modified the attributes of a range of
+memory, it has two choices:
+ 1. Flush the entire TLB with a two-instruction sequence.  This is
+    a quick operation, but it causes collateral damage: TLB entries
+    from areas other than the one we are trying to flush will be
+    destroyed and must be refilled later, at some cost.
+ 2. Use the invlpg instruction to invalidate a single page at a
+    time.  This could potentialy cost many more instructions, but
+    it is a much more precise operation, causing no collateral
+    damage to other TLB entries.
+Which method to do depends on a few things:
+ 1. The size of the flush being performed.  A flush of the entire
+    address space is obviously better performed by flushing the
+    entire TLB than doing 2^48/PAGE_SIZE individual flushes.
+ 2. The contents of the TLB.  If the TLB is empty, then there will
+    be no collateral damage caused by doing the global flush, and
+    all of the individual flush will have ended up being wasted
+    work.
+ 3. The size of the TLB.  The larger the TLB, the more collateral
+    damage we do with a full flush.  So, the larger the TLB, the
+    more attrative an individual flush looks.  Data and
+    instructions have separate TLBs, as do different page sizes.
+ 4. The microarchitecture.  The TLB has become a multi-level
+    cache on modern CPUs, and the global flushes have become more
+    expensive relative to single-page flushes.
+There is obviously no way the kernel can know all these things,
+especially the contents of the TLB during a given flush.  The
+sizes of the flush will vary greatly depending on the workload as
+well.  There is essentially no "right" point to choose.
+You may be doing too many individual invalidations if you see the
+invlpg instruction (or instructions _near_ it) show up high in
+profiles.  If you believe that individual invalidations being
+called too often, you can lower the tunable:
+        /sys/debug/kernel/x86/tlb_single_page_flush_ceiling
+This will cause us to do the global flush for more cases.
+Lowering it to 0 will disable the use of the individual flushes.
+Setting it to 1 is a very conservative setting and it should
+never need to be 0 under normal circumstances.
+Despite the fact that a single individual flush on x86 is
+guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
+flushes.  THP is treated exactly the same as normal memory.
+You might see invlpg inside of flush_tlb_mm_range() show up in
+profiles, or you can use the trace_tlb_flush() tracepoints. to
+determine how long the flush operations are taking.
+Essentially, you are balancing the cycles you spend doing invlpg
+with the cycles that you spend refilling the TLB later.
+You can measure how expensive TLB refills are by using
+performance counters and 'perf stat', like this:
+perf stat -e
+        cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
+        cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
+        cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
+        cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
+        cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
+        cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
+That works on an IvyBridge-era CPU (i5-3320M).  Different CPUs
+may have differently-named counters, but they should at least
+be there in some form.  You can use pmu-tools 'ocperf list'
+(https://github.com/andikleen/pmu-tools) to find the right
+counters for a given CPU.
+1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
+   says: "One execution of INVLPG is sufficient even for a page
+   with size greater than 4 KBytes."
author	Linus Torvalds <torvalds@linux-foundation.org>	2014-08-04 20:15:45 -0400
committer	Linus Torvalds <torvalds@linux-foundation.org>	2014-08-04 20:15:45 -0400
commit	ce4747963252a30613ebf1c1df3d83b9526a342e (patch)
tree	6c61d1b1045a72965006324ae3805280be296e53 /Documentation
parent	76f09aa464a1913efd596dd0edbf88f932fde08c (diff)
parent	a5102476a24bce364b74f1110005542a2c964103 (diff)

diff --git a/Documentation/x86/tlb.txt b/Documentation/x86/tlb.txt new file mode 100644 index 000000000000..2b3a82e69151 --- /dev/null +++ b/Documentation/x86/tlb.txt
@@ -0,0 +1,75 @@
	1	When the kernel unmaps or modified the attributes of a range of
	2	memory, it has two choices:
	3	1. Flush the entire TLB with a two-instruction sequence. This is
	4	a quick operation, but it causes collateral damage: TLB entries
	5	from areas other than the one we are trying to flush will be
	6	destroyed and must be refilled later, at some cost.
	7	2. Use the invlpg instruction to invalidate a single page at a
	8	time. This could potentialy cost many more instructions, but
	9	it is a much more precise operation, causing no collateral
	10	damage to other TLB entries.
	11
	12	Which method to do depends on a few things:
	13	1. The size of the flush being performed. A flush of the entire
	14	address space is obviously better performed by flushing the
	15	entire TLB than doing 2^48/PAGE_SIZE individual flushes.
	16	2. The contents of the TLB. If the TLB is empty, then there will
	17	be no collateral damage caused by doing the global flush, and
	18	all of the individual flush will have ended up being wasted
	19	work.
	20	3. The size of the TLB. The larger the TLB, the more collateral
	21	damage we do with a full flush. So, the larger the TLB, the
	22	more attrative an individual flush looks. Data and
	23	instructions have separate TLBs, as do different page sizes.
	24	4. The microarchitecture. The TLB has become a multi-level
	25	cache on modern CPUs, and the global flushes have become more
	26	expensive relative to single-page flushes.
	27
	28	There is obviously no way the kernel can know all these things,
	29	especially the contents of the TLB during a given flush. The
	30	sizes of the flush will vary greatly depending on the workload as
	31	well. There is essentially no "right" point to choose.
	32
	33	You may be doing too many individual invalidations if you see the
	34	invlpg instruction (or instructions _near_ it) show up high in
	35	profiles. If you believe that individual invalidations being
	36	called too often, you can lower the tunable:
	37
	38	/sys/debug/kernel/x86/tlb_single_page_flush_ceiling
	39
	40	This will cause us to do the global flush for more cases.
	41	Lowering it to 0 will disable the use of the individual flushes.
	42	Setting it to 1 is a very conservative setting and it should
	43	never need to be 0 under normal circumstances.
	44
	45	Despite the fact that a single individual flush on x86 is
	46	guaranteed to flush a full 2MB [1], hugetlbfs always uses the full
	47	flushes. THP is treated exactly the same as normal memory.
	48
	49	You might see invlpg inside of flush_tlb_mm_range() show up in
	50	profiles, or you can use the trace_tlb_flush() tracepoints. to
	51	determine how long the flush operations are taking.
	52
	53	Essentially, you are balancing the cycles you spend doing invlpg
	54	with the cycles that you spend refilling the TLB later.
	55
	56	You can measure how expensive TLB refills are by using
	57	performance counters and 'perf stat', like this:
	58
	59	perf stat -e
	60	cpu/event=0x8,umask=0x84,name=dtlb_load_misses_walk_duration/,
	61	cpu/event=0x8,umask=0x82,name=dtlb_load_misses_walk_completed/,
	62	cpu/event=0x49,umask=0x4,name=dtlb_store_misses_walk_duration/,
	63	cpu/event=0x49,umask=0x2,name=dtlb_store_misses_walk_completed/,
	64	cpu/event=0x85,umask=0x4,name=itlb_misses_walk_duration/,
	65	cpu/event=0x85,umask=0x2,name=itlb_misses_walk_completed/
	66
	67	That works on an IvyBridge-era CPU (i5-3320M). Different CPUs
	68	may have differently-named counters, but they should at least
	69	be there in some form. You can use pmu-tools 'ocperf list'
	70	(https://github.com/andikleen/pmu-tools) to find the right
	71	counters for a given CPU.
	72
	73	1. A footnote in Intel's SDM "4.10.4.2 Recommended Invalidation"
	74	says: "One execution of INVLPG is sufficient even for a page
	75	with size greater than 4 KBytes."