diff options
author | Dave Hansen <dave.hansen@linux.intel.com> | 2018-01-05 12:44:36 -0500 |
---|---|---|
committer | Thomas Gleixner <tglx@linutronix.de> | 2018-01-06 15:39:10 -0500 |
commit | 01c9b17bf673b05bb401b76ec763e9730ccf1376 (patch) | |
tree | e44529cd44bf30899a3182ca874099d0b8c20655 | |
parent | de53c3786a3ce162a1c815d0c04c766c23ec9c0a (diff) |
x86/Documentation: Add PTI description
Add some details about how PTI works, what some of the downsides
are, and how to debug it when things go wrong.
Also document the kernel parameter: 'pti/nopti'.
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Kees Cook <keescook@chromium.org>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andi Lutomirsky <luto@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20180105174436.1BC6FA2B@viggo.jf.intel.com
-rw-r--r-- | Documentation/admin-guide/kernel-parameters.txt | 21 | ||||
-rw-r--r-- | Documentation/x86/pti.txt | 186 |
2 files changed, 200 insertions, 7 deletions
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 520fdec15bbb..905991745d26 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt | |||
@@ -2685,8 +2685,6 @@ | |||
2685 | steal time is computed, but won't influence scheduler | 2685 | steal time is computed, but won't influence scheduler |
2686 | behaviour | 2686 | behaviour |
2687 | 2687 | ||
2688 | nopti [X86-64] Disable kernel page table isolation | ||
2689 | |||
2690 | nolapic [X86-32,APIC] Do not enable or use the local APIC. | 2688 | nolapic [X86-32,APIC] Do not enable or use the local APIC. |
2691 | 2689 | ||
2692 | nolapic_timer [X86-32,APIC] Do not use the local APIC timer. | 2690 | nolapic_timer [X86-32,APIC] Do not use the local APIC timer. |
@@ -3255,11 +3253,20 @@ | |||
3255 | pt. [PARIDE] | 3253 | pt. [PARIDE] |
3256 | See Documentation/blockdev/paride.txt. | 3254 | See Documentation/blockdev/paride.txt. |
3257 | 3255 | ||
3258 | pti= [X86_64] | 3256 | pti= [X86_64] Control Page Table Isolation of user and |
3259 | Control user/kernel address space isolation: | 3257 | kernel address spaces. Disabling this feature |
3260 | on - enable | 3258 | removes hardening, but improves performance of |
3261 | off - disable | 3259 | system calls and interrupts. |
3262 | auto - default setting | 3260 | |
3261 | on - unconditionally enable | ||
3262 | off - unconditionally disable | ||
3263 | auto - kernel detects whether your CPU model is | ||
3264 | vulnerable to issues that PTI mitigates | ||
3265 | |||
3266 | Not specifying this option is equivalent to pti=auto. | ||
3267 | |||
3268 | nopti [X86_64] | ||
3269 | Equivalent to pti=off | ||
3263 | 3270 | ||
3264 | pty.legacy_count= | 3271 | pty.legacy_count= |
3265 | [KNL] Number of legacy pty's. Overwrites compiled-in | 3272 | [KNL] Number of legacy pty's. Overwrites compiled-in |
diff --git a/Documentation/x86/pti.txt b/Documentation/x86/pti.txt new file mode 100644 index 000000000000..d11eff61fc9a --- /dev/null +++ b/Documentation/x86/pti.txt | |||
@@ -0,0 +1,186 @@ | |||
1 | Overview | ||
2 | ======== | ||
3 | |||
4 | Page Table Isolation (pti, previously known as KAISER[1]) is a | ||
5 | countermeasure against attacks on the shared user/kernel address | ||
6 | space such as the "Meltdown" approach[2]. | ||
7 | |||
8 | To mitigate this class of attacks, we create an independent set of | ||
9 | page tables for use only when running userspace applications. When | ||
10 | the kernel is entered via syscalls, interrupts or exceptions, the | ||
11 | page tables are switched to the full "kernel" copy. When the system | ||
12 | switches back to user mode, the user copy is used again. | ||
13 | |||
14 | The userspace page tables contain only a minimal amount of kernel | ||
15 | data: only what is needed to enter/exit the kernel such as the | ||
16 | entry/exit functions themselves and the interrupt descriptor table | ||
17 | (IDT). There are a few strictly unnecessary things that get mapped | ||
18 | such as the first C function when entering an interrupt (see | ||
19 | comments in pti.c). | ||
20 | |||
21 | This approach helps to ensure that side-channel attacks leveraging | ||
22 | the paging structures do not function when PTI is enabled. It can be | ||
23 | enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. | ||
24 | Once enabled at compile-time, it can be disabled at boot with the | ||
25 | 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). | ||
26 | |||
27 | Page Table Management | ||
28 | ===================== | ||
29 | |||
30 | When PTI is enabled, the kernel manages two sets of page tables. | ||
31 | The first set is very similar to the single set which is present in | ||
32 | kernels without PTI. This includes a complete mapping of userspace | ||
33 | that the kernel can use for things like copy_to_user(). | ||
34 | |||
35 | Although _complete_, the user portion of the kernel page tables is | ||
36 | crippled by setting the NX bit in the top level. This ensures | ||
37 | that any missed kernel->user CR3 switch will immediately crash | ||
38 | userspace upon executing its first instruction. | ||
39 | |||
40 | The userspace page tables map only the kernel data needed to enter | ||
41 | and exit the kernel. This data is entirely contained in the 'struct | ||
42 | cpu_entry_area' structure which is placed in the fixmap which gives | ||
43 | each CPU's copy of the area a compile-time-fixed virtual address. | ||
44 | |||
45 | For new userspace mappings, the kernel makes the entries in its | ||
46 | page tables like normal. The only difference is when the kernel | ||
47 | makes entries in the top (PGD) level. In addition to setting the | ||
48 | entry in the main kernel PGD, a copy of the entry is made in the | ||
49 | userspace page tables' PGD. | ||
50 | |||
51 | This sharing at the PGD level also inherently shares all the lower | ||
52 | layers of the page tables. This leaves a single, shared set of | ||
53 | userspace page tables to manage. One PTE to lock, one set of | ||
54 | accessed bits, dirty bits, etc... | ||
55 | |||
56 | Overhead | ||
57 | ======== | ||
58 | |||
59 | Protection against side-channel attacks is important. But, | ||
60 | this protection comes at a cost: | ||
61 | |||
62 | 1. Increased Memory Use | ||
63 | a. Each process now needs an order-1 PGD instead of order-0. | ||
64 | (Consumes an additional 4k per process). | ||
65 | b. The 'cpu_entry_area' structure must be 2MB in size and 2MB | ||
66 | aligned so that it can be mapped by setting a single PMD | ||
67 | entry. This consumes nearly 2MB of RAM once the kernel | ||
68 | is decompressed, but no space in the kernel image itself. | ||
69 | |||
70 | 2. Runtime Cost | ||
71 | a. CR3 manipulation to switch between the page table copies | ||
72 | must be done at interrupt, syscall, and exception entry | ||
73 | and exit (it can be skipped when the kernel is interrupted, | ||
74 | though.) Moves to CR3 are on the order of a hundred | ||
75 | cycles, and are required at every entry and exit. | ||
76 | b. A "trampoline" must be used for SYSCALL entry. This | ||
77 | trampoline depends on a smaller set of resources than the | ||
78 | non-PTI SYSCALL entry code, so requires mapping fewer | ||
79 | things into the userspace page tables. The downside is | ||
80 | that stacks must be switched at entry time. | ||
81 | d. Global pages are disabled for all kernel structures not | ||
82 | mapped into both kernel and userspace page tables. This | ||
83 | feature of the MMU allows different processes to share TLB | ||
84 | entries mapping the kernel. Losing the feature means more | ||
85 | TLB misses after a context switch. The actual loss of | ||
86 | performance is very small, however, never exceeding 1%. | ||
87 | d. Process Context IDentifiers (PCID) is a CPU feature that | ||
88 | allows us to skip flushing the entire TLB when switching page | ||
89 | tables by setting a special bit in CR3 when the page tables | ||
90 | are changed. This makes switching the page tables (at context | ||
91 | switch, or kernel entry/exit) cheaper. But, on systems with | ||
92 | PCID support, the context switch code must flush both the user | ||
93 | and kernel entries out of the TLB. The user PCID TLB flush is | ||
94 | deferred until the exit to userspace, minimizing the cost. | ||
95 | See intel.com/sdm for the gory PCID/INVPCID details. | ||
96 | e. The userspace page tables must be populated for each new | ||
97 | process. Even without PTI, the shared kernel mappings | ||
98 | are created by copying top-level (PGD) entries into each | ||
99 | new process. But, with PTI, there are now *two* kernel | ||
100 | mappings: one in the kernel page tables that maps everything | ||
101 | and one for the entry/exit structures. At fork(), we need to | ||
102 | copy both. | ||
103 | f. In addition to the fork()-time copying, there must also | ||
104 | be an update to the userspace PGD any time a set_pgd() is done | ||
105 | on a PGD used to map userspace. This ensures that the kernel | ||
106 | and userspace copies always map the same userspace | ||
107 | memory. | ||
108 | g. On systems without PCID support, each CR3 write flushes | ||
109 | the entire TLB. That means that each syscall, interrupt | ||
110 | or exception flushes the TLB. | ||
111 | h. INVPCID is a TLB-flushing instruction which allows flushing | ||
112 | of TLB entries for non-current PCIDs. Some systems support | ||
113 | PCIDs, but do not support INVPCID. On these systems, addresses | ||
114 | can only be flushed from the TLB for the current PCID. When | ||
115 | flushing a kernel address, we need to flush all PCIDs, so a | ||
116 | single kernel address flush will require a TLB-flushing CR3 | ||
117 | write upon the next use of every PCID. | ||
118 | |||
119 | Possible Future Work | ||
120 | ==================== | ||
121 | 1. We can be more careful about not actually writing to CR3 | ||
122 | unless its value is actually changed. | ||
123 | 2. Allow PTI to be enabled/disabled at runtime in addition to the | ||
124 | boot-time switching. | ||
125 | |||
126 | Testing | ||
127 | ======== | ||
128 | |||
129 | To test stability of PTI, the following test procedure is recommended, | ||
130 | ideally doing all of these in parallel: | ||
131 | |||
132 | 1. Set CONFIG_DEBUG_ENTRY=y | ||
133 | 2. Run several copies of all of the tools/testing/selftests/x86/ tests | ||
134 | (excluding MPX and protection_keys) in a loop on multiple CPUs for | ||
135 | several minutes. These tests frequently uncover corner cases in the | ||
136 | kernel entry code. In general, old kernels might cause these tests | ||
137 | themselves to crash, but they should never crash the kernel. | ||
138 | 3. Run the 'perf' tool in a mode (top or record) that generates many | ||
139 | frequent performance monitoring non-maskable interrupts (see "NMI" | ||
140 | in /proc/interrupts). This exercises the NMI entry/exit code which | ||
141 | is known to trigger bugs in code paths that did not expect to be | ||
142 | interrupted, including nested NMIs. Using "-c" boosts the rate of | ||
143 | NMIs, and using two -c with separate counters encourages nested NMIs | ||
144 | and less deterministic behavior. | ||
145 | |||
146 | while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done | ||
147 | |||
148 | 4. Launch a KVM virtual machine. | ||
149 | 5. Run 32-bit binaries on systems supporting the SYSCALL instruction. | ||
150 | This has been a lightly-tested code path and needs extra scrutiny. | ||
151 | |||
152 | Debugging | ||
153 | ========= | ||
154 | |||
155 | Bugs in PTI cause a few different signatures of crashes | ||
156 | that are worth noting here. | ||
157 | |||
158 | * Failures of the selftests/x86 code. Usually a bug in one of the | ||
159 | more obscure corners of entry_64.S | ||
160 | * Crashes in early boot, especially around CPU bringup. Bugs | ||
161 | in the trampoline code or mappings cause these. | ||
162 | * Crashes at the first interrupt. Caused by bugs in entry_64.S, | ||
163 | like screwing up a page table switch. Also caused by | ||
164 | incorrectly mapping the IRQ handler entry code. | ||
165 | * Crashes at the first NMI. The NMI code is separate from main | ||
166 | interrupt handlers and can have bugs that do not affect | ||
167 | normal interrupts. Also caused by incorrectly mapping NMI | ||
168 | code. NMIs that interrupt the entry code must be very | ||
169 | careful and can be the cause of crashes that show up when | ||
170 | running perf. | ||
171 | * Kernel crashes at the first exit to userspace. entry_64.S | ||
172 | bugs, or failing to map some of the exit code. | ||
173 | * Crashes at first interrupt that interrupts userspace. The paths | ||
174 | in entry_64.S that return to userspace are sometimes separate | ||
175 | from the ones that return to the kernel. | ||
176 | * Double faults: overflowing the kernel stack because of page | ||
177 | faults upon page faults. Caused by touching non-pti-mapped | ||
178 | data in the entry code, or forgetting to switch to kernel | ||
179 | CR3 before calling into C functions which are not pti-mapped. | ||
180 | * Userspace segfaults early in boot, sometimes manifesting | ||
181 | as mount(8) failing to mount the rootfs. These have | ||
182 | tended to be TLB invalidation issues. Usually invalidating | ||
183 | the wrong PCID, or otherwise missing an invalidation. | ||
184 | |||
185 | 1. https://gruss.cc/files/kaiser.pdf | ||
186 | 2. https://meltdownattack.com/meltdown.pdf | ||