diff options
Diffstat (limited to 'Documentation/x86/intel_mpx.txt')
-rw-r--r-- | Documentation/x86/intel_mpx.txt | 234 |
1 files changed, 234 insertions, 0 deletions
diff --git a/Documentation/x86/intel_mpx.txt b/Documentation/x86/intel_mpx.txt new file mode 100644 index 000000000000..4472ed2ad921 --- /dev/null +++ b/Documentation/x86/intel_mpx.txt | |||
@@ -0,0 +1,234 @@ | |||
1 | 1. Intel(R) MPX Overview | ||
2 | ======================== | ||
3 | |||
4 | Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability | ||
5 | introduced into Intel Architecture. Intel MPX provides hardware features | ||
6 | that can be used in conjunction with compiler changes to check memory | ||
7 | references, for those references whose compile-time normal intentions are | ||
8 | usurped at runtime due to buffer overflow or underflow. | ||
9 | |||
10 | For more information, please refer to Intel(R) Architecture Instruction | ||
11 | Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection | ||
12 | Extensions. | ||
13 | |||
14 | Note: Currently no hardware with MPX ISA is available but it is always | ||
15 | possible to use SDE (Intel(R) Software Development Emulator) instead, which | ||
16 | can be downloaded from | ||
17 | http://software.intel.com/en-us/articles/intel-software-development-emulator | ||
18 | |||
19 | |||
20 | 2. How to get the advantage of MPX | ||
21 | ================================== | ||
22 | |||
23 | For MPX to work, changes are required in the kernel, binutils and compiler. | ||
24 | No source changes are required for applications, just a recompile. | ||
25 | |||
26 | There are a lot of moving parts of this to all work right. The following | ||
27 | is how we expect the compiler, application and kernel to work together. | ||
28 | |||
29 | 1) Application developer compiles with -fmpx. The compiler will add the | ||
30 | instrumentation as well as some setup code called early after the app | ||
31 | starts. New instruction prefixes are noops for old CPUs. | ||
32 | 2) That setup code allocates (virtual) space for the "bounds directory", | ||
33 | points the "bndcfgu" register to the directory and notifies the kernel | ||
34 | (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) that the app will be using | ||
35 | MPX. | ||
36 | 3) The kernel detects that the CPU has MPX, allows the new prctl() to | ||
37 | succeed, and notes the location of the bounds directory. Userspace is | ||
38 | expected to keep the bounds directory at that locationWe note it | ||
39 | instead of reading it each time because the 'xsave' operation needed | ||
40 | to access the bounds directory register is an expensive operation. | ||
41 | 4) If the application needs to spill bounds out of the 4 registers, it | ||
42 | issues a bndstx instruction. Since the bounds directory is empty at | ||
43 | this point, a bounds fault (#BR) is raised, the kernel allocates a | ||
44 | bounds table (in the user address space) and makes the relevant entry | ||
45 | in the bounds directory point to the new table. | ||
46 | 5) If the application violates the bounds specified in the bounds registers, | ||
47 | a separate kind of #BR is raised which will deliver a signal with | ||
48 | information about the violation in the 'struct siginfo'. | ||
49 | 6) Whenever memory is freed, we know that it can no longer contain valid | ||
50 | pointers, and we attempt to free the associated space in the bounds | ||
51 | tables. If an entire table becomes unused, we will attempt to free | ||
52 | the table and remove the entry in the directory. | ||
53 | |||
54 | To summarize, there are essentially three things interacting here: | ||
55 | |||
56 | GCC with -fmpx: | ||
57 | * enables annotation of code with MPX instructions and prefixes | ||
58 | * inserts code early in the application to call in to the "gcc runtime" | ||
59 | GCC MPX Runtime: | ||
60 | * Checks for hardware MPX support in cpuid leaf | ||
61 | * allocates virtual space for the bounds directory (malloc() essentially) | ||
62 | * points the hardware BNDCFGU register at the directory | ||
63 | * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to | ||
64 | start managing the bounds directories | ||
65 | Kernel MPX Code: | ||
66 | * Checks for hardware MPX support in cpuid leaf | ||
67 | * Handles #BR exceptions and sends SIGSEGV to the app when it violates | ||
68 | bounds, like during a buffer overflow. | ||
69 | * When bounds are spilled in to an unallocated bounds table, the kernel | ||
70 | notices in the #BR exception, allocates the virtual space, then | ||
71 | updates the bounds directory to point to the new table. It keeps | ||
72 | special track of the memory with a VM_MPX flag. | ||
73 | * Frees unused bounds tables at the time that the memory they described | ||
74 | is unmapped. | ||
75 | |||
76 | |||
77 | 3. How does MPX kernel code work | ||
78 | ================================ | ||
79 | |||
80 | Handling #BR faults caused by MPX | ||
81 | --------------------------------- | ||
82 | |||
83 | When MPX is enabled, there are 2 new situations that can generate | ||
84 | #BR faults. | ||
85 | * new bounds tables (BT) need to be allocated to save bounds. | ||
86 | * bounds violation caused by MPX instructions. | ||
87 | |||
88 | We hook #BR handler to handle these two new situations. | ||
89 | |||
90 | On-demand kernel allocation of bounds tables | ||
91 | -------------------------------------------- | ||
92 | |||
93 | MPX only has 4 hardware registers for storing bounds information. If | ||
94 | MPX-enabled code needs more than these 4 registers, it needs to spill | ||
95 | them somewhere. It has two special instructions for this which allow | ||
96 | the bounds to be moved between the bounds registers and some new "bounds | ||
97 | tables". | ||
98 | |||
99 | #BR exceptions are a new class of exceptions just for MPX. They are | ||
100 | similar conceptually to a page fault and will be raised by the MPX | ||
101 | hardware during both bounds violations or when the tables are not | ||
102 | present. The kernel handles those #BR exceptions for not-present tables | ||
103 | by carving the space out of the normal processes address space and then | ||
104 | pointing the bounds-directory over to it. | ||
105 | |||
106 | The tables need to be accessed and controlled by userspace because | ||
107 | the instructions for moving bounds in and out of them are extremely | ||
108 | frequent. They potentially happen every time a register points to | ||
109 | memory. Any direct kernel involvement (like a syscall) to access the | ||
110 | tables would obviously destroy performance. | ||
111 | |||
112 | Why not do this in userspace? MPX does not strictly require anything in | ||
113 | the kernel. It can theoretically be done completely from userspace. Here | ||
114 | are a few ways this could be done. We don't think any of them are practical | ||
115 | in the real-world, but here they are. | ||
116 | |||
117 | Q: Can virtual space simply be reserved for the bounds tables so that we | ||
118 | never have to allocate them? | ||
119 | A: MPX-enabled application will possibly create a lot of bounds tables in | ||
120 | process address space to save bounds information. These tables can take | ||
121 | up huge swaths of memory (as much as 80% of the memory on the system) | ||
122 | even if we clean them up aggressively. In the worst-case scenario, the | ||
123 | tables can be 4x the size of the data structure being tracked. IOW, a | ||
124 | 1-page structure can require 4 bounds-table pages. An X-GB virtual | ||
125 | area needs 4*X GB of virtual space, plus 2GB for the bounds directory. | ||
126 | If we were to preallocate them for the 128TB of user virtual address | ||
127 | space, we would need to reserve 512TB+2GB, which is larger than the | ||
128 | entire virtual address space today. This means they can not be reserved | ||
129 | ahead of time. Also, a single process's pre-popualated bounds directory | ||
130 | consumes 2GB of virtual *AND* physical memory. IOW, it's completely | ||
131 | infeasible to prepopulate bounds directories. | ||
132 | |||
133 | Q: Can we preallocate bounds table space at the same time memory is | ||
134 | allocated which might contain pointers that might eventually need | ||
135 | bounds tables? | ||
136 | A: This would work if we could hook the site of each and every memory | ||
137 | allocation syscall. This can be done for small, constrained applications. | ||
138 | But, it isn't practical at a larger scale since a given app has no | ||
139 | way of controlling how all the parts of the app might allocate memory | ||
140 | (think libraries). The kernel is really the only place to intercept | ||
141 | these calls. | ||
142 | |||
143 | Q: Could a bounds fault be handed to userspace and the tables allocated | ||
144 | there in a signal handler intead of in the kernel? | ||
145 | A: mmap() is not on the list of safe async handler functions and even | ||
146 | if mmap() would work it still requires locking or nasty tricks to | ||
147 | keep track of the allocation state there. | ||
148 | |||
149 | Having ruled out all of the userspace-only approaches for managing | ||
150 | bounds tables that we could think of, we create them on demand in | ||
151 | the kernel. | ||
152 | |||
153 | Decoding MPX instructions | ||
154 | ------------------------- | ||
155 | |||
156 | If a #BR is generated due to a bounds violation caused by MPX. | ||
157 | We need to decode MPX instructions to get violation address and | ||
158 | set this address into extended struct siginfo. | ||
159 | |||
160 | The _sigfault feild of struct siginfo is extended as follow: | ||
161 | |||
162 | 87 /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */ | ||
163 | 88 struct { | ||
164 | 89 void __user *_addr; /* faulting insn/memory ref. */ | ||
165 | 90 #ifdef __ARCH_SI_TRAPNO | ||
166 | 91 int _trapno; /* TRAP # which caused the signal */ | ||
167 | 92 #endif | ||
168 | 93 short _addr_lsb; /* LSB of the reported address */ | ||
169 | 94 struct { | ||
170 | 95 void __user *_lower; | ||
171 | 96 void __user *_upper; | ||
172 | 97 } _addr_bnd; | ||
173 | 98 } _sigfault; | ||
174 | |||
175 | The '_addr' field refers to violation address, and new '_addr_and' | ||
176 | field refers to the upper/lower bounds when a #BR is caused. | ||
177 | |||
178 | Glibc will be also updated to support this new siginfo. So user | ||
179 | can get violation address and bounds when bounds violations occur. | ||
180 | |||
181 | Cleanup unused bounds tables | ||
182 | ---------------------------- | ||
183 | |||
184 | When a BNDSTX instruction attempts to save bounds to a bounds directory | ||
185 | entry marked as invalid, a #BR is generated. This is an indication that | ||
186 | no bounds table exists for this entry. In this case the fault handler | ||
187 | will allocate a new bounds table on demand. | ||
188 | |||
189 | Since the kernel allocated those tables on-demand without userspace | ||
190 | knowledge, it is also responsible for freeing them when the associated | ||
191 | mappings go away. | ||
192 | |||
193 | Here, the solution for this issue is to hook do_munmap() to check | ||
194 | whether one process is MPX enabled. If yes, those bounds tables covered | ||
195 | in the virtual address region which is being unmapped will be freed also. | ||
196 | |||
197 | Adding new prctl commands | ||
198 | ------------------------- | ||
199 | |||
200 | Two new prctl commands are added to enable and disable MPX bounds tables | ||
201 | management in kernel. | ||
202 | |||
203 | 155 #define PR_MPX_ENABLE_MANAGEMENT 43 | ||
204 | 156 #define PR_MPX_DISABLE_MANAGEMENT 44 | ||
205 | |||
206 | Runtime library in userspace is responsible for allocation of bounds | ||
207 | directory. So kernel have to use XSAVE instruction to get the base | ||
208 | of bounds directory from BNDCFG register. | ||
209 | |||
210 | But XSAVE is expected to be very expensive. In order to do performance | ||
211 | optimization, we have to get the base of bounds directory and save it | ||
212 | into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT | ||
213 | command execution. | ||
214 | |||
215 | |||
216 | 4. Special rules | ||
217 | ================ | ||
218 | |||
219 | 1) If userspace is requesting help from the kernel to do the management | ||
220 | of bounds tables, it may not create or modify entries in the bounds directory. | ||
221 | |||
222 | Certainly users can allocate bounds tables and forcibly point the bounds | ||
223 | directory at them through XSAVE instruction, and then set valid bit | ||
224 | of bounds entry to have this entry valid. But, the kernel will decline | ||
225 | to assist in managing these tables. | ||
226 | |||
227 | 2) Userspace may not take multiple bounds directory entries and point | ||
228 | them at the same bounds table. | ||
229 | |||
230 | This is allowed architecturally. See more information "Intel(R) Architecture | ||
231 | Instruction Set Extensions Programming Reference" (9.3.4). | ||
232 | |||
233 | However, if users did this, the kernel might be fooled in to unmaping an | ||
234 | in-use bounds table since it does not recognize sharing. | ||