aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorDave Hansen <dave.hansen@linux.intel.com>2014-11-14 10:18:29 -0500
committerThomas Gleixner <tglx@linutronix.de>2014-11-17 18:58:53 -0500
commitfe3d197f84319d3bce379a9c0dc17b1f48ad358c (patch)
tree68f479a165c25dcd3867648b5d923e6ec80316d4
parentfcc7ffd67991b63029ca54925644753d534ddc5f (diff)
x86, mpx: On-demand kernel allocation of bounds tables
This is really the meat of the MPX patch set. If there is one patch to review in the entire series, this is the one. There is a new ABI here and this kernel code also interacts with userspace memory in a relatively unusual manner. (small FAQ below). Long Description: This patch adds two prctl() commands to provide enable or disable the management of bounds tables in kernel, including on-demand kernel allocation (See the patch "on-demand kernel allocation of bounds tables") and cleanup (See the patch "cleanup unused bound tables"). Applications do not strictly need the kernel to manage bounds tables and we expect some applications to use MPX without taking advantage of this kernel support. This means the kernel can not simply infer whether an application needs bounds table management from the MPX registers. The prctl() is an explicit signal from userspace. PR_MPX_ENABLE_MANAGEMENT is meant to be a signal from userspace to require kernel's help in managing bounds tables. PR_MPX_DISABLE_MANAGEMENT is the opposite, meaning that userspace don't want kernel's help any more. With PR_MPX_DISABLE_MANAGEMENT, the kernel won't allocate and free bounds tables even if the CPU supports MPX. PR_MPX_ENABLE_MANAGEMENT will fetch the base address of the bounds directory out of a userspace register (bndcfgu) and then cache it into a new field (->bd_addr) in the 'mm_struct'. PR_MPX_DISABLE_MANAGEMENT will set "bd_addr" to an invalid address. Using this scheme, we can use "bd_addr" to determine whether the management of bounds tables in kernel is enabled. Also, the only way to access that bndcfgu register is via an xsaves, which can be expensive. Caching "bd_addr" like this also helps reduce the cost of those xsaves when doing table cleanup at munmap() time. Unfortunately, we can not apply this optimization to #BR fault time because we need an xsave to get the value of BNDSTATUS. ==== Why does the hardware even have these Bounds Tables? ==== MPX only has 4 hardware registers for storing bounds information. If MPX-enabled code needs more than these 4 registers, it needs to spill them somewhere. It has two special instructions for this which allow the bounds to be moved between the bounds registers and some new "bounds tables". They are similar conceptually to a page fault and will be raised by the MPX hardware during both bounds violations or when the tables are not present. This patch handles those #BR exceptions for not-present tables by carving the space out of the normal processes address space (essentially calling the new mmap() interface indroduced earlier in this patch set.) and then pointing the bounds-directory over to it. The tables *need* to be accessed and controlled by userspace because the instructions for moving bounds in and out of them are extremely frequent. They potentially happen every time a register pointing to memory is dereferenced. Any direct kernel involvement (like a syscall) to access the tables would obviously destroy performance. ==== Why not do this in userspace? ==== This patch is obviously doing this allocation in the kernel. However, MPX does not strictly *require* anything in the kernel. It can theoretically be done completely from userspace. Here are a few ways this *could* be done. I don't think any of them are practical in the real-world, but here they are. Q: Can virtual space simply be reserved for the bounds tables so that we never have to allocate them? A: As noted earlier, these tables are *HUGE*. An X-GB virtual area needs 4*X GB of virtual space, plus 2GB for the bounds directory. If we were to preallocate them for the 128TB of user virtual address space, we would need to reserve 512TB+2GB, which is larger than the entire virtual address space today. This means they can not be reserved ahead of time. Also, a single process's pre-popualated bounds directory consumes 2GB of virtual *AND* physical memory. IOW, it's completely infeasible to prepopulate bounds directories. Q: Can we preallocate bounds table space at the same time memory is allocated which might contain pointers that might eventually need bounds tables? A: This would work if we could hook the site of each and every memory allocation syscall. This can be done for small, constrained applications. But, it isn't practical at a larger scale since a given app has no way of controlling how all the parts of the app might allocate memory (think libraries). The kernel is really the only place to intercept these calls. Q: Could a bounds fault be handed to userspace and the tables allocated there in a signal handler instead of in the kernel? A: (thanks to tglx) mmap() is not on the list of safe async handler functions and even if mmap() would work it still requires locking or nasty tricks to keep track of the allocation state there. Having ruled out all of the userspace-only approaches for managing bounds tables that we could think of, we create them on demand in the kernel. Based-on-patch-by: Qiaowei Ren <qiaowei.ren@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: linux-mm@kvack.org Cc: linux-mips@linux-mips.org Cc: Dave Hansen <dave@sr71.net> Link: http://lkml.kernel.org/r/20141114151829.AD4310DE@viggo.jf.intel.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
-rw-r--r--arch/x86/include/asm/mmu_context.h7
-rw-r--r--arch/x86/include/asm/mpx.h41
-rw-r--r--arch/x86/include/asm/processor.h18
-rw-r--r--arch/x86/kernel/setup.c2
-rw-r--r--arch/x86/kernel/traps.c85
-rw-r--r--arch/x86/mm/mpx.c223
-rw-r--r--fs/exec.c2
-rw-r--r--include/asm-generic/mmu_context.h5
-rw-r--r--include/linux/mm_types.h4
-rw-r--r--include/uapi/linux/prctl.h6
-rw-r--r--kernel/sys.c12
11 files changed, 399 insertions, 6 deletions
diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 166af2a8e865..0b0ba91ff1ef 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -10,6 +10,7 @@
10#include <asm/pgalloc.h> 10#include <asm/pgalloc.h>
11#include <asm/tlbflush.h> 11#include <asm/tlbflush.h>
12#include <asm/paravirt.h> 12#include <asm/paravirt.h>
13#include <asm/mpx.h>
13#ifndef CONFIG_PARAVIRT 14#ifndef CONFIG_PARAVIRT
14#include <asm-generic/mm_hooks.h> 15#include <asm-generic/mm_hooks.h>
15 16
@@ -102,4 +103,10 @@ do { \
102} while (0) 103} while (0)
103#endif 104#endif
104 105
106static inline void arch_bprm_mm_init(struct mm_struct *mm,
107 struct vm_area_struct *vma)
108{
109 mpx_mm_init(mm);
110}
111
105#endif /* _ASM_X86_MMU_CONTEXT_H */ 112#endif /* _ASM_X86_MMU_CONTEXT_H */
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index 35bcb1cddf40..05eecbf8a484 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -5,6 +5,14 @@
5#include <asm/ptrace.h> 5#include <asm/ptrace.h>
6#include <asm/insn.h> 6#include <asm/insn.h>
7 7
8/*
9 * NULL is theoretically a valid place to put the bounds
10 * directory, so point this at an invalid address.
11 */
12#define MPX_INVALID_BOUNDS_DIR ((void __user *)-1)
13#define MPX_BNDCFG_ENABLE_FLAG 0x1
14#define MPX_BD_ENTRY_VALID_FLAG 0x1
15
8#ifdef CONFIG_X86_64 16#ifdef CONFIG_X86_64
9 17
10/* upper 28 bits [47:20] of the virtual address in 64-bit used to 18/* upper 28 bits [47:20] of the virtual address in 64-bit used to
@@ -18,6 +26,7 @@
18#define MPX_BT_ENTRY_OFFSET 17 26#define MPX_BT_ENTRY_OFFSET 17
19#define MPX_BT_ENTRY_SHIFT 5 27#define MPX_BT_ENTRY_SHIFT 5
20#define MPX_IGN_BITS 3 28#define MPX_IGN_BITS 3
29#define MPX_BD_ENTRY_TAIL 3
21 30
22#else 31#else
23 32
@@ -26,23 +35,55 @@
26#define MPX_BT_ENTRY_OFFSET 10 35#define MPX_BT_ENTRY_OFFSET 10
27#define MPX_BT_ENTRY_SHIFT 4 36#define MPX_BT_ENTRY_SHIFT 4
28#define MPX_IGN_BITS 2 37#define MPX_IGN_BITS 2
38#define MPX_BD_ENTRY_TAIL 2
29 39
30#endif 40#endif
31 41
32#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT)) 42#define MPX_BD_SIZE_BYTES (1UL<<(MPX_BD_ENTRY_OFFSET+MPX_BD_ENTRY_SHIFT))
33#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT)) 43#define MPX_BT_SIZE_BYTES (1UL<<(MPX_BT_ENTRY_OFFSET+MPX_BT_ENTRY_SHIFT))
34 44
45#define MPX_BNDSTA_TAIL 2
46#define MPX_BNDCFG_TAIL 12
47#define MPX_BNDSTA_ADDR_MASK (~((1UL<<MPX_BNDSTA_TAIL)-1))
48#define MPX_BNDCFG_ADDR_MASK (~((1UL<<MPX_BNDCFG_TAIL)-1))
49#define MPX_BT_ADDR_MASK (~((1UL<<MPX_BD_ENTRY_TAIL)-1))
50
51#define MPX_BNDCFG_ADDR_MASK (~((1UL<<MPX_BNDCFG_TAIL)-1))
35#define MPX_BNDSTA_ERROR_CODE 0x3 52#define MPX_BNDSTA_ERROR_CODE 0x3
36 53
37#ifdef CONFIG_X86_INTEL_MPX 54#ifdef CONFIG_X86_INTEL_MPX
38siginfo_t *mpx_generate_siginfo(struct pt_regs *regs, 55siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
39 struct xsave_struct *xsave_buf); 56 struct xsave_struct *xsave_buf);
57int mpx_handle_bd_fault(struct xsave_struct *xsave_buf);
58static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
59{
60 return (mm->bd_addr != MPX_INVALID_BOUNDS_DIR);
61}
62static inline void mpx_mm_init(struct mm_struct *mm)
63{
64 /*
65 * NULL is theoretically a valid place to put the bounds
66 * directory, so point this at an invalid address.
67 */
68 mm->bd_addr = MPX_INVALID_BOUNDS_DIR;
69}
40#else 70#else
41static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs, 71static inline siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
42 struct xsave_struct *xsave_buf) 72 struct xsave_struct *xsave_buf)
43{ 73{
44 return NULL; 74 return NULL;
45} 75}
76static inline int mpx_handle_bd_fault(struct xsave_struct *xsave_buf)
77{
78 return -EINVAL;
79}
80static inline int kernel_managing_mpx_tables(struct mm_struct *mm)
81{
82 return 0;
83}
84static inline void mpx_mm_init(struct mm_struct *mm)
85{
86}
46#endif /* CONFIG_X86_INTEL_MPX */ 87#endif /* CONFIG_X86_INTEL_MPX */
47 88
48#endif /* _ASM_X86_MPX_H */ 89#endif /* _ASM_X86_MPX_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 6571aaabacb9..9617a1716813 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -954,6 +954,24 @@ extern void start_thread(struct pt_regs *regs, unsigned long new_ip,
954extern int get_tsc_mode(unsigned long adr); 954extern int get_tsc_mode(unsigned long adr);
955extern int set_tsc_mode(unsigned int val); 955extern int set_tsc_mode(unsigned int val);
956 956
957/* Register/unregister a process' MPX related resource */
958#define MPX_ENABLE_MANAGEMENT(tsk) mpx_enable_management((tsk))
959#define MPX_DISABLE_MANAGEMENT(tsk) mpx_disable_management((tsk))
960
961#ifdef CONFIG_X86_INTEL_MPX
962extern int mpx_enable_management(struct task_struct *tsk);
963extern int mpx_disable_management(struct task_struct *tsk);
964#else
965static inline int mpx_enable_management(struct task_struct *tsk)
966{
967 return -EINVAL;
968}
969static inline int mpx_disable_management(struct task_struct *tsk)
970{
971 return -EINVAL;
972}
973#endif /* CONFIG_X86_INTEL_MPX */
974
957extern u16 amd_get_nb_id(int cpu); 975extern u16 amd_get_nb_id(int cpu);
958 976
959static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves) 977static inline uint32_t hypervisor_cpuid_base(const char *sig, uint32_t leaves)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index ab08aa2276fb..214245d6b996 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -960,6 +960,8 @@ void __init setup_arch(char **cmdline_p)
960 init_mm.end_data = (unsigned long) _edata; 960 init_mm.end_data = (unsigned long) _edata;
961 init_mm.brk = _brk_end; 961 init_mm.brk = _brk_end;
962 962
963 mpx_mm_init(&init_mm);
964
963 code_resource.start = __pa_symbol(_text); 965 code_resource.start = __pa_symbol(_text);
964 code_resource.end = __pa_symbol(_etext)-1; 966 code_resource.end = __pa_symbol(_etext)-1;
965 data_resource.start = __pa_symbol(_etext); 967 data_resource.start = __pa_symbol(_etext);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 0d0e922fafc1..651d5d4f7558 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -60,6 +60,7 @@
60#include <asm/fixmap.h> 60#include <asm/fixmap.h>
61#include <asm/mach_traps.h> 61#include <asm/mach_traps.h>
62#include <asm/alternative.h> 62#include <asm/alternative.h>
63#include <asm/mpx.h>
63 64
64#ifdef CONFIG_X86_64 65#ifdef CONFIG_X86_64
65#include <asm/x86_init.h> 66#include <asm/x86_init.h>
@@ -228,7 +229,6 @@ dotraplinkage void do_##name(struct pt_regs *regs, long error_code) \
228 229
229DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error) 230DO_ERROR(X86_TRAP_DE, SIGFPE, "divide error", divide_error)
230DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow) 231DO_ERROR(X86_TRAP_OF, SIGSEGV, "overflow", overflow)
231DO_ERROR(X86_TRAP_BR, SIGSEGV, "bounds", bounds)
232DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op) 232DO_ERROR(X86_TRAP_UD, SIGILL, "invalid opcode", invalid_op)
233DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun",coprocessor_segment_overrun) 233DO_ERROR(X86_TRAP_OLD_MF, SIGFPE, "coprocessor segment overrun",coprocessor_segment_overrun)
234DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS) 234DO_ERROR(X86_TRAP_TS, SIGSEGV, "invalid TSS", invalid_TSS)
@@ -278,6 +278,89 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
278} 278}
279#endif 279#endif
280 280
281dotraplinkage void do_bounds(struct pt_regs *regs, long error_code)
282{
283 struct task_struct *tsk = current;
284 struct xsave_struct *xsave_buf;
285 enum ctx_state prev_state;
286 struct bndcsr *bndcsr;
287 siginfo_t *info;
288
289 prev_state = exception_enter();
290 if (notify_die(DIE_TRAP, "bounds", regs, error_code,
291 X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
292 goto exit;
293 conditional_sti(regs);
294
295 if (!user_mode(regs))
296 die("bounds", regs, error_code);
297
298 if (!cpu_feature_enabled(X86_FEATURE_MPX)) {
299 /* The exception is not from Intel MPX */
300 goto exit_trap;
301 }
302
303 /*
304 * We need to look at BNDSTATUS to resolve this exception.
305 * It is not directly accessible, though, so we need to
306 * do an xsave and then pull it out of the xsave buffer.
307 */
308 fpu_save_init(&tsk->thread.fpu);
309 xsave_buf = &(tsk->thread.fpu.state->xsave);
310 bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
311 if (!bndcsr)
312 goto exit_trap;
313
314 /*
315 * The error code field of the BNDSTATUS register communicates status
316 * information of a bound range exception #BR or operation involving
317 * bound directory.
318 */
319 switch (bndcsr->bndstatus & MPX_BNDSTA_ERROR_CODE) {
320 case 2: /* Bound directory has invalid entry. */
321 if (mpx_handle_bd_fault(xsave_buf))
322 goto exit_trap;
323 break; /* Success, it was handled */
324 case 1: /* Bound violation. */
325 info = mpx_generate_siginfo(regs, xsave_buf);
326 if (PTR_ERR(info)) {
327 /*
328 * We failed to decode the MPX instruction. Act as if
329 * the exception was not caused by MPX.
330 */
331 goto exit_trap;
332 }
333 /*
334 * Success, we decoded the instruction and retrieved
335 * an 'info' containing the address being accessed
336 * which caused the exception. This information
337 * allows and application to possibly handle the
338 * #BR exception itself.
339 */
340 do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, info);
341 kfree(info);
342 break;
343 case 0: /* No exception caused by Intel MPX operations. */
344 goto exit_trap;
345 default:
346 die("bounds", regs, error_code);
347 }
348
349exit:
350 exception_exit(prev_state);
351 return;
352exit_trap:
353 /*
354 * This path out is for all the cases where we could not
355 * handle the exception in some way (like allocating a
356 * table or telling userspace about it. We will also end
357 * up here if the kernel has MPX turned off at compile
358 * time..
359 */
360 do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
361 exception_exit(prev_state);
362}
363
281dotraplinkage void 364dotraplinkage void
282do_general_protection(struct pt_regs *regs, long error_code) 365do_general_protection(struct pt_regs *regs, long error_code)
283{ 366{
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index 9009e094d686..96266375441e 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -10,8 +10,12 @@
10#include <linux/syscalls.h> 10#include <linux/syscalls.h>
11#include <linux/sched/sysctl.h> 11#include <linux/sched/sysctl.h>
12 12
13#include <asm/i387.h>
14#include <asm/insn.h>
13#include <asm/mman.h> 15#include <asm/mman.h>
14#include <asm/mpx.h> 16#include <asm/mpx.h>
17#include <asm/processor.h>
18#include <asm/fpu-internal.h>
15 19
16static const char *mpx_mapping_name(struct vm_area_struct *vma) 20static const char *mpx_mapping_name(struct vm_area_struct *vma)
17{ 21{
@@ -266,10 +270,11 @@ bad_opcode:
266siginfo_t *mpx_generate_siginfo(struct pt_regs *regs, 270siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
267 struct xsave_struct *xsave_buf) 271 struct xsave_struct *xsave_buf)
268{ 272{
273 struct bndreg *bndregs, *bndreg;
274 siginfo_t *info = NULL;
269 struct insn insn; 275 struct insn insn;
270 uint8_t bndregno; 276 uint8_t bndregno;
271 int err; 277 int err;
272 siginfo_t *info;
273 278
274 err = mpx_insn_decode(&insn, regs); 279 err = mpx_insn_decode(&insn, regs);
275 if (err) 280 if (err)
@@ -285,6 +290,15 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
285 err = -EINVAL; 290 err = -EINVAL;
286 goto err_out; 291 goto err_out;
287 } 292 }
293 /* get the bndregs _area_ of the xsave structure */
294 bndregs = get_xsave_addr(xsave_buf, XSTATE_BNDREGS);
295 if (!bndregs) {
296 err = -EINVAL;
297 goto err_out;
298 }
299 /* now go select the individual register in the set of 4 */
300 bndreg = &bndregs[bndregno];
301
288 info = kzalloc(sizeof(*info), GFP_KERNEL); 302 info = kzalloc(sizeof(*info), GFP_KERNEL);
289 if (!info) { 303 if (!info) {
290 err = -ENOMEM; 304 err = -ENOMEM;
@@ -300,10 +314,8 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
300 * complains when casting from integers to different-size 314 * complains when casting from integers to different-size
301 * pointers. 315 * pointers.
302 */ 316 */
303 info->si_lower = (void __user *)(unsigned long) 317 info->si_lower = (void __user *)(unsigned long)bndreg->lower_bound;
304 (xsave_buf->bndreg[bndregno].lower_bound); 318 info->si_upper = (void __user *)(unsigned long)~bndreg->upper_bound;
305 info->si_upper = (void __user *)(unsigned long)
306 (~xsave_buf->bndreg[bndregno].upper_bound);
307 info->si_addr_lsb = 0; 319 info->si_addr_lsb = 0;
308 info->si_signo = SIGSEGV; 320 info->si_signo = SIGSEGV;
309 info->si_errno = 0; 321 info->si_errno = 0;
@@ -319,5 +331,206 @@ siginfo_t *mpx_generate_siginfo(struct pt_regs *regs,
319 } 331 }
320 return info; 332 return info;
321err_out: 333err_out:
334 /* info might be NULL, but kfree() handles that */
335 kfree(info);
322 return ERR_PTR(err); 336 return ERR_PTR(err);
323} 337}
338
339static __user void *task_get_bounds_dir(struct task_struct *tsk)
340{
341 struct bndcsr *bndcsr;
342
343 if (!cpu_feature_enabled(X86_FEATURE_MPX))
344 return MPX_INVALID_BOUNDS_DIR;
345
346 /*
347 * The bounds directory pointer is stored in a register
348 * only accessible if we first do an xsave.
349 */
350 fpu_save_init(&tsk->thread.fpu);
351 bndcsr = get_xsave_addr(&tsk->thread.fpu.state->xsave, XSTATE_BNDCSR);
352 if (!bndcsr)
353 return MPX_INVALID_BOUNDS_DIR;
354
355 /*
356 * Make sure the register looks valid by checking the
357 * enable bit.
358 */
359 if (!(bndcsr->bndcfgu & MPX_BNDCFG_ENABLE_FLAG))
360 return MPX_INVALID_BOUNDS_DIR;
361
362 /*
363 * Lastly, mask off the low bits used for configuration
364 * flags, and return the address of the bounds table.
365 */
366 return (void __user *)(unsigned long)
367 (bndcsr->bndcfgu & MPX_BNDCFG_ADDR_MASK);
368}
369
370int mpx_enable_management(struct task_struct *tsk)
371{
372 void __user *bd_base = MPX_INVALID_BOUNDS_DIR;
373 struct mm_struct *mm = tsk->mm;
374 int ret = 0;
375
376 /*
377 * runtime in the userspace will be responsible for allocation of
378 * the bounds directory. Then, it will save the base of the bounds
379 * directory into XSAVE/XRSTOR Save Area and enable MPX through
380 * XRSTOR instruction.
381 *
382 * fpu_xsave() is expected to be very expensive. Storing the bounds
383 * directory here means that we do not have to do xsave in the unmap
384 * path; we can just use mm->bd_addr instead.
385 */
386 bd_base = task_get_bounds_dir(tsk);
387 down_write(&mm->mmap_sem);
388 mm->bd_addr = bd_base;
389 if (mm->bd_addr == MPX_INVALID_BOUNDS_DIR)
390 ret = -ENXIO;
391
392 up_write(&mm->mmap_sem);
393 return ret;
394}
395
396int mpx_disable_management(struct task_struct *tsk)
397{
398 struct mm_struct *mm = current->mm;
399
400 if (!cpu_feature_enabled(X86_FEATURE_MPX))
401 return -ENXIO;
402
403 down_write(&mm->mmap_sem);
404 mm->bd_addr = MPX_INVALID_BOUNDS_DIR;
405 up_write(&mm->mmap_sem);
406 return 0;
407}
408
409/*
410 * With 32-bit mode, MPX_BT_SIZE_BYTES is 4MB, and the size of each
411 * bounds table is 16KB. With 64-bit mode, MPX_BT_SIZE_BYTES is 2GB,
412 * and the size of each bounds table is 4MB.
413 */
414static int allocate_bt(long __user *bd_entry)
415{
416 unsigned long expected_old_val = 0;
417 unsigned long actual_old_val = 0;
418 unsigned long bt_addr;
419 int ret = 0;
420
421 /*
422 * Carve the virtual space out of userspace for the new
423 * bounds table:
424 */
425 bt_addr = mpx_mmap(MPX_BT_SIZE_BYTES);
426 if (IS_ERR((void *)bt_addr))
427 return PTR_ERR((void *)bt_addr);
428 /*
429 * Set the valid flag (kinda like _PAGE_PRESENT in a pte)
430 */
431 bt_addr = bt_addr | MPX_BD_ENTRY_VALID_FLAG;
432
433 /*
434 * Go poke the address of the new bounds table in to the
435 * bounds directory entry out in userspace memory. Note:
436 * we may race with another CPU instantiating the same table.
437 * In that case the cmpxchg will see an unexpected
438 * 'actual_old_val'.
439 *
440 * This can fault, but that's OK because we do not hold
441 * mmap_sem at this point, unlike some of the other part
442 * of the MPX code that have to pagefault_disable().
443 */
444 ret = user_atomic_cmpxchg_inatomic(&actual_old_val, bd_entry,
445 expected_old_val, bt_addr);
446 if (ret)
447 goto out_unmap;
448
449 /*
450 * The user_atomic_cmpxchg_inatomic() will only return nonzero
451 * for faults, *not* if the cmpxchg itself fails. Now we must
452 * verify that the cmpxchg itself completed successfully.
453 */
454 /*
455 * We expected an empty 'expected_old_val', but instead found
456 * an apparently valid entry. Assume we raced with another
457 * thread to instantiate this table and desclare succecss.
458 */
459 if (actual_old_val & MPX_BD_ENTRY_VALID_FLAG) {
460 ret = 0;
461 goto out_unmap;
462 }
463 /*
464 * We found a non-empty bd_entry but it did not have the
465 * VALID_FLAG set. Return an error which will result in
466 * a SEGV since this probably means that somebody scribbled
467 * some invalid data in to a bounds table.
468 */
469 if (expected_old_val != actual_old_val) {
470 ret = -EINVAL;
471 goto out_unmap;
472 }
473 return 0;
474out_unmap:
475 vm_munmap(bt_addr & MPX_BT_ADDR_MASK, MPX_BT_SIZE_BYTES);
476 return ret;
477}
478
479/*
480 * When a BNDSTX instruction attempts to save bounds to a bounds
481 * table, it will first attempt to look up the table in the
482 * first-level bounds directory. If it does not find a table in
483 * the directory, a #BR is generated and we get here in order to
484 * allocate a new table.
485 *
486 * With 32-bit mode, the size of BD is 4MB, and the size of each
487 * bound table is 16KB. With 64-bit mode, the size of BD is 2GB,
488 * and the size of each bound table is 4MB.
489 */
490static int do_mpx_bt_fault(struct xsave_struct *xsave_buf)
491{
492 unsigned long bd_entry, bd_base;
493 struct bndcsr *bndcsr;
494
495 bndcsr = get_xsave_addr(xsave_buf, XSTATE_BNDCSR);
496 if (!bndcsr)
497 return -EINVAL;
498 /*
499 * Mask off the preserve and enable bits
500 */
501 bd_base = bndcsr->bndcfgu & MPX_BNDCFG_ADDR_MASK;
502 /*
503 * The hardware provides the address of the missing or invalid
504 * entry via BNDSTATUS, so we don't have to go look it up.
505 */
506 bd_entry = bndcsr->bndstatus & MPX_BNDSTA_ADDR_MASK;
507 /*
508 * Make sure the directory entry is within where we think
509 * the directory is.
510 */
511 if ((bd_entry < bd_base) ||
512 (bd_entry >= bd_base + MPX_BD_SIZE_BYTES))
513 return -EINVAL;
514
515 return allocate_bt((long __user *)bd_entry);
516}
517
518int mpx_handle_bd_fault(struct xsave_struct *xsave_buf)
519{
520 /*
521 * Userspace never asked us to manage the bounds tables,
522 * so refuse to help.
523 */
524 if (!kernel_managing_mpx_tables(current->mm))
525 return -EINVAL;
526
527 if (do_mpx_bt_fault(xsave_buf)) {
528 force_sig(SIGSEGV, current);
529 /*
530 * The force_sig() is essentially "handling" this
531 * exception, so we do not pass up the error
532 * from do_mpx_bt_fault().
533 */
534 }
535 return 0;
536}
diff --git a/fs/exec.c b/fs/exec.c
index 7302b75a9820..65d4f5c70ef4 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -60,6 +60,7 @@
60#include <asm/uaccess.h> 60#include <asm/uaccess.h>
61#include <asm/mmu_context.h> 61#include <asm/mmu_context.h>
62#include <asm/tlb.h> 62#include <asm/tlb.h>
63#include <asm/mpx.h>
63 64
64#include <trace/events/task.h> 65#include <trace/events/task.h>
65#include "internal.h" 66#include "internal.h"
@@ -277,6 +278,7 @@ static int __bprm_mm_init(struct linux_binprm *bprm)
277 goto err; 278 goto err;
278 279
279 mm->stack_vm = mm->total_vm = 1; 280 mm->stack_vm = mm->total_vm = 1;
281 arch_bprm_mm_init(mm, vma);
280 up_write(&mm->mmap_sem); 282 up_write(&mm->mmap_sem);
281 bprm->p = vma->vm_end - sizeof(void *); 283 bprm->p = vma->vm_end - sizeof(void *);
282 return 0; 284 return 0;
diff --git a/include/asm-generic/mmu_context.h b/include/asm-generic/mmu_context.h
index a7eec910ba6c..1f2a8f9c9264 100644
--- a/include/asm-generic/mmu_context.h
+++ b/include/asm-generic/mmu_context.h
@@ -42,4 +42,9 @@ static inline void activate_mm(struct mm_struct *prev_mm,
42{ 42{
43} 43}
44 44
45static inline void arch_bprm_mm_init(struct mm_struct *mm,
46 struct vm_area_struct *vma)
47{
48}
49
45#endif /* __ASM_GENERIC_MMU_CONTEXT_H */ 50#endif /* __ASM_GENERIC_MMU_CONTEXT_H */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6e0b286649f1..004e9d17b47e 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -454,6 +454,10 @@ struct mm_struct {
454 bool tlb_flush_pending; 454 bool tlb_flush_pending;
455#endif 455#endif
456 struct uprobes_state uprobes_state; 456 struct uprobes_state uprobes_state;
457#ifdef CONFIG_X86_INTEL_MPX
458 /* address of the bounds directory */
459 void __user *bd_addr;
460#endif
457}; 461};
458 462
459static inline void mm_init_cpumask(struct mm_struct *mm) 463static inline void mm_init_cpumask(struct mm_struct *mm)
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index 513df75d0fc9..89f63503f903 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -179,4 +179,10 @@ struct prctl_mm_map {
179#define PR_SET_THP_DISABLE 41 179#define PR_SET_THP_DISABLE 41
180#define PR_GET_THP_DISABLE 42 180#define PR_GET_THP_DISABLE 42
181 181
182/*
183 * Tell the kernel to start/stop helping userspace manage bounds tables.
184 */
185#define PR_MPX_ENABLE_MANAGEMENT 43
186#define PR_MPX_DISABLE_MANAGEMENT 44
187
182#endif /* _LINUX_PRCTL_H */ 188#endif /* _LINUX_PRCTL_H */
diff --git a/kernel/sys.c b/kernel/sys.c
index 1eaa2f0b0246..a8c9f5a7dda6 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -91,6 +91,12 @@
91#ifndef SET_TSC_CTL 91#ifndef SET_TSC_CTL
92# define SET_TSC_CTL(a) (-EINVAL) 92# define SET_TSC_CTL(a) (-EINVAL)
93#endif 93#endif
94#ifndef MPX_ENABLE_MANAGEMENT
95# define MPX_ENABLE_MANAGEMENT(a) (-EINVAL)
96#endif
97#ifndef MPX_DISABLE_MANAGEMENT
98# define MPX_DISABLE_MANAGEMENT(a) (-EINVAL)
99#endif
94 100
95/* 101/*
96 * this is where the system-wide overflow UID and GID are defined, for 102 * this is where the system-wide overflow UID and GID are defined, for
@@ -2203,6 +2209,12 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
2203 me->mm->def_flags &= ~VM_NOHUGEPAGE; 2209 me->mm->def_flags &= ~VM_NOHUGEPAGE;
2204 up_write(&me->mm->mmap_sem); 2210 up_write(&me->mm->mmap_sem);
2205 break; 2211 break;
2212 case PR_MPX_ENABLE_MANAGEMENT:
2213 error = MPX_ENABLE_MANAGEMENT(me);
2214 break;
2215 case PR_MPX_DISABLE_MANAGEMENT:
2216 error = MPX_DISABLE_MANAGEMENT(me);
2217 break;
2206 default: 2218 default:
2207 error = -EINVAL; 2219 error = -EINVAL;
2208 break; 2220 break;