x86, entry: Switch stacks on a paranoid entry from userspace

This causes all non-NMI, non-double-fault kernel entries from userspace to run on the normal kernel stack. Double-fault is exempt to minimize confusion if we double-fault directly from userspace due to a bad kernel stack. This is, suprisingly, simpler and shorter than the current code. It removes the IMO rather frightening paranoid_userspace path, and it make sync_regs much simpler. There is no risk of stack overflow due to this change -- the kernel stack that we switch to is empty. This will also enable us to create non-atomic sections within machine checks from userspace, which will simplify memory failure handling. It will also allow the upcoming fsgsbase code to be simplified, because it doesn't need to worry about usergs when scheduling in paranoid_exit, as that code no longer exists. Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Tony Luck <tony.luck@intel.com> Acked-by: Borislav Petkov <bp@alien8.de> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
author: Andy Lutomirski <luto@amacapital.net> 2014-11-11 15:49:41 -0500
committer: Andy Lutomirski <luto@amacapital.net> 2015-01-02 13:22:45 -0500
commit: 48e08d0fb265b007ebbb29a72297ff7e40938969 (patch)
tree: 424a8207cc53c2b0dfbd9fb12bee15952ce822ae /Documentation/x86
parent: 734d16801349fbe951d2f780191d32c5b8a892d1 (diff)
2 files changed, 17 insertions, 9 deletions
diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt
index 4a1c5c2dc5a9..9132b86176a3 100644
--- a/Documentation/x86/entry_64.txt
+++ b/Documentation/x86/entry_64.txt
@@ -78,9 +78,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
        xorl %ebx,%ebx
 1:      ret
-and the whole paranoid non-paranoid macro complexity is about whether
-to suffer that RDMSR cost.
 If we are at an interrupt or user-trap/gate-alike boundary then we can
 use the faster check: the stack will be a reliable indicator of
 whether SWAPGS was already done: if we see that we are a secondary
@@ -93,6 +90,15 @@ which might have triggered right after a normal entry wrote CS to the
 stack but before we executed SWAPGS, then the only safe way to check
 for GS is the slower method: the RDMSR.
-So we try only to mark those entry methods 'paranoid' that absolutely
+Therefore, super-atomic entries (except NMI, which is handled separately)
-need the more expensive check for the GS base - and we generate all
+must use idtentry with paranoid=1 to handle gsbase correctly.  This
-'normal' entry points with the regular (faster) entry macros.
+triggers three main behavior changes:
+ - Interrupt entry will use the slower gsbase check.
+ - Interrupt entry from user mode will switch off the IST stack.
+ - Interrupt exit to kernel mode will not attempt to reschedule.
+We try to only use IST entries and the paranoid entry code for vectors
+that absolutely need the more expensive check for the GS base - and we
+generate all 'normal' entry points with the regular (faster) paranoid=0
+variant.
diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks
index a01eec5d1d0b..e3c8a49d1a2f 100644
--- a/Documentation/x86/x86_64/kernel-stacks
+++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
 interrupt-gate descriptor.  When an interrupt occurs and the hardware
 loads such a descriptor, the hardware automatically sets the new stack
 pointer based on the IST value, then invokes the interrupt handler.  If
-software wants to allow nested IST interrupts then the handler must
+the interrupt came from user mode, then the interrupt handler prologue
-adjust the IST values on entry to and exit from the interrupt handler.
+will switch back to the per-thread stack.  If software wants to allow
-(This is occasionally done, e.g. for debug exceptions.)
+nested IST interrupts then the handler must adjust the IST values on
+entry to and exit from the interrupt handler.  (This is occasionally
+done, e.g. for debug exceptions.)
 Events with different IST codes (i.e. with different stacks) can be
 nested.  For example, a debug interrupt can safely be interrupted by an
author	Andy Lutomirski <luto@amacapital.net>	2014-11-11 15:49:41 -0500
committer	Andy Lutomirski <luto@amacapital.net>	2015-01-02 13:22:45 -0500
commit	48e08d0fb265b007ebbb29a72297ff7e40938969 (patch)
tree	424a8207cc53c2b0dfbd9fb12bee15952ce822ae /Documentation/x86
parent	734d16801349fbe951d2f780191d32c5b8a892d1 (diff)

diff --git a/Documentation/x86/entry_64.txt b/Documentation/x86/entry_64.txt index 4a1c5c2dc5a9..9132b86176a3 100644 --- a/Documentation/x86/entry_64.txt +++ b/Documentation/x86/entry_64.txt
@@ -78,9 +78,6 @@ The expensive (paranoid) way is to read back the MSR_GS_BASE value
78	xorl %ebx,%ebx	78	xorl %ebx,%ebx
79	1: ret	79	1: ret
80		80
81	and the whole paranoid non-paranoid macro complexity is about whether
82	to suffer that RDMSR cost.
83
84	If we are at an interrupt or user-trap/gate-alike boundary then we can	81	If we are at an interrupt or user-trap/gate-alike boundary then we can
85	use the faster check: the stack will be a reliable indicator of	82	use the faster check: the stack will be a reliable indicator of
86	whether SWAPGS was already done: if we see that we are a secondary	83	whether SWAPGS was already done: if we see that we are a secondary
@@ -93,6 +90,15 @@ which might have triggered right after a normal entry wrote CS to the
93	stack but before we executed SWAPGS, then the only safe way to check	90	stack but before we executed SWAPGS, then the only safe way to check
94	for GS is the slower method: the RDMSR.	91	for GS is the slower method: the RDMSR.
95		92
96	So we try only to mark those entry methods 'paranoid' that absolutely	93	Therefore, super-atomic entries (except NMI, which is handled separately)
97	need the more expensive check for the GS base - and we generate all	94	must use idtentry with paranoid=1 to handle gsbase correctly. This
98	'normal' entry points with the regular (faster) entry macros.	95	triggers three main behavior changes:
		96
		97	- Interrupt entry will use the slower gsbase check.
		98	- Interrupt entry from user mode will switch off the IST stack.
		99	- Interrupt exit to kernel mode will not attempt to reschedule.
		100
		101	We try to only use IST entries and the paranoid entry code for vectors
		102	that absolutely need the more expensive check for the GS base - and we
		103	generate all 'normal' entry points with the regular (faster) paranoid=0
		104	variant.


diff --git a/Documentation/x86/x86_64/kernel-stacks b/Documentation/x86/x86_64/kernel-stacks index a01eec5d1d0b..e3c8a49d1a2f 100644 --- a/Documentation/x86/x86_64/kernel-stacks +++ b/Documentation/x86/x86_64/kernel-stacks
@@ -40,9 +40,11 @@ An IST is selected by a non-zero value in the IST field of an
40	interrupt-gate descriptor. When an interrupt occurs and the hardware	40	interrupt-gate descriptor. When an interrupt occurs and the hardware
41	loads such a descriptor, the hardware automatically sets the new stack	41	loads such a descriptor, the hardware automatically sets the new stack
42	pointer based on the IST value, then invokes the interrupt handler. If	42	pointer based on the IST value, then invokes the interrupt handler. If
43	software wants to allow nested IST interrupts then the handler must	43	the interrupt came from user mode, then the interrupt handler prologue
44	adjust the IST values on entry to and exit from the interrupt handler.	44	will switch back to the per-thread stack. If software wants to allow
45	(This is occasionally done, e.g. for debug exceptions.)	45	nested IST interrupts then the handler must adjust the IST values on
		46	entry to and exit from the interrupt handler. (This is occasionally
		47	done, e.g. for debug exceptions.)
46		48
47	Events with different IST codes (i.e. with different stacks) can be	49	Events with different IST codes (i.e. with different stacks) can be
48	nested. For example, a debug interrupt can safely be interrupted by an	50	nested. For example, a debug interrupt can safely be interrupted by an