x86: Add workaround to NMI iret woes

In x86, when an NMI goes off, the CPU goes into an NMI context that prevents other NMIs to trigger on that CPU. If an NMI is suppose to trigger, it has to wait till the previous NMI leaves NMI context. At that time, the next NMI can trigger (note, only one more NMI will trigger, as only one can be latched at a time). The way x86 gets out of NMI context is by calling iret. The problem with this is that this causes problems if the NMI handle either triggers an exception, or a breakpoint. Both the exception and the breakpoint handlers will finish with an iret. If this happens while in NMI context, the CPU will leave NMI context and a new NMI may come in. As NMI handlers are not made to be re-entrant, this can cause havoc with the system, not to mention, the nested NMI will write all over the previous NMI's stack. Linus Torvalds proposed the following workaround to this problem: https://lkml.org/lkml/2010/7/14/264 "In fact, I wonder if we couldn't just do a software NMI disable instead? Hav ea per-cpu variable (in the _core_ percpu areas that get allocated statically) that points to the NMI stack frame, and just make the NMI code itself do something like NMI entry: - load percpu NMI stack frame pointer - if non-zero we know we're nested, and should ignore this NMI: - we're returning to kernel mode, so return immediately by using "popf/ret", which also keeps NMI's disabled in the hardware until the "real" NMI iret happens. - before the popf/iret, use the NMI stack pointer to make the NMI return stack be invalid and cause a fault - set the NMI stack pointer to the current stack pointer NMI exit (not the above "immediate exit because we nested"): clear the percpu NMI stack pointer Just do the iret. Now, the thing is, now the "iret" is atomic. If we had a nested NMI, we'll take a fault, and that re-does our "delayed" NMI - and NMI's will stay masked. And if we didn't have a nested NMI, that iret will now unmask NMI's, and everything is happy." I first tried to follow this advice but as I started implementing this code, a few gotchas showed up. One, is accessing per-cpu variables in the NMI handler. The problem is that per-cpu variables use the %gs register to get the variable for the given CPU. But as the NMI may happen in userspace, we must first perform a SWAPGS to get to it. The NMI handler already does this later in the code, but its too late as we have saved off all the registers and we don't want to do that for a disabled NMI. Peter Zijlstra suggested to keep all variables on the stack. This simplifies things greatly and it has the added benefit of cache locality. Two, faulting on the iret. I really wanted to make this work, but it was becoming very hacky, and I never got it to be stable. The iret already had a fault handler for userspace faulting with bad segment registers, and getting NMI to trigger a fault and detect it was very tricky. But for strange reasons, the system would usually take a double fault and crash. I never figured out why and decided to go with a simple "jmp" approach. The new approach I took also simplified things. Finally, the last problem with Linus's approach was to have the nested NMI handler do a ret instead of an iret to give the first NMI NMI-context again. The problem is that ret is much more limited than an iret. I couldn't figure out how to get the stack back where it belonged. I could have copied the current stack, pushed the return onto it, but my fear here is that there may be some place that writes data below the stack pointer. I know that is not something code should depend on, but I don't want to chance it. I may add this feature later, but for now, an NMI handler that loses NMI context will not get it back. Here's what is done: When an NMI comes in, the HW pushes the interrupt stack frame onto the per cpu NMI stack that is selected by the IST. A special location on the NMI stack holds a variable that is set when the first NMI handler runs. If this variable is set then we know that this is a nested NMI and we process the nested NMI code. There is still a race when this variable is cleared and an NMI comes in just before the first NMI does the return. For this case, if the variable is cleared, we also check if the interrupted stack is the NMI stack. If it is, then we process the nested NMI code. Why the two tests and not just test the interrupted stack? If the first NMI hits a breakpoint and loses NMI context, and then it hits another breakpoint and while processing that breakpoint we get a nested NMI. When processing a breakpoint, the stack changes to the breakpoint stack. If another NMI comes in here we can't rely on the interrupted stack to be the NMI stack. If the variable is not set and the interrupted task's stack is not the NMI stack, then we know this is the first NMI and we can process things normally. But in order to do so, we need to do a few things first. 1) Set the stack variable that tells us that we are in an NMI handler 2) Make two copies of the interrupt stack frame. One copy is used to return on iret The other is used to restore the first one if we have a nested NMI. This is what the stack will look like: +-------------------------+ | original SS | | original Return RSP | | original RFLAGS | | original CS | | original RIP | +-------------------------+ | temp storage for rdx | +-------------------------+ | NMI executing variable | +-------------------------+ | Saved SS | | Saved Return RSP | | Saved RFLAGS | | Saved CS | | Saved RIP | +-------------------------+ | copied SS | | copied Return RSP | | copied RFLAGS | | copied CS | | copied RIP | +-------------------------+ | pt_regs | +-------------------------+ The original stack frame contains what the HW put in when we entered the NMI. We store %rdx as a temp variable to use. Both the original HW stack frame and this %rdx storage will be clobbered by nested NMIs so we can not rely on them later in the first NMI handler. The next item is the special stack variable that is set when we execute the rest of the NMI handler. Then we have two copies of the interrupt stack. The second copy is modified by any nested NMIs to let the first NMI know that we triggered a second NMI (latched) and that we should repeat the NMI handler. If the first NMI hits an exception or breakpoint that takes it out of NMI context, if a second NMI comes in before the first one finishes, it will update the copied interrupt stack to point to a fix up location to trigger another NMI. When the first NMI calls iret, it will instead jump to the fix up location. This fix up location will copy the saved interrupt stack back to the copy and execute the nmi handler again. Note, the nested NMI knows enough to check if it preempted a previous NMI handler while it is in the fixup location. If it has, it will not modify the copied interrupt stack and will just leave as if nothing happened. As the NMI handle is about to execute again, there's no reason to latch now. To test all this, I forced the NMI handler to call iret and take itself out of NMI context. I also added assemble code to write to the serial to make sure that it hits the nested path as well as the fix up path. Everything seems to be working fine. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
author: Steven Rostedt <srostedt@redhat.com> 2011-12-08 12:36:23 -0500
committer: Steven Rostedt <rostedt@goodmis.org> 2011-12-21 15:38:54 -0500
commit: 3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0 (patch)
tree: 204e9e097fee7450c268c94c32a3338766a53401 /arch/x86
parent: 1fd466efc88c48f50e5ee29f4dbb4e210a889172 (diff)
1 files changed, 177 insertions, 0 deletions
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index d1d5434e7f6a..b62aa298df7f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1475,11 +1475,166 @@ ENTRY(error_exit)
        CFI_ENDPROC
 END(error_exit)
+/*
+ * Test if a given stack is an NMI stack or not.
+ */
+        .macro test_in_nmi reg stack nmi_ret normal_ret
+        cmpq %\reg, \stack
+        ja \normal_ret
+        subq $EXCEPTION_STKSZ, %\reg
+        cmpq %\reg, \stack
+        jb \normal_ret
+        jmp \nmi_ret
+        .endm
        /* runs on exception stack */
 ENTRY(nmi)
        INTR_FRAME
        PARAVIRT_ADJUST_EXCEPTION_FRAME
+        /*
+         * We allow breakpoints in NMIs. If a breakpoint occurs, then
+         * the iretq it performs will take us out of NMI context.
+         * This means that we can have nested NMIs where the next
+         * NMI is using the top of the stack of the previous NMI. We
+         * can't let it execute because the nested NMI will corrupt the
+         * stack of the previous NMI. NMI handlers are not re-entrant
+         * anyway.
+         *
+         * To handle this case we do the following:
+         *  Check the a special location on the stack that contains
+         *  a variable that is set when NMIs are executing.
+         *  The interrupted task's stack is also checked to see if it
+         *  is an NMI stack.
+         *  If the variable is not set and the stack is not the NMI
+         *  stack then:
+         *    o Set the special variable on the stack
+         *    o Copy the interrupt frame into a "saved" location on the stack
+         *    o Copy the interrupt frame into a "copy" location on the stack
+         *    o Continue processing the NMI
+         *  If the variable is set or the previous stack is the NMI stack:
+         *    o Modify the "copy" location to jump to the repeate_nmi
+         *    o return back to the first NMI
+         *
+         * Now on exit of the first NMI, we first clear the stack variable
+         * The NMI stack will tell any nested NMIs at that point that it is
+         * nested. Then we pop the stack normally with iret, and if there was
+         * a nested NMI that updated the copy interrupt stack frame, a
+         * jump will be made to the repeat_nmi code that will handle the second
+         * NMI.
+         */
+        /* Use %rdx as out temp variable throughout */
+        pushq_cfi %rdx
+        /*
+         * Check the special variable on the stack to see if NMIs are
+         * executing.
+         */
+        cmp $1, -8(%rsp)
+        je nested_nmi
+        /*
+         * Now test if the previous stack was an NMI stack.
+         * We need the double check. We check the NMI stack to satisfy the
+         * race when the first NMI clears the variable before returning.
+         * We check the variable because the first NMI could be in a
+         * breakpoint routine using a breakpoint stack.
+         */
+        lea 6*8(%rsp), %rdx
+        test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
+nested_nmi:
+        /*
+         * Do nothing if we interrupted the fixup in repeat_nmi.
+         * It's about to repeat the NMI handler, so we are fine
+         * with ignoring this one.
+         */
+        movq $repeat_nmi, %rdx
+        cmpq 8(%rsp), %rdx
+        ja 1f
+        movq $end_repeat_nmi, %rdx
+        cmpq 8(%rsp), %rdx
+        ja nested_nmi_out
+1:
+        /* Set up the interrupted NMIs stack to jump to repeat_nmi */
+        leaq -6*8(%rsp), %rdx
+        movq %rdx, %rsp
+        CFI_ADJUST_CFA_OFFSET 6*8
+        pushq_cfi $__KERNEL_DS
+        pushq_cfi %rdx
+        pushfq_cfi
+        pushq_cfi $__KERNEL_CS
+        pushq_cfi $repeat_nmi
+        /* Put stack back */
+        addq $(11*8), %rsp
+        CFI_ADJUST_CFA_OFFSET -11*8
+nested_nmi_out:
+        popq_cfi %rdx
+        /* No need to check faults here */
+        INTERRUPT_RETURN
+first_nmi:
+        /*
+         * Because nested NMIs will use the pushed location that we
+         * stored in rdx, we must keep that space available.
+         * Here's what our stack frame will look like:
+         * +-------------------------+
+         * | original SS             |
+         * | original Return RSP     |
+         * | original RFLAGS         |
+         * | original CS             |
+         * | original RIP            |
+         * +-------------------------+
+         * | temp storage for rdx    |
+         * +-------------------------+
+         * | NMI executing variable  |
+         * +-------------------------+
+         * | Saved SS                |
+         * | Saved Return RSP        |
+         * | Saved RFLAGS            |
+         * | Saved CS                |
+         * | Saved RIP               |
+         * +-------------------------+
+         * | copied SS               |
+         * | copied Return RSP       |
+         * | copied RFLAGS           |
+         * | copied CS               |
+         * | copied RIP              |
+         * +-------------------------+
+         * | pt_regs                 |
+         * +-------------------------+
+         *
+         * The saved RIP is used to fix up the copied RIP that a nested
+         * NMI may zero out. The original stack frame and the temp storage
+         * is also used by nested NMIs and can not be trusted on exit.
+         */
+        /* Set the NMI executing variable on the stack. */
+        pushq_cfi $1
+        /* Copy the stack frame to the Saved frame */
+        .rept 5
+        pushq_cfi 6*8(%rsp)
+        .endr
+        /* Make another copy, this one may be modified by nested NMIs */
+        .rept 5
+        pushq_cfi 4*8(%rsp)
+        .endr
+        /* Do not pop rdx, nested NMIs will corrupt it */
+        movq 11*8(%rsp), %rdx
+        /*
+         * Everything below this point can be preempted by a nested
+         * NMI if the first NMI took an exception. Repeated NMIs
+         * caused by an exception and nested NMI will start here, and
+         * can still be preempted by another NMI.
+         */
+restart_nmi:
        pushq_cfi $-1           /* ORIG_RAX: no syscall to restart */
        subq $ORIG_RAX-R15, %rsp
        CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
@@ -1502,10 +1657,32 @@ nmi_swapgs:
        SWAPGS_UNSAFE_STACK
 nmi_restore:
        RESTORE_ALL 8
+        /* Clear the NMI executing stack variable */
+        movq $0, 10*8(%rsp)
        jmp irq_return
        CFI_ENDPROC
 END(nmi)
+        /*
+         * If an NMI hit an iret because of an exception or breakpoint,
+         * it can lose its NMI context, and a nested NMI may come in.
+         * In that case, the nested NMI will change the preempted NMI's
+         * stack to jump to here when it does the final iret.
+         */
+repeat_nmi:
+        INTR_FRAME
+        /* Update the stack variable to say we are still in NMI */
+        movq $1, 5*8(%rsp)
+        /* copy the saved stack back to copy stack */
+        .rept 5
+        pushq_cfi 4*8(%rsp)
+        .endr
+        jmp restart_nmi
+        CFI_ENDPROC
+end_repeat_nmi:
 ENTRY(ignore_sysret)
        CFI_STARTPROC
        mov $-ENOSYS,%eax
author	Steven Rostedt <srostedt@redhat.com>	2011-12-08 12:36:23 -0500
committer	Steven Rostedt <rostedt@goodmis.org>	2011-12-21 15:38:54 -0500
commit	3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0 (patch)
tree	204e9e097fee7450c268c94c32a3338766a53401 /arch/x86
parent	1fd466efc88c48f50e5ee29f4dbb4e210a889172 (diff)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index d1d5434e7f6a..b62aa298df7f 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S
@@ -1475,11 +1475,166 @@ ENTRY(error_exit)
1475	CFI_ENDPROC	1475	CFI_ENDPROC
1476	END(error_exit)	1476	END(error_exit)
1477		1477
		1478	/*
		1479	* Test if a given stack is an NMI stack or not.
		1480	*/
		1481	.macro test_in_nmi reg stack nmi_ret normal_ret
		1482	cmpq %\reg, \stack
		1483	ja \normal_ret
		1484	subq $EXCEPTION_STKSZ, %\reg
		1485	cmpq %\reg, \stack
		1486	jb \normal_ret
		1487	jmp \nmi_ret
		1488	.endm
1478		1489
1479	/* runs on exception stack */	1490	/* runs on exception stack */
1480	ENTRY(nmi)	1491	ENTRY(nmi)
1481	INTR_FRAME	1492	INTR_FRAME
1482	PARAVIRT_ADJUST_EXCEPTION_FRAME	1493	PARAVIRT_ADJUST_EXCEPTION_FRAME
		1494	/*
		1495	* We allow breakpoints in NMIs. If a breakpoint occurs, then
		1496	* the iretq it performs will take us out of NMI context.
		1497	* This means that we can have nested NMIs where the next
		1498	* NMI is using the top of the stack of the previous NMI. We
		1499	* can't let it execute because the nested NMI will corrupt the
		1500	* stack of the previous NMI. NMI handlers are not re-entrant
		1501	* anyway.
		1502	*
		1503	* To handle this case we do the following:
		1504	* Check the a special location on the stack that contains
		1505	* a variable that is set when NMIs are executing.
		1506	* The interrupted task's stack is also checked to see if it
		1507	* is an NMI stack.
		1508	* If the variable is not set and the stack is not the NMI
		1509	* stack then:
		1510	* o Set the special variable on the stack
		1511	* o Copy the interrupt frame into a "saved" location on the stack
		1512	* o Copy the interrupt frame into a "copy" location on the stack
		1513	* o Continue processing the NMI
		1514	* If the variable is set or the previous stack is the NMI stack:
		1515	* o Modify the "copy" location to jump to the repeate_nmi
		1516	* o return back to the first NMI
		1517	*
		1518	* Now on exit of the first NMI, we first clear the stack variable
		1519	* The NMI stack will tell any nested NMIs at that point that it is
		1520	* nested. Then we pop the stack normally with iret, and if there was
		1521	* a nested NMI that updated the copy interrupt stack frame, a
		1522	* jump will be made to the repeat_nmi code that will handle the second
		1523	* NMI.
		1524	*/
		1525
		1526	/* Use %rdx as out temp variable throughout */
		1527	pushq_cfi %rdx
		1528
		1529	/*
		1530	* Check the special variable on the stack to see if NMIs are
		1531	* executing.
		1532	*/
		1533	cmp $1, -8(%rsp)
		1534	je nested_nmi
		1535
		1536	/*
		1537	* Now test if the previous stack was an NMI stack.
		1538	* We need the double check. We check the NMI stack to satisfy the
		1539	* race when the first NMI clears the variable before returning.
		1540	* We check the variable because the first NMI could be in a
		1541	* breakpoint routine using a breakpoint stack.
		1542	*/
		1543	lea 6*8(%rsp), %rdx
		1544	test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
		1545
		1546	nested_nmi:
		1547	/*
		1548	* Do nothing if we interrupted the fixup in repeat_nmi.
		1549	* It's about to repeat the NMI handler, so we are fine
		1550	* with ignoring this one.
		1551	*/
		1552	movq $repeat_nmi, %rdx
		1553	cmpq 8(%rsp), %rdx
		1554	ja 1f
		1555	movq $end_repeat_nmi, %rdx
		1556	cmpq 8(%rsp), %rdx
		1557	ja nested_nmi_out
		1558
		1559	1:
		1560	/* Set up the interrupted NMIs stack to jump to repeat_nmi */
		1561	leaq -6*8(%rsp), %rdx
		1562	movq %rdx, %rsp
		1563	CFI_ADJUST_CFA_OFFSET 6*8
		1564	pushq_cfi $__KERNEL_DS
		1565	pushq_cfi %rdx
		1566	pushfq_cfi
		1567	pushq_cfi $__KERNEL_CS
		1568	pushq_cfi $repeat_nmi
		1569
		1570	/* Put stack back */
		1571	addq $(11*8), %rsp
		1572	CFI_ADJUST_CFA_OFFSET -11*8
		1573
		1574	nested_nmi_out:
		1575	popq_cfi %rdx
		1576
		1577	/* No need to check faults here */
		1578	INTERRUPT_RETURN
		1579
		1580	first_nmi:
		1581	/*
		1582	* Because nested NMIs will use the pushed location that we
		1583	* stored in rdx, we must keep that space available.
		1584	* Here's what our stack frame will look like:
		1585	* +-------------------------+
		1586	* \| original SS \|
		1587	* \| original Return RSP \|
		1588	* \| original RFLAGS \|
		1589	* \| original CS \|
		1590	* \| original RIP \|
		1591	* +-------------------------+
		1592	* \| temp storage for rdx \|
		1593	* +-------------------------+
		1594	* \| NMI executing variable \|
		1595	* +-------------------------+
		1596	* \| Saved SS \|
		1597	* \| Saved Return RSP \|
		1598	* \| Saved RFLAGS \|
		1599	* \| Saved CS \|
		1600	* \| Saved RIP \|
		1601	* +-------------------------+
		1602	* \| copied SS \|
		1603	* \| copied Return RSP \|
		1604	* \| copied RFLAGS \|
		1605	* \| copied CS \|
		1606	* \| copied RIP \|
		1607	* +-------------------------+
		1608	* \| pt_regs \|
		1609	* +-------------------------+
		1610	*
		1611	* The saved RIP is used to fix up the copied RIP that a nested
		1612	* NMI may zero out. The original stack frame and the temp storage
		1613	* is also used by nested NMIs and can not be trusted on exit.
		1614	*/
		1615	/* Set the NMI executing variable on the stack. */
		1616	pushq_cfi $1
		1617
		1618	/* Copy the stack frame to the Saved frame */
		1619	.rept 5
		1620	pushq_cfi 6*8(%rsp)
		1621	.endr
		1622
		1623	/* Make another copy, this one may be modified by nested NMIs */
		1624	.rept 5
		1625	pushq_cfi 4*8(%rsp)
		1626	.endr
		1627
		1628	/* Do not pop rdx, nested NMIs will corrupt it */
		1629	movq 11*8(%rsp), %rdx
		1630
		1631	/*
		1632	* Everything below this point can be preempted by a nested
		1633	* NMI if the first NMI took an exception. Repeated NMIs
		1634	* caused by an exception and nested NMI will start here, and
		1635	* can still be preempted by another NMI.
		1636	*/
		1637	restart_nmi:
1483	pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */	1638	pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
1484	subq $ORIG_RAX-R15, %rsp	1639	subq $ORIG_RAX-R15, %rsp
1485	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15	1640	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
@@ -1502,10 +1657,32 @@ nmi_swapgs:
1502	SWAPGS_UNSAFE_STACK	1657	SWAPGS_UNSAFE_STACK
1503	nmi_restore:	1658	nmi_restore:
1504	RESTORE_ALL 8	1659	RESTORE_ALL 8
		1660	/* Clear the NMI executing stack variable */
		1661	movq $0, 10*8(%rsp)
1505	jmp irq_return	1662	jmp irq_return
1506	CFI_ENDPROC	1663	CFI_ENDPROC
1507	END(nmi)	1664	END(nmi)
1508		1665
		1666	/*
		1667	* If an NMI hit an iret because of an exception or breakpoint,
		1668	* it can lose its NMI context, and a nested NMI may come in.
		1669	* In that case, the nested NMI will change the preempted NMI's
		1670	* stack to jump to here when it does the final iret.
		1671	*/
		1672	repeat_nmi:
		1673	INTR_FRAME
		1674	/* Update the stack variable to say we are still in NMI */
		1675	movq $1, 5*8(%rsp)
		1676
		1677	/* copy the saved stack back to copy stack */
		1678	.rept 5
		1679	pushq_cfi 4*8(%rsp)
		1680	.endr
		1681
		1682	jmp restart_nmi
		1683	CFI_ENDPROC
		1684	end_repeat_nmi:
		1685
1509	ENTRY(ignore_sysret)	1686	ENTRY(ignore_sysret)
1510	CFI_STARTPROC	1687	CFI_STARTPROC
1511	mov $-ENOSYS,%eax	1688	mov $-ENOSYS,%eax