aboutsummaryrefslogtreecommitdiffstats
path: root/arch/x86
diff options
context:
space:
mode:
authorSteven Rostedt <srostedt@redhat.com>2011-12-08 12:36:23 -0500
committerSteven Rostedt <rostedt@goodmis.org>2011-12-21 15:38:54 -0500
commit3f3c8b8c4b2a34776c3470142a7c8baafcda6eb0 (patch)
tree204e9e097fee7450c268c94c32a3338766a53401 /arch/x86
parent1fd466efc88c48f50e5ee29f4dbb4e210a889172 (diff)
x86: Add workaround to NMI iret woes
In x86, when an NMI goes off, the CPU goes into an NMI context that prevents other NMIs to trigger on that CPU. If an NMI is suppose to trigger, it has to wait till the previous NMI leaves NMI context. At that time, the next NMI can trigger (note, only one more NMI will trigger, as only one can be latched at a time). The way x86 gets out of NMI context is by calling iret. The problem with this is that this causes problems if the NMI handle either triggers an exception, or a breakpoint. Both the exception and the breakpoint handlers will finish with an iret. If this happens while in NMI context, the CPU will leave NMI context and a new NMI may come in. As NMI handlers are not made to be re-entrant, this can cause havoc with the system, not to mention, the nested NMI will write all over the previous NMI's stack. Linus Torvalds proposed the following workaround to this problem: https://lkml.org/lkml/2010/7/14/264 "In fact, I wonder if we couldn't just do a software NMI disable instead? Hav ea per-cpu variable (in the _core_ percpu areas that get allocated statically) that points to the NMI stack frame, and just make the NMI code itself do something like NMI entry: - load percpu NMI stack frame pointer - if non-zero we know we're nested, and should ignore this NMI: - we're returning to kernel mode, so return immediately by using "popf/ret", which also keeps NMI's disabled in the hardware until the "real" NMI iret happens. - before the popf/iret, use the NMI stack pointer to make the NMI return stack be invalid and cause a fault - set the NMI stack pointer to the current stack pointer NMI exit (not the above "immediate exit because we nested"): clear the percpu NMI stack pointer Just do the iret. Now, the thing is, now the "iret" is atomic. If we had a nested NMI, we'll take a fault, and that re-does our "delayed" NMI - and NMI's will stay masked. And if we didn't have a nested NMI, that iret will now unmask NMI's, and everything is happy." I first tried to follow this advice but as I started implementing this code, a few gotchas showed up. One, is accessing per-cpu variables in the NMI handler. The problem is that per-cpu variables use the %gs register to get the variable for the given CPU. But as the NMI may happen in userspace, we must first perform a SWAPGS to get to it. The NMI handler already does this later in the code, but its too late as we have saved off all the registers and we don't want to do that for a disabled NMI. Peter Zijlstra suggested to keep all variables on the stack. This simplifies things greatly and it has the added benefit of cache locality. Two, faulting on the iret. I really wanted to make this work, but it was becoming very hacky, and I never got it to be stable. The iret already had a fault handler for userspace faulting with bad segment registers, and getting NMI to trigger a fault and detect it was very tricky. But for strange reasons, the system would usually take a double fault and crash. I never figured out why and decided to go with a simple "jmp" approach. The new approach I took also simplified things. Finally, the last problem with Linus's approach was to have the nested NMI handler do a ret instead of an iret to give the first NMI NMI-context again. The problem is that ret is much more limited than an iret. I couldn't figure out how to get the stack back where it belonged. I could have copied the current stack, pushed the return onto it, but my fear here is that there may be some place that writes data below the stack pointer. I know that is not something code should depend on, but I don't want to chance it. I may add this feature later, but for now, an NMI handler that loses NMI context will not get it back. Here's what is done: When an NMI comes in, the HW pushes the interrupt stack frame onto the per cpu NMI stack that is selected by the IST. A special location on the NMI stack holds a variable that is set when the first NMI handler runs. If this variable is set then we know that this is a nested NMI and we process the nested NMI code. There is still a race when this variable is cleared and an NMI comes in just before the first NMI does the return. For this case, if the variable is cleared, we also check if the interrupted stack is the NMI stack. If it is, then we process the nested NMI code. Why the two tests and not just test the interrupted stack? If the first NMI hits a breakpoint and loses NMI context, and then it hits another breakpoint and while processing that breakpoint we get a nested NMI. When processing a breakpoint, the stack changes to the breakpoint stack. If another NMI comes in here we can't rely on the interrupted stack to be the NMI stack. If the variable is not set and the interrupted task's stack is not the NMI stack, then we know this is the first NMI and we can process things normally. But in order to do so, we need to do a few things first. 1) Set the stack variable that tells us that we are in an NMI handler 2) Make two copies of the interrupt stack frame. One copy is used to return on iret The other is used to restore the first one if we have a nested NMI. This is what the stack will look like: +-------------------------+ | original SS | | original Return RSP | | original RFLAGS | | original CS | | original RIP | +-------------------------+ | temp storage for rdx | +-------------------------+ | NMI executing variable | +-------------------------+ | Saved SS | | Saved Return RSP | | Saved RFLAGS | | Saved CS | | Saved RIP | +-------------------------+ | copied SS | | copied Return RSP | | copied RFLAGS | | copied CS | | copied RIP | +-------------------------+ | pt_regs | +-------------------------+ The original stack frame contains what the HW put in when we entered the NMI. We store %rdx as a temp variable to use. Both the original HW stack frame and this %rdx storage will be clobbered by nested NMIs so we can not rely on them later in the first NMI handler. The next item is the special stack variable that is set when we execute the rest of the NMI handler. Then we have two copies of the interrupt stack. The second copy is modified by any nested NMIs to let the first NMI know that we triggered a second NMI (latched) and that we should repeat the NMI handler. If the first NMI hits an exception or breakpoint that takes it out of NMI context, if a second NMI comes in before the first one finishes, it will update the copied interrupt stack to point to a fix up location to trigger another NMI. When the first NMI calls iret, it will instead jump to the fix up location. This fix up location will copy the saved interrupt stack back to the copy and execute the nmi handler again. Note, the nested NMI knows enough to check if it preempted a previous NMI handler while it is in the fixup location. If it has, it will not modify the copied interrupt stack and will just leave as if nothing happened. As the NMI handle is about to execute again, there's no reason to latch now. To test all this, I forced the NMI handler to call iret and take itself out of NMI context. I also added assemble code to write to the serial to make sure that it hits the nested path as well as the fix up path. Everything seems to be working fine. Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Paul Turner <pjt@google.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Diffstat (limited to 'arch/x86')
-rw-r--r--arch/x86/kernel/entry_64.S177
1 files changed, 177 insertions, 0 deletions
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index d1d5434e7f6a..b62aa298df7f 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1475,11 +1475,166 @@ ENTRY(error_exit)
1475 CFI_ENDPROC 1475 CFI_ENDPROC
1476END(error_exit) 1476END(error_exit)
1477 1477
1478/*
1479 * Test if a given stack is an NMI stack or not.
1480 */
1481 .macro test_in_nmi reg stack nmi_ret normal_ret
1482 cmpq %\reg, \stack
1483 ja \normal_ret
1484 subq $EXCEPTION_STKSZ, %\reg
1485 cmpq %\reg, \stack
1486 jb \normal_ret
1487 jmp \nmi_ret
1488 .endm
1478 1489
1479 /* runs on exception stack */ 1490 /* runs on exception stack */
1480ENTRY(nmi) 1491ENTRY(nmi)
1481 INTR_FRAME 1492 INTR_FRAME
1482 PARAVIRT_ADJUST_EXCEPTION_FRAME 1493 PARAVIRT_ADJUST_EXCEPTION_FRAME
1494 /*
1495 * We allow breakpoints in NMIs. If a breakpoint occurs, then
1496 * the iretq it performs will take us out of NMI context.
1497 * This means that we can have nested NMIs where the next
1498 * NMI is using the top of the stack of the previous NMI. We
1499 * can't let it execute because the nested NMI will corrupt the
1500 * stack of the previous NMI. NMI handlers are not re-entrant
1501 * anyway.
1502 *
1503 * To handle this case we do the following:
1504 * Check the a special location on the stack that contains
1505 * a variable that is set when NMIs are executing.
1506 * The interrupted task's stack is also checked to see if it
1507 * is an NMI stack.
1508 * If the variable is not set and the stack is not the NMI
1509 * stack then:
1510 * o Set the special variable on the stack
1511 * o Copy the interrupt frame into a "saved" location on the stack
1512 * o Copy the interrupt frame into a "copy" location on the stack
1513 * o Continue processing the NMI
1514 * If the variable is set or the previous stack is the NMI stack:
1515 * o Modify the "copy" location to jump to the repeate_nmi
1516 * o return back to the first NMI
1517 *
1518 * Now on exit of the first NMI, we first clear the stack variable
1519 * The NMI stack will tell any nested NMIs at that point that it is
1520 * nested. Then we pop the stack normally with iret, and if there was
1521 * a nested NMI that updated the copy interrupt stack frame, a
1522 * jump will be made to the repeat_nmi code that will handle the second
1523 * NMI.
1524 */
1525
1526 /* Use %rdx as out temp variable throughout */
1527 pushq_cfi %rdx
1528
1529 /*
1530 * Check the special variable on the stack to see if NMIs are
1531 * executing.
1532 */
1533 cmp $1, -8(%rsp)
1534 je nested_nmi
1535
1536 /*
1537 * Now test if the previous stack was an NMI stack.
1538 * We need the double check. We check the NMI stack to satisfy the
1539 * race when the first NMI clears the variable before returning.
1540 * We check the variable because the first NMI could be in a
1541 * breakpoint routine using a breakpoint stack.
1542 */
1543 lea 6*8(%rsp), %rdx
1544 test_in_nmi rdx, 4*8(%rsp), nested_nmi, first_nmi
1545
1546nested_nmi:
1547 /*
1548 * Do nothing if we interrupted the fixup in repeat_nmi.
1549 * It's about to repeat the NMI handler, so we are fine
1550 * with ignoring this one.
1551 */
1552 movq $repeat_nmi, %rdx
1553 cmpq 8(%rsp), %rdx
1554 ja 1f
1555 movq $end_repeat_nmi, %rdx
1556 cmpq 8(%rsp), %rdx
1557 ja nested_nmi_out
1558
15591:
1560 /* Set up the interrupted NMIs stack to jump to repeat_nmi */
1561 leaq -6*8(%rsp), %rdx
1562 movq %rdx, %rsp
1563 CFI_ADJUST_CFA_OFFSET 6*8
1564 pushq_cfi $__KERNEL_DS
1565 pushq_cfi %rdx
1566 pushfq_cfi
1567 pushq_cfi $__KERNEL_CS
1568 pushq_cfi $repeat_nmi
1569
1570 /* Put stack back */
1571 addq $(11*8), %rsp
1572 CFI_ADJUST_CFA_OFFSET -11*8
1573
1574nested_nmi_out:
1575 popq_cfi %rdx
1576
1577 /* No need to check faults here */
1578 INTERRUPT_RETURN
1579
1580first_nmi:
1581 /*
1582 * Because nested NMIs will use the pushed location that we
1583 * stored in rdx, we must keep that space available.
1584 * Here's what our stack frame will look like:
1585 * +-------------------------+
1586 * | original SS |
1587 * | original Return RSP |
1588 * | original RFLAGS |
1589 * | original CS |
1590 * | original RIP |
1591 * +-------------------------+
1592 * | temp storage for rdx |
1593 * +-------------------------+
1594 * | NMI executing variable |
1595 * +-------------------------+
1596 * | Saved SS |
1597 * | Saved Return RSP |
1598 * | Saved RFLAGS |
1599 * | Saved CS |
1600 * | Saved RIP |
1601 * +-------------------------+
1602 * | copied SS |
1603 * | copied Return RSP |
1604 * | copied RFLAGS |
1605 * | copied CS |
1606 * | copied RIP |
1607 * +-------------------------+
1608 * | pt_regs |
1609 * +-------------------------+
1610 *
1611 * The saved RIP is used to fix up the copied RIP that a nested
1612 * NMI may zero out. The original stack frame and the temp storage
1613 * is also used by nested NMIs and can not be trusted on exit.
1614 */
1615 /* Set the NMI executing variable on the stack. */
1616 pushq_cfi $1
1617
1618 /* Copy the stack frame to the Saved frame */
1619 .rept 5
1620 pushq_cfi 6*8(%rsp)
1621 .endr
1622
1623 /* Make another copy, this one may be modified by nested NMIs */
1624 .rept 5
1625 pushq_cfi 4*8(%rsp)
1626 .endr
1627
1628 /* Do not pop rdx, nested NMIs will corrupt it */
1629 movq 11*8(%rsp), %rdx
1630
1631 /*
1632 * Everything below this point can be preempted by a nested
1633 * NMI if the first NMI took an exception. Repeated NMIs
1634 * caused by an exception and nested NMI will start here, and
1635 * can still be preempted by another NMI.
1636 */
1637restart_nmi:
1483 pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */ 1638 pushq_cfi $-1 /* ORIG_RAX: no syscall to restart */
1484 subq $ORIG_RAX-R15, %rsp 1639 subq $ORIG_RAX-R15, %rsp
1485 CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15 1640 CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
@@ -1502,10 +1657,32 @@ nmi_swapgs:
1502 SWAPGS_UNSAFE_STACK 1657 SWAPGS_UNSAFE_STACK
1503nmi_restore: 1658nmi_restore:
1504 RESTORE_ALL 8 1659 RESTORE_ALL 8
1660 /* Clear the NMI executing stack variable */
1661 movq $0, 10*8(%rsp)
1505 jmp irq_return 1662 jmp irq_return
1506 CFI_ENDPROC 1663 CFI_ENDPROC
1507END(nmi) 1664END(nmi)
1508 1665
1666 /*
1667 * If an NMI hit an iret because of an exception or breakpoint,
1668 * it can lose its NMI context, and a nested NMI may come in.
1669 * In that case, the nested NMI will change the preempted NMI's
1670 * stack to jump to here when it does the final iret.
1671 */
1672repeat_nmi:
1673 INTR_FRAME
1674 /* Update the stack variable to say we are still in NMI */
1675 movq $1, 5*8(%rsp)
1676
1677 /* copy the saved stack back to copy stack */
1678 .rept 5
1679 pushq_cfi 4*8(%rsp)
1680 .endr
1681
1682 jmp restart_nmi
1683 CFI_ENDPROC
1684end_repeat_nmi:
1685
1509ENTRY(ignore_sysret) 1686ENTRY(ignore_sysret)
1510 CFI_STARTPROC 1687 CFI_STARTPROC
1511 mov $-ENOSYS,%eax 1688 mov $-ENOSYS,%eax