diff options
Diffstat (limited to 'Documentation/nmi_watchdog.txt')
-rw-r--r-- | Documentation/nmi_watchdog.txt | 81 |
1 files changed, 81 insertions, 0 deletions
diff --git a/Documentation/nmi_watchdog.txt b/Documentation/nmi_watchdog.txt new file mode 100644 index 000000000000..c025a4561c10 --- /dev/null +++ b/Documentation/nmi_watchdog.txt | |||
@@ -0,0 +1,81 @@ | |||
1 | |||
2 | [NMI watchdog is available for x86 and x86-64 architectures] | ||
3 | |||
4 | Is your system locking up unpredictably? No keyboard activity, just | ||
5 | a frustrating complete hard lockup? Do you want to help us debugging | ||
6 | such lockups? If all yes then this document is definitely for you. | ||
7 | |||
8 | On many x86/x86-64 type hardware there is a feature that enables | ||
9 | us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt | ||
10 | which get executed even if the system is otherwise locked up hard). | ||
11 | This can be used to debug hard kernel lockups. By executing periodic | ||
12 | NMI interrupts, the kernel can monitor whether any CPU has locked up, | ||
13 | and print out debugging messages if so. | ||
14 | |||
15 | In order to use the NMI watchdog, you need to have APIC support in your | ||
16 | kernel. For SMP kernels, APIC support gets compiled in automatically. For | ||
17 | UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local | ||
18 | APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and | ||
19 | features -> IO-APIC support on uniprocessors) in your kernel config. | ||
20 | CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC. | ||
21 | CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain | ||
22 | kernel debugging options, such as Kernel Stack Meter or Kernel Tracer, | ||
23 | may implicitly disable the NMI watchdog.] | ||
24 | |||
25 | For x86-64, the needed APIC is always compiled in, and the NMI watchdog is | ||
26 | always enabled with I/O-APIC mode (nmi_watchdog=1). Currently, local APIC | ||
27 | mode (nmi_watchdog=2) does not work on x86-64. | ||
28 | |||
29 | Using local APIC (nmi_watchdog=2) needs the first performance register, so | ||
30 | you can't use it for other purposes (such as high precision performance | ||
31 | profiling.) However, at least oprofile and the perfctr driver disable the | ||
32 | local APIC NMI watchdog automatically. | ||
33 | |||
34 | To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot | ||
35 | parameter. Eg. the relevant lilo.conf entry: | ||
36 | |||
37 | append="nmi_watchdog=1" | ||
38 | |||
39 | For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1. | ||
40 | For UP machines without an IO-APIC use nmi_watchdog=2, this only works | ||
41 | for some processor types. If in doubt, boot with nmi_watchdog=1 and | ||
42 | check the NMI count in /proc/interrupts; if the count is zero then | ||
43 | reboot with nmi_watchdog=2 and check the NMI count. If it is still | ||
44 | zero then log a problem, you probably have a processor that needs to be | ||
45 | added to the nmi code. | ||
46 | |||
47 | A 'lockup' is the following scenario: if any CPU in the system does not | ||
48 | execute the period local timer interrupt for more than 5 seconds, then | ||
49 | the NMI handler generates an oops and kills the process. This | ||
50 | 'controlled crash' (and the resulting kernel messages) can be used to | ||
51 | debug the lockup. Thus whenever the lockup happens, wait 5 seconds and | ||
52 | the oops will show up automatically. If the kernel produces no messages | ||
53 | then the system has crashed so hard (eg. hardware-wise) that either it | ||
54 | cannot even accept NMI interrupts, or the crash has made the kernel | ||
55 | unable to print messages. | ||
56 | |||
57 | Be aware that when using local APIC, the frequency of NMI interrupts | ||
58 | it generates, depends on the system load. The local APIC NMI watchdog, | ||
59 | lacking a better source, uses the "cycles unhalted" event. As you may | ||
60 | guess it doesn't tick when the CPU is in the halted state (which happens | ||
61 | when the system is idle), but if your system locks up on anything but the | ||
62 | "hlt" processor instruction, the watchdog will trigger very soon as the | ||
63 | "cycles unhalted" event will happen every clock tick. If it locks up on | ||
64 | "hlt", then you are out of luck -- the event will not happen at all and the | ||
65 | watchdog won't trigger. This is a shortcoming of the local APIC watchdog | ||
66 | -- unfortunately there is no "clock ticks" event that would work all the | ||
67 | time. The I/O APIC watchdog is driven externally and has no such shortcoming. | ||
68 | But its NMI frequency is much higher, resulting in a more significant hit | ||
69 | to the overall system performance. | ||
70 | |||
71 | NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default, | ||
72 | you have to enable it with a boot time parameter. Prior to 2.4.2-ac18 | ||
73 | the NMI-oopser is enabled unconditionally on x86 SMP boxes. | ||
74 | |||
75 | On x86-64 the NMI oopser is on by default. On 64bit Intel CPUs | ||
76 | it uses IO-APIC by default and on AMD it uses local APIC. | ||
77 | |||
78 | [ feel free to send bug reports, suggestions and patches to | ||
79 | Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing | ||
80 | list at <linux-smp@vger.kernel.org> ] | ||
81 | |||