diff options
Diffstat (limited to 'Documentation/x86/x86_64/machinecheck')
-rw-r--r-- | Documentation/x86/x86_64/machinecheck | 77 |
1 files changed, 77 insertions, 0 deletions
diff --git a/Documentation/x86/x86_64/machinecheck b/Documentation/x86/x86_64/machinecheck new file mode 100644 index 000000000000..a05e58e7b159 --- /dev/null +++ b/Documentation/x86/x86_64/machinecheck | |||
@@ -0,0 +1,77 @@ | |||
1 | |||
2 | Configurable sysfs parameters for the x86-64 machine check code. | ||
3 | |||
4 | Machine checks report internal hardware error conditions detected | ||
5 | by the CPU. Uncorrected errors typically cause a machine check | ||
6 | (often with panic), corrected ones cause a machine check log entry. | ||
7 | |||
8 | Machine checks are organized in banks (normally associated with | ||
9 | a hardware subsystem) and subevents in a bank. The exact meaning | ||
10 | of the banks and subevent is CPU specific. | ||
11 | |||
12 | mcelog knows how to decode them. | ||
13 | |||
14 | When you see the "Machine check errors logged" message in the system | ||
15 | log then mcelog should run to collect and decode machine check entries | ||
16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | ||
17 | |||
18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | ||
19 | (N = CPU number) | ||
20 | |||
21 | The directory contains some configurable entries: | ||
22 | |||
23 | Entries: | ||
24 | |||
25 | bankNctl | ||
26 | (N bank number) | ||
27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N | ||
28 | When a bit in the bitmask is zero then the respective | ||
29 | subevent will not be reported. | ||
30 | By default all events are enabled. | ||
31 | Note that BIOS maintain another mask to disable specific events | ||
32 | per bank. This is not visible here | ||
33 | |||
34 | The following entries appear for each CPU, but they are truly shared | ||
35 | between all CPUs. | ||
36 | |||
37 | check_interval | ||
38 | How often to poll for corrected machine check errors, in seconds | ||
39 | (Note output is hexademical). Default 5 minutes. When the poller | ||
40 | finds MCEs it triggers an exponential speedup (poll more often) on | ||
41 | the polling interval. When the poller stops finding MCEs, it | ||
42 | triggers an exponential backoff (poll less often) on the polling | ||
43 | interval. The check_interval variable is both the initial and | ||
44 | maximum polling interval. | ||
45 | |||
46 | tolerant | ||
47 | Tolerance level. When a machine check exception occurs for a non | ||
48 | corrected machine check the kernel can take different actions. | ||
49 | Since machine check exceptions can happen any time it is sometimes | ||
50 | risky for the kernel to kill a process because it defies | ||
51 | normal kernel locking rules. The tolerance level configures | ||
52 | how hard the kernel tries to recover even at some risk of | ||
53 | deadlock. Higher tolerant values trade potentially better uptime | ||
54 | with the risk of a crash or even corruption (for tolerant >= 3). | ||
55 | |||
56 | 0: always panic on uncorrected errors, log corrected errors | ||
57 | 1: panic or SIGBUS on uncorrected errors, log corrected errors | ||
58 | 2: SIGBUS or log uncorrected errors, log corrected errors | ||
59 | 3: never panic or SIGBUS, log all errors (for testing only) | ||
60 | |||
61 | Default: 1 | ||
62 | |||
63 | Note this only makes a difference if the CPU allows recovery | ||
64 | from a machine check exception. Current x86 CPUs generally do not. | ||
65 | |||
66 | trigger | ||
67 | Program to run when a machine check event is detected. | ||
68 | This is an alternative to running mcelog regularly from cron | ||
69 | and allows to detect events faster. | ||
70 | |||
71 | TBD document entries for AMD threshold interrupt configuration | ||
72 | |||
73 | For more details about the x86 machine check architecture | ||
74 | see the Intel and AMD architecture manuals from their developer websites. | ||
75 | |||
76 | For more details about the architecture see | ||
77 | see http://one.firstfloor.org/~andi/mce.pdf | ||