diff options
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/x86_64/machinecheck | 70 |
1 files changed, 70 insertions, 0 deletions
diff --git a/Documentation/x86_64/machinecheck b/Documentation/x86_64/machinecheck new file mode 100644 index 000000000000..068a6d9904b9 --- /dev/null +++ b/Documentation/x86_64/machinecheck | |||
@@ -0,0 +1,70 @@ | |||
1 | |||
2 | Configurable sysfs parameters for the x86-64 machine check code. | ||
3 | |||
4 | Machine checks report internal hardware error conditions detected | ||
5 | by the CPU. Uncorrected errors typically cause a machine check | ||
6 | (often with panic), corrected ones cause a machine check log entry. | ||
7 | |||
8 | Machine checks are organized in banks (normally associated with | ||
9 | a hardware subsystem) and subevents in a bank. The exact meaning | ||
10 | of the banks and subevent is CPU specific. | ||
11 | |||
12 | mcelog knows how to decode them. | ||
13 | |||
14 | When you see the "Machine check errors logged" message in the system | ||
15 | log then mcelog should run to collect and decode machine check entries | ||
16 | from /dev/mcelog. Normally mcelog should be run regularly from a cronjob. | ||
17 | |||
18 | Each CPU has a directory in /sys/devices/system/machinecheck/machinecheckN | ||
19 | (N = CPU number) | ||
20 | |||
21 | The directory contains some configurable entries: | ||
22 | |||
23 | Entries: | ||
24 | |||
25 | bankNctl | ||
26 | (N bank number) | ||
27 | 64bit Hex bitmask enabling/disabling specific subevents for bank N | ||
28 | When a bit in the bitmask is zero then the respective | ||
29 | subevent will not be reported. | ||
30 | By default all events are enabled. | ||
31 | Note that BIOS maintain another mask to disable specific events | ||
32 | per bank. This is not visible here | ||
33 | |||
34 | The following entries appear for each CPU, but they are truly shared | ||
35 | between all CPUs. | ||
36 | |||
37 | check_interval | ||
38 | How often to poll for corrected machine check errors, in seconds | ||
39 | (Note output is hexademical). Default 5 minutes. | ||
40 | |||
41 | tolerant | ||
42 | Tolerance level. When a machine check exception occurs for a non | ||
43 | corrected machine check the kernel can take different actions. | ||
44 | Since machine check exceptions can happen any time it is sometimes | ||
45 | risky for the kernel to kill a process because it defies | ||
46 | normal kernel locking rules. The tolerance level configures | ||
47 | how hard the kernel tries to recover even at some risk of deadlock. | ||
48 | |||
49 | 0: always panic, | ||
50 | 1: panic if deadlock possible, | ||
51 | 2: try to avoid panic, | ||
52 | 3: never panic or exit (for testing only) | ||
53 | |||
54 | Default: 1 | ||
55 | |||
56 | Note this only makes a difference if the CPU allows recovery | ||
57 | from a machine check exception. Current x86 CPUs generally do not. | ||
58 | |||
59 | trigger | ||
60 | Program to run when a machine check event is detected. | ||
61 | This is an alternative to running mcelog regularly from cron | ||
62 | and allows to detect events faster. | ||
63 | |||
64 | TBD document entries for AMD threshold interrupt configuration | ||
65 | |||
66 | For more details about the x86 machine check architecture | ||
67 | see the Intel and AMD architecture manuals from their developer websites. | ||
68 | |||
69 | For more details about the architecture see | ||
70 | see http://one.firstfloor.org/~andi/mce.pdf | ||