diff options
author | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2016-10-26 06:14:12 -0400 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab@s-opensource.com> | 2016-12-15 05:54:49 -0500 |
commit | b27a2d04feb6969e74942378d5012d84877d3544 (patch) | |
tree | 6fb726a47822b6cb7bcea749ea75309922870da2 | |
parent | 032d0ab743ff8ee340d5fc2a00c833dfe74c49e4 (diff) |
edac.txt: convert EDAC documentation to ReST
Converts the EDAC driver subsystem documentation to ReST:
- Put paragraph titles in lower case;
- Add code blocks where needed;
- Convert tables to ReST markup;
- Mark filesystem and module names as verbatim;
- Adjust document to be properly displayed in html.
Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
-rw-r--r-- | Documentation/edac.txt | 551 |
1 files changed, 295 insertions, 256 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 502988524519..316456ba2e0a 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt | |||
@@ -1,29 +1,34 @@ | |||
1 | .. include:: <isonum.txt> | ||
2 | |||
3 | ===================================== | ||
1 | EDAC - Error Detection And Correction | 4 | EDAC - Error Detection And Correction |
2 | ===================================== | 5 | ===================================== |
3 | 6 | ||
4 | "bluesmoke" was the name for this device driver when it | 7 | .. note:: |
5 | was "out-of-tree" and maintained at sourceforge.net - | ||
6 | bluesmoke.sourceforge.net. That site is mostly archaic now and can be | ||
7 | used only for historical purposes. | ||
8 | 8 | ||
9 | When the subsystem was pushed into 2.6.16 for the first time, it was | 9 | "bluesmoke" was the name for this device driver when it |
10 | renamed to 'EDAC'. | 10 | was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. |
11 | That site is mostly archaic now and can be used only for historical | ||
12 | purposes. | ||
11 | 13 | ||
12 | PURPOSE | 14 | When the subsystem was pushed into 2.6.16 for the first time, it was |
15 | renamed to ``EDAC``. | ||
16 | |||
17 | Purpose | ||
13 | ------- | 18 | ------- |
14 | 19 | ||
15 | The 'edac' kernel module's goal is to detect and report hardware errors | 20 | The ``edac`` kernel module's goal is to detect and report hardware errors |
16 | that occur within the computer system running under linux. | 21 | that occur within the computer system running under linux. |
17 | 22 | ||
18 | MEMORY | 23 | Memory |
19 | ------ | 24 | ------ |
20 | 25 | ||
21 | Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the | 26 | Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the |
22 | primary errors being harvested. These types of errors are harvested by | 27 | primary errors being harvested. These types of errors are harvested by |
23 | the 'edac_mc' device. | 28 | the ``edac_mc`` device. |
24 | 29 | ||
25 | Detecting CE events, then harvesting those events and reporting them, | 30 | Detecting CE events, then harvesting those events and reporting them, |
26 | *can* but must not necessarily be a predictor of future UE events. With | 31 | **can** but must not necessarily be a predictor of future UE events. With |
27 | CE events only, the system can and will continue to operate as no data | 32 | CE events only, the system can and will continue to operate as no data |
28 | has been damaged yet. | 33 | has been damaged yet. |
29 | 34 | ||
@@ -31,10 +36,10 @@ However, preventive maintenance and proactive part replacement of memory | |||
31 | DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events | 36 | DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events |
32 | and system panics. | 37 | and system panics. |
33 | 38 | ||
34 | OTHER HARDWARE ELEMENTS | 39 | Other hardware elements |
35 | ----------------------- | 40 | ----------------------- |
36 | 41 | ||
37 | A new feature for EDAC, the edac_device class of device, was added in | 42 | A new feature for EDAC, the ``edac_device`` class of device, was added in |
38 | the 2.6.23 version of the kernel. | 43 | the 2.6.23 version of the kernel. |
39 | 44 | ||
40 | This new device type allows for non-memory type of ECC hardware detectors | 45 | This new device type allows for non-memory type of ECC hardware detectors |
@@ -48,14 +53,14 @@ reports it, then a edac_device device probably can be constructed to | |||
48 | harvest and present that to userspace. | 53 | harvest and present that to userspace. |
49 | 54 | ||
50 | 55 | ||
51 | PCI BUS SCANNING | 56 | PCI bus scanning |
52 | ---------------- | 57 | ---------------- |
53 | 58 | ||
54 | In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors | 59 | In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors |
55 | in order to determine if errors are occurring during data transfers. | 60 | in order to determine if errors are occurring during data transfers. |
56 | 61 | ||
57 | The presence of PCI Parity errors must be examined with a grain of salt. | 62 | The presence of PCI Parity errors must be examined with a grain of salt. |
58 | There are several add-in adapters that do *not* follow the PCI specification | 63 | There are several add-in adapters that do **not** follow the PCI specification |
59 | with regards to Parity generation and reporting. The specification says | 64 | with regards to Parity generation and reporting. The specification says |
60 | the vendor should tie the parity status bits to 0 if they do not intend | 65 | the vendor should tie the parity status bits to 0 if they do not intend |
61 | to generate parity. Some vendors do not do this, and thus the parity bit | 66 | to generate parity. Some vendors do not do this, and thus the parity bit |
@@ -63,62 +68,64 @@ can "float" giving false positives. | |||
63 | 68 | ||
64 | There is a PCI device attribute located in sysfs that is checked by | 69 | There is a PCI device attribute located in sysfs that is checked by |
65 | the EDAC PCI scanning code. If that attribute is set, PCI parity/error | 70 | the EDAC PCI scanning code. If that attribute is set, PCI parity/error |
66 | scanning is skipped for that device. The attribute is: | 71 | scanning is skipped for that device. The attribute is:: |
67 | 72 | ||
68 | broken_parity_status | 73 | broken_parity_status |
69 | 74 | ||
70 | and is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for | 75 | and is located in ``/sys/devices/pci<XXX>/0000:XX:YY.Z`` directories for |
71 | PCI devices. | 76 | PCI devices. |
72 | 77 | ||
73 | 78 | ||
74 | VERSIONING | 79 | Versioning |
75 | ---------- | 80 | ---------- |
76 | 81 | ||
77 | EDAC is composed of a "core" module (edac_core.ko) and several Memory | 82 | EDAC is composed of a "core" module (``edac_core.ko``) and several Memory |
78 | Controller (MC) driver modules. On a given system, the CORE is loaded | 83 | Controller (MC) driver modules. On a given system, the CORE is loaded |
79 | and one MC driver will be loaded. Both the CORE and the MC driver (or | 84 | and one MC driver will be loaded. Both the CORE and the MC driver (or |
80 | edac_device driver) have individual versions that reflect current | 85 | ``edac_device`` driver) have individual versions that reflect current |
81 | release level of their respective modules. | 86 | release level of their respective modules. |
82 | 87 | ||
83 | Thus, to "report" on what version a system is running, one must report | 88 | Thus, to "report" on what version a system is running, one must report |
84 | both the CORE's and the MC driver's versions. | 89 | both the CORE's and the MC driver's versions. |
85 | 90 | ||
86 | 91 | ||
87 | LOADING | 92 | Loading |
88 | ------- | 93 | ------- |
89 | 94 | ||
90 | If 'edac' was statically linked with the kernel then no loading | 95 | If ``edac`` was statically linked with the kernel then no loading |
91 | is necessary. If 'edac' was built as modules then simply modprobe | 96 | is necessary. If ``edac`` was built as modules then simply modprobe |
92 | the 'edac' pieces that you need. You should be able to modprobe | 97 | the ``edac`` pieces that you need. You should be able to modprobe |
93 | hardware-specific modules and have the dependencies load the necessary | 98 | hardware-specific modules and have the dependencies load the necessary |
94 | core modules. | 99 | core modules. |
95 | 100 | ||
96 | Example: | 101 | Example:: |
97 | 102 | ||
98 | $> modprobe amd76x_edac | 103 | $ modprobe amd76x_edac |
99 | 104 | ||
100 | loads both the amd76x_edac.ko memory controller module and the edac_mc.ko | 105 | loads both the ``amd76x_edac.ko`` memory controller module and the |
101 | core module. | 106 | ``edac_mc.ko`` core module. |
102 | 107 | ||
103 | 108 | ||
104 | SYSFS INTERFACE | 109 | Sysfs interface |
105 | --------------- | 110 | --------------- |
106 | 111 | ||
107 | EDAC presents a 'sysfs' interface for control and reporting purposes. It | 112 | EDAC presents a ``sysfs`` interface for control and reporting purposes. It |
108 | lives in the /sys/devices/system/edac directory. | 113 | lives in the /sys/devices/system/edac directory. |
109 | 114 | ||
110 | Within this directory there currently reside 2 components: | 115 | Within this directory there currently reside 2 components: |
111 | 116 | ||
117 | ======= ============================== | ||
112 | mc memory controller(s) system | 118 | mc memory controller(s) system |
113 | pci PCI control and status system | 119 | pci PCI control and status system |
120 | ======= ============================== | ||
114 | 121 | ||
115 | 122 | ||
116 | 123 | ||
117 | Memory Controller (mc) Model | 124 | Memory Controller (mc) Model |
118 | ---------------------------- | 125 | ---------------------------- |
119 | 126 | ||
120 | Each 'mc' device controls a set of DIMM memory modules. These modules | 127 | Each ``mc`` device controls a set of DIMM memory modules. These modules |
121 | are laid out in a Chip-Select Row (csrowX) and Channel table (chX). | 128 | are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). |
122 | There can be multiple csrows and multiple channels. | 129 | There can be multiple csrows and multiple channels. |
123 | 130 | ||
124 | Memory controllers allow for several csrows, with 8 csrows being a | 131 | Memory controllers allow for several csrows, with 8 csrows being a |
@@ -129,28 +136,28 @@ Dual channels allows for 128 bit data transfers to/from the CPU from/to | |||
129 | memory. Some newer chipsets allow for more than 2 channels, like Fully | 136 | memory. Some newer chipsets allow for more than 2 channels, like Fully |
130 | Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: | 137 | Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: |
131 | 138 | ||
132 | 139 | +--------+-----------+-----------+ | |
133 | Channel 0 Channel 1 | 140 | | | Channel 0 | Channel 1 | |
134 | =================================== | 141 | +========+===========+===========+ |
135 | csrow0 | DIMM_A0 | DIMM_B0 | | 142 | | csrow0 | DIMM_A0 | DIMM_B0 | |
136 | csrow1 | DIMM_A0 | DIMM_B0 | | 143 | +--------+ | | |
137 | =================================== | 144 | | csrow1 | | | |
138 | 145 | +--------+-----------+-----------+ | |
139 | =================================== | 146 | | csrow2 | DIMM_A1 | DIMM_B1 | |
140 | csrow2 | DIMM_A1 | DIMM_B1 | | 147 | +--------+ | | |
141 | csrow3 | DIMM_A1 | DIMM_B1 | | 148 | | csrow3 | | | |
142 | =================================== | 149 | +--------+-----------+-----------+ |
143 | 150 | ||
144 | In the above example table there are 4 physical slots on the motherboard | 151 | In the above example table there are 4 physical slots on the motherboard |
145 | for memory DIMMs: | 152 | for memory DIMMs: |
146 | 153 | ||
147 | DIMM_A0 | 154 | - DIMM_A0 |
148 | DIMM_B0 | 155 | - DIMM_B0 |
149 | DIMM_A1 | 156 | - DIMM_A1 |
150 | DIMM_B1 | 157 | - DIMM_B1 |
151 | 158 | ||
152 | Labels for these slots are usually silk-screened on the motherboard. | 159 | Labels for these slots are usually silk-screened on the motherboard. |
153 | Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are | 160 | Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are |
154 | channel 1. Notice that there are two csrows possible on a physical DIMM. | 161 | channel 1. Notice that there are two csrows possible on a physical DIMM. |
155 | These csrows are allocated their csrow assignment based on the slot into | 162 | These csrows are allocated their csrow assignment based on the slot into |
156 | which the memory DIMM is placed. Thus, when 1 DIMM is placed in each | 163 | which the memory DIMM is placed. Thus, when 1 DIMM is placed in each |
@@ -166,8 +173,7 @@ csrow3. | |||
166 | The representation of the above is reflected in the directory | 173 | The representation of the above is reflected in the directory |
167 | tree in EDAC's sysfs interface. Starting in directory | 174 | tree in EDAC's sysfs interface. Starting in directory |
168 | /sys/devices/system/edac/mc each memory controller will be represented | 175 | /sys/devices/system/edac/mc each memory controller will be represented |
169 | by its own 'mcX' directory, where 'X' is the index of the MC. | 176 | by its own ``mcX`` directory, where ``X`` is the index of the MC:: |
170 | |||
171 | 177 | ||
172 | ..../edac/mc/ | 178 | ..../edac/mc/ |
173 | | | 179 | | |
@@ -176,9 +182,8 @@ by its own 'mcX' directory, where 'X' is the index of the MC. | |||
176 | |->mc2 | 182 | |->mc2 |
177 | .... | 183 | .... |
178 | 184 | ||
179 | Under each 'mcX' directory each 'csrowX' is again represented by a | 185 | Under each ``mcX`` directory each ``csrowX`` is again represented by a |
180 | 'csrowX', where 'X' is the csrow index: | 186 | ``csrowX``, where ``X`` is the csrow index:: |
181 | |||
182 | 187 | ||
183 | .../mc/mc0/ | 188 | .../mc/mc0/ |
184 | | | 189 | | |
@@ -194,17 +199,18 @@ csrow3 are populated, this indicates a dual ranked set of DIMMs for | |||
194 | channels 0 and 1. | 199 | channels 0 and 1. |
195 | 200 | ||
196 | 201 | ||
197 | Within each of the 'mcX' and 'csrowX' directories are several EDAC | 202 | Within each of the ``mcX`` and ``csrowX`` directories are several EDAC |
198 | control and attribute files. | 203 | control and attribute files. |
199 | 204 | ||
200 | 205 | ||
201 | 'mcX' directories | 206 | ``mcX`` directories |
202 | ----------------- | 207 | ------------------- |
203 | 208 | ||
204 | In 'mcX' directories are EDAC control and attribute files for | 209 | In ``mcX`` directories are EDAC control and attribute files for |
205 | this 'X' instance of the memory controllers. | 210 | this ``X`` instance of the memory controllers. |
206 | 211 | ||
207 | For a description of the sysfs API, please see: | 212 | For a description of the sysfs API, please see: |
213 | |||
208 | Documentation/ABI/testing/sysfs-devices-edac | 214 | Documentation/ABI/testing/sysfs-devices-edac |
209 | 215 | ||
210 | 216 | ||
@@ -329,21 +335,19 @@ this ``X`` memory module: | |||
329 | symlinks inside the sysfs mapping that are automatically created by | 335 | symlinks inside the sysfs mapping that are automatically created by |
330 | the sysfs subsystem. Currently, they serve no purpose. | 336 | the sysfs subsystem. Currently, they serve no purpose. |
331 | 337 | ||
332 | 'csrowX' directories | 338 | ``csrowX`` directories |
333 | -------------------- | 339 | ---------------------- |
334 | 340 | ||
335 | When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX | 341 | When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX |
336 | directories. As this API doesn't work properly for Rambus, FB-DIMMs and | 342 | directories. As this API doesn't work properly for Rambus, FB-DIMMs and |
337 | modern Intel Memory Controllers, this is being deprecated in favor of | 343 | modern Intel Memory Controllers, this is being deprecated in favor of |
338 | dimmX directories. | 344 | dimmX directories. |
339 | 345 | ||
340 | In the 'csrowX' directories are EDAC control and attribute files for | 346 | In the ``csrowX`` directories are EDAC control and attribute files for |
341 | this 'X' instance of csrow: | 347 | this ``X`` instance of csrow: |
342 | 348 | ||
343 | 349 | ||
344 | Total Uncorrectable Errors count attribute file: | 350 | - ``ue_count`` - Total Uncorrectable Errors count attribute file |
345 | |||
346 | 'ue_count' | ||
347 | 351 | ||
348 | This attribute file displays the total count of uncorrectable | 352 | This attribute file displays the total count of uncorrectable |
349 | errors that have occurred on this csrow. If panic_on_ue is set | 353 | errors that have occurred on this csrow. If panic_on_ue is set |
@@ -351,9 +355,7 @@ Total Uncorrectable Errors count attribute file: | |||
351 | will panic the system. | 355 | will panic the system. |
352 | 356 | ||
353 | 357 | ||
354 | Total Correctable Errors count attribute file: | 358 | - ``ce_count`` - Total Correctable Errors count attribute file |
355 | |||
356 | 'ce_count' | ||
357 | 359 | ||
358 | This attribute file displays the total count of correctable | 360 | This attribute file displays the total count of correctable |
359 | errors that have occurred on this csrow. This count is very | 361 | errors that have occurred on this csrow. This count is very |
@@ -363,65 +365,54 @@ Total Correctable Errors count attribute file: | |||
363 | to the system administrator. | 365 | to the system administrator. |
364 | 366 | ||
365 | 367 | ||
366 | Total memory managed by this csrow attribute file: | 368 | - ``size_mb`` - Total memory managed by this csrow attribute file |
367 | |||
368 | 'size_mb' | ||
369 | 369 | ||
370 | This attribute file displays, in count of megabytes, the memory | 370 | This attribute file displays, in count of megabytes, the memory |
371 | that this csrow contains. | 371 | that this csrow contains. |
372 | 372 | ||
373 | 373 | ||
374 | Memory Type attribute file: | 374 | - ``mem_type`` - Memory Type attribute file |
375 | |||
376 | 'mem_type' | ||
377 | 375 | ||
378 | This attribute file will display what type of memory is currently | 376 | This attribute file will display what type of memory is currently |
379 | on this csrow. Normally, either buffered or unbuffered memory. | 377 | on this csrow. Normally, either buffered or unbuffered memory. |
380 | Examples: | 378 | Examples: |
381 | Registered-DDR | ||
382 | Unbuffered-DDR | ||
383 | 379 | ||
380 | - Registered-DDR | ||
381 | - Unbuffered-DDR | ||
384 | 382 | ||
385 | EDAC Mode of operation attribute file: | ||
386 | 383 | ||
387 | 'edac_mode' | 384 | - ``edac_mode`` - EDAC Mode of operation attribute file |
388 | 385 | ||
389 | This attribute file will display what type of Error detection | 386 | This attribute file will display what type of Error detection |
390 | and correction is being utilized. | 387 | and correction is being utilized. |
391 | 388 | ||
392 | 389 | ||
393 | Device type attribute file: | 390 | - ``dev_type`` - Device type attribute file |
394 | |||
395 | 'dev_type' | ||
396 | 391 | ||
397 | This attribute file will display what type of DRAM device is | 392 | This attribute file will display what type of DRAM device is |
398 | being utilized on this DIMM. | 393 | being utilized on this DIMM. |
399 | Examples: | 394 | Examples: |
400 | x1 | ||
401 | x2 | ||
402 | x4 | ||
403 | x8 | ||
404 | 395 | ||
396 | - x1 | ||
397 | - x2 | ||
398 | - x4 | ||
399 | - x8 | ||
405 | 400 | ||
406 | Channel 0 CE Count attribute file: | ||
407 | 401 | ||
408 | 'ch0_ce_count' | 402 | - ``ch0_ce_count`` - Channel 0 CE Count attribute file |
409 | 403 | ||
410 | This attribute file will display the count of CEs on this | 404 | This attribute file will display the count of CEs on this |
411 | DIMM located in channel 0. | 405 | DIMM located in channel 0. |
412 | 406 | ||
413 | 407 | ||
414 | Channel 0 UE Count attribute file: | 408 | - ``ch0_ue_count`` - Channel 0 UE Count attribute file |
415 | |||
416 | 'ch0_ue_count' | ||
417 | 409 | ||
418 | This attribute file will display the count of UEs on this | 410 | This attribute file will display the count of UEs on this |
419 | DIMM located in channel 0. | 411 | DIMM located in channel 0. |
420 | 412 | ||
421 | 413 | ||
422 | Channel 0 DIMM Label control file: | 414 | - ``ch0_dimm_label`` - Channel 0 DIMM Label control file |
423 | 415 | ||
424 | 'ch0_dimm_label' | ||
425 | 416 | ||
426 | This control file allows this DIMM to have a label assigned | 417 | This control file allows this DIMM to have a label assigned |
427 | to it. With this label in the module, when errors occur | 418 | to it. With this label in the module, when errors occur |
@@ -436,25 +427,21 @@ Channel 0 DIMM Label control file: | |||
436 | must occur in userland at this time. | 427 | must occur in userland at this time. |
437 | 428 | ||
438 | 429 | ||
439 | Channel 1 CE Count attribute file: | 430 | - ``ch1_ce_count`` - Channel 1 CE Count attribute file |
440 | 431 | ||
441 | 'ch1_ce_count' | ||
442 | 432 | ||
443 | This attribute file will display the count of CEs on this | 433 | This attribute file will display the count of CEs on this |
444 | DIMM located in channel 1. | 434 | DIMM located in channel 1. |
445 | 435 | ||
446 | 436 | ||
447 | Channel 1 UE Count attribute file: | 437 | - ``ch1_ue_count`` - Channel 1 UE Count attribute file |
448 | 438 | ||
449 | 'ch1_ue_count' | ||
450 | 439 | ||
451 | This attribute file will display the count of UEs on this | 440 | This attribute file will display the count of UEs on this |
452 | DIMM located in channel 0. | 441 | DIMM located in channel 0. |
453 | 442 | ||
454 | 443 | ||
455 | Channel 1 DIMM Label control file: | 444 | - ``ch1_dimm_label`` - Channel 1 DIMM Label control file |
456 | |||
457 | 'ch1_dimm_label' | ||
458 | 445 | ||
459 | This control file allows this DIMM to have a label assigned | 446 | This control file allows this DIMM to have a label assigned |
460 | to it. With this label in the module, when errors occur | 447 | to it. With this label in the module, when errors occur |
@@ -469,33 +456,44 @@ Channel 1 DIMM Label control file: | |||
469 | must occur in userland at this time. | 456 | must occur in userland at this time. |
470 | 457 | ||
471 | 458 | ||
472 | 459 | System Logging | |
473 | SYSTEM LOGGING | ||
474 | -------------- | 460 | -------------- |
475 | 461 | ||
476 | If logging for UEs and CEs is enabled, then system logs will contain | 462 | If logging for UEs and CEs is enabled, then system logs will contain |
477 | information indicating that errors have been detected: | 463 | information indicating that errors have been detected:: |
478 | 464 | ||
479 | EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, | 465 | EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76x_edac |
480 | channel 1 "DIMM_B1": amd76x_edac | 466 | EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76x_edac |
481 | |||
482 | EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, | ||
483 | channel 1 "DIMM_B1": amd76x_edac | ||
484 | 467 | ||
485 | 468 | ||
486 | The structure of the message is: | 469 | The structure of the message is: |
487 | the memory controller (MC0) | 470 | |
488 | Error type (CE) | 471 | +---------------------------------------+-------------+ |
489 | memory page (0x283) | 472 | | Content + Example | |
490 | offset in the page (0xce0) | 473 | +=======================================+=============+ |
491 | the byte granularity (grain 8) | 474 | | The memory controller | MC0 | |
492 | or resolution of the error | 475 | +---------------------------------------+-------------+ |
493 | the error syndrome (0xb741) | 476 | | Error type | CE | |
494 | memory row (row 0) | 477 | +---------------------------------------+-------------+ |
495 | memory channel (channel 1) | 478 | | Memory page | 0x283 | |
496 | DIMM label, if set prior (DIMM B1 | 479 | +---------------------------------------+-------------+ |
497 | and then an optional, driver-specific message that may | 480 | | Offset in the page | 0xce0 | |
498 | have additional information. | 481 | +---------------------------------------+-------------+ |
482 | | The byte granularity | grain 8 | | ||
483 | | or resolution of the error | | | ||
484 | +---------------------------------------+-------------+ | ||
485 | | The error syndrome | 0xb741 | | ||
486 | +---------------------------------------+-------------+ | ||
487 | | Memory row | row 0 + | ||
488 | +---------------------------------------+-------------+ | ||
489 | | Memory channel | channel 1 | | ||
490 | +---------------------------------------+-------------+ | ||
491 | | DIMM label, if set prior | DIMM B1 | | ||
492 | +---------------------------------------+-------------+ | ||
493 | | And then an optional, driver-specific | | | ||
494 | | message that may have additional | | | ||
495 | | information. | | | ||
496 | +---------------------------------------+-------------+ | ||
499 | 497 | ||
500 | Both UEs and CEs with no info will lack all but memory controller, error | 498 | Both UEs and CEs with no info will lack all but memory controller, error |
501 | type, a notice of "no info" and then an optional, driver-specific error | 499 | type, a notice of "no info" and then an optional, driver-specific error |
@@ -512,43 +510,38 @@ Type 01 bridges, the secondary status register is also looked at to see | |||
512 | if parity occurred on the bus on the other side of the bridge. | 510 | if parity occurred on the bus on the other side of the bridge. |
513 | 511 | ||
514 | 512 | ||
515 | SYSFS CONFIGURATION | 513 | Sysfs configuration |
516 | ------------------- | 514 | ------------------- |
517 | 515 | ||
518 | Under /sys/devices/system/edac/pci are control and attribute files as follows: | 516 | Under ``/sys/devices/system/edac/pci`` are control and attribute files as |
517 | follows: | ||
519 | 518 | ||
520 | 519 | ||
521 | Enable/Disable PCI Parity checking control file: | 520 | - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file |
522 | |||
523 | 'check_pci_parity' | ||
524 | |||
525 | 521 | ||
526 | This control file enables or disables the PCI Bus Parity scanning | 522 | This control file enables or disables the PCI Bus Parity scanning |
527 | operation. Writing a 1 to this file enables the scanning. Writing | 523 | operation. Writing a 1 to this file enables the scanning. Writing |
528 | a 0 to this file disables the scanning. | 524 | a 0 to this file disables the scanning. |
529 | 525 | ||
530 | Enable: | 526 | Enable:: |
531 | echo "1" >/sys/devices/system/edac/pci/check_pci_parity | 527 | |
528 | echo "1" >/sys/devices/system/edac/pci/check_pci_parity | ||
532 | 529 | ||
533 | Disable: | 530 | Disable:: |
534 | echo "0" >/sys/devices/system/edac/pci/check_pci_parity | ||
535 | 531 | ||
532 | echo "0" >/sys/devices/system/edac/pci/check_pci_parity | ||
536 | 533 | ||
537 | Parity Count: | ||
538 | 534 | ||
539 | 'pci_parity_count' | 535 | - ``pci_parity_count`` - Parity Count |
540 | 536 | ||
541 | This attribute file will display the number of parity errors that | 537 | This attribute file will display the number of parity errors that |
542 | have been detected. | 538 | have been detected. |
543 | 539 | ||
544 | 540 | ||
545 | 541 | Module parameters | |
546 | MODULE PARAMETERS | ||
547 | ----------------- | 542 | ----------------- |
548 | 543 | ||
549 | Panic on UE control file: | 544 | - ``edac_mc_panic_on_ue`` - Panic on UE control file |
550 | |||
551 | 'edac_mc_panic_on_ue' | ||
552 | 545 | ||
553 | An uncorrectable error will cause a machine panic. This is usually | 546 | An uncorrectable error will cause a machine panic. This is usually |
554 | desirable. It is a bad idea to continue when an uncorrectable error | 547 | desirable. It is a bad idea to continue when an uncorrectable error |
@@ -557,40 +550,49 @@ Panic on UE control file: | |||
557 | corruption. If the kernel has MCE configured, then EDAC will never | 550 | corruption. If the kernel has MCE configured, then EDAC will never |
558 | notice the UE. | 551 | notice the UE. |
559 | 552 | ||
560 | LOAD TIME: module/kernel parameter: edac_mc_panic_on_ue=[0|1] | 553 | LOAD TIME:: |
554 | |||
555 | module/kernel parameter: edac_mc_panic_on_ue=[0|1] | ||
556 | |||
557 | RUN TIME:: | ||
561 | 558 | ||
562 | RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue | 559 | echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue |
563 | 560 | ||
564 | 561 | ||
565 | Log UE control file: | 562 | - ``edac_mc_log_ue`` - Log UE control file |
566 | 563 | ||
567 | 'edac_mc_log_ue' | ||
568 | 564 | ||
569 | Generate kernel messages describing uncorrectable errors. These errors | 565 | Generate kernel messages describing uncorrectable errors. These errors |
570 | are reported through the system message log system. UE statistics | 566 | are reported through the system message log system. UE statistics |
571 | will be accumulated even when UE logging is disabled. | 567 | will be accumulated even when UE logging is disabled. |
572 | 568 | ||
573 | LOAD TIME: module/kernel parameter: edac_mc_log_ue=[0|1] | 569 | LOAD TIME:: |
574 | 570 | ||
575 | RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue | 571 | module/kernel parameter: edac_mc_log_ue=[0|1] |
576 | 572 | ||
573 | RUN TIME:: | ||
577 | 574 | ||
578 | Log CE control file: | 575 | echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue |
576 | |||
577 | |||
578 | - ``edac_mc_log_ce`` - Log CE control file | ||
579 | 579 | ||
580 | 'edac_mc_log_ce' | ||
581 | 580 | ||
582 | Generate kernel messages describing correctable errors. These | 581 | Generate kernel messages describing correctable errors. These |
583 | errors are reported through the system message log system. | 582 | errors are reported through the system message log system. |
584 | CE statistics will be accumulated even when CE logging is disabled. | 583 | CE statistics will be accumulated even when CE logging is disabled. |
585 | 584 | ||
586 | LOAD TIME: module/kernel parameter: edac_mc_log_ce=[0|1] | 585 | LOAD TIME:: |
586 | |||
587 | module/kernel parameter: edac_mc_log_ce=[0|1] | ||
588 | |||
589 | RUN TIME:: | ||
587 | 590 | ||
588 | RUN TIME: echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce | 591 | echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce |
589 | 592 | ||
590 | 593 | ||
591 | Polling period control file: | 594 | - ``edac_mc_poll_msec`` - Polling period control file |
592 | 595 | ||
593 | 'edac_mc_poll_msec' | ||
594 | 596 | ||
595 | The time period, in milliseconds, for polling for error information. | 597 | The time period, in milliseconds, for polling for error information. |
596 | Too small a value wastes resources. Too large a value might delay | 598 | Too small a value wastes resources. Too large a value might delay |
@@ -599,27 +601,33 @@ Polling period control file: | |||
599 | default. Systems which require all the bandwidth they can get, may | 601 | default. Systems which require all the bandwidth they can get, may |
600 | increase this. | 602 | increase this. |
601 | 603 | ||
602 | LOAD TIME: module/kernel parameter: edac_mc_poll_msec=[0|1] | 604 | LOAD TIME:: |
603 | 605 | ||
604 | RUN TIME: echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec | 606 | module/kernel parameter: edac_mc_poll_msec=[0|1] |
605 | 607 | ||
608 | RUN TIME:: | ||
606 | 609 | ||
607 | Panic on PCI PARITY Error: | 610 | echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec |
608 | 611 | ||
609 | 'panic_on_pci_parity' | 612 | |
613 | - ``panic_on_pci_parity`` - Panic on PCI PARITY Error | ||
610 | 614 | ||
611 | 615 | ||
612 | This control file enables or disables panicking when a parity | 616 | This control file enables or disables panicking when a parity |
613 | error has been detected. | 617 | error has been detected. |
614 | 618 | ||
615 | 619 | ||
616 | module/kernel parameter: edac_panic_on_pci_pe=[0|1] | 620 | module/kernel parameter:: |
621 | |||
622 | edac_panic_on_pci_pe=[0|1] | ||
623 | |||
624 | Enable:: | ||
625 | |||
626 | echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe | ||
617 | 627 | ||
618 | Enable: | 628 | Disable:: |
619 | echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe | ||
620 | 629 | ||
621 | Disable: | 630 | echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe |
622 | echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe | ||
623 | 631 | ||
624 | 632 | ||
625 | 633 | ||
@@ -631,28 +639,31 @@ and APIs for the EDAC_DEVICE. | |||
631 | 639 | ||
632 | User space access to an edac_device is through the sysfs interface. | 640 | User space access to an edac_device is through the sysfs interface. |
633 | 641 | ||
634 | At the location /sys/devices/system/edac (sysfs) new edac_device devices will | 642 | At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices |
635 | appear. | 643 | will appear. |
636 | 644 | ||
637 | There is a three level tree beneath the above 'edac' directory. For example, | 645 | There is a three level tree beneath the above ``edac`` directory. For example, |
638 | the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website) | 646 | the ``test_device_edac`` device (found at the http://bluesmoke.sourceforget.net |
639 | installs itself as: | 647 | website) installs itself as:: |
640 | 648 | ||
641 | /sys/devices/systm/edac/test-instance | 649 | /sys/devices/system/edac/test-instance |
642 | 650 | ||
643 | in this directory are various controls, a symlink and one or more 'instance' | 651 | in this directory are various controls, a symlink and one or more ``instance`` |
644 | directories. | 652 | directories. |
645 | 653 | ||
646 | The standard default controls are: | 654 | The standard default controls are: |
647 | 655 | ||
656 | ============== ======================================================= | ||
648 | log_ce boolean to log CE events | 657 | log_ce boolean to log CE events |
649 | log_ue boolean to log UE events | 658 | log_ue boolean to log UE events |
650 | panic_on_ue boolean to 'panic' the system if an UE is encountered | 659 | panic_on_ue boolean to ``panic`` the system if an UE is encountered |
651 | (default off, can be set true via startup script) | 660 | (default off, can be set true via startup script) |
652 | poll_msec time period between POLL cycles for events | 661 | poll_msec time period between POLL cycles for events |
662 | ============== ======================================================= | ||
653 | 663 | ||
654 | The test_device_edac device adds at least one of its own custom control: | 664 | The test_device_edac device adds at least one of its own custom control: |
655 | 665 | ||
666 | ============== ================================================== | ||
656 | test_bits which in the current test driver does nothing but | 667 | test_bits which in the current test driver does nothing but |
657 | show how it is installed. A ported driver can | 668 | show how it is installed. A ported driver can |
658 | add one or more such controls and/or attributes | 669 | add one or more such controls and/or attributes |
@@ -660,42 +671,52 @@ The test_device_edac device adds at least one of its own custom control: | |||
660 | One out-of-tree driver uses controls here to allow | 671 | One out-of-tree driver uses controls here to allow |
661 | for ERROR INJECTION operations to hardware | 672 | for ERROR INJECTION operations to hardware |
662 | injection registers | 673 | injection registers |
674 | ============== ================================================== | ||
663 | 675 | ||
664 | The symlink points to the 'struct dev' that is registered for this edac_device. | 676 | The symlink points to the 'struct dev' that is registered for this edac_device. |
665 | 677 | ||
666 | INSTANCES | 678 | Instances |
667 | --------- | 679 | --------- |
668 | 680 | ||
669 | One or more instance directories are present. For the 'test_device_edac' case: | 681 | One or more instance directories are present. For the ``test_device_edac`` |
682 | case: | ||
670 | 683 | ||
671 | test-instance0 | 684 | +----------------+ |
685 | | test-instance0 | | ||
686 | +----------------+ | ||
672 | 687 | ||
673 | 688 | ||
674 | In this directory there are two default counter attributes, which are totals of | 689 | In this directory there are two default counter attributes, which are totals of |
675 | counter in deeper subdirectories. | 690 | counter in deeper subdirectories. |
676 | 691 | ||
692 | ============== ==================================== | ||
677 | ce_count total of CE events of subdirectories | 693 | ce_count total of CE events of subdirectories |
678 | ue_count total of UE events of subdirectories | 694 | ue_count total of UE events of subdirectories |
695 | ============== ==================================== | ||
679 | 696 | ||
680 | BLOCKS | 697 | Blocks |
681 | ------ | 698 | ------ |
682 | 699 | ||
683 | At the lowest directory level is the 'block' directory. There can be 0, 1 | 700 | At the lowest directory level is the ``block`` directory. There can be 0, 1 |
684 | or more blocks specified in each instance. | 701 | or more blocks specified in each instance: |
685 | |||
686 | test-block0 | ||
687 | 702 | ||
703 | +-------------+ | ||
704 | | test-block0 | | ||
705 | +-------------+ | ||
688 | 706 | ||
689 | In this directory the default attributes are: | 707 | In this directory the default attributes are: |
690 | 708 | ||
691 | ce_count which is counter of CE events for this 'block' | 709 | ============== ================================================ |
710 | ce_count which is counter of CE events for this ``block`` | ||
692 | of hardware being monitored | 711 | of hardware being monitored |
693 | ue_count which is counter of UE events for this 'block' | 712 | ue_count which is counter of UE events for this ``block`` |
694 | of hardware being monitored | 713 | of hardware being monitored |
714 | ============== ================================================ | ||
695 | 715 | ||
696 | 716 | ||
697 | The 'test_device_edac' device adds 4 attributes and 1 control: | 717 | The ``test_device_edac`` device adds 4 attributes and 1 control: |
698 | 718 | ||
719 | ================== ==================================================== | ||
699 | test-block-bits-0 for every POLL cycle this counter | 720 | test-block-bits-0 for every POLL cycle this counter |
700 | is incremented | 721 | is incremented |
701 | test-block-bits-1 every 10 cycles, this counter is bumped once, | 722 | test-block-bits-1 every 10 cycles, this counter is bumped once, |
@@ -704,20 +725,23 @@ The 'test_device_edac' device adds 4 attributes and 1 control: | |||
704 | and test-block-bits-1 is set to 0 | 725 | and test-block-bits-1 is set to 0 |
705 | test-block-bits-3 every 1000 cycles, this counter is bumped once, | 726 | test-block-bits-3 every 1000 cycles, this counter is bumped once, |
706 | and test-block-bits-2 is set to 0 | 727 | and test-block-bits-2 is set to 0 |
728 | ================== ==================================================== | ||
707 | 729 | ||
708 | 730 | ||
731 | ================== ==================================================== | ||
709 | reset-counters writing ANY thing to this control will | 732 | reset-counters writing ANY thing to this control will |
710 | reset all the above counters. | 733 | reset all the above counters. |
734 | ================== ==================================================== | ||
711 | 735 | ||
712 | 736 | ||
713 | Use of the 'test_device_edac' driver should enable any others to create their own | 737 | Use of the ``test_device_edac`` driver should enable any others to create their own |
714 | unique drivers for their hardware systems. | 738 | unique drivers for their hardware systems. |
715 | 739 | ||
716 | The 'test_device_edac' sample driver is located at the | 740 | The ``test_device_edac`` sample driver is located at the |
717 | bluesmoke.sourceforge.net project site for EDAC. | 741 | http://bluesmoke.sourceforge.net project site for EDAC. |
718 | 742 | ||
719 | 743 | ||
720 | NEHALEM USAGE OF EDAC APIs | 744 | Nehalem Usage of EDAC APIs |
721 | -------------------------- | 745 | -------------------------- |
722 | 746 | ||
723 | This chapter documents some EXPERIMENTAL mappings for EDAC API to handle | 747 | This chapter documents some EXPERIMENTAL mappings for EDAC API to handle |
@@ -739,7 +763,8 @@ were done at i7core_edac driver. This chapter will cover those differences | |||
739 | As EDAC API maps the minimum unity is csrows, the driver sequentially | 763 | As EDAC API maps the minimum unity is csrows, the driver sequentially |
740 | maps channel/dimm into different csrows. | 764 | maps channel/dimm into different csrows. |
741 | 765 | ||
742 | For example, supposing the following layout: | 766 | For example, supposing the following layout:: |
767 | |||
743 | Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs | 768 | Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs |
744 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | 769 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 |
745 | dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 | 770 | dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 |
@@ -747,14 +772,15 @@ were done at i7core_edac driver. This chapter will cover those differences | |||
747 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | 772 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 |
748 | Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs | 773 | Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs |
749 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | 774 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 |
750 | The driver will map it as: | 775 | |
776 | The driver will map it as:: | ||
777 | |||
751 | csrow0: channel 0, dimm0 | 778 | csrow0: channel 0, dimm0 |
752 | csrow1: channel 0, dimm1 | 779 | csrow1: channel 0, dimm1 |
753 | csrow2: channel 1, dimm0 | 780 | csrow2: channel 1, dimm0 |
754 | csrow3: channel 2, dimm0 | 781 | csrow3: channel 2, dimm0 |
755 | 782 | ||
756 | exports one | 783 | exports one DIMM per csrow. |
757 | DIMM per csrow. | ||
758 | 784 | ||
759 | Each QPI is exported as a different memory controller. | 785 | Each QPI is exported as a different memory controller. |
760 | 786 | ||
@@ -762,47 +788,53 @@ exports one | |||
762 | functionality via some error injection nodes: | 788 | functionality via some error injection nodes: |
763 | 789 | ||
764 | For injecting a memory error, there are some sysfs nodes, under | 790 | For injecting a memory error, there are some sysfs nodes, under |
765 | /sys/devices/system/edac/mc/mc?/: | 791 | ``/sys/devices/system/edac/mc/mc?/``: |
766 | 792 | ||
767 | inject_addrmatch/*: | 793 | - ``inject_addrmatch/*``: |
768 | Controls the error injection mask register. It is possible to specify | 794 | Controls the error injection mask register. It is possible to specify |
769 | several characteristics of the address to match an error code: | 795 | several characteristics of the address to match an error code:: |
796 | |||
770 | dimm = the affected dimm. Numbers are relative to a channel; | 797 | dimm = the affected dimm. Numbers are relative to a channel; |
771 | rank = the memory rank; | 798 | rank = the memory rank; |
772 | channel = the channel that will generate an error; | 799 | channel = the channel that will generate an error; |
773 | bank = the affected bank; | 800 | bank = the affected bank; |
774 | page = the page address; | 801 | page = the page address; |
775 | column (or col) = the address column. | 802 | column (or col) = the address column. |
803 | |||
776 | each of the above values can be set to "any" to match any valid value. | 804 | each of the above values can be set to "any" to match any valid value. |
777 | 805 | ||
778 | At driver init, all values are set to any. | 806 | At driver init, all values are set to any. |
779 | 807 | ||
780 | For example, to generate an error at rank 1 of dimm 2, for any channel, | 808 | For example, to generate an error at rank 1 of dimm 2, for any channel, |
781 | any bank, any page, any column: | 809 | any bank, any page, any column:: |
810 | |||
782 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm | 811 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm |
783 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank | 812 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank |
784 | 813 | ||
785 | To return to the default behaviour of matching any, you can do: | 814 | To return to the default behaviour of matching any, you can do:: |
815 | |||
786 | echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm | 816 | echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm |
787 | echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank | 817 | echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank |
788 | 818 | ||
789 | inject_eccmask: | 819 | - ``inject_eccmask``: |
790 | specifies what bits will have troubles, | 820 | specifies what bits will have troubles, |
821 | |||
822 | - ``inject_section``: | ||
823 | specifies what ECC cache section will get the error:: | ||
791 | 824 | ||
792 | inject_section: | ||
793 | specifies what ECC cache section will get the error: | ||
794 | 3 for both | 825 | 3 for both |
795 | 2 for the highest | 826 | 2 for the highest |
796 | 1 for the lowest | 827 | 1 for the lowest |
797 | 828 | ||
798 | inject_type: | 829 | - ``inject_type``: |
799 | specifies the type of error, being a combination of the following bits: | 830 | specifies the type of error, being a combination of the following bits:: |
831 | |||
800 | bit 0 - repeat | 832 | bit 0 - repeat |
801 | bit 1 - ecc | 833 | bit 1 - ecc |
802 | bit 2 - parity | 834 | bit 2 - parity |
803 | 835 | ||
804 | inject_enable starts the error generation when something different | 836 | - ``inject_enable``: |
805 | than 0 is written. | 837 | starts the error generation when something different than 0 is written. |
806 | 838 | ||
807 | All inject vars can be read. root permission is needed for write. | 839 | All inject vars can be read. root permission is needed for write. |
808 | 840 | ||
@@ -811,21 +843,21 @@ exports one | |||
811 | also produce an error. | 843 | also produce an error. |
812 | 844 | ||
813 | For example, the following code will generate an error for any write access | 845 | For example, the following code will generate an error for any write access |
814 | at socket 0, on any DIMM/address on channel 2: | 846 | at socket 0, on any DIMM/address on channel 2:: |
815 | 847 | ||
816 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel | 848 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel |
817 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_type | 849 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_type |
818 | echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask | 850 | echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask |
819 | echo 3 >/sys/devices/system/edac/mc/mc0/inject_section | 851 | echo 3 >/sys/devices/system/edac/mc/mc0/inject_section |
820 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable | 852 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable |
821 | dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null | 853 | dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null |
822 | 854 | ||
823 | For socket 1, it is needed to replace "mc0" by "mc1" at the above | 855 | For socket 1, it is needed to replace "mc0" by "mc1" at the above |
824 | commands. | 856 | commands. |
825 | 857 | ||
826 | The generated error message will look like: | 858 | The generated error message will look like:: |
827 | 859 | ||
828 | EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) | 860 | EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) |
829 | 861 | ||
830 | 3) Nehalem specific Corrected Error memory counters | 862 | 3) Nehalem specific Corrected Error memory counters |
831 | 863 | ||
@@ -837,9 +869,9 @@ exports one | |||
837 | granularity than the default ones), the driver exposes those registers for | 869 | granularity than the default ones), the driver exposes those registers for |
838 | UDIMM memories. | 870 | UDIMM memories. |
839 | 871 | ||
840 | They can be read by looking at the contents of all_channel_counts/ | 872 | They can be read by looking at the contents of ``all_channel_counts/``:: |
841 | 873 | ||
842 | $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done | 874 | $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done |
843 | /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 | 875 | /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0 |
844 | 0 | 876 | 0 |
845 | /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 | 877 | /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1 |
@@ -849,17 +881,21 @@ exports one | |||
849 | 881 | ||
850 | What happens here is that errors on different csrows, but at the same | 882 | What happens here is that errors on different csrows, but at the same |
851 | dimm number will increment the same counter. | 883 | dimm number will increment the same counter. |
852 | So, in this memory mapping: | 884 | So, in this memory mapping:: |
885 | |||
853 | csrow0: channel 0, dimm0 | 886 | csrow0: channel 0, dimm0 |
854 | csrow1: channel 0, dimm1 | 887 | csrow1: channel 0, dimm1 |
855 | csrow2: channel 1, dimm0 | 888 | csrow2: channel 1, dimm0 |
856 | csrow3: channel 2, dimm0 | 889 | csrow3: channel 2, dimm0 |
890 | |||
857 | The hardware will increment udimm0 for an error at the first dimm at either | 891 | The hardware will increment udimm0 for an error at the first dimm at either |
858 | csrow0, csrow2 or csrow3; | 892 | csrow0, csrow2 or csrow3; |
893 | |||
859 | The hardware will increment udimm1 for an error at the second dimm at either | 894 | The hardware will increment udimm1 for an error at the second dimm at either |
860 | csrow0, csrow2 or csrow3; | 895 | csrow0, csrow2 or csrow3; |
896 | |||
861 | The hardware will increment udimm2 for an error at the third dimm at either | 897 | The hardware will increment udimm2 for an error at the third dimm at either |
862 | csrow0, csrow2 or csrow3; | 898 | csrow0, csrow2 or csrow3; |
863 | 899 | ||
864 | 4) Standard error counters | 900 | 4) Standard error counters |
865 | 901 | ||
@@ -868,65 +904,68 @@ exports one | |||
868 | possible that some errors could be lost. With rdimm's, they display the | 904 | possible that some errors could be lost. With rdimm's, they display the |
869 | contents of the registers | 905 | contents of the registers |
870 | 906 | ||
871 | AMD64_EDAC REFERENCE DOCUMENTS USED | 907 | Reference documents used on ``amd64_edac`` |
872 | ----------------------------------- | 908 | ------------------------------------------ |
873 | amd64_edac module is based on the following documents | 909 | |
910 | ``amd64_edac`` module is based on the following documents | ||
874 | (available from http://support.amd.com/en-us/search/tech-docs): | 911 | (available from http://support.amd.com/en-us/search/tech-docs): |
875 | 912 | ||
876 | 1. Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD | 913 | 1. :Title: BIOS and Kernel Developer's Guide for AMD Athlon 64 and AMD |
877 | Opteron Processors | 914 | Opteron Processors |
878 | AMD publication #: 26094 | 915 | :AMD publication #: 26094 |
879 | Revision: 3.26 | 916 | :Revision: 3.26 |
880 | Link: http://support.amd.com/TechDocs/26094.PDF | 917 | :Link: http://support.amd.com/TechDocs/26094.PDF |
881 | 918 | ||
882 | 2. Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh | 919 | 2. :Title: BIOS and Kernel Developer's Guide for AMD NPT Family 0Fh |
883 | Processors | 920 | Processors |
884 | AMD publication #: 32559 | 921 | :AMD publication #: 32559 |
885 | Revision: 3.00 | 922 | :Revision: 3.00 |
886 | Issue Date: May 2006 | 923 | :Issue Date: May 2006 |
887 | Link: http://support.amd.com/TechDocs/32559.pdf | 924 | :Link: http://support.amd.com/TechDocs/32559.pdf |
888 | 925 | ||
889 | 3. Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h | 926 | 3. :Title: BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h |
890 | Processors | 927 | Processors |
891 | AMD publication #: 31116 | 928 | :AMD publication #: 31116 |
892 | Revision: 3.00 | 929 | :Revision: 3.00 |
893 | Issue Date: September 07, 2007 | 930 | :Issue Date: September 07, 2007 |
894 | Link: http://support.amd.com/TechDocs/31116.pdf | 931 | :Link: http://support.amd.com/TechDocs/31116.pdf |
895 | 932 | ||
896 | 4. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h | 933 | 4. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h |
897 | Models 30h-3Fh Processors | 934 | Models 30h-3Fh Processors |
898 | AMD publication #: 49125 | 935 | :AMD publication #: 49125 |
899 | Revision: 3.06 | 936 | :Revision: 3.06 |
900 | Issue Date: 2/12/2015 (latest release) | 937 | :Issue Date: 2/12/2015 (latest release) |
901 | Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf | 938 | :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf |
902 | 939 | ||
903 | 5. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h | 940 | 5. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h |
904 | Models 60h-6Fh Processors | 941 | Models 60h-6Fh Processors |
905 | AMD publication #: 50742 | 942 | :AMD publication #: 50742 |
906 | Revision: 3.01 | 943 | :Revision: 3.01 |
907 | Issue Date: 7/23/2015 (latest release) | 944 | :Issue Date: 7/23/2015 (latest release) |
908 | Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf | 945 | :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf |
909 | 946 | ||
910 | 6. Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h | 947 | 6. :Title: BIOS and Kernel Developer's Guide (BKDG) for AMD Family 16h |
911 | Models 00h-0Fh Processors | 948 | Models 00h-0Fh Processors |
912 | AMD publication #: 48751 | 949 | :AMD publication #: 48751 |
913 | Revision: 3.03 | 950 | :Revision: 3.03 |
914 | Issue Date: 2/23/2015 (latest release) | 951 | :Issue Date: 2/23/2015 (latest release) |
915 | Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf | 952 | :Link: http://support.amd.com/TechDocs/48751_16h_bkdg.pdf |
953 | |||
954 | Credits | ||
955 | ======= | ||
956 | |||
957 | * Written by Doug Thompson <dougthompson@xmission.com> | ||
916 | 958 | ||
917 | CREDITS: | 959 | - 7 Dec 2005 |
918 | ======== | 960 | - 17 Jul 2007 Updated |
919 | 961 | ||
920 | Written by Doug Thompson <dougthompson@xmission.com> | 962 | * |copy| Mauro Carvalho Chehab |
921 | 7 Dec 2005 | ||
922 | 17 Jul 2007 Updated | ||
923 | 963 | ||
924 | (c) Mauro Carvalho Chehab | 964 | - 05 Aug 2009 Nehalem interface |
925 | 05 Aug 2009 Nehalem interface | ||
926 | 965 | ||
927 | EDAC authors/maintainers: | 966 | * EDAC authors/maintainers: |
928 | 967 | ||
929 | Doug Thompson, Dave Jiang, Dave Peterson et al, | 968 | - Doug Thompson, Dave Jiang, Dave Peterson et al, |
930 | Mauro Carvalho Chehab | 969 | - Mauro Carvalho Chehab |
931 | Borislav Petkov | 970 | - Borislav Petkov |
932 | original author: Thayne Harbaugh | 971 | - original author: Thayne Harbaugh |