aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorBorislav Petkov <bp@suse.de>2015-06-19 05:47:17 -0400
committerBorislav Petkov <bp@suse.de>2015-06-24 12:17:40 -0400
commit043b43180efee8dcc41dde5ca710827b26d17510 (patch)
tree3f79ff9aa658e142404a085083e48d20ef5b1242
parent3aae9edd5a63e226baf3375bb8f7e8d05f5d9098 (diff)
EDAC: Update Documentation/edac.txt
Do some initial cleanup, more probably will come. - Move credits section to the end - Update maintainers - Drop sourceforge reference - project is long upstream now - Reformat sections - Reformat paragraphs - Clarify text - Bring it up-to-date - Drop useless "future hardware scanning" section Signed-off-by: Borislav Petkov <bp@suse.de>
-rw-r--r--Documentation/edac.txt273
1 files changed, 130 insertions, 143 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 4df786e73e87..0cf27a3544a5 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -1,53 +1,34 @@
1
2
3EDAC - Error Detection And Correction 1EDAC - Error Detection And Correction
4 2=====================================
5Written by Doug Thompson <dougthompson@xmission.com>
67 Dec 2005
717 Jul 2007 Updated
8
9(c) Mauro Carvalho Chehab
1005 Aug 2009 Nehalem interface
11
12EDAC is maintained and written by:
13
14 Doug Thompson, Dave Jiang, Dave Peterson et al,
15 original author: Thayne Harbaugh,
16
17Contact:
18 website: bluesmoke.sourceforge.net
19 mailing list: bluesmoke-devel@lists.sourceforge.net
20 3
21"bluesmoke" was the name for this device driver when it was "out-of-tree" 4"bluesmoke" was the name for this device driver when it was "out-of-tree"
22and maintained at sourceforge.net. When it was pushed into 2.6.16 for the 5and maintained at sourceforge.net. When it was pushed into 2.6.16 for the
23first time, it was renamed to 'EDAC'. 6first time, it was renamed to 'EDAC'.
24 7
25The bluesmoke project at sourceforge.net is now utilized as a 'staging area' 8PURPOSE
26for EDAC development, before it is sent upstream to kernel.org 9-------
27
28At the bluesmoke/EDAC project site, there is a series of quilt patches against
29recent kernels, stored in a SVN repository. For easier downloading, there
30is also a tarball snapshot available.
31 10
32============================================================================ 11The 'edac' kernel module's goal is to detect and report hardware errors
33EDAC PURPOSE 12that occur within the computer system running under linux.
34
35The 'edac' kernel module goal is to detect and report errors that occur
36within the computer system running under linux.
37 13
38MEMORY 14MEMORY
15------
39 16
40In the initial release, memory Correctable Errors (CE) and Uncorrectable 17Memory Correctable Errors (CE) and Uncorrectable Errors (UE) are the
41Errors (UE) are the primary errors being harvested. These types of errors 18primary errors being harvested. These types of errors are harvested by
42are harvested by the 'edac_mc' class of device. 19the 'edac_mc' device.
43 20
44Detecting CE events, then harvesting those events and reporting them, 21Detecting CE events, then harvesting those events and reporting them,
45CAN be a predictor of future UE events. With CE events, the system can 22*can* but must not necessarily be a predictor of future UE events. With
46continue to operate, but with less safety. Preventive maintenance and 23CE events only, the system can and will continue to operate as no data
47proactive part replacement of memory DIMMs exhibiting CEs can reduce 24has been damaged yet.
48the likelihood of the dreaded UE events and system 'panics'. 25
26However, preventive maintenance and proactive part replacement of memory
27DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events
28and system panics.
49 29
50NON-MEMORY 30OTHER HARDWARE ELEMENTS
31-----------------------
51 32
52A new feature for EDAC, the edac_device class of device, was added in 33A new feature for EDAC, the edac_device class of device, was added in
53the 2.6.23 version of the kernel. 34the 2.6.23 version of the kernel.
@@ -56,70 +37,57 @@ This new device type allows for non-memory type of ECC hardware detectors
56to have their states harvested and presented to userspace via the sysfs 37to have their states harvested and presented to userspace via the sysfs
57interface. 38interface.
58 39
59Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA 40Some architectures have ECC detectors for L1, L2 and L3 caches,
60engines, fabric switches, main data path switches, interconnections, 41along with DMA engines, fabric switches, main data path switches,
61and various other hardware data paths. If the hardware reports it, then 42interconnections, and various other hardware data paths. If the hardware
62a edac_device device probably can be constructed to harvest and present 43reports it, then a edac_device device probably can be constructed to
63that to userspace. 44harvest and present that to userspace.
64 45
65 46
66PCI BUS SCANNING 47PCI BUS SCANNING
48----------------
67 49
68In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices 50In addition, PCI devices are scanned for PCI Bus Parity and SERR Errors
69in order to determine if errors are occurring on data transfers. 51in order to determine if errors are occurring during data transfers.
70 52
71The presence of PCI Parity errors must be examined with a grain of salt. 53The presence of PCI Parity errors must be examined with a grain of salt.
72There are several add-in adapters that do NOT follow the PCI specification 54There are several add-in adapters that do *not* follow the PCI specification
73with regards to Parity generation and reporting. The specification says 55with regards to Parity generation and reporting. The specification says
74the vendor should tie the parity status bits to 0 if they do not intend 56the vendor should tie the parity status bits to 0 if they do not intend
75to generate parity. Some vendors do not do this, and thus the parity bit 57to generate parity. Some vendors do not do this, and thus the parity bit
76can "float" giving false positives. 58can "float" giving false positives.
77 59
78In the kernel there is a PCI device attribute located in sysfs that is 60There is a PCI device attribute located in sysfs that is checked by
79checked by the EDAC PCI scanning code. If that attribute is set, 61the EDAC PCI scanning code. If that attribute is set, PCI parity/error
80PCI parity/error scanning is skipped for that device. The attribute 62scanning is skipped for that device. The attribute is:
81is:
82 63
83 broken_parity_status 64 broken_parity_status
84 65
85as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for 66and is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directories for
86PCI devices. 67PCI devices.
87 68
88FUTURE HARDWARE SCANNING
89 69
90EDAC will have future error detectors that will be integrated with 70VERSIONING
91EDAC or added to it, in the following list: 71----------
92
93 MCE Machine Check Exception
94 MCA Machine Check Architecture
95 NMI NMI notification of ECC errors
96 MSRs Machine Specific Register error cases
97 and other mechanisms.
98
99These errors are usually bus errors, ECC errors, thermal throttling
100and the like.
101
102
103============================================================================
104EDAC VERSIONING
105 72
106EDAC is composed of a "core" module (edac_core.ko) and several Memory 73EDAC is composed of a "core" module (edac_core.ko) and several Memory
107Controller (MC) driver modules. On a given system, the CORE 74Controller (MC) driver modules. On a given system, the CORE is loaded
108is loaded and one MC driver will be loaded. Both the CORE and 75and one MC driver will be loaded. Both the CORE and the MC driver (or
109the MC driver (or edac_device driver) have individual versions that reflect 76edac_device driver) have individual versions that reflect current
110current release level of their respective modules. 77release level of their respective modules.
111 78
112Thus, to "report" on what version a system is running, one must report both 79Thus, to "report" on what version a system is running, one must report
113the CORE's and the MC driver's versions. 80both the CORE's and the MC driver's versions.
114 81
115 82
116LOADING 83LOADING
84-------
117 85
118If 'edac' was statically linked with the kernel then no loading is 86If 'edac' was statically linked with the kernel then no loading
119necessary. If 'edac' was built as modules then simply modprobe the 87is necessary. If 'edac' was built as modules then simply modprobe
120'edac' pieces that you need. You should be able to modprobe 88the 'edac' pieces that you need. You should be able to modprobe
121hardware-specific modules and have the dependencies load the necessary core 89hardware-specific modules and have the dependencies load the necessary
122modules. 90core modules.
123 91
124Example: 92Example:
125 93
@@ -129,35 +97,33 @@ loads both the amd76x_edac.ko memory controller module and the edac_mc.ko
129core module. 97core module.
130 98
131 99
132============================================================================ 100SYSFS INTERFACE
133EDAC sysfs INTERFACE 101---------------
134
135EDAC presents a 'sysfs' interface for control, reporting and attribute
136reporting purposes.
137 102
138EDAC lives in the /sys/devices/system/edac directory. 103EDAC presents a 'sysfs' interface for control and reporting purposes. It
104lives in the /sys/devices/system/edac directory.
139 105
140Within this directory there currently reside 2 'edac' components: 106Within this directory there currently reside 2 components:
141 107
142 mc memory controller(s) system 108 mc memory controller(s) system
143 pci PCI control and status system 109 pci PCI control and status system
144 110
145 111
146============================================================================ 112
147Memory Controller (mc) Model 113Memory Controller (mc) Model
114----------------------------
148 115
149First a background on the memory controller's model abstracted in EDAC. 116Each 'mc' device controls a set of DIMM memory modules. These modules
150Each 'mc' device controls a set of DIMM memory modules. These modules are 117are laid out in a Chip-Select Row (csrowX) and Channel table (chX).
151laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can 118There can be multiple csrows and multiple channels.
152be multiple csrows and multiple channels.
153 119
154Memory controllers allow for several csrows, with 8 csrows being a typical value. 120Memory controllers allow for several csrows, with 8 csrows being a
155Yet, the actual number of csrows depends on the electrical "loading" 121typical value. Yet, the actual number of csrows depends on the layout of
156of a given motherboard, memory controller and DIMM characteristics. 122a given motherboard, memory controller and DIMM characteristics.
157 123
158Dual channels allows for 128 bit data transfers to the CPU from memory. 124Dual channels allows for 128 bit data transfers to/from the CPU from/to
159Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs 125memory. Some newer chipsets allow for more than 2 channels, like Fully
160(FB-DIMMs). The following example will assume 2 channels: 126Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels:
161 127
162 128
163 Channel 0 Channel 1 129 Channel 0 Channel 1
@@ -179,12 +145,12 @@ for memory DIMMs:
179 DIMM_A1 145 DIMM_A1
180 DIMM_B1 146 DIMM_B1
181 147
182Labels for these slots are usually silk screened on the motherboard. Slots 148Labels for these slots are usually silk-screened on the motherboard.
183labeled 'A' are channel 0 in this example. Slots labeled 'B' 149Slots labeled 'A' are channel 0 in this example. Slots labeled 'B' are
184are channel 1. Notice that there are two csrows possible on a 150channel 1. Notice that there are two csrows possible on a physical DIMM.
185physical DIMM. These csrows are allocated their csrow assignment 151These csrows are allocated their csrow assignment based on the slot into
186based on the slot into which the memory DIMM is placed. Thus, when 1 DIMM 152which the memory DIMM is placed. Thus, when 1 DIMM is placed in each
187is placed in each Channel, the csrows cross both DIMMs. 153Channel, the csrows cross both DIMMs.
188 154
189Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 155Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
190Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 156Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
@@ -193,8 +159,8 @@ when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
193csrow1 will be populated. The pattern repeats itself for csrow2 and 159csrow1 will be populated. The pattern repeats itself for csrow2 and
194csrow3. 160csrow3.
195 161
196The representation of the above is reflected in the directory tree 162The representation of the above is reflected in the directory
197in EDAC's sysfs interface. Starting in directory 163tree in EDAC's sysfs interface. Starting in directory
198/sys/devices/system/edac/mc each memory controller will be represented 164/sys/devices/system/edac/mc each memory controller will be represented
199by its own 'mcX' directory, where 'X' is the index of the MC. 165by its own 'mcX' directory, where 'X' is the index of the MC.
200 166
@@ -217,19 +183,19 @@ Under each 'mcX' directory each 'csrowX' is again represented by a
217 |->csrow3 183 |->csrow3
218 .... 184 ....
219 185
220Notice that there is no csrow1, which indicates that csrow0 is 186Notice that there is no csrow1, which indicates that csrow0 is composed
221composed of a single ranked DIMMs. This should also apply in both 187of a single ranked DIMMs. This should also apply in both Channels, in
222Channels, in order to have dual-channel mode be operational. Since 188order to have dual-channel mode be operational. Since both csrow2 and
223both csrow2 and csrow3 are populated, this indicates a dual ranked 189csrow3 are populated, this indicates a dual ranked set of DIMMs for
224set of DIMMs for channels 0 and 1. 190channels 0 and 1.
225 191
226 192
227Within each of the 'mcX' and 'csrowX' directories are several 193Within each of the 'mcX' and 'csrowX' directories are several EDAC
228EDAC control and attribute files. 194control and attribute files.
229 195
230============================================================================
231'mcX' DIRECTORIES
232 196
197'mcX' directories
198-----------------
233 199
234In 'mcX' directories are EDAC control and attribute files for 200In 'mcX' directories are EDAC control and attribute files for
235this 'X' instance of the memory controllers. 201this 'X' instance of the memory controllers.
@@ -238,13 +204,14 @@ For a description of the sysfs API, please see:
238 Documentation/ABI/testing/sysfs-devices-edac 204 Documentation/ABI/testing/sysfs-devices-edac
239 205
240 206
241============================================================================
242'csrowX' DIRECTORIES
243 207
244When CONFIG_EDAC_LEGACY_SYSFS is enabled, the sysfs will contain the 208'csrowX' directories
245csrowX directories. As this API doesn't work properly for Rambus, FB-DIMMs 209--------------------
246and modern Intel Memory Controllers, this is being deprecated in favor 210
247of dimmX directories. 211When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX
212directories. As this API doesn't work properly for Rambus, FB-DIMMs and
213modern Intel Memory Controllers, this is being deprecated in favor of
214dimmX directories.
248 215
249In the 'csrowX' directories are EDAC control and attribute files for 216In the 'csrowX' directories are EDAC control and attribute files for
250this 'X' instance of csrow: 217this 'X' instance of csrow:
@@ -265,11 +232,11 @@ Total Correctable Errors count attribute file:
265 'ce_count' 232 'ce_count'
266 233
267 This attribute file displays the total count of correctable 234 This attribute file displays the total count of correctable
268 errors that have occurred on this csrow. This 235 errors that have occurred on this csrow. This count is very
269 count is very important to examine. CEs provide early 236 important to examine. CEs provide early indications that a
270 indications that a DIMM is beginning to fail. This count 237 DIMM is beginning to fail. This count field should be
271 field should be monitored for non-zero values and report 238 monitored for non-zero values and report such information
272 such information to the system administrator. 239 to the system administrator.
273 240
274 241
275Total memory managed by this csrow attribute file: 242Total memory managed by this csrow attribute file:
@@ -377,11 +344,13 @@ Channel 1 DIMM Label control file:
377 motherboard specific and determination of this information 344 motherboard specific and determination of this information
378 must occur in userland at this time. 345 must occur in userland at this time.
379 346
380============================================================================ 347
348
381SYSTEM LOGGING 349SYSTEM LOGGING
350--------------
382 351
383If logging for UEs and CEs are enabled then system logs will have 352If logging for UEs and CEs is enabled, then system logs will contain
384error notices indicating errors that have been detected: 353information indicating that errors have been detected:
385 354
386EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, 355EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
387channel 1 "DIMM_B1": amd76x_edac 356channel 1 "DIMM_B1": amd76x_edac
@@ -404,24 +373,23 @@ The structure of the message is:
404 and then an optional, driver-specific message that may 373 and then an optional, driver-specific message that may
405 have additional information. 374 have additional information.
406 375
407Both UEs and CEs with no info will lack all but memory controller, 376Both UEs and CEs with no info will lack all but memory controller, error
408error type, a notice of "no info" and then an optional, 377type, a notice of "no info" and then an optional, driver-specific error
409driver-specific error message. 378message.
410 379
411 380
412============================================================================
413PCI Bus Parity Detection 381PCI Bus Parity Detection
382------------------------
414 383
415 384On Header Type 00 devices, the primary status is looked at for any
416On Header Type 00 devices the primary status is looked at 385parity error regardless of whether parity is enabled on the device or
417for any parity error regardless of whether Parity is enabled on the 386not. (The spec indicates parity is generated in some cases). On Header
418device. (The spec indicates parity is generated in some cases). 387Type 01 bridges, the secondary status register is also looked at to see
419On Header Type 01 bridges, the secondary status register is also 388if parity occurred on the bus on the other side of the bridge.
420looked at to see if parity occurred on the bus on the other side of
421the bridge.
422 389
423 390
424SYSFS CONFIGURATION 391SYSFS CONFIGURATION
392-------------------
425 393
426Under /sys/devices/system/edac/pci are control and attribute files as follows: 394Under /sys/devices/system/edac/pci are control and attribute files as follows:
427 395
@@ -450,8 +418,9 @@ Parity Count:
450 have been detected. 418 have been detected.
451 419
452 420
453============================================================================ 421
454MODULE PARAMETERS 422MODULE PARAMETERS
423-----------------
455 424
456Panic on UE control file: 425Panic on UE control file:
457 426
@@ -530,10 +499,8 @@ Panic on PCI PARITY Error:
530 499
531 500
532 501
533======================================================================= 502EDAC device type
534 503----------------
535
536EDAC_DEVICE type of device
537 504
538In the header file, edac_core.h, there is a series of edac_device structures 505In the header file, edac_core.h, there is a series of edac_device structures
539and APIs for the EDAC_DEVICE. 506and APIs for the EDAC_DEVICE.
@@ -573,6 +540,7 @@ The test_device_edac device adds at least one of its own custom control:
573The symlink points to the 'struct dev' that is registered for this edac_device. 540The symlink points to the 'struct dev' that is registered for this edac_device.
574 541
575INSTANCES 542INSTANCES
543---------
576 544
577One or more instance directories are present. For the 'test_device_edac' case: 545One or more instance directories are present. For the 'test_device_edac' case:
578 546
@@ -586,6 +554,7 @@ counter in deeper subdirectories.
586 ue_count total of UE events of subdirectories 554 ue_count total of UE events of subdirectories
587 555
588BLOCKS 556BLOCKS
557------
589 558
590At the lowest directory level is the 'block' directory. There can be 0, 1 559At the lowest directory level is the 'block' directory. There can be 0, 1
591or more blocks specified in each instance. 560or more blocks specified in each instance.
@@ -623,8 +592,9 @@ unique drivers for their hardware systems.
623The 'test_device_edac' sample driver is located at the 592The 'test_device_edac' sample driver is located at the
624bluesmoke.sourceforge.net project site for EDAC. 593bluesmoke.sourceforge.net project site for EDAC.
625 594
626======================================================================= 595
627NEHALEM USAGE OF EDAC APIs 596NEHALEM USAGE OF EDAC APIs
597--------------------------
628 598
629This chapter documents some EXPERIMENTAL mappings for EDAC API to handle 599This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
630Nehalem EDAC driver. They will likely be changed on future versions 600Nehalem EDAC driver. They will likely be changed on future versions
@@ -773,3 +743,20 @@ exports one
773 by the driver. Since, with udimm, this is counted by software, it is 743 by the driver. Since, with udimm, this is counted by software, it is
774 possible that some errors could be lost. With rdimm's, they display the 744 possible that some errors could be lost. With rdimm's, they display the
775 contents of the registers 745 contents of the registers
746
747CREDITS:
748========
749
750Written by Doug Thompson <dougthompson@xmission.com>
7517 Dec 2005
75217 Jul 2007 Updated
753
754(c) Mauro Carvalho Chehab
75505 Aug 2009 Nehalem interface
756
757EDAC authors/maintainers:
758
759 Doug Thompson, Dave Jiang, Dave Peterson et al,
760 Mauro Carvalho Chehab
761 Borislav Petkov
762 original author: Thayne Harbaugh