aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/drivers/edac/edac.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/drivers/edac/edac.txt')
-rw-r--r--Documentation/drivers/edac/edac.txt152
1 files changed, 27 insertions, 125 deletions
diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt
index 70d96a62e5e1..7b3d969d2964 100644
--- a/Documentation/drivers/edac/edac.txt
+++ b/Documentation/drivers/edac/edac.txt
@@ -35,15 +35,14 @@ the vendor should tie the parity status bits to 0 if they do not intend
35to generate parity. Some vendors do not do this, and thus the parity bit 35to generate parity. Some vendors do not do this, and thus the parity bit
36can "float" giving false positives. 36can "float" giving false positives.
37 37
38The PCI Parity EDAC device has the ability to "skip" known flaky 38[There are patches in the kernel queue which will allow for storage of
39cards during the parity scan. These are set by the parity "blacklist" 39quirks of PCI devices reporting false parity positives. The 2.6.18
40interface in the sysfs for PCI Parity. (See the PCI section in the sysfs 40kernel should have those patches included. When that becomes available,
41section below.) There is also a parity "whitelist" which is used as 41then EDAC will be patched to utilize that information to "skip" such
42an explicit list of devices to scan, while the blacklist is a list 42devices.]
43of devices to skip.
44 43
45EDAC will have future error detectors that will be added or integrated 44EDAC will have future error detectors that will be integrated with
46into EDAC in the following list: 45EDAC or added to it, in the following list:
47 46
48 MCE Machine Check Exception 47 MCE Machine Check Exception
49 MCA Machine Check Architecture 48 MCA Machine Check Architecture
@@ -93,22 +92,24 @@ EDAC lives in the /sys/devices/system/edac directory. Within this directory
93there currently reside 2 'edac' components: 92there currently reside 2 'edac' components:
94 93
95 mc memory controller(s) system 94 mc memory controller(s) system
96 pci PCI status system 95 pci PCI control and status system
97 96
98 97
99============================================================================ 98============================================================================
100Memory Controller (mc) Model 99Memory Controller (mc) Model
101 100
102First a background on the memory controller's model abstracted in EDAC. 101First a background on the memory controller's model abstracted in EDAC.
103Each mc device controls a set of DIMM memory modules. These modules are 102Each 'mc' device controls a set of DIMM memory modules. These modules are
104laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can 103laid out in a Chip-Select Row (csrowX) and Channel table (chX). There can
105be multiple csrows and two channels. 104be multiple csrows and multiple channels.
106 105
107Memory controllers allow for several csrows, with 8 csrows being a typical value. 106Memory controllers allow for several csrows, with 8 csrows being a typical value.
108Yet, the actual number of csrows depends on the electrical "loading" 107Yet, the actual number of csrows depends on the electrical "loading"
109of a given motherboard, memory controller and DIMM characteristics. 108of a given motherboard, memory controller and DIMM characteristics.
110 109
111Dual channels allows for 128 bit data transfers to the CPU from memory. 110Dual channels allows for 128 bit data transfers to the CPU from memory.
111Some newer chipsets allow for more than 2 channels, like Fully Buffered DIMMs
112(FB-DIMMs). The following example will assume 2 channels:
112 113
113 114
114 Channel 0 Channel 1 115 Channel 0 Channel 1
@@ -234,23 +235,15 @@ Polling period control file:
234 The time period, in milliseconds, for polling for error information. 235 The time period, in milliseconds, for polling for error information.
235 Too small a value wastes resources. Too large a value might delay 236 Too small a value wastes resources. Too large a value might delay
236 necessary handling of errors and might loose valuable information for 237 necessary handling of errors and might loose valuable information for
237 locating the error. 1000 milliseconds (once each second) is about 238 locating the error. 1000 milliseconds (once each second) is the current
238 right for most uses. 239 default. Systems which require all the bandwidth they can get, may
240 increase this.
239 241
240 LOAD TIME: module/kernel parameter: poll_msec=[0|1] 242 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
241 243
242 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec 244 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec
243 245
244 246
245Module Version read-only attribute file:
246
247 'mc_version'
248
249 The EDAC CORE module's version and compile date are shown here to
250 indicate what EDAC is running.
251
252
253
254============================================================================ 247============================================================================
255'mcX' DIRECTORIES 248'mcX' DIRECTORIES
256 249
@@ -284,35 +277,6 @@ Seconds since last counter reset control file:
284 277
285 278
286 279
287DIMM capability attribute file:
288
289 'edac_capability'
290
291 The EDAC (Error Detection and Correction) capabilities/modes of
292 the memory controller hardware.
293
294
295DIMM Current Capability attribute file:
296
297 'edac_current_capability'
298
299 The EDAC capabilities available with the hardware
300 configuration. This may not be the same as "EDAC capability"
301 if the correct memory is not used. If a memory controller is
302 capable of EDAC, but DIMMs without check bits are in use, then
303 Parity, SECDED, S4ECD4ED capabilities will not be available
304 even though the memory controller might be capable of those
305 modes with the proper memory loaded.
306
307
308Memory Type supported on this controller attribute file:
309
310 'supported_mem_type'
311
312 This attribute file displays the memory type, usually
313 buffered and unbuffered DIMMs.
314
315
316Memory Controller name attribute file: 280Memory Controller name attribute file:
317 281
318 'mc_name' 282 'mc_name'
@@ -321,16 +285,6 @@ Memory Controller name attribute file:
321 that is being utilized. 285 that is being utilized.
322 286
323 287
324Memory Controller Module name attribute file:
325
326 'module_name'
327
328 This attribute file displays the memory controller module name,
329 version and date built. The name of the memory controller
330 hardware - some drivers work with multiple controllers and
331 this field shows which hardware is present.
332
333
334Total memory managed by this memory controller attribute file: 288Total memory managed by this memory controller attribute file:
335 289
336 'size_mb' 290 'size_mb'
@@ -432,6 +386,9 @@ Memory Type attribute file:
432 386
433 This attribute file will display what type of memory is currently 387 This attribute file will display what type of memory is currently
434 on this csrow. Normally, either buffered or unbuffered memory. 388 on this csrow. Normally, either buffered or unbuffered memory.
389 Examples:
390 Registered-DDR
391 Unbuffered-DDR
435 392
436 393
437EDAC Mode of operation attribute file: 394EDAC Mode of operation attribute file:
@@ -446,8 +403,13 @@ Device type attribute file:
446 403
447 'dev_type' 404 'dev_type'
448 405
449 This attribute file will display what type of DIMM device is 406 This attribute file will display what type of DRAM device is
450 being utilized. Example: x4 407 being utilized on this DIMM.
408 Examples:
409 x1
410 x2
411 x4
412 x8
451 413
452 414
453Channel 0 CE Count attribute file: 415Channel 0 CE Count attribute file:
@@ -522,10 +484,10 @@ SYSTEM LOGGING
522If logging for UEs and CEs are enabled then system logs will have 484If logging for UEs and CEs are enabled then system logs will have
523error notices indicating errors that have been detected: 485error notices indicating errors that have been detected:
524 486
525MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, 487EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
526channel 1 "DIMM_B1": amd76x_edac 488channel 1 "DIMM_B1": amd76x_edac
527 489
528MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, 490EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
529channel 1 "DIMM_B1": amd76x_edac 491channel 1 "DIMM_B1": amd76x_edac
530 492
531 493
@@ -610,64 +572,4 @@ Parity Count:
610 572
611 573
612 574
613PCI Device Whitelist:
614
615 'pci_parity_whitelist'
616
617 This control file allows for an explicit list of PCI devices to be
618 scanned for parity errors. Only devices found on this list will
619 be examined. The list is a line of hexadecimal VENDOR and DEVICE
620 ID tuples:
621
622 1022:7450,1434:16a6
623
624 One or more can be inserted, separated by a comma.
625
626 To write the above list doing the following as one command line:
627
628 echo "1022:7450,1434:16a6"
629 > /sys/devices/system/edac/pci/pci_parity_whitelist
630
631
632
633 To display what the whitelist is, simply 'cat' the same file.
634
635
636PCI Device Blacklist:
637
638 'pci_parity_blacklist'
639
640 This control file allows for a list of PCI devices to be
641 skipped for scanning.
642 The list is a line of hexadecimal VENDOR and DEVICE ID tuples:
643
644 1022:7450,1434:16a6
645
646 One or more can be inserted, separated by a comma.
647
648 To write the above list doing the following as one command line:
649
650 echo "1022:7450,1434:16a6"
651 > /sys/devices/system/edac/pci/pci_parity_blacklist
652
653
654 To display what the whitelist currently contains,
655 simply 'cat' the same file.
656
657======================================================================= 575=======================================================================
658
659PCI Vendor and Devices IDs can be obtained with the lspci command. Using
660the -n option lspci will display the vendor and device IDs. The system
661administrator will have to determine which devices should be scanned or
662skipped.
663
664
665
666The two lists (white and black) are prioritized. blacklist is the lower
667priority and will NOT be utilized when a whitelist has been set.
668Turn OFF a whitelist by an empty echo command:
669
670 echo > /sys/devices/system/edac/pci/pci_parity_whitelist
671
672and any previous blacklist will be utilized.
673