aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/drivers/edac/edac.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/drivers/edac/edac.txt')
-rw-r--r--Documentation/drivers/edac/edac.txt192
1 files changed, 165 insertions, 27 deletions
diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt
index 3c5a9e4297b4..a5c36842ecef 100644
--- a/Documentation/drivers/edac/edac.txt
+++ b/Documentation/drivers/edac/edac.txt
@@ -2,22 +2,42 @@
2 2
3EDAC - Error Detection And Correction 3EDAC - Error Detection And Correction
4 4
5Written by Doug Thompson <norsk5@xmission.com> 5Written by Doug Thompson <dougthompson@xmission.com>
67 Dec 2005 67 Dec 2005
717 Jul 2007 Updated
7 8
8 9
9EDAC was written by: 10EDAC is maintained and written by:
10 Thayne Harbaugh,
11 modified by Dave Peterson, Doug Thompson, et al,
12 from the bluesmoke.sourceforge.net project.
13 11
12 Doug Thompson, Dave Jiang, Dave Peterson et al,
13 original author: Thayne Harbaugh,
14
15Contact:
16 website: bluesmoke.sourceforge.net
17 mailing list: bluesmoke-devel@lists.sourceforge.net
18
19"bluesmoke" was the name for this device driver when it was "out-of-tree"
20and maintained at sourceforge.net. When it was pushed into 2.6.16 for the
21first time, it was renamed to 'EDAC'.
22
23The bluesmoke project at sourceforge.net is now utilized as a 'staging area'
24for EDAC development, before it is sent upstream to kernel.org
25
26At the bluesmoke/EDAC project site, is a series of quilt patches against
27recent kernels, stored in a SVN respository. For easier downloading, there
28is also a tarball snapshot available.
14 29
15============================================================================ 30============================================================================
16EDAC PURPOSE 31EDAC PURPOSE
17 32
18The 'edac' kernel module goal is to detect and report errors that occur 33The 'edac' kernel module goal is to detect and report errors that occur
19within the computer system. In the initial release, memory Correctable Errors 34within the computer system running under linux.
20(CE) and Uncorrectable Errors (UE) are the primary errors being harvested. 35
36MEMORY
37
38In the initial release, memory Correctable Errors (CE) and Uncorrectable
39Errors (UE) are the primary errors being harvested. These types of errors
40are harvested by the 'edac_mc' class of device.
21 41
22Detecting CE events, then harvesting those events and reporting them, 42Detecting CE events, then harvesting those events and reporting them,
23CAN be a predictor of future UE events. With CE events, the system can 43CAN be a predictor of future UE events. With CE events, the system can
@@ -25,9 +45,27 @@ continue to operate, but with less safety. Preventive maintenance and
25proactive part replacement of memory DIMMs exhibiting CEs can reduce 45proactive part replacement of memory DIMMs exhibiting CEs can reduce
26the likelihood of the dreaded UE events and system 'panics'. 46the likelihood of the dreaded UE events and system 'panics'.
27 47
48NON-MEMORY
49
50A new feature for EDAC, the edac_device class of device, was added in
51the 2.6.23 version of the kernel.
52
53This new device type allows for non-memory type of ECC hardware detectors
54to have their states harvested and presented to userspace via the sysfs
55interface.
56
57Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA
58engines, fabric switches, main data path switches, interconnections,
59and various other hardware data paths. If the hardware reports it, then
60a edac_device device probably can be constructed to harvest and present
61that to userspace.
62
63
64PCI BUS SCANNING
28 65
29In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices 66In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices
30in order to determine if errors are occurring on data transfers. 67in order to determine if errors are occurring on data transfers.
68
31The presence of PCI Parity errors must be examined with a grain of salt. 69The presence of PCI Parity errors must be examined with a grain of salt.
32There are several add-in adapters that do NOT follow the PCI specification 70There are several add-in adapters that do NOT follow the PCI specification
33with regards to Parity generation and reporting. The specification says 71with regards to Parity generation and reporting. The specification says
@@ -35,11 +73,17 @@ the vendor should tie the parity status bits to 0 if they do not intend
35to generate parity. Some vendors do not do this, and thus the parity bit 73to generate parity. Some vendors do not do this, and thus the parity bit
36can "float" giving false positives. 74can "float" giving false positives.
37 75
38[There are patches in the kernel queue which will allow for storage of 76In the kernel there is a pci device attribute located in sysfs that is
39quirks of PCI devices reporting false parity positives. The 2.6.18 77checked by the EDAC PCI scanning code. If that attribute is set,
40kernel should have those patches included. When that becomes available, 78PCI parity/error scannining is skipped for that device. The attribute
41then EDAC will be patched to utilize that information to "skip" such 79is:
42devices.] 80
81 broken_parity_status
82
83as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for
84PCI devices.
85
86FUTURE HARDWARE SCANNING
43 87
44EDAC will have future error detectors that will be integrated with 88EDAC will have future error detectors that will be integrated with
45EDAC or added to it, in the following list: 89EDAC or added to it, in the following list:
@@ -57,13 +101,14 @@ and the like.
57============================================================================ 101============================================================================
58EDAC VERSIONING 102EDAC VERSIONING
59 103
60EDAC is composed of a "core" module (edac_mc.ko) and several Memory 104EDAC is composed of a "core" module (edac_core.ko) and several Memory
61Controller (MC) driver modules. On a given system, the CORE 105Controller (MC) driver modules. On a given system, the CORE
62is loaded and one MC driver will be loaded. Both the CORE and 106is loaded and one MC driver will be loaded. Both the CORE and
63the MC driver have individual versions that reflect current release 107the MC driver (or edac_device driver) have individual versions that reflect
64level of their respective modules. Thus, to "report" on what version 108current release level of their respective modules.
65a system is running, one must report both the CORE's and the 109
66MC driver's versions. 110Thus, to "report" on what version a system is running, one must report both
111the CORE's and the MC driver's versions.
67 112
68 113
69LOADING 114LOADING
@@ -88,8 +133,9 @@ EDAC sysfs INTERFACE
88EDAC presents a 'sysfs' interface for control, reporting and attribute 133EDAC presents a 'sysfs' interface for control, reporting and attribute
89reporting purposes. 134reporting purposes.
90 135
91EDAC lives in the /sys/devices/system/edac directory. Within this directory 136EDAC lives in the /sys/devices/system/edac directory.
92there currently reside 2 'edac' components: 137
138Within this directory there currently reside 2 'edac' components:
93 139
94 mc memory controller(s) system 140 mc memory controller(s) system
95 pci PCI control and status system 141 pci PCI control and status system
@@ -188,7 +234,7 @@ In directory 'mc' are EDAC system overall control and attribute files:
188 234
189Panic on UE control file: 235Panic on UE control file:
190 236
191 'panic_on_ue' 237 'edac_mc_panic_on_ue'
192 238
193 An uncorrectable error will cause a machine panic. This is usually 239 An uncorrectable error will cause a machine panic. This is usually
194 desirable. It is a bad idea to continue when an uncorrectable error 240 desirable. It is a bad idea to continue when an uncorrectable error
@@ -199,12 +245,12 @@ Panic on UE control file:
199 245
200 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1] 246 LOAD TIME: module/kernel parameter: panic_on_ue=[0|1]
201 247
202 RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue 248 RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_panic_on_ue
203 249
204 250
205Log UE control file: 251Log UE control file:
206 252
207 'log_ue' 253 'edac_mc_log_ue'
208 254
209 Generate kernel messages describing uncorrectable errors. These errors 255 Generate kernel messages describing uncorrectable errors. These errors
210 are reported through the system message log system. UE statistics 256 are reported through the system message log system. UE statistics
@@ -212,12 +258,12 @@ Log UE control file:
212 258
213 LOAD TIME: module/kernel parameter: log_ue=[0|1] 259 LOAD TIME: module/kernel parameter: log_ue=[0|1]
214 260
215 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue 261 RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ue
216 262
217 263
218Log CE control file: 264Log CE control file:
219 265
220 'log_ce' 266 'edac_mc_log_ce'
221 267
222 Generate kernel messages describing correctable errors. These 268 Generate kernel messages describing correctable errors. These
223 errors are reported through the system message log system. 269 errors are reported through the system message log system.
@@ -225,12 +271,12 @@ Log CE control file:
225 271
226 LOAD TIME: module/kernel parameter: log_ce=[0|1] 272 LOAD TIME: module/kernel parameter: log_ce=[0|1]
227 273
228 RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce 274 RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ce
229 275
230 276
231Polling period control file: 277Polling period control file:
232 278
233 'poll_msec' 279 'edac_mc_poll_msec'
234 280
235 The time period, in milliseconds, for polling for error information. 281 The time period, in milliseconds, for polling for error information.
236 Too small a value wastes resources. Too large a value might delay 282 Too small a value wastes resources. Too large a value might delay
@@ -241,7 +287,7 @@ Polling period control file:
241 287
242 LOAD TIME: module/kernel parameter: poll_msec=[0|1] 288 LOAD TIME: module/kernel parameter: poll_msec=[0|1]
243 289
244 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec 290 RUN TIME: echo "1000" >/sys/devices/system/edac/mc/edac_mc_poll_msec
245 291
246 292
247============================================================================ 293============================================================================
@@ -587,3 +633,95 @@ Parity Count:
587 633
588 634
589======================================================================= 635=======================================================================
636
637
638EDAC_DEVICE type of device
639
640In the header file, edac_core.h, there is a series of edac_device structures
641and APIs for the EDAC_DEVICE.
642
643User space access to an edac_device is through the sysfs interface.
644
645At the location /sys/devices/system/edac (sysfs) new edac_device devices will
646appear.
647
648There is a three level tree beneath the above 'edac' directory. For example,
649the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website)
650installs itself as:
651
652 /sys/devices/systm/edac/test-instance
653
654in this directory are various controls, a symlink and one or more 'instance'
655directorys.
656
657The standard default controls are:
658
659 log_ce boolean to log CE events
660 log_ue boolean to log UE events
661 panic_on_ue boolean to 'panic' the system if an UE is encountered
662 (default off, can be set true via startup script)
663 poll_msec time period between POLL cycles for events
664
665The test_device_edac device adds at least one of its own custom control:
666
667 test_bits which in the current test driver does nothing but
668 show how it is installed. A ported driver can
669 add one or more such controls and/or attributes
670 for specific uses.
671 One out-of-tree driver uses controls here to allow
672 for ERROR INJECTION operations to hardware
673 injection registers
674
675The symlink points to the 'struct dev' that is registered for this edac_device.
676
677INSTANCES
678
679One or more instance directories are present. For the 'test_device_edac' case:
680
681 test-instance0
682
683
684In this directory there are two default counter attributes, which are totals of
685counter in deeper subdirectories.
686
687 ce_count total of CE events of subdirectories
688 ue_count total of UE events of subdirectories
689
690BLOCKS
691
692At the lowest directory level is the 'block' directory. There can be 0, 1
693or more blocks specified in each instance.
694
695 test-block0
696
697
698In this directory the default attributes are:
699
700 ce_count which is counter of CE events for this 'block'
701 of hardware being monitored
702 ue_count which is counter of UE events for this 'block'
703 of hardware being monitored
704
705
706The 'test_device_edac' device adds 4 attributes and 1 control:
707
708 test-block-bits-0 for every POLL cycle this counter
709 is incremented
710 test-block-bits-1 every 10 cycles, this counter is bumped once,
711 and test-block-bits-0 is set to 0
712 test-block-bits-2 every 100 cycles, this counter is bumped once,
713 and test-block-bits-1 is set to 0
714 test-block-bits-3 every 1000 cycles, this counter is bumped once,
715 and test-block-bits-2 is set to 0
716
717
718 reset-counters writing ANY thing to this control will
719 reset all the above counters.
720
721
722Use of the 'test_device_edac' driver should any others to create their own
723unique drivers for their hardware systems.
724
725The 'test_device_edac' sample driver is located at the
726bluesmoke.sourceforge.net project site for EDAC.
727