diff options
Diffstat (limited to 'Documentation/drivers/edac/edac.txt')
-rw-r--r-- | Documentation/drivers/edac/edac.txt | 192 |
1 files changed, 165 insertions, 27 deletions
diff --git a/Documentation/drivers/edac/edac.txt b/Documentation/drivers/edac/edac.txt index 3c5a9e4297b4..a5c36842ecef 100644 --- a/Documentation/drivers/edac/edac.txt +++ b/Documentation/drivers/edac/edac.txt | |||
@@ -2,22 +2,42 @@ | |||
2 | 2 | ||
3 | EDAC - Error Detection And Correction | 3 | EDAC - Error Detection And Correction |
4 | 4 | ||
5 | Written by Doug Thompson <norsk5@xmission.com> | 5 | Written by Doug Thompson <dougthompson@xmission.com> |
6 | 7 Dec 2005 | 6 | 7 Dec 2005 |
7 | 17 Jul 2007 Updated | ||
7 | 8 | ||
8 | 9 | ||
9 | EDAC was written by: | 10 | EDAC is maintained and written by: |
10 | Thayne Harbaugh, | ||
11 | modified by Dave Peterson, Doug Thompson, et al, | ||
12 | from the bluesmoke.sourceforge.net project. | ||
13 | 11 | ||
12 | Doug Thompson, Dave Jiang, Dave Peterson et al, | ||
13 | original author: Thayne Harbaugh, | ||
14 | |||
15 | Contact: | ||
16 | website: bluesmoke.sourceforge.net | ||
17 | mailing list: bluesmoke-devel@lists.sourceforge.net | ||
18 | |||
19 | "bluesmoke" was the name for this device driver when it was "out-of-tree" | ||
20 | and maintained at sourceforge.net. When it was pushed into 2.6.16 for the | ||
21 | first time, it was renamed to 'EDAC'. | ||
22 | |||
23 | The bluesmoke project at sourceforge.net is now utilized as a 'staging area' | ||
24 | for EDAC development, before it is sent upstream to kernel.org | ||
25 | |||
26 | At the bluesmoke/EDAC project site, is a series of quilt patches against | ||
27 | recent kernels, stored in a SVN respository. For easier downloading, there | ||
28 | is also a tarball snapshot available. | ||
14 | 29 | ||
15 | ============================================================================ | 30 | ============================================================================ |
16 | EDAC PURPOSE | 31 | EDAC PURPOSE |
17 | 32 | ||
18 | The 'edac' kernel module goal is to detect and report errors that occur | 33 | The 'edac' kernel module goal is to detect and report errors that occur |
19 | within the computer system. In the initial release, memory Correctable Errors | 34 | within the computer system running under linux. |
20 | (CE) and Uncorrectable Errors (UE) are the primary errors being harvested. | 35 | |
36 | MEMORY | ||
37 | |||
38 | In the initial release, memory Correctable Errors (CE) and Uncorrectable | ||
39 | Errors (UE) are the primary errors being harvested. These types of errors | ||
40 | are harvested by the 'edac_mc' class of device. | ||
21 | 41 | ||
22 | Detecting CE events, then harvesting those events and reporting them, | 42 | Detecting CE events, then harvesting those events and reporting them, |
23 | CAN be a predictor of future UE events. With CE events, the system can | 43 | CAN be a predictor of future UE events. With CE events, the system can |
@@ -25,9 +45,27 @@ continue to operate, but with less safety. Preventive maintenance and | |||
25 | proactive part replacement of memory DIMMs exhibiting CEs can reduce | 45 | proactive part replacement of memory DIMMs exhibiting CEs can reduce |
26 | the likelihood of the dreaded UE events and system 'panics'. | 46 | the likelihood of the dreaded UE events and system 'panics'. |
27 | 47 | ||
48 | NON-MEMORY | ||
49 | |||
50 | A new feature for EDAC, the edac_device class of device, was added in | ||
51 | the 2.6.23 version of the kernel. | ||
52 | |||
53 | This new device type allows for non-memory type of ECC hardware detectors | ||
54 | to have their states harvested and presented to userspace via the sysfs | ||
55 | interface. | ||
56 | |||
57 | Some architectures have ECC detectors for L1, L2 and L3 caches, along with DMA | ||
58 | engines, fabric switches, main data path switches, interconnections, | ||
59 | and various other hardware data paths. If the hardware reports it, then | ||
60 | a edac_device device probably can be constructed to harvest and present | ||
61 | that to userspace. | ||
62 | |||
63 | |||
64 | PCI BUS SCANNING | ||
28 | 65 | ||
29 | In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices | 66 | In addition, PCI Bus Parity and SERR Errors are scanned for on PCI devices |
30 | in order to determine if errors are occurring on data transfers. | 67 | in order to determine if errors are occurring on data transfers. |
68 | |||
31 | The presence of PCI Parity errors must be examined with a grain of salt. | 69 | The presence of PCI Parity errors must be examined with a grain of salt. |
32 | There are several add-in adapters that do NOT follow the PCI specification | 70 | There are several add-in adapters that do NOT follow the PCI specification |
33 | with regards to Parity generation and reporting. The specification says | 71 | with regards to Parity generation and reporting. The specification says |
@@ -35,11 +73,17 @@ the vendor should tie the parity status bits to 0 if they do not intend | |||
35 | to generate parity. Some vendors do not do this, and thus the parity bit | 73 | to generate parity. Some vendors do not do this, and thus the parity bit |
36 | can "float" giving false positives. | 74 | can "float" giving false positives. |
37 | 75 | ||
38 | [There are patches in the kernel queue which will allow for storage of | 76 | In the kernel there is a pci device attribute located in sysfs that is |
39 | quirks of PCI devices reporting false parity positives. The 2.6.18 | 77 | checked by the EDAC PCI scanning code. If that attribute is set, |
40 | kernel should have those patches included. When that becomes available, | 78 | PCI parity/error scannining is skipped for that device. The attribute |
41 | then EDAC will be patched to utilize that information to "skip" such | 79 | is: |
42 | devices.] | 80 | |
81 | broken_parity_status | ||
82 | |||
83 | as is located in /sys/devices/pci<XXX>/0000:XX:YY.Z directorys for | ||
84 | PCI devices. | ||
85 | |||
86 | FUTURE HARDWARE SCANNING | ||
43 | 87 | ||
44 | EDAC will have future error detectors that will be integrated with | 88 | EDAC will have future error detectors that will be integrated with |
45 | EDAC or added to it, in the following list: | 89 | EDAC or added to it, in the following list: |
@@ -57,13 +101,14 @@ and the like. | |||
57 | ============================================================================ | 101 | ============================================================================ |
58 | EDAC VERSIONING | 102 | EDAC VERSIONING |
59 | 103 | ||
60 | EDAC is composed of a "core" module (edac_mc.ko) and several Memory | 104 | EDAC is composed of a "core" module (edac_core.ko) and several Memory |
61 | Controller (MC) driver modules. On a given system, the CORE | 105 | Controller (MC) driver modules. On a given system, the CORE |
62 | is loaded and one MC driver will be loaded. Both the CORE and | 106 | is loaded and one MC driver will be loaded. Both the CORE and |
63 | the MC driver have individual versions that reflect current release | 107 | the MC driver (or edac_device driver) have individual versions that reflect |
64 | level of their respective modules. Thus, to "report" on what version | 108 | current release level of their respective modules. |
65 | a system is running, one must report both the CORE's and the | 109 | |
66 | MC driver's versions. | 110 | Thus, to "report" on what version a system is running, one must report both |
111 | the CORE's and the MC driver's versions. | ||
67 | 112 | ||
68 | 113 | ||
69 | LOADING | 114 | LOADING |
@@ -88,8 +133,9 @@ EDAC sysfs INTERFACE | |||
88 | EDAC presents a 'sysfs' interface for control, reporting and attribute | 133 | EDAC presents a 'sysfs' interface for control, reporting and attribute |
89 | reporting purposes. | 134 | reporting purposes. |
90 | 135 | ||
91 | EDAC lives in the /sys/devices/system/edac directory. Within this directory | 136 | EDAC lives in the /sys/devices/system/edac directory. |
92 | there currently reside 2 'edac' components: | 137 | |
138 | Within this directory there currently reside 2 'edac' components: | ||
93 | 139 | ||
94 | mc memory controller(s) system | 140 | mc memory controller(s) system |
95 | pci PCI control and status system | 141 | pci PCI control and status system |
@@ -188,7 +234,7 @@ In directory 'mc' are EDAC system overall control and attribute files: | |||
188 | 234 | ||
189 | Panic on UE control file: | 235 | Panic on UE control file: |
190 | 236 | ||
191 | 'panic_on_ue' | 237 | 'edac_mc_panic_on_ue' |
192 | 238 | ||
193 | An uncorrectable error will cause a machine panic. This is usually | 239 | An uncorrectable error will cause a machine panic. This is usually |
194 | desirable. It is a bad idea to continue when an uncorrectable error | 240 | desirable. It is a bad idea to continue when an uncorrectable error |
@@ -199,12 +245,12 @@ Panic on UE control file: | |||
199 | 245 | ||
200 | LOAD TIME: module/kernel parameter: panic_on_ue=[0|1] | 246 | LOAD TIME: module/kernel parameter: panic_on_ue=[0|1] |
201 | 247 | ||
202 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/panic_on_ue | 248 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_panic_on_ue |
203 | 249 | ||
204 | 250 | ||
205 | Log UE control file: | 251 | Log UE control file: |
206 | 252 | ||
207 | 'log_ue' | 253 | 'edac_mc_log_ue' |
208 | 254 | ||
209 | Generate kernel messages describing uncorrectable errors. These errors | 255 | Generate kernel messages describing uncorrectable errors. These errors |
210 | are reported through the system message log system. UE statistics | 256 | are reported through the system message log system. UE statistics |
@@ -212,12 +258,12 @@ Log UE control file: | |||
212 | 258 | ||
213 | LOAD TIME: module/kernel parameter: log_ue=[0|1] | 259 | LOAD TIME: module/kernel parameter: log_ue=[0|1] |
214 | 260 | ||
215 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ue | 261 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ue |
216 | 262 | ||
217 | 263 | ||
218 | Log CE control file: | 264 | Log CE control file: |
219 | 265 | ||
220 | 'log_ce' | 266 | 'edac_mc_log_ce' |
221 | 267 | ||
222 | Generate kernel messages describing correctable errors. These | 268 | Generate kernel messages describing correctable errors. These |
223 | errors are reported through the system message log system. | 269 | errors are reported through the system message log system. |
@@ -225,12 +271,12 @@ Log CE control file: | |||
225 | 271 | ||
226 | LOAD TIME: module/kernel parameter: log_ce=[0|1] | 272 | LOAD TIME: module/kernel parameter: log_ce=[0|1] |
227 | 273 | ||
228 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/log_ce | 274 | RUN TIME: echo "1" >/sys/devices/system/edac/mc/edac_mc_log_ce |
229 | 275 | ||
230 | 276 | ||
231 | Polling period control file: | 277 | Polling period control file: |
232 | 278 | ||
233 | 'poll_msec' | 279 | 'edac_mc_poll_msec' |
234 | 280 | ||
235 | The time period, in milliseconds, for polling for error information. | 281 | The time period, in milliseconds, for polling for error information. |
236 | Too small a value wastes resources. Too large a value might delay | 282 | Too small a value wastes resources. Too large a value might delay |
@@ -241,7 +287,7 @@ Polling period control file: | |||
241 | 287 | ||
242 | LOAD TIME: module/kernel parameter: poll_msec=[0|1] | 288 | LOAD TIME: module/kernel parameter: poll_msec=[0|1] |
243 | 289 | ||
244 | RUN TIME: echo "1000" >/sys/devices/system/edac/mc/poll_msec | 290 | RUN TIME: echo "1000" >/sys/devices/system/edac/mc/edac_mc_poll_msec |
245 | 291 | ||
246 | 292 | ||
247 | ============================================================================ | 293 | ============================================================================ |
@@ -587,3 +633,95 @@ Parity Count: | |||
587 | 633 | ||
588 | 634 | ||
589 | ======================================================================= | 635 | ======================================================================= |
636 | |||
637 | |||
638 | EDAC_DEVICE type of device | ||
639 | |||
640 | In the header file, edac_core.h, there is a series of edac_device structures | ||
641 | and APIs for the EDAC_DEVICE. | ||
642 | |||
643 | User space access to an edac_device is through the sysfs interface. | ||
644 | |||
645 | At the location /sys/devices/system/edac (sysfs) new edac_device devices will | ||
646 | appear. | ||
647 | |||
648 | There is a three level tree beneath the above 'edac' directory. For example, | ||
649 | the 'test_device_edac' device (found at the bluesmoke.sourceforget.net website) | ||
650 | installs itself as: | ||
651 | |||
652 | /sys/devices/systm/edac/test-instance | ||
653 | |||
654 | in this directory are various controls, a symlink and one or more 'instance' | ||
655 | directorys. | ||
656 | |||
657 | The standard default controls are: | ||
658 | |||
659 | log_ce boolean to log CE events | ||
660 | log_ue boolean to log UE events | ||
661 | panic_on_ue boolean to 'panic' the system if an UE is encountered | ||
662 | (default off, can be set true via startup script) | ||
663 | poll_msec time period between POLL cycles for events | ||
664 | |||
665 | The test_device_edac device adds at least one of its own custom control: | ||
666 | |||
667 | test_bits which in the current test driver does nothing but | ||
668 | show how it is installed. A ported driver can | ||
669 | add one or more such controls and/or attributes | ||
670 | for specific uses. | ||
671 | One out-of-tree driver uses controls here to allow | ||
672 | for ERROR INJECTION operations to hardware | ||
673 | injection registers | ||
674 | |||
675 | The symlink points to the 'struct dev' that is registered for this edac_device. | ||
676 | |||
677 | INSTANCES | ||
678 | |||
679 | One or more instance directories are present. For the 'test_device_edac' case: | ||
680 | |||
681 | test-instance0 | ||
682 | |||
683 | |||
684 | In this directory there are two default counter attributes, which are totals of | ||
685 | counter in deeper subdirectories. | ||
686 | |||
687 | ce_count total of CE events of subdirectories | ||
688 | ue_count total of UE events of subdirectories | ||
689 | |||
690 | BLOCKS | ||
691 | |||
692 | At the lowest directory level is the 'block' directory. There can be 0, 1 | ||
693 | or more blocks specified in each instance. | ||
694 | |||
695 | test-block0 | ||
696 | |||
697 | |||
698 | In this directory the default attributes are: | ||
699 | |||
700 | ce_count which is counter of CE events for this 'block' | ||
701 | of hardware being monitored | ||
702 | ue_count which is counter of UE events for this 'block' | ||
703 | of hardware being monitored | ||
704 | |||
705 | |||
706 | The 'test_device_edac' device adds 4 attributes and 1 control: | ||
707 | |||
708 | test-block-bits-0 for every POLL cycle this counter | ||
709 | is incremented | ||
710 | test-block-bits-1 every 10 cycles, this counter is bumped once, | ||
711 | and test-block-bits-0 is set to 0 | ||
712 | test-block-bits-2 every 100 cycles, this counter is bumped once, | ||
713 | and test-block-bits-1 is set to 0 | ||
714 | test-block-bits-3 every 1000 cycles, this counter is bumped once, | ||
715 | and test-block-bits-2 is set to 0 | ||
716 | |||
717 | |||
718 | reset-counters writing ANY thing to this control will | ||
719 | reset all the above counters. | ||
720 | |||
721 | |||
722 | Use of the 'test_device_edac' driver should any others to create their own | ||
723 | unique drivers for their hardware systems. | ||
724 | |||
725 | The 'test_device_edac' sample driver is located at the | ||
726 | bluesmoke.sourceforge.net project site for EDAC. | ||
727 | |||