summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab@s-opensource.com>2016-10-27 07:26:36 -0400
committerMauro Carvalho Chehab <mchehab@s-opensource.com>2016-12-15 05:54:50 -0500
commit9c058d24ccb36d91650a84d9cbc27409f769d9a9 (patch)
tree8d6fe1e1bad380475ae8c249756e9149c3d20c30
parente4b5301674c0d2d866de767f02a44bc322af8d7f (diff)
edac.txt: Improve documentation, adding RAS introduction
The edac.txt assumes that the reader has already deep knowledge on RAS features. However, this may not be the case. So, add an introduction chapter explaining the main concepts that are used by the EDAC subsystem and by other RAS drivers within the Kernel. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
-rw-r--r--Documentation/edac.txt287
1 files changed, 248 insertions, 39 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 0c9161c9ed7a..2f8706bae5a4 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -1,18 +1,218 @@
1.. include:: <isonum.txt> 1.. include:: <isonum.txt>
2 2
3===================================== 3============================================
4Reliability, Availability and Serviceability
5============================================
6
7RAS concepts
8************
9
10Reliability, Availability and Serviceability (RAS) is a concept used on
11servers meant to measure their robusteness.
12
13Reliability
14 is the probability that a system will produce correct outputs.
15
16 * Generally measured as Mean Time Between Failures (MTBF)
17 * Enhanced by features that help to avoid, detect and repair hardware faults
18
19Availability
20 is the probability that a system is operational at a given time
21
22 * Generally measured as a percentage of downtime per a period of time
23 * Often uses mechanisms to detect and correct hardware faults in
24 runtime;
25
26Serviceability (or maintainability)
27 is the simplicity and speed with which a system can be repaired or
28 maintained
29
30 * Generally measured on Mean Time Between Repair (MTBR)
31
32Improving RAS
33-------------
34
35In order to reduce systems downtime, a system should be capable of detecting
36hardware errors, and, when possible correcting them in runtime. It should
37also provide mechanisms to detect hardware degradation, in order to warn
38the system administrator to take the action of replacing a component before
39it causes data loss or system downtime.
40
41Among the monitoring measures, the most usual ones include:
42
43* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
44* Memory – add error correction logic (ECC) to detect and correct errors;
45* I/O – add CRC checksums for tranfered data;
46* Storage – RAID, journal file systems, checksums,
47 Self-Monitoring, Analysis and Reporting Technology (SMART).
48
49By monitoring the number of occurrences of error detections, it is possible
50to identify if the probability of hardware errors is increasing, and, on such
51case, do a preventive maintainance to replace a degrated component while
52those errors are correctable.
53
54Types of errors
55---------------
56
57Most mechanisms used on modern systems use use technologies like Hamming
58Codes that allow error correction when the number of errors on a bit packet
59is below a threshold. If the number of errors is above, those mechanisms
60can indicate with a high degree of confidence that an error happened, but
61they can't correct.
62
63Also, sometimes an error occur on a component that it is not used. For
64example, a part of the memory that it is not currently allocated.
65
66That defines some categories of errors:
67
68* **Correctable Error (CE)** - the error detection mechanism detected and
69 corrected the error. Such errors are usually not fatal, although some
70 Kernel mechanisms allow the system administrator to consider them as fatal.
71
72* **Uncorrected Error (UE)** - the amount of errors happened above the error
73 correction threshold, and the system was unable to auto-correct.
74
75* **Fatal Error** - when an UE error happens on a critical component of the
76 system (for example, a piece of the Kernel got corrupted by an UE), the
77 only reliable way to avoid data corruption is to hang or reboot the machine.
78
79* **Non-fatal Error** - when an UE error happens on an unused component,
80 like a CPU in power down state or an unused memory bank, the system may
81 still run, eventually replacing the affected hardware by a hot spare,
82 if available.
83
84 Also, when an error happens on an userspace process, it is also possible to
85 kill such process and let userspace restart it.
86
87The mechanism for handling non-fatal errors is usually complex and may
88require the help of some userspace application, in order to apply the
89policy desired by the system administrator.
90
91Identifying a bad hardware component
92------------------------------------
93
94Just detecting a hardware flaw is usually not enough, as the system needs
95to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
96to make the hardware reliable again.
97
98So, it requires not only error logging facilities, but also mechanisms that
99will translate the error message to the silkscreen or component label for
100the MRU.
101
102Typically, it is very complex for memory, as modern CPUs interlace memory
103from different memory modules, in order to provide a better performance. The
104DMI BIOS usually have a list of memory module labels, with can be obtained
105using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
106
107 Memory Device
108 Total Width: 64 bits
109 Data Width: 64 bits
110 Size: 16384 MB
111 Form Factor: SODIMM
112 Set: None
113 Locator: ChannelA-DIMM0
114 Bank Locator: BANK 0
115 Type: DDR4
116 Type Detail: Synchronous
117 Speed: 2133 MHz
118 Rank: 2
119 Configured Clock Speed: 2133 MHz
120
121On the above example, a DDR4 SO-DIMM memory module is located at the
122system's memory labeled as "BANK 0", as given by the *bank locator* field.
123Please notice that, on such system, the *total width* is equal to the
124*data witdh*. It means that such memory module doesn't have error
125detection/correction mechanisms.
126
127Unfortunately, not all systems use the same field to specify the memory
128bank. On this example, from an older server, ``dmidecode`` shows::
129
130 Memory Device
131 Array Handle: 0x1000
132 Error Information Handle: Not Provided
133 Total Width: 72 bits
134 Data Width: 64 bits
135 Size: 8192 MB
136 Form Factor: DIMM
137 Set: 1
138 Locator: DIMM_A1
139 Bank Locator: Not Specified
140 Type: DDR3
141 Type Detail: Synchronous Registered (Buffered)
142 Speed: 1600 MHz
143 Rank: 2
144 Configured Clock Speed: 1600 MHz
145
146There, the DDR3 RDIMM memory module is located at the system's memory labeled
147as "DIMM_A1", as given by the *locator* field. Please notice that this
148memory module has 64 bits of *data witdh* and 72 bits of *total width*. So,
149it has 8 extra bits to be used by error detection and correction mechanisms.
150Such kind of memory is called Error-correcting code memory (ECC memory).
151
152To make things even worse, it is not uncommon that systems with different
153labels on their system's board to use exactly the same BIOS, meaning that
154the labels provided by the BIOS won't match the real ones.
155
156ECC memory
157----------
158
159As mentioned on the previous section, ECC memory has extra bits to be
160used for error correction. So, on 64 bit systems, a memory module
161has 64 bits of *data width*, and 74 bits of *total width*. So, there are
1628 bits extra bits to be used for the error detection and correction
163mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
164
165So, when the cpu requests the memory controller to write a word with
166*data width*, the memory controller calculates the *syndrome* in real time,
167using Hamming code, or some other error correction code, like SECDED+,
168producing a code with *total width* size. Such code is then written
169on the memory modules.
170
171At read, the *total width* bits code is converted back, using the same
172ECC code used on write, producing a word with *data width* and a *syndrome*.
173The word with *data width* is sent to the CPU, even when errors happen.
174
175The memory controller also looks at the *syndrome* in order to check if
176there was an error, and if the ECC code was able to fix such error.
177If the error was corrected, a Corrected Error (CE) happened. If not, an
178Uncorrected Error (UE) happened.
179
180The information about the CE/UE errors is stored on some special registers
181at the memory controller and can be accessed by reading such registers,
182either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
183bit CPUs, such errors can also be retrieved via the Machine Check
184Architecture (MCA)\ [#f3]_.
185
186.. [#f1] Please notice that several memory controllers allow operation on a
187 mode called "Lock-Step", where it groups two memory modules together,
188 doing 128-bit reads/writes. That gives 16 bits for error correction, with
189 significatively improves the error correction mechanism, at the expense
190 that, when an error happens, there's no way to know what memory module is
191 to blame. So, it has to blame both memory modules.
192
193.. [#f2] Some memory controllers also allow using memory in mirror mode.
194 On such mode, the same data is written to two memory modules. At read,
195 the system checks both memory modules, in order to check if both provide
196 identical data. On such configuration, when an error happens, there's no
197 way to know what memory module is to blame. So, it has to blame both
198 memory modules (or 4 memory modules, if the system is also on Lock-step
199 mode).
200
201.. [#f3] For more details about the Machine Check Architecture (MCA),
202 please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
203
4EDAC - Error Detection And Correction 204EDAC - Error Detection And Correction
5===================================== 205*************************************
6 206
7.. note:: 207.. note::
8 208
9 "bluesmoke" was the name for this device driver when it 209 "bluesmoke" was the name for this device driver subsystem when it
10 was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net. 210 was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
11 That site is mostly archaic now and can be used only for historical 211 That site is mostly archaic now and can be used only for historical
12 purposes. 212 purposes.
13 213
14 When the subsystem was pushed into 2.6.16 for the first time, it was 214 When the subsystem was pushed upstream for the first time, on
15 renamed to ``EDAC``. 215 Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
16 216
17Purpose 217Purpose
18------- 218-------
@@ -33,7 +233,7 @@ CE events only, the system can and will continue to operate as no data
33has been damaged yet. 233has been damaged yet.
34 234
35However, preventive maintenance and proactive part replacement of memory 235However, preventive maintenance and proactive part replacement of memory
36DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events 236modules exhibiting CEs can reduce the likelihood of the dreaded UE events
37and system panics. 237and system panics.
38 238
39Other hardware elements 239Other hardware elements
@@ -124,37 +324,47 @@ Within this directory there currently reside 2 components:
124Memory Controller (mc) Model 324Memory Controller (mc) Model
125---------------------------- 325----------------------------
126 326
127Each ``mc`` device controls a set of DIMM memory modules. These modules 327Each ``mc`` device controls a set of memory modules [#f4]_. These modules
128are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``). 328are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
129There can be multiple csrows and multiple channels. 329There can be multiple csrows and multiple channels.
130 330
331.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
332 used to refer to a memory module, although there are other memory
333 packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
334 and inside the EDAC system, the term "dimm" is used for all memory
335 modules, even when they use a different kind of packaging.
336
131Memory controllers allow for several csrows, with 8 csrows being a 337Memory controllers allow for several csrows, with 8 csrows being a
132typical value. Yet, the actual number of csrows depends on the layout of 338typical value. Yet, the actual number of csrows depends on the layout of
133a given motherboard, memory controller and DIMM characteristics. 339a given motherboard, memory controller and memory module characteristics.
134 340
135Dual channels allows for 128 bit data transfers to/from the CPU from/to 341Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
136memory. Some newer chipsets allow for more than 2 channels, like Fully 342data transfers to/from the CPU from/to memory. Some newer chipsets allow
137Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels: 343for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
138 344controllers. The following example will assume 2 channels:
139 +--------+-----------+-----------+ 345
140 | | Channel 0 | Channel 1 | 346 +------------+-----------------------+
141 +========+===========+===========+ 347 | Chip | Channels |
142 | csrow0 | DIMM_A0 | DIMM_B0 | 348 | Select +-----------+-----------+
143 +--------+ | | 349 | rows | ``ch0`` | ``ch1`` |
144 | csrow1 | | | 350 +============+===========+===========+
145 +--------+-----------+-----------+ 351 | ``csrow0`` | DIMM_A0 | DIMM_B0 |
146 | csrow2 | DIMM_A1 | DIMM_B1 | 352 +------------+ | |
147 +--------+ | | 353 | ``csrow1`` | | |
148 | csrow3 | | | 354 +------------+-----------+-----------+
149 +--------+-----------+-----------+ 355 | ``csrow2`` | DIMM_A1 | DIMM_B1 |
150 356 +------------+ | |
151In the above example table there are 4 physical slots on the motherboard 357 | ``csrow3`` | | |
358 +------------+-----------+-----------+
359
360In the above example, there are 4 physical slots on the motherboard
152for memory DIMMs: 361for memory DIMMs:
153 362
154 - DIMM_A0 363 +---------+---------+
155 - DIMM_B0 364 | DIMM_A0 | DIMM_B0 |
156 - DIMM_A1 365 +---------+---------+
157 - DIMM_B1 366 | DIMM_A1 | DIMM_B1 |
367 +---------+---------+
158 368
159Labels for these slots are usually silk-screened on the motherboard. 369Labels for these slots are usually silk-screened on the motherboard.
160Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are 370Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
@@ -165,15 +375,16 @@ Channel, the csrows cross both DIMMs.
165 375
166Memory DIMMs come single or dual "ranked". A rank is a populated csrow. 376Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
167Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above 377Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
168will have 1 csrow, csrow0. csrow1 will be empty. On the other hand, 378will have just one csrow (csrow0). csrow1 will be empty. On the other
169when 2 dual ranked DIMMs are similarly placed, then both csrow0 and 379hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
170csrow1 will be populated. The pattern repeats itself for csrow2 and 380and csrow1 will be populated. The pattern repeats itself for csrow2 and
171csrow3. 381csrow3.
172 382
173The representation of the above is reflected in the directory 383The representation of the above is reflected in the directory
174tree in EDAC's sysfs interface. Starting in directory 384tree in EDAC's sysfs interface. Starting in directory
175/sys/devices/system/edac/mc each memory controller will be represented 385``/sys/devices/system/edac/mc``, each memory controller will be
176by its own ``mcX`` directory, where ``X`` is the index of the MC:: 386represented by its own ``mcX`` directory, where ``X`` is the
387index of the MC::
177 388
178 ..../edac/mc/ 389 ..../edac/mc/
179 | 390 |
@@ -198,11 +409,9 @@ order to have dual-channel mode be operational. Since both csrow2 and
198csrow3 are populated, this indicates a dual ranked set of DIMMs for 409csrow3 are populated, this indicates a dual ranked set of DIMMs for
199channels 0 and 1. 410channels 0 and 1.
200 411
201
202Within each of the ``mcX`` and ``csrowX`` directories are several EDAC 412Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
203control and attribute files. 413control and attribute files.
204 414
205
206``mcX`` directories 415``mcX`` directories
207------------------- 416-------------------
208 417
@@ -338,10 +547,10 @@ this ``X`` memory module:
338``csrowX`` directories 547``csrowX`` directories
339---------------------- 548----------------------
340 549
341When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX 550When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
342directories. As this API doesn't work properly for Rambus, FB-DIMMs and 551directories. As this API doesn't work properly for Rambus, FB-DIMMs and
343modern Intel Memory Controllers, this is being deprecated in favor of 552modern Intel Memory Controllers, this is being deprecated in favor of
344dimmX directories. 553``dimmX`` directories.
345 554
346In the ``csrowX`` directories are EDAC control and attribute files for 555In the ``csrowX`` directories are EDAC control and attribute files for
347this ``X`` instance of csrow: 556this ``X`` instance of csrow: