edac.txt: Improve documentation, adding RAS introduction

The edac.txt assumes that the reader has already deep knowledge on RAS features. However, this may not be the case. So, add an introduction chapter explaining the main concepts that are used by the EDAC subsystem and by other RAS drivers within the Kernel. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com>
author: Mauro Carvalho Chehab <mchehab@s-opensource.com> 2016-10-27 07:26:36 -0400
committer: Mauro Carvalho Chehab <mchehab@s-opensource.com> 2016-12-15 05:54:50 -0500
commit: 9c058d24ccb36d91650a84d9cbc27409f769d9a9 (patch)
tree: 8d6fe1e1bad380475ae8c249756e9149c3d20c30
parent: e4b5301674c0d2d866de767f02a44bc322af8d7f (diff)
1 files changed, 248 insertions, 39 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 0c9161c9ed7a..2f8706bae5a4 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -1,18 +1,218 @@
 .. include:: <isonum.txt>
-=====================================
+============================================
+Reliability, Availability and Serviceability
+============================================
+RAS concepts
+************
+Reliability, Availability and Serviceability (RAS) is a concept used on
+servers meant to measure their robusteness.
+Reliability
+  is the probability that a system will produce correct outputs.
+  * Generally measured as Mean Time Between Failures (MTBF)
+  * Enhanced by features that help to avoid, detect and repair hardware faults
+Availability
+  is the probability that a system is operational at a given time
+  * Generally measured as a percentage of downtime per a period of time
+  * Often uses mechanisms to detect and correct hardware faults in
+    runtime;
+Serviceability (or maintainability)
+  is the simplicity and speed with which a system can be repaired or
+  maintained
+  * Generally measured on Mean Time Between Repair (MTBR)
+Improving RAS
+-------------
+In order to reduce systems downtime, a system should be capable of detecting
+hardware errors, and, when possible correcting them in runtime. It should
+also provide mechanisms to detect hardware degradation, in order to warn
+the system administrator to take the action of replacing a component before
+it causes data loss or system downtime.
+Among the monitoring measures, the most usual ones include:
+* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
+* Memory – add error correction logic (ECC) to detect and correct errors;
+* I/O – add CRC checksums for tranfered data;
+* Storage – RAID, journal file systems, checksums,
+  Self-Monitoring, Analysis and Reporting Technology (SMART).
+By monitoring the number of occurrences of error detections, it is possible
+to identify if the probability of hardware errors is increasing, and, on such
+case, do a preventive maintainance to replace a degrated component while
+those errors are correctable.
+Types of errors
+---------------
+Most mechanisms used on modern systems use use technologies like Hamming
+Codes that allow error correction when the number of errors on a bit packet
+is below a threshold. If the number of errors is above, those mechanisms
+can indicate with a high degree of confidence that an error happened, but
+they can't correct.
+Also, sometimes an error occur on a component that it is not used. For
+example, a part of the memory that it is not currently allocated.
+That defines some categories of errors:
+* **Correctable Error (CE)** - the error detection mechanism detected and
+  corrected the error. Such errors are usually not fatal, although some
+  Kernel mechanisms allow the system administrator to consider them as fatal.
+* **Uncorrected Error (UE)** - the amount of errors happened above the error
+  correction threshold, and the system was unable to auto-correct.
+* **Fatal Error** - when an UE error happens on a critical component of the
+  system (for example, a piece of the Kernel got corrupted by an UE), the
+  only reliable way to avoid data corruption is to hang or reboot the machine.
+* **Non-fatal Error** - when an UE error happens on an unused component,
+  like a CPU in power down state or an unused memory bank, the system may
+  still run, eventually replacing the affected hardware by a hot spare,
+  if available.
+  Also, when an error happens on an userspace process, it is also possible to
+  kill such process and let userspace restart it.
+The mechanism for handling non-fatal errors is usually complex and may
+require the help of some userspace application, in order to apply the
+policy desired by the system administrator.
+Identifying a bad hardware component
+------------------------------------
+Just detecting a hardware flaw is usually not enough, as the system needs
+to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
+to make the hardware reliable again.
+So, it requires not only error logging facilities, but also mechanisms that
+will translate the error message to the silkscreen or component label for
+the MRU.
+Typically, it is very complex for memory, as modern CPUs interlace memory
+from different memory modules, in order to provide a better performance. The
+DMI BIOS usually have a list of memory module labels, with can be obtained
+using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
+        Memory Device
+                Total Width: 64 bits
+                Data Width: 64 bits
+                Size: 16384 MB
+                Form Factor: SODIMM
+                Set: None
+                Locator: ChannelA-DIMM0
+                Bank Locator: BANK 0
+                Type: DDR4
+                Type Detail: Synchronous
+                Speed: 2133 MHz
+                Rank: 2
+                Configured Clock Speed: 2133 MHz
+On the above example, a DDR4 SO-DIMM memory module is located at the
+system's memory labeled as "BANK 0", as given by the *bank locator* field.
+Please notice that, on such system, the *total width* is equal to the
+*data witdh*. It means that such memory module doesn't have error
+detection/correction mechanisms.
+Unfortunately, not all systems use the same field to specify the memory
+bank. On this example, from an older server, ``dmidecode`` shows::
+        Memory Device
+                Array Handle: 0x1000
+                Error Information Handle: Not Provided
+                Total Width: 72 bits
+                Data Width: 64 bits
+                Size: 8192 MB
+                Form Factor: DIMM
+                Set: 1
+                Locator: DIMM_A1
+                Bank Locator: Not Specified
+                Type: DDR3
+                Type Detail: Synchronous Registered (Buffered)
+                Speed: 1600 MHz
+                Rank: 2
+                Configured Clock Speed: 1600 MHz
+There, the DDR3 RDIMM memory module is located at the system's memory labeled
+as "DIMM_A1", as given by the *locator* field. Please notice that this
+memory module has 64 bits of *data witdh* and 72 bits of *total width*. So,
+it has 8 extra bits to be used by error detection and correction mechanisms.
+Such kind of memory is called Error-correcting code memory (ECC memory).
+To make things even worse, it is not uncommon that systems with different
+labels on their system's board to use exactly the same BIOS, meaning that
+the labels provided by the BIOS won't match the real ones.
+ECC memory
+----------
+As mentioned on the previous section, ECC memory has extra bits to be
+used for error correction. So, on 64 bit systems, a memory module
+has 64 bits of *data width*, and 74 bits of *total width*. So, there are
+8 bits extra bits to be used for the error detection and correction
+mechanisms. Those extra bits are called *syndrome*\ [#f1]_\ [#f2]_.
+So, when the cpu requests the memory controller to write a word with
+*data width*, the memory controller calculates the *syndrome* in real time,
+using Hamming code, or some other error correction code, like SECDED+,
+producing a code with *total width* size. Such code is then written
+on the memory modules.
+At read, the *total width* bits code is converted back, using the same
+ECC code used on write, producing a word with *data width* and a *syndrome*.
+The word with *data width* is sent to the CPU, even when errors happen.
+The memory controller also looks at the *syndrome* in order to check if
+there was an error, and if the ECC code was able to fix such error.
+If the error was corrected, a Corrected Error (CE) happened. If not, an
+Uncorrected Error (UE) happened.
+The information about the CE/UE errors is stored on some special registers
+at the memory controller and can be accessed by reading such registers,
+either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
+bit CPUs, such errors can also be retrieved via the Machine Check
+Architecture (MCA)\ [#f3]_.
+.. [#f1] Please notice that several memory controllers allow operation on a
+  mode called "Lock-Step", where it groups two memory modules together,
+  doing 128-bit reads/writes. That gives 16 bits for error correction, with
+  significatively improves the error correction mechanism, at the expense
+  that, when an error happens, there's no way to know what memory module is
+  to blame. So, it has to blame both memory modules.
+.. [#f2] Some memory controllers also allow using memory in mirror mode.
+  On such mode, the same data is written to two memory modules. At read,
+  the system checks both memory modules, in order to check if both provide
+  identical data. On such configuration, when an error happens, there's no
+  way to know what memory module is to blame. So, it has to blame both
+  memory modules (or 4 memory modules, if the system is also on Lock-step
+  mode).
+.. [#f3] For more details about the Machine Check Architecture (MCA),
+  please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
 EDAC - Error Detection And Correction
-=====================================
+*************************************
 .. note::
-   "bluesmoke" was the name for this device driver when it
+   "bluesmoke" was the name for this device driver subsystem when it
   was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
   That site is mostly archaic now and can be used only for historical
   purposes.
-   When the subsystem was pushed into 2.6.16 for the first time, it was
+   When the subsystem was pushed upstream for the first time, on
-   renamed to ``EDAC``.
+   Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
 Purpose
 -------
@@ -33,7 +233,7 @@ CE events only, the system can and will continue to operate as no data
 has been damaged yet.
 However, preventive maintenance and proactive part replacement of memory
-DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events
+modules exhibiting CEs can reduce the likelihood of the dreaded UE events
 and system panics.
 Other hardware elements
@@ -124,37 +324,47 @@ Within this directory there currently reside 2 components:
 Memory Controller (mc) Model
 ----------------------------
-Each ``mc`` device controls a set of DIMM memory modules. These modules
+Each ``mc`` device controls a set of memory modules [#f4]_. These modules
 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
 There can be multiple csrows and multiple channels.
+.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
+  used to refer to a memory module, although there are other memory
+  packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
+  and inside the EDAC system, the term "dimm" is used for all memory
+  modules, even when they use a different kind of packaging.
 Memory controllers allow for several csrows, with 8 csrows being a
 typical value. Yet, the actual number of csrows depends on the layout of
-a given motherboard, memory controller and DIMM characteristics.
+a given motherboard, memory controller and memory module characteristics.
-Dual channels allows for 128 bit data transfers to/from the CPU from/to
+Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
-memory. Some newer chipsets allow for more than 2 channels, like Fully
+data transfers to/from the CPU from/to memory. Some newer chipsets allow
-Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels:
+for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
+controllers. The following example will assume 2 channels:
-        +--------+-----------+-----------+
-        |        | Channel 0 | Channel 1 |
+        +------------+-----------------------+
-        +========+===========+===========+
+        | Chip       |       Channels        |
-        | csrow0 |  DIMM_A0  |  DIMM_B0  |
+        | Select     +-----------+-----------+
-        +--------+           |           |
+        | rows       |  ``ch0``  |  ``ch1``  |
-        | csrow1 |           |           |
+        +============+===========+===========+
-        +--------+-----------+-----------+
+        | ``csrow0`` |  DIMM_A0  |  DIMM_B0  |
-        | csrow2 |  DIMM_A1  | DIMM_B1   |
+        +------------+           |           |
-        +--------+           |           |
+        | ``csrow1`` |           |           |
-        | csrow3 |           |           |
+        +------------+-----------+-----------+
-        +--------+-----------+-----------+
+        | ``csrow2`` |  DIMM_A1  | DIMM_B1   |
+        +------------+           |           |
-In the above example table there are 4 physical slots on the motherboard
+        | ``csrow3`` |           |           |
+        +------------+-----------+-----------+
+In the above example, there are 4 physical slots on the motherboard
 for memory DIMMs:
-        - DIMM_A0
+        +---------+---------+
-        - DIMM_B0
+        | DIMM_A0 | DIMM_B0 |
-        - DIMM_A1
+        +---------+---------+
-        - DIMM_B1
+        | DIMM_A1 | DIMM_B1 |
+        +---------+---------+
 Labels for these slots are usually silk-screened on the motherboard.
 Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
@@ -165,15 +375,16 @@ Channel, the csrows cross both DIMMs.
 Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
 Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
-will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,
+will have just one csrow (csrow0). csrow1 will be empty. On the other
-when 2 dual ranked DIMMs are similarly placed, then both csrow0 and
+hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
-csrow1 will be populated. The pattern repeats itself for csrow2 and
+and csrow1 will be populated. The pattern repeats itself for csrow2 and
 csrow3.
 The representation of the above is reflected in the directory
 tree in EDAC's sysfs interface. Starting in directory
-/sys/devices/system/edac/mc each memory controller will be represented
+``/sys/devices/system/edac/mc``, each memory controller will be
-by its own ``mcX`` directory, where ``X`` is the index of the MC::
+represented by its own ``mcX`` directory, where ``X`` is the
+index of the MC::
        ..../edac/mc/
                   |
@@ -198,11 +409,9 @@ order to have dual-channel mode be operational. Since both csrow2 and
 csrow3 are populated, this indicates a dual ranked set of DIMMs for
 channels 0 and 1.
 Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
 control and attribute files.
 ``mcX`` directories
 -------------------
@@ -338,10 +547,10 @@ this ``X`` memory module:
 ``csrowX`` directories
 ----------------------
-When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX
+When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
 modern Intel Memory Controllers, this is being deprecated in favor of
-dimmX directories.
+``dimmX`` directories.
 In the ``csrowX`` directories are EDAC control and attribute files for
 this ``X`` instance of csrow:
author	Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-10-27 07:26:36 -0400
committer	Mauro Carvalho Chehab <mchehab@s-opensource.com>	2016-12-15 05:54:50 -0500
commit	9c058d24ccb36d91650a84d9cbc27409f769d9a9 (patch)
tree	8d6fe1e1bad380475ae8c249756e9149c3d20c30
parent	e4b5301674c0d2d866de767f02a44bc322af8d7f (diff)

diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 0c9161c9ed7a..2f8706bae5a4 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt
@@ -1,18 +1,218 @@
1	.. include:: <isonum.txt>	1	.. include:: <isonum.txt>
2		2
3	=====================================	3	============================================
		4	Reliability, Availability and Serviceability
		5	============================================
		6
		7	RAS concepts
		8	************
		9
		10	Reliability, Availability and Serviceability (RAS) is a concept used on
		11	servers meant to measure their robusteness.
		12
		13	Reliability
		14	is the probability that a system will produce correct outputs.
		15
		16	* Generally measured as Mean Time Between Failures (MTBF)
		17	* Enhanced by features that help to avoid, detect and repair hardware faults
		18
		19	Availability
		20	is the probability that a system is operational at a given time
		21
		22	* Generally measured as a percentage of downtime per a period of time
		23	* Often uses mechanisms to detect and correct hardware faults in
		24	runtime;
		25
		26	Serviceability (or maintainability)
		27	is the simplicity and speed with which a system can be repaired or
		28	maintained
		29
		30	* Generally measured on Mean Time Between Repair (MTBR)
		31
		32	Improving RAS
		33	-------------
		34
		35	In order to reduce systems downtime, a system should be capable of detecting
		36	hardware errors, and, when possible correcting them in runtime. It should
		37	also provide mechanisms to detect hardware degradation, in order to warn
		38	the system administrator to take the action of replacing a component before
		39	it causes data loss or system downtime.
		40
		41	Among the monitoring measures, the most usual ones include:
		42
		43	* CPU – detect errors at instruction execution and at L1/L2/L3 caches;
		44	* Memory – add error correction logic (ECC) to detect and correct errors;
		45	* I/O – add CRC checksums for tranfered data;
		46	* Storage – RAID, journal file systems, checksums,
		47	Self-Monitoring, Analysis and Reporting Technology (SMART).
		48
		49	By monitoring the number of occurrences of error detections, it is possible
		50	to identify if the probability of hardware errors is increasing, and, on such
		51	case, do a preventive maintainance to replace a degrated component while
		52	those errors are correctable.
		53
		54	Types of errors
		55	---------------
		56
		57	Most mechanisms used on modern systems use use technologies like Hamming
		58	Codes that allow error correction when the number of errors on a bit packet
		59	is below a threshold. If the number of errors is above, those mechanisms
		60	can indicate with a high degree of confidence that an error happened, but
		61	they can't correct.
		62
		63	Also, sometimes an error occur on a component that it is not used. For
		64	example, a part of the memory that it is not currently allocated.
		65
		66	That defines some categories of errors:
		67
		68	* Correctable Error (CE) - the error detection mechanism detected and
		69	corrected the error. Such errors are usually not fatal, although some
		70	Kernel mechanisms allow the system administrator to consider them as fatal.
		71
		72	* Uncorrected Error (UE) - the amount of errors happened above the error
		73	correction threshold, and the system was unable to auto-correct.
		74
		75	* Fatal Error - when an UE error happens on a critical component of the
		76	system (for example, a piece of the Kernel got corrupted by an UE), the
		77	only reliable way to avoid data corruption is to hang or reboot the machine.
		78
		79	* Non-fatal Error - when an UE error happens on an unused component,
		80	like a CPU in power down state or an unused memory bank, the system may
		81	still run, eventually replacing the affected hardware by a hot spare,
		82	if available.
		83
		84	Also, when an error happens on an userspace process, it is also possible to
		85	kill such process and let userspace restart it.
		86
		87	The mechanism for handling non-fatal errors is usually complex and may
		88	require the help of some userspace application, in order to apply the
		89	policy desired by the system administrator.
		90
		91	Identifying a bad hardware component
		92	------------------------------------
		93
		94	Just detecting a hardware flaw is usually not enough, as the system needs
		95	to pinpoint to the minimal replaceable unit (MRU) that should be exchanged
		96	to make the hardware reliable again.
		97
		98	So, it requires not only error logging facilities, but also mechanisms that
		99	will translate the error message to the silkscreen or component label for
		100	the MRU.
		101
		102	Typically, it is very complex for memory, as modern CPUs interlace memory
		103	from different memory modules, in order to provide a better performance. The
		104	DMI BIOS usually have a list of memory module labels, with can be obtained
		105	using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
		106
		107	Memory Device
		108	Total Width: 64 bits
		109	Data Width: 64 bits
		110	Size: 16384 MB
		111	Form Factor: SODIMM
		112	Set: None
		113	Locator: ChannelA-DIMM0
		114	Bank Locator: BANK 0
		115	Type: DDR4
		116	Type Detail: Synchronous
		117	Speed: 2133 MHz
		118	Rank: 2
		119	Configured Clock Speed: 2133 MHz
		120
		121	On the above example, a DDR4 SO-DIMM memory module is located at the
		122	system's memory labeled as "BANK 0", as given by the bank locator field.
		123	Please notice that, on such system, the total width is equal to the
		124	data witdh. It means that such memory module doesn't have error
		125	detection/correction mechanisms.
		126
		127	Unfortunately, not all systems use the same field to specify the memory
		128	bank. On this example, from an older server, ``dmidecode`` shows::
		129
		130	Memory Device
		131	Array Handle: 0x1000
		132	Error Information Handle: Not Provided
		133	Total Width: 72 bits
		134	Data Width: 64 bits
		135	Size: 8192 MB
		136	Form Factor: DIMM
		137	Set: 1
		138	Locator: DIMM_A1
		139	Bank Locator: Not Specified
		140	Type: DDR3
		141	Type Detail: Synchronous Registered (Buffered)
		142	Speed: 1600 MHz
		143	Rank: 2
		144	Configured Clock Speed: 1600 MHz
		145
		146	There, the DDR3 RDIMM memory module is located at the system's memory labeled
		147	as "DIMM_A1", as given by the locator field. Please notice that this
		148	memory module has 64 bits of data witdh and 72 bits of total width. So,
		149	it has 8 extra bits to be used by error detection and correction mechanisms.
		150	Such kind of memory is called Error-correcting code memory (ECC memory).
		151
		152	To make things even worse, it is not uncommon that systems with different
		153	labels on their system's board to use exactly the same BIOS, meaning that
		154	the labels provided by the BIOS won't match the real ones.
		155
		156	ECC memory
		157	----------
		158
		159	As mentioned on the previous section, ECC memory has extra bits to be
		160	used for error correction. So, on 64 bit systems, a memory module
		161	has 64 bits of data width, and 74 bits of total width. So, there are
		162	8 bits extra bits to be used for the error detection and correction
		163	mechanisms. Those extra bits are called syndrome\ [#f1]_\ [#f2]_.
		164
		165	So, when the cpu requests the memory controller to write a word with
		166	data width, the memory controller calculates the syndrome in real time,
		167	using Hamming code, or some other error correction code, like SECDED+,
		168	producing a code with total width size. Such code is then written
		169	on the memory modules.
		170
		171	At read, the total width bits code is converted back, using the same
		172	ECC code used on write, producing a word with data width and a syndrome.
		173	The word with data width is sent to the CPU, even when errors happen.
		174
		175	The memory controller also looks at the syndrome in order to check if
		176	there was an error, and if the ECC code was able to fix such error.
		177	If the error was corrected, a Corrected Error (CE) happened. If not, an
		178	Uncorrected Error (UE) happened.
		179
		180	The information about the CE/UE errors is stored on some special registers
		181	at the memory controller and can be accessed by reading such registers,
		182	either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
		183	bit CPUs, such errors can also be retrieved via the Machine Check
		184	Architecture (MCA)\ [#f3]_.
		185
		186	.. [#f1] Please notice that several memory controllers allow operation on a
		187	mode called "Lock-Step", where it groups two memory modules together,
		188	doing 128-bit reads/writes. That gives 16 bits for error correction, with
		189	significatively improves the error correction mechanism, at the expense
		190	that, when an error happens, there's no way to know what memory module is
		191	to blame. So, it has to blame both memory modules.
		192
		193	.. [#f2] Some memory controllers also allow using memory in mirror mode.
		194	On such mode, the same data is written to two memory modules. At read,
		195	the system checks both memory modules, in order to check if both provide
		196	identical data. On such configuration, when an error happens, there's no
		197	way to know what memory module is to blame. So, it has to blame both
		198	memory modules (or 4 memory modules, if the system is also on Lock-step
		199	mode).
		200
		201	.. [#f3] For more details about the Machine Check Architecture (MCA),
		202	please read Documentation/x86/x86_64/machinecheck at the Kernel tree.
		203
4	EDAC - Error Detection And Correction	204	EDAC - Error Detection And Correction
5	=====================================	205	*************************************
6		206
7	.. note::	207	.. note::
8		208
9	"bluesmoke" was the name for this device driver when it	209	"bluesmoke" was the name for this device driver subsystem when it
10	was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.	210	was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
11	That site is mostly archaic now and can be used only for historical	211	That site is mostly archaic now and can be used only for historical
12	purposes.	212	purposes.
13		213
14	When the subsystem was pushed into 2.6.16 for the first time, it was	214	When the subsystem was pushed upstream for the first time, on
15	renamed to ``EDAC``.	215	Kernel 2.6.16, for the first time, it was renamed to ``EDAC``.
16		216
17	Purpose	217	Purpose
18	-------	218	-------
@@ -33,7 +233,7 @@ CE events only, the system can and will continue to operate as no data
33	has been damaged yet.	233	has been damaged yet.
34		234
35	However, preventive maintenance and proactive part replacement of memory	235	However, preventive maintenance and proactive part replacement of memory
36	DIMMs exhibiting CEs can reduce the likelihood of the dreaded UE events	236	modules exhibiting CEs can reduce the likelihood of the dreaded UE events
37	and system panics.	237	and system panics.
38		238
39	Other hardware elements	239	Other hardware elements
@@ -124,37 +324,47 @@ Within this directory there currently reside 2 components:
124	Memory Controller (mc) Model	324	Memory Controller (mc) Model
125	----------------------------	325	----------------------------
126		326
127	Each ``mc`` device controls a set of DIMM memory modules. These modules	327	Each ``mc`` device controls a set of memory modules [#f4]_. These modules
128	are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).	328	are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
129	There can be multiple csrows and multiple channels.	329	There can be multiple csrows and multiple channels.
130		330
		331	.. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
		332	used to refer to a memory module, although there are other memory
		333	packaging alternatives, like SO-DIMM, SIMM, etc. Along this document,
		334	and inside the EDAC system, the term "dimm" is used for all memory
		335	modules, even when they use a different kind of packaging.
		336
131	Memory controllers allow for several csrows, with 8 csrows being a	337	Memory controllers allow for several csrows, with 8 csrows being a
132	typical value. Yet, the actual number of csrows depends on the layout of	338	typical value. Yet, the actual number of csrows depends on the layout of
133	a given motherboard, memory controller and DIMM characteristics.	339	a given motherboard, memory controller and memory module characteristics.
134		340
135	Dual channels allows for 128 bit data transfers to/from the CPU from/to	341	Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
136	memory. Some newer chipsets allow for more than 2 channels, like Fully	342	data transfers to/from the CPU from/to memory. Some newer chipsets allow
137	Buffered DIMMs (FB-DIMMs). The following example will assume 2 channels:	343	for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
138		344	controllers. The following example will assume 2 channels:
139	+--------+-----------+-----------+	345
140	\| \| Channel 0 \| Channel 1 \|	346	+------------+-----------------------+
141	+========+===========+===========+	347	\| Chip \| Channels \|
142	\| csrow0 \| DIMM_A0 \| DIMM_B0 \|	348	\| Select +-----------+-----------+
143	+--------+ \| \|	349	\| rows \| ``ch0`` \| ``ch1`` \|
144	\| csrow1 \| \| \|	350	+============+===========+===========+
145	+--------+-----------+-----------+	351	\| ``csrow0`` \| DIMM_A0 \| DIMM_B0 \|
146	\| csrow2 \| DIMM_A1 \| DIMM_B1 \|	352	+------------+ \| \|
147	+--------+ \| \|	353	\| ``csrow1`` \| \| \|
148	\| csrow3 \| \| \|	354	+------------+-----------+-----------+
149	+--------+-----------+-----------+	355	\| ``csrow2`` \| DIMM_A1 \| DIMM_B1 \|
150		356	+------------+ \| \|
151	In the above example table there are 4 physical slots on the motherboard	357	\| ``csrow3`` \| \| \|
		358	+------------+-----------+-----------+
		359
		360	In the above example, there are 4 physical slots on the motherboard
152	for memory DIMMs:	361	for memory DIMMs:
153		362
154	- DIMM_A0	363	+---------+---------+
155	- DIMM_B0	364	\| DIMM_A0 \| DIMM_B0 \|
156	- DIMM_A1	365	+---------+---------+
157	- DIMM_B1	366	\| DIMM_A1 \| DIMM_B1 \|
		367	+---------+---------+
158		368
159	Labels for these slots are usually silk-screened on the motherboard.	369	Labels for these slots are usually silk-screened on the motherboard.
160	Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are	370	Slots labeled ``A`` are channel 0 in this example. Slots labeled ``B`` are
@@ -165,15 +375,16 @@ Channel, the csrows cross both DIMMs.
165		375
166	Memory DIMMs come single or dual "ranked". A rank is a populated csrow.	376	Memory DIMMs come single or dual "ranked". A rank is a populated csrow.
167	Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above	377	Thus, 2 single ranked DIMMs, placed in slots DIMM_A0 and DIMM_B0 above
168	will have 1 csrow, csrow0. csrow1 will be empty. On the other hand,	378	will have just one csrow (csrow0). csrow1 will be empty. On the other
169	when 2 dual ranked DIMMs are similarly placed, then both csrow0 and	379	hand, when 2 dual ranked DIMMs are similarly placed, then both csrow0
170	csrow1 will be populated. The pattern repeats itself for csrow2 and	380	and csrow1 will be populated. The pattern repeats itself for csrow2 and
171	csrow3.	381	csrow3.
172		382
173	The representation of the above is reflected in the directory	383	The representation of the above is reflected in the directory
174	tree in EDAC's sysfs interface. Starting in directory	384	tree in EDAC's sysfs interface. Starting in directory
175	/sys/devices/system/edac/mc each memory controller will be represented	385	``/sys/devices/system/edac/mc``, each memory controller will be
176	by its own ``mcX`` directory, where ``X`` is the index of the MC::	386	represented by its own ``mcX`` directory, where ``X`` is the
		387	index of the MC::
177		388
178	..../edac/mc/	389	..../edac/mc/
179	\|	390	\|
@@ -198,11 +409,9 @@ order to have dual-channel mode be operational. Since both csrow2 and
198	csrow3 are populated, this indicates a dual ranked set of DIMMs for	409	csrow3 are populated, this indicates a dual ranked set of DIMMs for
199	channels 0 and 1.	410	channels 0 and 1.
200		411
201
202	Within each of the ``mcX`` and ``csrowX`` directories are several EDAC	412	Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
203	control and attribute files.	413	control and attribute files.
204		414
205
206	``mcX`` directories	415	``mcX`` directories
207	-------------------	416	-------------------
208		417
@@ -338,10 +547,10 @@ this ``X`` memory module:
338	``csrowX`` directories	547	``csrowX`` directories
339	----------------------	548	----------------------
340		549
341	When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the csrowX	550	When CONFIG_EDAC_LEGACY_SYSFS is enabled, sysfs will contain the ``csrowX``
342	directories. As this API doesn't work properly for Rambus, FB-DIMMs and	551	directories. As this API doesn't work properly for Rambus, FB-DIMMs and
343	modern Intel Memory Controllers, this is being deprecated in favor of	552	modern Intel Memory Controllers, this is being deprecated in favor of
344	dimmX directories.	553	``dimmX`` directories.
345		554
346	In the ``csrowX`` directories are EDAC control and attribute files for	555	In the ``csrowX`` directories are EDAC control and attribute files for
347	this ``X`` instance of csrow:	556	this ``X`` instance of csrow: