diff options
author | Mauro Carvalho Chehab <mchehab@redhat.com> | 2009-08-05 20:16:56 -0400 |
---|---|---|
committer | Mauro Carvalho Chehab <mchehab@redhat.com> | 2010-05-10 10:44:55 -0400 |
commit | 31983a04d686f9f90b356072089d8d677e40e776 (patch) | |
tree | 7f5df1f504f35fe474785c4366643eff03788267 | |
parent | 4157d9f55435331deef01ba8a9a47f248c042fb2 (diff) |
Documentation/edac.txt: Add Nehalem specific EDAC characteristics
As Nehalem has a different binding to EDAC API, and its own different
error injection code, documents it.
Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
-rw-r--r-- | Documentation/edac.txt | 110 |
1 files changed, 110 insertions, 0 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 79c533223762..8bc320467c64 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt | |||
@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com> | |||
6 | 7 Dec 2005 | 6 | 7 Dec 2005 |
7 | 17 Jul 2007 Updated | 7 | 17 Jul 2007 Updated |
8 | 8 | ||
9 | (c) Mauro Carvalho Chehab <mchehab@redhat.com> | ||
10 | 05 Aug 2009 Nehalem interface | ||
9 | 11 | ||
10 | EDAC is maintained and written by: | 12 | EDAC is maintained and written by: |
11 | 13 | ||
@@ -717,3 +719,111 @@ unique drivers for their hardware systems. | |||
717 | The 'test_device_edac' sample driver is located at the | 719 | The 'test_device_edac' sample driver is located at the |
718 | bluesmoke.sourceforge.net project site for EDAC. | 720 | bluesmoke.sourceforge.net project site for EDAC. |
719 | 721 | ||
722 | ======================================================================= | ||
723 | NEHALEM USAGE OF EDAC APIs | ||
724 | |||
725 | This chapter documents some EXPERIMENTAL mappings for EDAC API to handle | ||
726 | Nehalem EDAC driver. They will likely be changed on future versions | ||
727 | of the driver. | ||
728 | |||
729 | Due to the way Nehalem exports Memory Controller data, some adjustments | ||
730 | were done at i7core_edac driver. This chapter will cover those differences | ||
731 | |||
732 | 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect | ||
733 | (QPI). At the driver, the term "socket" means one QPI. It should also be | ||
734 | associated with the CPU physical socket. | ||
735 | |||
736 | Each MC have 3 physical read channels, 3 physical write channels and | ||
737 | 3 logic channels. The driver currenty sees it as just 3 channels. | ||
738 | Each channel can have up to 3 DIMMs. | ||
739 | |||
740 | The minimum known unity is DIMMs. There are no information about csrows. | ||
741 | As EDAC API maps the minimum unity is csrows, the driver exports one | ||
742 | DIMM per csrow. | ||
743 | |||
744 | Currently, it also exports the several memory controllers as just one. This | ||
745 | limit will be removed on future versions of the driver. | ||
746 | |||
747 | 2) Nehalem MC has the hability to generate errors. The driver implements this | ||
748 | functionality via some error injection nodes: | ||
749 | |||
750 | For injecting a memory error, there are some sysfs nodes, under | ||
751 | /sys/devices/system/edac/mc/mc0/: | ||
752 | |||
753 | inject_addrmatch: | ||
754 | Controls the error injection mask register. It is possible to specify | ||
755 | several characteristics of the address to match an error code: | ||
756 | dimm = the affected dimm. Numbers are relative to a channel; | ||
757 | rank = the memory rank; | ||
758 | channel = the channel that will generate an error; | ||
759 | bank = the affected bank; | ||
760 | page = the page address; | ||
761 | column (or col) = the address column. | ||
762 | each of the above values can be set to "any" to match any valid value. | ||
763 | |||
764 | At driver init, all values are set to any. | ||
765 | |||
766 | For example, to generate an error at rank 1 of dimm 2, for any channel, | ||
767 | any bank, any page, any column: | ||
768 | echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch | ||
769 | |||
770 | To return to the default behaviour of matching any, you can do: | ||
771 | echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch | ||
772 | |||
773 | inject_eccmask: | ||
774 | specifies what bits will have troubles, | ||
775 | |||
776 | inject_section: | ||
777 | specifies what ECC cache section will get the error: | ||
778 | 3 for both | ||
779 | 2 for the highest | ||
780 | 1 for the lowest | ||
781 | |||
782 | inject_socket: | ||
783 | specifies what QPI (or processor socket) will generate the error. | ||
784 | on Xeon 35xx, it should be 0. | ||
785 | on Xeon 55xx, it should be 0 or 1. | ||
786 | |||
787 | inject_type: | ||
788 | specifies the type of error, being a combination of the following bits: | ||
789 | bit 0 - repeat | ||
790 | bit 1 - ecc | ||
791 | bit 2 - parity | ||
792 | |||
793 | inject_enable starts the error generation when something different | ||
794 | than 0 is written. | ||
795 | |||
796 | All inject vars can be read. root permission is needed for write. | ||
797 | |||
798 | Datasheet states that the error will only be generated after a write on an | ||
799 | address that matches inject_addrmatch. It seems, however, that reading will | ||
800 | also produce an error. | ||
801 | |||
802 | For example, the following code will generate an error for any write access | ||
803 | at socket 0, on any DIMM/address on channel 2: | ||
804 | |||
805 | echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch | ||
806 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_type | ||
807 | echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask | ||
808 | echo 3 >/sys/devices/system/edac/mc/mc0/inject_section | ||
809 | echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket | ||
810 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable | ||
811 | dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null | ||
812 | |||
813 | The generated error message will look like: | ||
814 | |||
815 | EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) | ||
816 | |||
817 | 3) Nehalem specific Corrected Error memory counters | ||
818 | |||
819 | Nehalem have some registers to count memory errors, reporting it on a | ||
820 | way that it is different from what EDAC API allows. Due to that, a | ||
821 | separate sysfs note were created to handle such counters. | ||
822 | |||
823 | They can be read by looking at the contents of "corrected_error_counts" | ||
824 | counter: | ||
825 | |||
826 | $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts | ||
827 | dimm0: 15866 | ||
828 | dimm1: 0 | ||
829 | dimm2: 27285 | ||