aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab@redhat.com>2009-08-05 20:16:56 -0400
committerMauro Carvalho Chehab <mchehab@redhat.com>2010-05-10 10:44:55 -0400
commit31983a04d686f9f90b356072089d8d677e40e776 (patch)
tree7f5df1f504f35fe474785c4366643eff03788267 /Documentation
parent4157d9f55435331deef01ba8a9a47f248c042fb2 (diff)
Documentation/edac.txt: Add Nehalem specific EDAC characteristics
As Nehalem has a different binding to EDAC API, and its own different error injection code, documents it. Signed-off-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/edac.txt110
1 files changed, 110 insertions, 0 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 79c533223762..8bc320467c64 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
67 Dec 2005 67 Dec 2005
717 Jul 2007 Updated 717 Jul 2007 Updated
8 8
9(c) Mauro Carvalho Chehab <mchehab@redhat.com>
1005 Aug 2009 Nehalem interface
9 11
10EDAC is maintained and written by: 12EDAC is maintained and written by:
11 13
@@ -717,3 +719,111 @@ unique drivers for their hardware systems.
717The 'test_device_edac' sample driver is located at the 719The 'test_device_edac' sample driver is located at the
718bluesmoke.sourceforge.net project site for EDAC. 720bluesmoke.sourceforge.net project site for EDAC.
719 721
722=======================================================================
723NEHALEM USAGE OF EDAC APIs
724
725This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
726Nehalem EDAC driver. They will likely be changed on future versions
727of the driver.
728
729Due to the way Nehalem exports Memory Controller data, some adjustments
730were done at i7core_edac driver. This chapter will cover those differences
731
7321) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
733 (QPI). At the driver, the term "socket" means one QPI. It should also be
734 associated with the CPU physical socket.
735
736 Each MC have 3 physical read channels, 3 physical write channels and
737 3 logic channels. The driver currenty sees it as just 3 channels.
738 Each channel can have up to 3 DIMMs.
739
740 The minimum known unity is DIMMs. There are no information about csrows.
741 As EDAC API maps the minimum unity is csrows, the driver exports one
742 DIMM per csrow.
743
744 Currently, it also exports the several memory controllers as just one. This
745 limit will be removed on future versions of the driver.
746
7472) Nehalem MC has the hability to generate errors. The driver implements this
748 functionality via some error injection nodes:
749
750 For injecting a memory error, there are some sysfs nodes, under
751 /sys/devices/system/edac/mc/mc0/:
752
753 inject_addrmatch:
754 Controls the error injection mask register. It is possible to specify
755 several characteristics of the address to match an error code:
756 dimm = the affected dimm. Numbers are relative to a channel;
757 rank = the memory rank;
758 channel = the channel that will generate an error;
759 bank = the affected bank;
760 page = the page address;
761 column (or col) = the address column.
762 each of the above values can be set to "any" to match any valid value.
763
764 At driver init, all values are set to any.
765
766 For example, to generate an error at rank 1 of dimm 2, for any channel,
767 any bank, any page, any column:
768 echo "dimm:2 rank:1" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
769
770 To return to the default behaviour of matching any, you can do:
771 echo "dimm:any rank:any" >/sys/devices/system/edac/mc/mc0/inject_addrmatch
772
773 inject_eccmask:
774 specifies what bits will have troubles,
775
776 inject_section:
777 specifies what ECC cache section will get the error:
778 3 for both
779 2 for the highest
780 1 for the lowest
781
782 inject_socket:
783 specifies what QPI (or processor socket) will generate the error.
784 on Xeon 35xx, it should be 0.
785 on Xeon 55xx, it should be 0 or 1.
786
787 inject_type:
788 specifies the type of error, being a combination of the following bits:
789 bit 0 - repeat
790 bit 1 - ecc
791 bit 2 - parity
792
793 inject_enable starts the error generation when something different
794 than 0 is written.
795
796 All inject vars can be read. root permission is needed for write.
797
798 Datasheet states that the error will only be generated after a write on an
799 address that matches inject_addrmatch. It seems, however, that reading will
800 also produce an error.
801
802 For example, the following code will generate an error for any write access
803 at socket 0, on any DIMM/address on channel 2:
804
805 echo "channel:2" > /sys/devices/system/edac/mc/mc0/inject_addrmatch
806 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
807 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
808 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
809 echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
810 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
811 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
812
813 The generated error message will look like:
814
815 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
816
8173) Nehalem specific Corrected Error memory counters
818
819 Nehalem have some registers to count memory errors, reporting it on a
820 way that it is different from what EDAC API allows. Due to that, a
821 separate sysfs note were created to handle such counters.
822
823 They can be read by looking at the contents of "corrected_error_counts"
824 counter:
825
826 $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
827 dimm0: 15866
828 dimm1: 0
829 dimm2: 27285