aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/edac.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/edac.txt')
-rw-r--r--Documentation/edac.txt152
1 files changed, 152 insertions, 0 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 79c533223762..0b875e8da969 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -6,6 +6,8 @@ Written by Doug Thompson <dougthompson@xmission.com>
67 Dec 2005 67 Dec 2005
717 Jul 2007 Updated 717 Jul 2007 Updated
8 8
9(c) Mauro Carvalho Chehab <mchehab@redhat.com>
1005 Aug 2009 Nehalem interface
9 11
10EDAC is maintained and written by: 12EDAC is maintained and written by:
11 13
@@ -717,3 +719,153 @@ unique drivers for their hardware systems.
717The 'test_device_edac' sample driver is located at the 719The 'test_device_edac' sample driver is located at the
718bluesmoke.sourceforge.net project site for EDAC. 720bluesmoke.sourceforge.net project site for EDAC.
719 721
722=======================================================================
723NEHALEM USAGE OF EDAC APIs
724
725This chapter documents some EXPERIMENTAL mappings for EDAC API to handle
726Nehalem EDAC driver. They will likely be changed on future versions
727of the driver.
728
729Due to the way Nehalem exports Memory Controller data, some adjustments
730were done at i7core_edac driver. This chapter will cover those differences
731
7321) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
733 (QPI). At the driver, the term "socket" means one QPI. This is
734 associated with a physical CPU socket.
735
736 Each MC have 3 physical read channels, 3 physical write channels and
737 3 logic channels. The driver currenty sees it as just 3 channels.
738 Each channel can have up to 3 DIMMs.
739
740 The minimum known unity is DIMMs. There are no information about csrows.
741 As EDAC API maps the minimum unity is csrows, the driver sequencially
742 maps channel/dimm into different csrows.
743
744 For example, suposing the following layout:
745 Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
746 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
747 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
748 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
749 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
750 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
751 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
752 The driver will map it as:
753 csrow0: channel 0, dimm0
754 csrow1: channel 0, dimm1
755 csrow2: channel 1, dimm0
756 csrow3: channel 2, dimm0
757
758exports one
759 DIMM per csrow.
760
761 Each QPI is exported as a different memory controller.
762
7632) Nehalem MC has the hability to generate errors. The driver implements this
764 functionality via some error injection nodes:
765
766 For injecting a memory error, there are some sysfs nodes, under
767 /sys/devices/system/edac/mc/mc?/:
768
769 inject_addrmatch/*:
770 Controls the error injection mask register. It is possible to specify
771 several characteristics of the address to match an error code:
772 dimm = the affected dimm. Numbers are relative to a channel;
773 rank = the memory rank;
774 channel = the channel that will generate an error;
775 bank = the affected bank;
776 page = the page address;
777 column (or col) = the address column.
778 each of the above values can be set to "any" to match any valid value.
779
780 At driver init, all values are set to any.
781
782 For example, to generate an error at rank 1 of dimm 2, for any channel,
783 any bank, any page, any column:
784 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
785 echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
786
787 To return to the default behaviour of matching any, you can do:
788 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
789 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
790
791 inject_eccmask:
792 specifies what bits will have troubles,
793
794 inject_section:
795 specifies what ECC cache section will get the error:
796 3 for both
797 2 for the highest
798 1 for the lowest
799
800 inject_type:
801 specifies the type of error, being a combination of the following bits:
802 bit 0 - repeat
803 bit 1 - ecc
804 bit 2 - parity
805
806 inject_enable starts the error generation when something different
807 than 0 is written.
808
809 All inject vars can be read. root permission is needed for write.
810
811 Datasheet states that the error will only be generated after a write on an
812 address that matches inject_addrmatch. It seems, however, that reading will
813 also produce an error.
814
815 For example, the following code will generate an error for any write access
816 at socket 0, on any DIMM/address on channel 2:
817
818 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
819 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
820 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
821 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
822 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
823 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
824
825 For socket 1, it is needed to replace "mc0" by "mc1" at the above
826 commands.
827
828 The generated error message will look like:
829
830 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
831
8323) Nehalem specific Corrected Error memory counters
833
834 Nehalem have some registers to count memory errors. The driver uses those
835 registers to report Corrected Errors on devices with Registered Dimms.
836
837 However, those counters don't work with Unregistered Dimms. As the chipset
838 offers some counters that also work with UDIMMS (but with a worse level of
839 granularity than the default ones), the driver exposes those registers for
840 UDIMM memories.
841
842 They can be read by looking at the contents of all_channel_counts/
843
844 $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
845 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
846 0
847 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
848 0
849 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
850 0
851
852 What happens here is that errors on different csrows, but at the same
853 dimm number will increment the same counter.
854 So, in this memory mapping:
855 csrow0: channel 0, dimm0
856 csrow1: channel 0, dimm1
857 csrow2: channel 1, dimm0
858 csrow3: channel 2, dimm0
859 The hardware will increment udimm0 for an error at the first dimm at either
860 csrow0, csrow2 or csrow3;
861 The hardware will increment udimm1 for an error at the second dimm at either
862 csrow0, csrow2 or csrow3;
863 The hardware will increment udimm2 for an error at the third dimm at either
864 csrow0, csrow2 or csrow3;
865
8664) Standard error counters
867
868 The standard error counters are generated when an mcelog error is received
869 by the driver. Since, with udimm, this is counted by software, it is
870 possible that some errors could be lost. With rdimm's, they displays the
871 contents of the registers