aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/edac.txt
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab@redhat.com>2009-09-05 04:10:15 -0400
committerMauro Carvalho Chehab <mchehab@redhat.com>2010-05-10 10:44:59 -0400
commitc344436319e898784febbeeea71d1b0f65ef53ae (patch)
tree84b0a35fa9efe50afa62aadbdbb00d1e0e5dd65b /Documentation/edac.txt
parentd4c277957f4e8e6f2b626e2661cbbf9c76782e36 (diff)
Documentation/edac.txt: Improve it to reflect the latest changes at the driver
Signed-off-by: Mauro Carvalho Chehab <mcheahb@redhat.com>
Diffstat (limited to 'Documentation/edac.txt')
-rw-r--r--Documentation/edac.txt72
1 files changed, 56 insertions, 16 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt
index 8bc320467c64..bd3f8a3905af 100644
--- a/Documentation/edac.txt
+++ b/Documentation/edac.txt
@@ -730,25 +730,41 @@ Due to the way Nehalem exports Memory Controller data, some adjustments
730were done at i7core_edac driver. This chapter will cover those differences 730were done at i7core_edac driver. This chapter will cover those differences
731 731
7321) On Nehalem, there are one Memory Controller per Quick Patch Interconnect 7321) On Nehalem, there are one Memory Controller per Quick Patch Interconnect
733 (QPI). At the driver, the term "socket" means one QPI. It should also be 733 (QPI). At the driver, the term "socket" means one QPI. This is
734 associated with the CPU physical socket. 734 associated with a physical CPU socket.
735 735
736 Each MC have 3 physical read channels, 3 physical write channels and 736 Each MC have 3 physical read channels, 3 physical write channels and
737 3 logic channels. The driver currenty sees it as just 3 channels. 737 3 logic channels. The driver currenty sees it as just 3 channels.
738 Each channel can have up to 3 DIMMs. 738 Each channel can have up to 3 DIMMs.
739 739
740 The minimum known unity is DIMMs. There are no information about csrows. 740 The minimum known unity is DIMMs. There are no information about csrows.
741 As EDAC API maps the minimum unity is csrows, the driver exports one 741 As EDAC API maps the minimum unity is csrows, the driver sequencially
742 maps channel/dimm into different csrows.
743
744 For example, suposing the following layout:
745 Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs
746 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
747 dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400
748 Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs
749 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
750 Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs
751 dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400
752 The driver will map it as:
753 csrow0: channel 0, dimm0
754 csrow1: channel 0, dimm1
755 csrow2: channel 1, dimm0
756 csrow3: channel 2, dimm0
757
758exports one
742 DIMM per csrow. 759 DIMM per csrow.
743 760
744 Currently, it also exports the several memory controllers as just one. This 761 Each QPI is exported as a different memory controller.
745 limit will be removed on future versions of the driver.
746 762
7472) Nehalem MC has the hability to generate errors. The driver implements this 7632) Nehalem MC has the hability to generate errors. The driver implements this
748 functionality via some error injection nodes: 764 functionality via some error injection nodes:
749 765
750 For injecting a memory error, there are some sysfs nodes, under 766 For injecting a memory error, there are some sysfs nodes, under
751 /sys/devices/system/edac/mc/mc0/: 767 /sys/devices/system/edac/mc/mc?/:
752 768
753 inject_addrmatch: 769 inject_addrmatch:
754 Controls the error injection mask register. It is possible to specify 770 Controls the error injection mask register. It is possible to specify
@@ -779,11 +795,6 @@ were done at i7core_edac driver. This chapter will cover those differences
779 2 for the highest 795 2 for the highest
780 1 for the lowest 796 1 for the lowest
781 797
782 inject_socket:
783 specifies what QPI (or processor socket) will generate the error.
784 on Xeon 35xx, it should be 0.
785 on Xeon 55xx, it should be 0 or 1.
786
787 inject_type: 798 inject_type:
788 specifies the type of error, being a combination of the following bits: 799 specifies the type of error, being a combination of the following bits:
789 bit 0 - repeat 800 bit 0 - repeat
@@ -806,10 +817,12 @@ were done at i7core_edac driver. This chapter will cover those differences
806 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type 817 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
807 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask 818 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
808 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section 819 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
809 echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket
810 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable 820 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
811 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null 821 dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null
812 822
823 For socket 1, it is needed to replace "mc0" by "mc1" at the above
824 commands.
825
813 The generated error message will look like: 826 The generated error message will look like:
814 827
815 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) 828 EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error))
@@ -821,9 +834,36 @@ were done at i7core_edac driver. This chapter will cover those differences
821 separate sysfs note were created to handle such counters. 834 separate sysfs note were created to handle such counters.
822 835
823 They can be read by looking at the contents of "corrected_error_counts" 836 They can be read by looking at the contents of "corrected_error_counts"
824 counter: 837 counter. Due to hardware limits, the output is different on machines
838 with unregistered memories and machines with registered ones.
839
840 With unregistered memories, it outputs:
825 841
826 $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts 842 $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
827 dimm0: 15866 843 all channels UDIMM0: 0 UDIMM1: 0 UDIMM2: 0
828 dimm1: 0 844
829 dimm2: 27285 845 What happens here is that errors on different csrows, but at the same
846 dimm number will increment the same counter.
847 So, in this memory mapping:
848 csrow0: channel 0, dimm0
849 csrow1: channel 0, dimm1
850 csrow2: channel 1, dimm0
851 csrow3: channel 2, dimm0
852 The hardware will increment UDIMM0 for an error at either csrow0, csrow2
853 or csrow3.
854
855 With registered memories, it outputs:
856
857 $cat /sys/devices/system/edac/mc/mc0/corrected_error_counts
858 channel 0 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
859 channel 1 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
860 channel 2 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0
861
862 So, with registered memories, there's a direct map between a csrow and a
863 counter.
864
8654) Standard error counters
866
867 The standard error counters are generated when an mcelog error is received
868 by the driver. Since it is counted by software, it is possible that some
869 errors could be lost.