diff options
Diffstat (limited to 'Documentation/edac.txt')
-rw-r--r-- | Documentation/edac.txt | 72 |
1 files changed, 56 insertions, 16 deletions
diff --git a/Documentation/edac.txt b/Documentation/edac.txt index 8bc320467c64..bd3f8a3905af 100644 --- a/Documentation/edac.txt +++ b/Documentation/edac.txt | |||
@@ -730,25 +730,41 @@ Due to the way Nehalem exports Memory Controller data, some adjustments | |||
730 | were done at i7core_edac driver. This chapter will cover those differences | 730 | were done at i7core_edac driver. This chapter will cover those differences |
731 | 731 | ||
732 | 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect | 732 | 1) On Nehalem, there are one Memory Controller per Quick Patch Interconnect |
733 | (QPI). At the driver, the term "socket" means one QPI. It should also be | 733 | (QPI). At the driver, the term "socket" means one QPI. This is |
734 | associated with the CPU physical socket. | 734 | associated with a physical CPU socket. |
735 | 735 | ||
736 | Each MC have 3 physical read channels, 3 physical write channels and | 736 | Each MC have 3 physical read channels, 3 physical write channels and |
737 | 3 logic channels. The driver currenty sees it as just 3 channels. | 737 | 3 logic channels. The driver currenty sees it as just 3 channels. |
738 | Each channel can have up to 3 DIMMs. | 738 | Each channel can have up to 3 DIMMs. |
739 | 739 | ||
740 | The minimum known unity is DIMMs. There are no information about csrows. | 740 | The minimum known unity is DIMMs. There are no information about csrows. |
741 | As EDAC API maps the minimum unity is csrows, the driver exports one | 741 | As EDAC API maps the minimum unity is csrows, the driver sequencially |
742 | maps channel/dimm into different csrows. | ||
743 | |||
744 | For example, suposing the following layout: | ||
745 | Ch0 phy rd0, wr0 (0x063f4031): 2 ranks, UDIMMs | ||
746 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | ||
747 | dimm 1 1024 Mb offset: 4, bank: 8, rank: 1, row: 0x4000, col: 0x400 | ||
748 | Ch1 phy rd1, wr1 (0x063f4031): 2 ranks, UDIMMs | ||
749 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | ||
750 | Ch2 phy rd3, wr3 (0x063f4031): 2 ranks, UDIMMs | ||
751 | dimm 0 1024 Mb offset: 0, bank: 8, rank: 1, row: 0x4000, col: 0x400 | ||
752 | The driver will map it as: | ||
753 | csrow0: channel 0, dimm0 | ||
754 | csrow1: channel 0, dimm1 | ||
755 | csrow2: channel 1, dimm0 | ||
756 | csrow3: channel 2, dimm0 | ||
757 | |||
758 | exports one | ||
742 | DIMM per csrow. | 759 | DIMM per csrow. |
743 | 760 | ||
744 | Currently, it also exports the several memory controllers as just one. This | 761 | Each QPI is exported as a different memory controller. |
745 | limit will be removed on future versions of the driver. | ||
746 | 762 | ||
747 | 2) Nehalem MC has the hability to generate errors. The driver implements this | 763 | 2) Nehalem MC has the hability to generate errors. The driver implements this |
748 | functionality via some error injection nodes: | 764 | functionality via some error injection nodes: |
749 | 765 | ||
750 | For injecting a memory error, there are some sysfs nodes, under | 766 | For injecting a memory error, there are some sysfs nodes, under |
751 | /sys/devices/system/edac/mc/mc0/: | 767 | /sys/devices/system/edac/mc/mc?/: |
752 | 768 | ||
753 | inject_addrmatch: | 769 | inject_addrmatch: |
754 | Controls the error injection mask register. It is possible to specify | 770 | Controls the error injection mask register. It is possible to specify |
@@ -779,11 +795,6 @@ were done at i7core_edac driver. This chapter will cover those differences | |||
779 | 2 for the highest | 795 | 2 for the highest |
780 | 1 for the lowest | 796 | 1 for the lowest |
781 | 797 | ||
782 | inject_socket: | ||
783 | specifies what QPI (or processor socket) will generate the error. | ||
784 | on Xeon 35xx, it should be 0. | ||
785 | on Xeon 55xx, it should be 0 or 1. | ||
786 | |||
787 | inject_type: | 798 | inject_type: |
788 | specifies the type of error, being a combination of the following bits: | 799 | specifies the type of error, being a combination of the following bits: |
789 | bit 0 - repeat | 800 | bit 0 - repeat |
@@ -806,10 +817,12 @@ were done at i7core_edac driver. This chapter will cover those differences | |||
806 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_type | 817 | echo 2 >/sys/devices/system/edac/mc/mc0/inject_type |
807 | echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask | 818 | echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask |
808 | echo 3 >/sys/devices/system/edac/mc/mc0/inject_section | 819 | echo 3 >/sys/devices/system/edac/mc/mc0/inject_section |
809 | echo 0 >/sys/devices/system/edac/mc/mc0/inject_socket | ||
810 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable | 820 | echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable |
811 | dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null | 821 | dd if=/dev/mem of=/dev/null seek=16k bs=4k count=1 >& /dev/null |
812 | 822 | ||
823 | For socket 1, it is needed to replace "mc0" by "mc1" at the above | ||
824 | commands. | ||
825 | |||
813 | The generated error message will look like: | 826 | The generated error message will look like: |
814 | 827 | ||
815 | EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) | 828 | EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, Dimm=0, Channel=2, syndrome=0x00000040, count=1, Err=8c0000400001009f:4000080482 (read error: read ECC error)) |
@@ -821,9 +834,36 @@ were done at i7core_edac driver. This chapter will cover those differences | |||
821 | separate sysfs note were created to handle such counters. | 834 | separate sysfs note were created to handle such counters. |
822 | 835 | ||
823 | They can be read by looking at the contents of "corrected_error_counts" | 836 | They can be read by looking at the contents of "corrected_error_counts" |
824 | counter: | 837 | counter. Due to hardware limits, the output is different on machines |
838 | with unregistered memories and machines with registered ones. | ||
839 | |||
840 | With unregistered memories, it outputs: | ||
825 | 841 | ||
826 | $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts | 842 | $ cat /sys/devices/system/edac/mc/mc0/corrected_error_counts |
827 | dimm0: 15866 | 843 | all channels UDIMM0: 0 UDIMM1: 0 UDIMM2: 0 |
828 | dimm1: 0 | 844 | |
829 | dimm2: 27285 | 845 | What happens here is that errors on different csrows, but at the same |
846 | dimm number will increment the same counter. | ||
847 | So, in this memory mapping: | ||
848 | csrow0: channel 0, dimm0 | ||
849 | csrow1: channel 0, dimm1 | ||
850 | csrow2: channel 1, dimm0 | ||
851 | csrow3: channel 2, dimm0 | ||
852 | The hardware will increment UDIMM0 for an error at either csrow0, csrow2 | ||
853 | or csrow3. | ||
854 | |||
855 | With registered memories, it outputs: | ||
856 | |||
857 | $cat /sys/devices/system/edac/mc/mc0/corrected_error_counts | ||
858 | channel 0 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0 | ||
859 | channel 1 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0 | ||
860 | channel 2 RDIMM0: 0 RDIMM1: 0 RDIMM2: 0 | ||
861 | |||
862 | So, with registered memories, there's a direct map between a csrow and a | ||
863 | counter. | ||
864 | |||
865 | 4) Standard error counters | ||
866 | |||
867 | The standard error counters are generated when an mcelog error is received | ||
868 | by the driver. Since it is counted by software, it is possible that some | ||
869 | errors could be lost. | ||