Announcement

Collapse
No announcement yet.

MemTest86 ECC RAM error reporting status

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • MemTest86 ECC RAM error reporting status

    History
    Back in 2002 some support was added to MemTest86 V3 to support ECC RAM (Error-Correcting Code RAM). A couple of years later in V3.1 support for additional memory controllers were added. In the options menu of MemTest86 V3 there was a 'ECC mode' that could be activated.

    Since 2004 this code to read ECC errors was never maintained. So as new memory controllers arrived and memory controllers got moved into the CPUs, the code worked fewer and fewer machines. In 2011 for V4.0 Chris Brady (the original MemTest86 author) decided to drop all support for ECC, along with the related code to identify chipsets. Part of the reasoning behind this was that he decided that reporting incorrect information is worse than no information.

    The lack of ECC support remained in place for all V4.x releases.

    Present
    During 2013 for V5 we have been working to bring back support for reporting of ECC errors, for at least the popular current platforms, if not for some of the older platforms as well. It turns out that this is not a trivial exercise. Different code is required for different chipsets and the mechanisms for reporting errors in UEFI BIOS are poorly documented, with some of the documents not even being available to the public. (Note that V5 of MemTest86 will only support UEFI based hardware).

    Testing of ECC RAM error reporting
    Even once code to detect ECC errors is written there are great difficulties in testing the code. These testing problems revolve around,
    • Getting enough and a variety of ECC capable hardware to test with. Typical ECC is way more expensive than normal consumer hardware. So it prohibitively expensive to purchase all the CPU, RAM and motherboards required.
    • Generating errors in a repeatable manner on demand. We tried using heat guns and strong electromagnetic interference to force ECC errors, but the process was too random. For example, heating the RAM stick to 120C would often the result in the entire machine crashing before an error could be reported.
    • Generating a RAM error where exactly one bit in a byte is wrong. One bit errors are important as this is the type of ECC that ECC RAM should detect and correct.


    Custom ECC test hardware
    We were lucky enough to be contacted by "Team Group Inc", a company that distributes ECC RAM. They offered to supply us with some customised ECC RAM that had a button affixed to the PCB that could generate 1 bit ECC errors on demand. This is what is looked like,


    It isn't a perfect solution as, being a DDR3 module, it won't help support DDR2 RAM and it also won't help to check the behaviour of multi-bit errors. Nonetheless it is a bit step up over having nothing at all.

    ECC Test results
    Using a combination of the customised RAM stick and the ability of some memory controllers to simulate an ECC faults we have been able to get MemTest86 V5 to report on ECC errors. Detected errors typically look like this,



    Conclusion
    ECC error reporting should be available again, for significant number of current chipsets in V5 of MemTest86.

  • #2
    Systems tested to date
    We (and other users) have been using the following systems for ECC testing.

    AMD FX-8150 CPU, with ASUS M5A97 Motherboard
    Intel Xeon E3-1220 V2 CPU, with Intel S1200BTL Motherboard
    Intel Xeon E3-1225 V2 CPU, with custom Intel Panther Point, C216 Motherboard
    Intel Xeon E5-2648L CPU, with Intel S2600CO4 Motherboard
    Intel Xeon E5-2687 CPU, with Tyan S7053 Motherboard

    We expect it to work with a much wider range of CPUs & Motherboards but for the moment we have only confirmed it working on the above list.

    Comment


    • #3
      ECC Error detection on an Intel platform

      Here is what ECC errors look like on an Intel platform.





      Note how the address field looks different from the AMD address field. For the Intel machine it is possible to get a breakdown of the address into Column, Row, Rank & Bank. (e.g. 288, 5F82, 2, 6).

      The Syn: value represents the "ECC syndrome" value. When a 64bit word is written to RAM, seven ECC bits are computed. When the same word is re-read the ECC bits are recalculated. Then an exclusive OR is done on the ECC bits to determine if there has been an error. A zero result means no error. A non zero result indicates an error and the syndrome value can be used to determine the actual bit that was in error.

      Comment


      • #4
        David, thank you for working with Team Group and wish we can continue to attack this industrial wide ECC issue. Our goal is simple, make good quality ECC modules and reduce SBE/MBE rate for server industry from module maker's position.

        I will provide DDR2 SBE module and DDR3 2-3 bits error module to you as well.

        Comment


        • #5
          V5 of MemTest86 will be for UEFI machines only. Which basically means that V5 won't run on any machine that was made more than 5 years ago. (For these older machines we are continuing to maintain V4.x).

          The consequence of that is that we don't really need to worry about supporting ECC DDR2 RAM in V5 of MemTest86.

          Comment


          • #6
            Who was your contacts at Team Group Inc? I could use an ECC error module.

            Who was your contacts at Team Group Inc? I could use an ECC error module.
            Thanks

            Comment


            • #7
              David, have you had any chance to test on supermicro server boards running Intel processors? If not, is there any way to help join the testing pool of users?

              Comment


              • #8
                We don't have any supermicro server boards in house.
                So unless you wanted to send us a machine the best solution is to test the software on your own machine and see if it works. The problem of course is forcing an ECC error.

                If you did want to send us a machine, we could send it back to you once we are done.

                Comment


                • #9
                  Thank you David. I currently have an open request with SuperMicro about this. We have found that the x9drd-7LNF4-JBOD will only run in Single Core mode even after firmware updates. Hopefully they can help us correct this. We have had success in running an SuperMicro X10SLM-F with Intel E3-1270v3 with all cores.

                  Comment


                  • #10
                    Hi David,

                    Not sure if it was reported on the current Memtest running on our Intel S2600CO4 motherboard, the ECC error checking only works with CPU socket 1 and first bank of DIMM slots. If we move the Teamgroup test DIMM to a higher slot or CPU socket 2, then ECC errors are not reported when pushing the error create button on the DIMM. I will get the log if needed to debug.

                    thanks.

                    Comment


                    • #11
                      Originally posted by victorz View Post
                      Hi David,

                      Not sure if it was reported on the current Memtest running on our Intel S2600CO4 motherboard, the ECC error checking only works with CPU socket 1 and first bank of DIMM slots. If we move the Teamgroup test DIMM to a higher slot or CPU socket 2, then ECC errors are not reported when pushing the error create button on the DIMM. I will get the log if needed to debug.

                      thanks.
                      Can you send the log file to the e-mail address on our Support page found here:
                      http://www.passmark.com/support/index.htm.

                      Comment

                      Working...
                      X