PassMark Software

Announcement

Collapse
No announcement yet.

NUMA Memory benchmarking on AMD ThreadRipper

Collapse
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • NUMA Memory benchmarking on AMD ThreadRipper

    AMD's new ThreadRipper CPUs actually contains two CPU modules. Which means two separate memory busses.

    As a result of this, there is the possibility of Non-Uniform Memory Access (NUMA). Which means that some of the time CPU 0 might use it's own RAM, and some of the time CPU 0 might need to use the RAM connected to CPU 1. In theory using the RAM directly connected to your own CPU should be faster, and using the more remote RAM should be slow.

    We set out to test this using PerformanceTest's advanced memory test today on a ThreadRipper 1950X. Memory in use was Corsair cmk32gx4m2a2666c16 (DDR4, 2666Mhz, 16-18-18-35, 4 x 16GB).

    Note that you need PerformanceTest Version 9.0 build 1022 (20/Dec/2017) or higher to do this. In previously releases the advanced memory test was not NUMA aware (or had NUMA related bugs).

    Graphs are below, but the summary is:

    1) For sequential reading of memory, there is no significant performance difference for small blocks of RAM, but for larger blocks, that fall outside of the CPU's cache, the performance different can be up to 60%. (this test corresponds to our "memory speed per block size" test in PerformanceTest). So having NUMA aware applications is important for a system like this.

    2) For non sequential read, where we skip forward by some step factor before reading the next value, there is a significant performance hit. (this corresponds to our "memory speed per step size" test). We suspect this is due to the cache being a lot less effective. Performance degradation was around 20%.

    3) Latency is higher for distant nodes, compared to accessing memory on the local node. So memory access are around 60% slower. Again showing why NUMA aware applications (and operating systems) are important. What we did notice however is that if we didn't explicitly select the NUMA node, most of the time the system itself seemed to select the best node anyway (using malloc() on Win10). We don't know if this was by design, or just good luck.

    Note: AMD's EYPC CPUs should behave the same, but we only have ThreadRipper system to play with.



    NUMA Memory Step Size Benchmark

    NUMA Graph AMD ThreadRipper Block Size

    Latency results - Same NUMA node

    NUMA Memory Latency Same Node

    Latency results - Remote NUMA node

    NUMA Memory Latency Remote Node






    Also it is worth noting that the default BIOS setup didn't enable NUMA (ASUS motherboard). It had to be manually enabled

    This was done in the advanced mode, under "DF common options" by selecting the "memory interleaving" settings and changing this from "Auto" to "Channel".

    ASUS NUMA memory setting in UEFI BIOS
    Attached Files

  • #2
    And just for comparison here is a graph from an Intel i7-4770 with 16GB of GSkill DDR3 - F3-17000CL11. 4 x 4GB RAM.


    Click image for larger version  Name:	Intel-RAM-Benchmark.png Views:	1 Size:	81.8 KB ID:	41167

    Latency:
    Linear: 5.4ns
    Random: 78ns
    Random range: 26ms

    Comment


    • #3
      RE NUMA slowdown of CPU performance

      Thanks for the update on this! I am seeing similar impact of NUMA while benchmarking

      VIDEOSTAR workstation- CPU: Threadripper 1950x GPU: GTX 1080Ti, MBd: ASRock Tai Chi
      RAM: 64GB RAM@3200 SSD NVMe Samsung 960 Pro 1TB, PSU: DarkQuiet Pro 1200W
      Case: Fractal XL R2 Cooler: Arctic 360(6-fan) + 4x 140mm fans DISP: BENQ PV3200PT OS: Win10Pro,

      If I understand correctly for my Tai Chi board- NUMA is 'on' when memory interleaf is set to 'channel'.
      and 'off' when is set to 'auto'. With that definition here are scores for exact same configuration
      with NUMA on or off, CPU @4.125 GHz:

      Passmark 9 NUMA and HPET OFF NUMA and HPET ON

      System 7004 7216
      CPU 28004 25125
      2D 994 981
      3D 15982 15853
      Mem 2238 2622
      Disk 20349 19938

      So NUMA increases system score by about 3%, but reduces CPU score
      by near 11%. The increase in memory score of 17% is what drives the
      increased system score. NUMA slightly reduces 2D, 3D, and disk scores

      The pattern is similar and the magnitude similar at other CPU speeds
      - 3.4, 3.7, 3.975, and 4.075 GHz, and also when I was using
      slower SSD, RAM at 2133, and stock settings for Win10 Pro.

      Is this a permanent problem associated with design, or something AMD
      memory controller and BIOS design could improve?






      Comment


      • #4
        Hi David, I've been running the CPU Benchmark on an EPYC system (single socket 7551, Gigabyte MZ31-AR0, 128GB ECC RAM @2666, 500GB Samsung 750evo SSD) and I'm getting a typical ~19000 score consistent with what's listed on the big results page.

        Looking at the individual component scores, it looks like the overall CPU score gets really sandbagged by the Prime number and Physics tests in particular - both of which have some memory-access dependency per the test descriptions. I suspect this is a NUMA-related issue, based also on looking at the scores (overall and component) I get from each of a Ryzen system (same CPU core, non-NUMA) and an Intel Xeon Gold system (same number of cores, different NUMA architecture) I also have.

        As such is there a way to run the CPU Benchmark test in a NUMA-aware mode, or are we entirely reliant on the OS's memory allocation for these tests? Setting the memory interleaving to Channel doesn't seem to have any impact on these CPU Benchmark test components btw.

        Comment


        • #5
          or are we entirely reliant on the OS's memory allocation for these tests?
          Yes, the CPU tests are not NUMA aware. We hope the O/S will do a good job.

          From a programming point of view it is a bit painful to make every memory allocation NUMA aware.

          Comment


          • #6
            Understood, thanks for confirming.

            More directly related to the thread topic, I have been running the Advanced Memory Tests on my EPYC system, and I've encountered an interesting quirk/anomaly. I don't know if this is something Passmark's doing, Windows is doing, some AMD driver is doing, or what, but I get unexpected latency and memory read speed scores for NUMA nodes 2 and 3, with respect to running local node or distant node memory in the test. NUMA 0 and NUMA 1 behave as expected, as follows:

            NUMA 0 Processor 0
            Average read speed (MB/s per step size) for local node 0: 6451 MB/s
            Average read speed for distant node 1/2/3: ~5150 MB/s
            Random Range Latency for local node 0: 68 ns
            Random Range Latency for remote node 1/2/3: 110 ns

            NUMA1 Processor 16 shows similar numbers, except of course local node is 1 and distant nodes are 0/2/3. This is also as expected.

            For NUMA 2 Processor 32 however, I see the following:
            Average read speed for "local" node 2: 5207 MB/s
            Average read speed for "distant" node 0: 6469 MB/s
            Random Range Latency for "local" node 2: 110 ns
            Random Range Latency for "remote" node 0: 68ns

            So, it's like allocation node 0 memory is really the local memory for NUMA 2! Similarly, the results I get for NUMA 3 Processor 48 indicate that allocation node 1 is really the local memory for this Processor, and not node 3.

            Any idea why I'm seeing this? Shouldn't the local node performance always be better? The GUI does identify the expected local allocation node prior to the test e.g. NUMA allocation node 2 is labeled as the local node in the drop-down when Processor 32 is selected.

            Comment


            • #7
              Could you please try this updated build of PerformanceTest, https://www.passmark.com/ftp/temp/petst_debug.exe, and let us know if you still see the same behaviour.

              Comment

              Working...
              X