PassMark Software


No announcement yet.

NUMA Memory benchmarking on AMD ThreadRipper

  • Filter
  • Time
  • Show
Clear All
new posts

  • NUMA Memory benchmarking on AMD ThreadRipper

    AMD's new ThreadRipper CPUs actually contains two CPU modules. Which means two separate memory busses.

    As a result of this, there is the possibility of Non-Uniform Memory Access (NUMA). Which means that some of the time CPU 0 might use it's own RAM, and some of the time CPU 0 might need to use the RAM connected to CPU 1. In theory using the RAM directly connected to your own CPU should be faster, and using the more remote RAM should be slow.

    We set out to test this using PerformanceTest's advanced memory test today on a ThreadRipper 1950X. Memory in use was Corsair cmk32gx4m2a2666c16 (DDR4, 2666Mhz, 16-18-18-35, 4 x 16GB).

    Note that you need PerformanceTest Version 9.0 build 1022 (20/Dec/2017) or higher to do this. In previously releases the advanced memory test was not NUMA aware (or had NUMA related bugs).

    Graphs are below, but the summary is:

    1) For sequential reading of memory, there is no significant performance difference for small blocks of RAM, but for larger blocks, that fall outside of the CPU's cache, the performance different can be up to 60%. (this test corresponds to our "memory speed per block size" test in PerformanceTest). So having NUMA aware applications is important for a system like this.

    2) For non sequential read, where we skip forward by some step factor before reading the next value, there is a significant performance hit. (this corresponds to our "memory speed per step size" test). We suspect this is due to the cache being a lot less effective. Performance degradation was around 20%.

    3) Latency is higher for distant nodes, compared to accessing memory on the local node. So memory access are around 60% slower. Again showing why NUMA aware applications (and operating systems) are important. What we did notice however is that if we didn't explicitly select the NUMA node, most of the time the system itself seemed to select the best node anyway (using malloc() on Win10). We don't know if this was by design, or just good luck.

    Note: AMD's EYPC CPUs should behave the same, but we only have ThreadRipper system to play with.

    NUMA Memory Step Size Benchmark

    NUMA Graph AMD ThreadRipper Block Size

    Latency results - Same NUMA node

    NUMA Memory Latency Same Node

    Latency results - Remote NUMA node

    NUMA Memory Latency Remote Node

    Also it is worth noting that the default BIOS setup didn't enable NUMA (ASUS motherboard). It had to be manually enabled

    This was done in the advanced mode, under "DF common options" by selecting the "memory interleaving" settings and changing this from "Auto" to "Channel".

    ASUS NUMA memory setting in UEFI BIOS
    Attached Files

  • #2
    And just for comparison here is a graph from an Intel i7-4770 with 16GB of GSkill DDR3 - F3-17000CL11. 4 x 4GB RAM.

    Click image for larger version  Name:	Intel-RAM-Benchmark.png Views:	1 Size:	81.8 KB ID:	41167

    Linear: 5.4ns
    Random: 78ns
    Random range: 26ms


    • #3
      RE NUMA slowdown of CPU performance

      Thanks for the update on this! I am seeing similar impact of NUMA while benchmarking

      VIDEOSTAR workstation- CPU: Threadripper 1950x GPU: GTX 1080Ti, MBd: ASRock Tai Chi
      RAM: 64GB RAM@3200 SSD NVMe Samsung 960 Pro 1TB, PSU: DarkQuiet Pro 1200W
      Case: Fractal XL R2 Cooler: Arctic 360(6-fan) + 4x 140mm fans DISP: BENQ PV3200PT OS: Win10Pro,

      If I understand correctly for my Tai Chi board- NUMA is 'on' when memory interleaf is set to 'channel'.
      and 'off' when is set to 'auto'. With that definition here are scores for exact same configuration
      with NUMA on or off, CPU @4.125 GHz:

      Passmark 9 NUMA and HPET OFF NUMA and HPET ON

      System 7004 7216
      CPU 28004 25125
      2D 994 981
      3D 15982 15853
      Mem 2238 2622
      Disk 20349 19938

      So NUMA increases system score by about 3%, but reduces CPU score
      by near 11%. The increase in memory score of 17% is what drives the
      increased system score. NUMA slightly reduces 2D, 3D, and disk scores

      The pattern is similar and the magnitude similar at other CPU speeds
      - 3.4, 3.7, 3.975, and 4.075 GHz, and also when I was using
      slower SSD, RAM at 2133, and stock settings for Win10 Pro.

      Is this a permanent problem associated with design, or something AMD
      memory controller and BIOS design could improve?