Measuring NUMA effects with the STREAM benchmark

Lars Bergstrom. 18 May, 2012.
Communicated by John Reppy.
Supersedes: TR-2011-02 (updated 05/18/12)


Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are then connected to one another via an interconnect, creating a system with a heterogeneous memory hierarchy. This design breaks the common illusion for programmers of a uniform address space where any access to global memory takes the same amount of time, ignoring cache issues. Access to memory that is directly connected to the physical processor that the code is running on is not only lower latency (due to a shorter distance) but also provides different bandwidth.

The impact of this heterogeneous memory architecture is not easily understood from vendor benchmarks. Even where these measurements are available, they provide only best-case memory throughput and rely on hand-tuned generation of the perfect assembly instructions to maximize throughput. This work presents concrete measurements of NUMA effects on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine, obtained using modifications to the well-known STREAM benchmark compiled with the GCC compiler with full optimizations enabled. This version of the benchmark has already been used across industry and academic settings to measure both existing and prototype hardware.

Original Document

The original document is available in PDF (uploaded 18 May, 2012 by John Reppy).