A Data Layout Transformation (DLT) Accelerator: Architectural support for data movement optimization in accelerated systems

Tung Hoang; Amirali Shambayati; Andrew Chien. 30 March, 2015.
Communicated by Andrew Chien.


Technology scaling and growing use of accelerators makes optimization of data movement of increasing importance in all systems. Further, growing diversity in memory structures makes embedding such optimization in software non-portable. We propose an architectural solution, the Data Layout Transformation (DLT) that provides a simple set of instructions that enable software to describe the required data movement compactly, and free the implementation to optimize the movement based on knowledge of the memory hierarchy and system structure.

The DLT architecture ideas can be applicable to both tradition and accelerator-based heterogeneous systems. Experiments show that DLT can make use of the full banwidth (>97%) of a wide range of memory systems, such as DDR3 and HMC, while incurring low software overhead (<2%) versus 11%-50% for advanced gather-scatter DMA engines. We evaluate DLT in accelerated system, the 10x10 federated heterogeneous system, with DDR3 and HMC. Our results demonstrate that DLT improves system performance by 4.6x-99x (DDR3) and 4.4x-115x (HMC) as well as energy efficiency by 2.8x-48x (DDR3) and 1.4x-38x (HMC).

Original Document

The original document is available in PDF (uploaded 30 March, 2015 by Andrew Chien).