Chen Zou. 1 May, 2019.
Communicated by Andrew Chien.


Heterogeneous architectures based on accelerators are important paths to high performance. However, acceleration increases the performance demands for memory hierarchy designs. In this study, our focus is tiled heterogeneous architectures that pair accelerators with conven- tional cores and tile them across the CPU chip.

To understand the memory hierarchy requirements and challenges of such tiled heteroge- neous architectures, we study a generic accelerator architecture and different memory system configurations using a trace-driven simulation framework with a set of high-performance computing benchmarks. The simulation results enable the analyses of performance and area/power consumption of memory hierarchy with different configurations.

According to our results, tiled acceleration produces increased bandwidth requirements for the memory hierarchy, impacting each level (L1, L2, LLC, DRAM). Further, the asymme- try of tiles produces unbalanced requirements between the core- and accelerator-side cache hierarchy. Our studies show that without change, the memory hierarchy will reduce perfor- mance as much as 6.4x compared to the best memory hierarchy design.

To match accelerator bandwidth requirements, L1 cache architectures should exploit high-bank-parallelism, for an 8x accelerator, 16 banks are appropriate. Further to maximize support of the accelerator, a L2 cache sharing capacity between core and accelerator produces better hit rates and pools backside miss processing bandwidth, mitigating LLC bandwidth bottlenecks. Finally, the bandwidth requirements on LLC and DRAM incurred by the accelerators limit the effectiveness of scaling accelerators to higher performance, making the balance between scaling up and scaling out necessary for system efficiency.

Furthermore, to show the potential benefits of a tiled heterogeneous architecture, we performed design optimizations in a broad design space. Among a set of optimized designs in a Pareto front trading performance with energy efficiency, a tiled heterogeneous chip with 16 tiles, 12x faster accelerator in each tile and a carefully optimized memory hierarchy brings 3.2x performance than a homogeneous chip of 16 baseline tiles. Acceleration alone with only baseline memory hierarchy design brings 2.2x performance. The use of the proposed high- bandwidth L1 caches improves the performance to 2.9x, and other memory hierarchy designs mitigating bandwidth bottlenecks at LLC and main memory brings the system performance to 3.2x.

Original Document

The original document is available in PDF (uploaded 1 May, 2019 by Andrew Chien).