A Simple Cache Coherence Scheme for Integrated CPU-GPU Systems

Ardhi W. B. Yudha; Reza Pulungan; Henry Hoffmann. 30 July, 2019.
Communicated by Henry Hoffmann.


This paper presents a novel approach to accelerate applications running on integrated CPU-GPU systems. Many integrated CPU-GPU systems use cache-coherent shared memory to communicate efficiently and easily. In this pull-based approach, the CPU produces data for the GPU to consume, and data resides in a shared cache until the GPU accesses it, resulting in long load latency on a first GPU access to a cache line. In this work, we propose a new, push-based, coherence mechanism that explicitly exploits the CPU and GPU’s producerconsumer relationship by automatically moving data from CPU to GPU’s last-level cache. The proposed mechanism results in a dramatic reduction of the GPU’s L2 cache miss rate in general, and a consequent increase in overall performance. Our experiments show that the proposed scheme can increase performance by up to 37%, with typical improvements in the 5–7% range. We find that even if an application does not benefit from the proposed approach, performance is never worse under this model. While we demonstrate how the proposed scheme can co-exist with traditional cache-coherence mechanisms, we also argue that it could serve as a simpler replacement for existing protocols.

Original Document

The original document is available in PDF (uploaded 30 July, 2019 by Henry Hoffmann).