Communicated by Andrew Chien.


Efficient Data movement has emerged as one of the key challenges in large scale data intensive applications. Traditionally, CPUs have focused on designing latency reduction schemes to maximize thread performance. Power hungry features like branch predictors, out-of-order execution using reorder buffers and speculative execution supported by deep cache hierarchies achieve commendable IPC. But these mechanisms do not scale efficiently for these large scale data intensive applications. GPUs, on the other hand, have been quite successfully repurposed for many of these applications where data representations are inherently dense and regular such as fully connected deep learning networks. However, neither CPUs nor GPUs are effective for large scale irregular applications like graph processing that are characterized by irregular data structures, control and communication patterns. In such applications, the reactive and heuristic data movement policies in deep cache hierarchies prove ineffective in maintaining a high data supply rate due to low spatial locality and poor data reuse.

We propose the UpDown accelerator, an event-driven programmable processor, for efficient data movement. We present a novel instruction set architecture for the UpDown accelerator (UpDown ISA) that uses events as first class primitives. Coupled with efficient architecture mechanisms, the UpDown accelerator can respond quickly to software and hardware events while intelligently filtering and moving data with low latency and overhead. With parallelism, UpDown can scale to match high memory bandwidths. Additionally, the UpDown accelerator’s programmability enables exploitation of application data structures and algorithmic knowledge to optimize data movement and encoding. We evaluate the UpDown ideas using a collection of graph processing applications, as representative irregular workloads, that exercise the programmability of UpDown. We use cycle-accurate simulations to demonstrate the effectiveness of the UpDown architecture mechanisms and instruction set architecture. Our evaluations show that UpDown can achieve speedups as high as 631x over a single x86 core (baseline), and exceeds the performance of hardwired graph accelerators by 2.5x. Further, UpDown’s programmable processors incur only a 1.25% area increase for a major increase in graph performance.

Original Document

The original document is available in PDF (uploaded 27 March, 2023 by Andrew Chien).