Chen Zou. 15 January, 2023.
Communicated by Andrew Chien.


The dramatic growth in the importance of large-scale data analytics has driven the transformation of data center storage from hard disk drives to solid-state drives (SSD). Central to increased capability is the rapid growth in SSD/flash bandwidth and associated compute requirements, resurrecting the question of how to distribute compute across CPUs and storage resources.

In this dissertation, we address three fundamental questions on architecting general-purpose computational storage. First, what are key opportunities for computational storage acceleration? Second, what SSD architecture provides the most efficient support for computational storage? Third, what computational storage processor architecture enables high performance that can match the continued rapid improvement in flash bandwidth? Based on a broad survey of computational storage proposals and research, we show that function properties of ‘data size change’, ‘offload direction’ and ‘vectorizable’ determine the system efficiency and performance benefits of computational SSD offloads and thus determine the priority of offloading. Common properties including streaming access and variable-width values exposed by first-priority functions call for architectural support. Existing computational SSD architectures suffer from poor cost scaling. This is exacerbated by the continued improvement in flash array bandwidth, creating an SSD DRAM bottleneck. We propose the ASSASIN SSD architecture, which provides a unified set of compute engines between SSD DRAM and the flash array to eliminate the bottleneck by enabling direct computing on flash data streams with streambuffers. ASSASIN thus delivers 1.5x - 2.4x speedup for various computational SSD offloads along with 2.0x power efficiency and 3.2x area efficiency.

Existing processor architectures suffer from low datapath efficiency when computing on variable-width values for requiring padding each value to 32/64 bits. Variable-width values are central to coding and storage efficiency, and thus critical for computational storage. We propose VarVE, a vector instruction set architecture extension that provides native variable-width value vector support to compute directly without padding. VarVE delivers 1.3x - 5.4x speedup over ARM’s current best, scalable vector extension (SVE), on popular file system and database computational SSD kernels by achieving higher datapath efficiency. VarVE builds on the vector-length agnostic (VLA) approach, which is gaining widespread adoption. As a result, VarVE has broad potential impact as a general SIMD extension for all processors, increasing datapath efficiency and mitigating the SIMD instruction count explosion.

Original Document

The original document is available in PDF (uploaded 15 January, 2023 by Andrew Chien).