Yuanwei Fang. 14 December, 2015.
Communicated by Andrew Chien.


Finite automata models are a fundamental computing abstraction for algorithms and solu- tions to a growing range of problems (bioinformatics, text search, ranking, compression, and network monitoring). Consequently there have been two vibrant threads of invention - new finite automata models and novel implementations - but so far there is no accepted software- hardware interface, a computer architecture, to separate and thereby accelerate progress in both software and hardware. We propose the Unified Automata Processor (UAP), a new architecture that provides general and efficient support for finite automata (FA). We also propose a novel automata transition packing approach used by UAP,“efficient coupled-linear packing” (EffCLiP), that optimizes both finite-automata size and performance. EffCLiP and UAP together enable high performance flexible automata processing. They enable new automata models and ap- plications to be created and deployed as software exploiting integration of high performance automata processing on FA accelerated CPU, instead of requiring a new ASIC/FPGA. This technique elevates novel automata innovation for future hardware platform. EffCLiP em- ploys a novel transition representation that enables a simple addressing operator (integer addition) while providing flexibility to support transition packing. With EffCLiP, the UAP supports a wide range of existing finite automata models (DFAs, NFAs, A-DFAs, JFAs, counting-DFAs, and counting-NFAs), and future novel FA models.

The interface to UAP adds a small number of instructions and registers to a tradi- tional core with vector extensions. The UAP system has multiple parallel lanes pairing with scratchpad memory to allow parallel FA processing. We add lane completion detection fea- ture which allows multi-step FA execution for each UAP lane, and the software-controlled static mapping between input streams and lanes which allows flexible pattern aggregation or aggressive throughput. The UAP lane has three main parts: sequence, combining queue, and stream buffer. The sequence component processes primitives and actions each with one clock cycle. The combining queue holds multiple active states and ensures no concurrent identical states. The stream buffer stores the input stream and frees the vector register file for other use during UAP execution.

Evaluation across a variety of widely-used finite automata workloads and models show that EffCLiP achieves compression rates as high as 8.4x and packing densities of 77% for DFA, 2.24x and >99% for A-DFA, 3.9x and >99% for NFA while supporting high speed and simple implementation. Evaluation on realistic workloads shows that UAP implements all models efficiently, achieving line rates of 94.5% of ideal. A single UAP lane delivers line rates 50x greater than software approaches on CPUs and GPUs. Scaling UAP to 64 lanes achieves FA transition throughputs as high as 295 Gbps, more than 100x higher than CPUs and GPUs, and even exceeding ASIC approaches such as IBM’s RegX by 6x. With efficient support for multiple input streams, UAP achieves throughputs that saturate even high speed stacked DRAM memory systems. We implemented the UAP design with EffCLiP, whose implementation is efficient in instructions executed, memory references, and energy. UAP achieves a 1.2 Ghz clock rate (32nm TSMC CMOS), and a 64-lane UAP requires only 2.2 mm2 and 563 mW. At these levels of performance and energy-efficiency, UAP is a promising candidate for integration into general-purpose computing architectures.

Original Document

The original document is available in PDF (uploaded 14 December, 2015 by Andrew Chien).