Arjun Rawal. 3 June, 2020.
Communicated by Andrew Chien.


Data storage is a fundamental concern for high energy physics; the experiments and data analysis needed to discover new results require petabytes of measurements from particle collisions. Accordingly, data compression has been a central focus of data storage solutions, as it provides an effective way to reduce storage costs and improve analysis performance. Whereas interactive analysis workloads benefit from fast data availability for computation, archival storage benefits from compression which makes data as small as possible. For most high energy physics data, the standard approach to compression is “one size fits all” — data is stored for archive with the same compression used for interactive analysis. Because data analysis and long term storage are fundamentally different use cases, the tradeoffs made to provide performant data analysis result in relatively poor compression for long term data storage. We propose that high energy physics data could be stored much more compactly if we use modern computational algorithms and compression approaches that take into account the fundamental characteristics of the data.

We study several modern compression algorithms and evaluate their performance on high energy physics data. We then evaluate a variety of techniques used in data compression to improve compression ratio: delta encoding, floating point representation, data aggregation, and dictionary optimizations. These algorithms and techniques exist in a tradeoff space where compression ratio, throughput, and resource utilization can be exchanged to find the best compression for a specific use case.

Evaluation on real datasets from the ATLAS and CMS experiments shows that adopting algorithms designed for modern processors and larger memory sizes can provide compression ratio improvements of 7% while providing better compression and decompression throughput. Furthermore, applying techniques that take into account the underlying type of a block of data, not just the bytes of data, can increase compression ratio by an additional 5%. Overall, we find that an approach that prioritizes compression ratio can reduce the overall size of data files by more than 15%, providing a significant reduction in data storage requirements.

However, this solution is useful only if it is cost-effective. We analyze the cost of scaling up our compression strategies for the ATLAS experiment. We find that a production im- plementation of our approach would require fewer than 50 CPU cores to handle reading a petabyte of data per day. This approach could reduce data storage requirements by more than 8 petabytes, and save hundreds of thousands of dollars in hard drive and tape storage costs each year. Hence, our approach is cost effective and feasible on a large scale.

Original Document

The original document is available in PDF (uploaded 3 June, 2020 by Andrew Chien).