TR-2012-07

An Evaluation of Difference and Threshold Techniques for Efficient Checkpoints

Sean Hogan; Jeff R. Hammond; Andrew A. Chien. 17 August, 2012.
Communicated by Andrew Chien.

Abstract

To ensure reliability, long-running and large-scale computations have long used checkpoint-and-restart techniques to preserve computational progress in case of soft or hard failures. These techniques can incur significant overhead, consuming as much as 15% of an application’s resources for the US DOE’s leadership-class systems, and these overheads are projected to grow in exascale systems which are likely to have lower IO to compute ratios and higher failure rates. We explore the use of differenced checkpoint and cutoff techniques to increase the effectiveness of Lempel-Ziv (gzip), and thereby reduce the size of checkpoints. We apply these techniques to several types of scientific checkpoint data from NWChem, a widely-used computational chemistry code. Our results show that while standard compression techniques (and even those customized for floating point data) yield modest compression ratios ( 1.2), differenced checkpoints and cutoffs are dramatically more successful, improving compression ratios by 50% to 1.55 to 3.15 for a variety of checkpoint data. If cutoffs in the differenced checkpoints are incorporated, these compression ratios can be increased further with cutoff of 10^7 yielding dramatic improvement in compression ratios greater than 100. These results suggest further exploration of these approaches are promising to reduce checkpoint (and resilience) overhead.

Original Document

The original document is available in PDF (uploaded 17 August, 2012 by Andrew Chien).