TR-2012-07
An Evaluation of Difference and Threshold Techniques for Efficient Checkpoints
Sean Hogan; Jeff R. Hammond; Andrew A. Chien. 17 August, 2012.
Communicated by Andrew Chien.
Abstract
To ensure reliability, long-running and large-scale
computations have long used checkpoint-and-restart techniques to
preserve computational progress in case of soft or hard failures.
These techniques can incur significant overhead, consuming as
much as 15% of an application’s resources for the US DOE’s
leadership-class systems, and these overheads are projected to
grow in exascale systems which are likely to have lower IO to
compute ratios and higher failure rates.
We explore the use of differenced checkpoint and cutoff
techniques to increase the effectiveness of Lempel-Ziv (gzip),
and thereby reduce the size of checkpoints. We apply these
techniques to several types of scientific checkpoint data from
NWChem, a widely-used computational chemistry code. Our
results show that while standard compression techniques (and
even those customized for floating point data) yield modest
compression ratios ( 1.2), differenced checkpoints and cutoffs
are dramatically more successful, improving compression ratios
by 50% to 1.55 to 3.15 for a variety of checkpoint data. If
cutoffs in the differenced checkpoints are incorporated, these
compression ratios can be increased further with cutoff of 10^7
yielding dramatic improvement in compression ratios greater
than 100. These results suggest further exploration of these
approaches are promising to reduce checkpoint (and resilience)
overhead.
Original Document
The original document is available in PDF (uploaded 17 August, 2012 by
Andrew Chien).