Error Checking and Snapshot-Based Recovery in a Preconditioned Conjugate Gradient Solver

Zachary Rubenstein; James Dinan; Hajime Fujita; Ziming Zheng; Andrew A Chien. 16 December, 2013.
Communicated by Andrew Chien.


Soft errors are a significant concern for high-performance computing systems in the exascale time frame. We apply our group's Global View Resilience (GVR) library to a preconditioned conjugate gradient solver, evaluating per-data-structure snapshots and varied error detection approaches to tolerate soft errors.

Using 14 real-world matrices from the University of Florida Sparse Matrix Collection, we use error-injection to assess the viability of several detection and correction schemes. These studies show: 1) though inexpensive,residual-based detection performs poorly. To achieve acceptably low false negative rates, much higher (20x) false positives rates are required. 2) though more expensive, algorithm-based detection performs better overall, achieving much lower false negative rates at one fifth the false positive rate. Even this ``expensive'' error detection is inexpensive compared to a single iteration, and therefore is viable for linear solvers---particularly in high fault-rate systems.

Original Document

The original document is available in PDF (uploaded 16 December, 2013 by Andrew Chien).