Fault Tolerance in an Inner-outer Solver: A GVR-enabled Case Study

Ziming Zheng; Andrew A Chien; Mark Hoemmen; Keita Teranishi. 9 January, 2014.
Communicated by Andrew Chien.


Resilience is a major challenge for large-scale systems. Inner-outer solvers such as Flexible Generalized Minimal Residual Method (FGMRES) are widely-used for scientific applications, playing a key role in performance and resilience. Though FGMRES is robust to soft error, we show that single bit flip errors can lead to high computation overhead or even divergence (failure). Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos FGMRES solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

Original Document

The original document is available in PDF (uploaded 9 January, 2014 by Andrew Chien).