TR-2014-06

How Applications use GVR: Use Cases

Andrew A. Chien; The GVR Team. 28 April, 2014.
Communicated by Andrew Chien.

Abstract

As scientific computation moves from petascale to exascale, we must contend with a corresponding increase in hardware faults. In order to retain correct results and acceptable time to completion, some aspects of existing systems or software (or both) need to be hardened against an increasing fault rate. Generic fault tolerance, such as global checkpoint/restart or dual-modular redundancy, can handle faults, but only at a great cost in terms of compute time or extra hardware. More efficient fault tolerance likely requires that developers make each application fault tolerant individually. We present the Global Resilience View (GVR) framework, which aims to ease the task of augmenting existing applications with fault-tolerance mechanisms that are tailored to the requirements of the applica- tion. Then, we discuss our experiences in providing fault-tolerance for a number of existing applications (miniMD, ddcMD, miniFE, Trilinos, Preconditioned Conjugate Gradient, GMRES, and OpenMC) using GVR. We find that GVR is useful for adding resilience to a number of diverse applications in application- specific ways.

Original Document

The original document is available in Color PDF (uploaded 28 April, 2014 by Andrew Chien).