TR-2014-13

Global View Resilience (GVR) Documentation, Release 1.0

Andrew Chien. 5 October, 2014.
Communicated by Andrew Chien.

Abstract

Describes the Global View Resilience systems, typical use cases (including applications case studies such as miniMD, ddcMD, miniFE, Trilinos, Preconditioned Conjugate Gradient, GMRES, and OpenMC).

Describes the application programming interface for the Global View Resilience (GVR) system. Global View Resilience (GVR) is a new approach that exploits a global view data model (global naming of data, consistency, and distributed layout), adding reliability to globally visible distributed arrays. Key novel features in GVR include: 1) multi-version arrays with each versioning rate controlled separately by the application (multi-stream), 2) flexible multi-version recovery, and 3) Open Resiliene: a unified error signalling and handling for flexible cross-layer error recovery. With a global versioned array as a portable abstraction, GVR enables application programmers to manage reliability (and its overhead) in a flexible, portable fashion, tapping their deep scientific and application code insights. We will research algorithms and a runtime that map and adapt the application/systemís reliability deployment based on application-specified reliability priorities. The unified error handling framework enables applications error detection (checking) and recovery routines that handle diverse classes of errors with a single application recovery. This architecture enables applications and systems to work in concert, exploiting semantics (algorithmic or even scientific domain) and key capabilities (e.g., fast error detection in hardware) to dramatically increase the range of errors that can be detected and corrected.

Original Document

The original document is available in PDF (uploaded 5 October, 2014 by Andrew Chien).