CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Faisal Shahzad, Jonas Thies, Moritz Kreutzer, Thomas Zeiser, Georg Hager, Gerhard Wellein

Research output: Contribution to journalArticleScientificpeer-review

30 Citations (Scopus)

Abstract

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort.

Original languageEnglish
JournalIEEE Transactions on Parallel and Distributed Systems
DOIs
Publication statusPublished - 2018
Externally publishedYes

Keywords

  • Application-level checkpoint/restart
  • automatic fault tolerance
  • Checkpointing
  • Fault tolerance
  • Fault tolerant systems
  • Hardware
  • Kernel
  • Libraries
  • User-Level Failure Mitigation (ULFM

Fingerprint

Dive into the research topics of 'CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance'. Together they form a unique fingerprint.

Cite this