Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks

Bogdan Ghit, Dick Epema

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

10 Citations (Scopus)
140 Downloads (Pure)

Abstract

Providing fault-tolerance is of major importance for data analytics frameworks such as Hadoop and Spark, which are typically deployed in large clusters that are known to experience high failures rates. Unexpected events such as compute node failures are in particular an important challenge for in-memory data analytics frameworks, as the widely adopted approach to deal with them is to recompute work already done. Recomputing lost work, however, requires allocation of extra resource to re-execute tasks, thus increasing the job runtimes. To address this problem, we design a checkpointing system called panda that is tailored to the intrinsic characteristics of data analytics frameworks. In particular, panda employs fine-grained checkpointing at the level of task outputs and dynamically identifies tasks that are worthwhile to be checkpointed
rather than be recomputed. As has been abundantly shown, tasks of data analytics jobs may have very variable runtimes and output sizes. These properties form the basis of three checkpointing policies which we incorporate into panda. We first empirically evaluate panda on a multicluster system with single data analytics applications under space-correlated failures, and find that panda is close to the performance of a fail-free execution in unmodified Spark for a large range of concurrent failures. Then we perform simulations of complete workloads, mimicking the size and operation of a Google cluster, and show that panda provides significant improvements in the average job runtime for wide ranges of the failure rate and system load.
Original languageEnglish
Title of host publicationProceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2017
Place of PublicationNew York, NY
PublisherAssociation for Computing Machinery (ACM)
Pages105-116
Number of pages12
ISBN (Electronic)978-1-4503-4699-3
DOIs
Publication statusPublished - 2017
EventHPDC 2017: 26th International Symposium on High-Performance Parallel and Distributed Computing - Washington, DC, United States
Duration: 26 Jul 201730 Jul 2017
Conference number: 26
http://www.hpdc.org/2017/

Conference

ConferenceHPDC 2017
Country/TerritoryUnited States
CityWashington, DC
Period26/07/1730/07/17
Internet address

Fingerprint

Dive into the research topics of 'Better Safe than Sorry: Grappling with Failures of In-Memory Data Analytics Frameworks'. Together they form a unique fingerprint.

Cite this