Catching failures of failures at big-data clusters: A two-level neural network approach

Andrea Rosa, Lydia Y. Chen, Walter Binder

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

5 Citations (Scopus)

Abstract

Big-data applications are becoming the core of today's business operations, featuring complex data structures and high task fan-out. According to the publicly available Google trace, more than 40% of big-data jobs do not reach successful completion. Interestingly, a significant portion of tasks of such failed jobs undergo multiple types of repetitive failed executions and consume a non-negligible amount of resources. To conserve resources for big-data clusters, it is imperative to capture such failed tasks of failed jobs, a very challenging problem due to multiple types of failures associated with tasks and highly uneven tasks distribution. In this paper, we develop an on-line two-level Neural Network (NN) model which can accurately untangle the complex dependencies among tasks and jobs, and predict their execution classes in an extremely dynamic and heterogeneous system. Our proposed NN model predicts first the job class, and secondly three classes of failed tasks of failed jobs, based on a sliding learning window. Furthermore, we develop resource conservation policies that terminate failed tasks of failed jobs after a grace period that is derived from prediction confidences and task execution times. Overall, evaluating our results on a Google cluster trace, we are able to accurately capture failures of failures at big-data clusters, mitigate false negative tasks to 1%, and efficiently save system resources, achieving significant reductions of CPU, memory and disk consumption - as high as 49%.

Original languageEnglish
Title of host publication2015 IEEE 23rd International Symposium on Quality of Service, IWQoS 2015
PublisherInstitute of Electrical and Electronics Engineers (IEEE)
Pages231-236
Number of pages6
ISBN (Electronic)9781467371131
DOIs
Publication statusPublished - 10 Feb 2016
Externally publishedYes
Event23rd IEEE International Symposium on Quality of Service, IWQoS 2015 - Portland, United States
Duration: 15 Jun 201516 Jun 2015

Conference

Conference23rd IEEE International Symposium on Quality of Service, IWQoS 2015
CountryUnited States
CityPortland
Period15/06/1516/06/15

Fingerprint Dive into the research topics of 'Catching failures of failures at big-data clusters: A two-level neural network approach'. Together they form a unique fingerprint.

Cite this