Runtime interval optimization and dependable performance for application-level checkpointing

Apostolos Kokolis*, Alexandros Mavrogiannis, Dimitrios Rodopoulos, Christos Strydis, Dimitrios Soudris

*Corresponding author for this work

    Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

    2 Citations (Scopus)

    Abstract

    As aggressive integration paves the way for performance enhancement of many-core chips and technology nodes go below deca-nanometer dimensions, system-wide failure rates are becoming noticeable. Inevitably, system designers need to properly account for such failures. Checkpoint/Restart (C/R) can be deployed to prolong dependable operation of such systems. However, it introduces additional overheads that lead to performance variability. We present a versatile dependability manager (DepMan) that orchestrates a many-core application-level C/R scheme, while being able to follow time-varying error rates. DepMan also contains a dedicated module that ensures on-the-fly performance dependability for the executing application. We evaluate the performance of our scheme using an error injection module both on the experimental Intel Single-Chip Cloud Computer (SCC) and on a commercial Intel i7 general purpose computer. Runtime checkpoint interval optimization adapts to a variety of failure rates without extra performance or energy costs. The inevitable timing overhead of C/R is reclaimed systematically with Dynamic Voltage and Frequency Scaling (DVFS), so that dependable application performance is ensured.

    Original languageEnglish
    Title of host publicationProceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
    PublisherInstitute of Electrical and Electronics Engineers (IEEE)
    Pages594-599
    Number of pages6
    ISBN (Electronic)9783981537062
    DOIs
    Publication statusPublished - 25 Apr 2016
    Event19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016 - Dresden, Germany
    Duration: 14 Mar 201618 Mar 2016

    Publication series

    NameProceedings of the 2016 Design, Automation and Test in Europe Conference and Exhibition, DATE 2016

    Conference

    Conference19th Design, Automation and Test in Europe Conference and Exhibition, DATE 2016
    Country/TerritoryGermany
    CityDresden
    Period14/03/1618/03/16

    Fingerprint

    Dive into the research topics of 'Runtime interval optimization and dependable performance for application-level checkpointing'. Together they form a unique fingerprint.

    Cite this