• Login
    View Item 
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    •   SMARTech Home
    • Georgia Tech Theses and Dissertations
    • Georgia Tech Theses and Dissertations
    • View Item
    JavaScript is disabled for your browser. Some features of this site may not work without it.

    Compiler-Assisted Resilience Framework for Recovery from Transient Faults

    Thumbnail
    View/Open
    CHEN-DISSERTATION-2020.pdf (1023.Kb)
    Date
    2020-12-06
    Author
    Chen, Chao
    Metadata
    Show full item record
    Abstract
    Due to system scaling trends toward smaller transistor size, higher circuit density and the use of near-threshold voltage (NTV) techniques, transient hardware faults introduced by external noises, e.g., heat fluxes and particle strikes, have become a growing concern for current and upcoming extreme-scale high-performance-computing (HPC) systems. Applications running on these systems are projected to experience transient errors more frequently than ever before, which will either lead them to generate incorrect outputs without warning users or cause them to crash. Therefore, efficient resilience techniques against transient hardware faults are required for modern HPC applications. This dissertation is concerned with the design, implementation, and evaluation of a light-weight resilience framework for large-scale scientific applications to mitigate impacts of transient hardware faults. In particular, it consists of 3 novel techniques: 1) LADR, a light-weight anomaly-based approach to protect scientific applications against transient-fault-induced silent data corruptions (SDCs); 2) CARE, a low-cost compiler-assisted technique to repair the crashed process on-the-fly when a crash-causing transient error is detected, such that applications can continue their executions instead of being simply terminated and restarted; and 3) IterPro, which targets the problem of recovery from corruptions to the induction variables by exploiting side-effects of modern compiler optimization techniques. To limit the runtime overheads during the normal executions of applications, these approaches exploit properties of scientific applications via compiler techniques. Due to the design strategy of these approaches, they only incur negligible (<3%) or even zero runtime overheads during the normal execution of applications, but still achieve a high-level fault coverage.
    URI
    http://hdl.handle.net/1853/64214
    Collections
    • College of Computing Theses and Dissertations [1191]
    • Georgia Tech Theses and Dissertations [23877]
    • School of Computer Science Theses and Dissertations [79]

    Browse

    All of SMARTechCommunities & CollectionsDatesAuthorsTitlesSubjectsTypesThis CollectionDatesAuthorsTitlesSubjectsTypes

    My SMARTech

    Login

    Statistics

    View Usage StatisticsView Google Analytics Statistics
    facebook instagram twitter youtube
    • My Account
    • Contact us
    • Directory
    • Campus Map
    • Support/Give
    • Library Accessibility
      • About SMARTech
      • SMARTech Terms of Use
    Georgia Tech Library266 4th Street NW, Atlanta, GA 30332
    404.894.4500
    • Emergency Information
    • Legal and Privacy Information
    • Human Trafficking Notice
    • Accessibility
    • Accountability
    • Accreditation
    • Employment
    © 2020 Georgia Institute of Technology