KIMA: Hybrid Checkpointing for Recovery from a Wide Range of Errors and Detection Latencies
MetadataShow full item record
Full system reliability is a problem that spans multiple levels of the software/hardware stack. The normal execution of a program in a system can be disrupted by multiple factors, ranging from transient errors in a processor and software bugs, to permanent hardware failures and human mistakes. A common method for recovering from such errors is the creation of checkpoints during the execution of the program, allowing the system to restore the program to a previous error-free state and resume execution. Different causes of errors, though, have different occurrence frequencies and detection latencies, requiring the creation of multiple checkpoints at different frequencies in order to maximize the availability of the system. In this paper we present KIMA, a novel checkpointing creation and management technique that combines efficiently the existing undo-log and redo-log checkpointing approaches, reducing the overall bandwidth requirements to both the memory and the hard disk. KIMA establishes DRAM-based undo-log checkpoints every 10ms, then leverages the undo-log metadata and checkpointed data to establish redo-log checkpoints every 1 second in non-volatile memory (such as PCM). Our results show that KIMA incurs average overheads of less than 1% while enabling efficient recovery from both transient and hard errors that have a variety of detection latencies.