Result: Application-level checkpointing for shared memory programs

Title:

Application-level checkpointing for shared memory programs

Authors:

BRONEVETSKY, Greg, MARQUES, Daniel, PINGALI, Keshav, SZWED, Peter, SCHULZ, Martin

Source:

Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-XI), October 9-13, 2004, Boston, MassachusettsACM SIGPLAN notices. 39(11):235-247

Publisher Information:

Broadway, NY: ACM, 2004.

Publication Year:

2004

Physical Description:

print, 29 ref

Original Material:

INIST-CNRS

Subject Terms:

Computer science, Informatique, Sciences exactes et technologie, Exact sciences and technology, Sciences appliquees, Applied sciences, Informatique; automatique theorique; systemes, Computer science; control theory; systems, Informatique théorique, Theoretical computing, Théorie programmation, Programming theory, Logiciel, Software, Organisation des mémoires. Traitement des données, Memory organisation. Data processing, Gestion des mémoires et des fichiers (y compris la protection et la sécurité des fichiers), Memory and file management (including protection and security), Compilateur, Compiler, Compilador, Envoi message, Message passing, Mémoire partagée, Shared memory, Memoria compartida, Programmation parallèle, Parallel programming, Programación paralela, Programme concurrent, Concurrent program, Programa competidor, Technique programmation, Programmation technique, Técnica programación, Tolérance faute, Fault tolerance, Tolerancia falta, Point reprise, Checkpointing, Punto reanudación, Système Linux, Linux system, Sistema linux

Document Type:

Conference Conference Paper

File Description:

text

Language:

English

Author Affiliations:

Department of Computer Science, Cornell University, Ithaca, NY 14853, United States
School of Electrical and Computer Engineering, Cornell University, Ithaca, NY 14853, United States
Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, Livermore, CA 94551, United States

ISSN:

1523-2867

Access URL:

http://pascal-francis.inist.fr/vibad/index.php?action=search&terms=16359410

Rights:

Copyright 2005 INIST-CNRS
CC BY 4.0
Sauf mention contraire ci-dessus, le contenu de cette notice bibliographique peut être utilisé dans le cadre d’une licence CC BY 4.0 Inist-CNRS / Unless otherwise stated above, the content of this bibliographic record may be used under a CC BY 4.0 licence by Inist-CNRS / A menos que se haya señalado antes, el contenido de este registro bibliográfico puede ser utilizado al amparo de una licencia CC BY 4.0 Inist-CNRS

Notes:

Computer science; theoretical automation; systems

Accession Number:

edscal.16359410

Database:

PASCAL Archive

Further Information

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart (CPR) - the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. This system has two components: (i) a pre-compiler for source-to-source modification of applications, and (ii) a run-time system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

Result: Application-level checkpointing for shared memory programs

Further Information

Links

Additional functions