LEGaTO researcher Konstantinos Parasyris, from the Barcelona Supercomputing Center, gave a tutorial called "Multilevel Checkpointing for MPI Applications" at the EuroMPI 2019 conference in Zurich, Switzerland on September 10, 2019.
In the first part of the tutorial, participants were introduced to a checkpoint – restart library called FTI. The library provides a vast set of configuration options to efficiently guarantee application progress when executed in large scale systems that suffer from frequent failures. During the second part of the tutorial the participants were asked to execute a set of examples and in the end to extend an application with fault tolerance features using FTI.
"Fault tolerance is becoming more important as we approach the era of exascale computing. The increasing number of cores as well as the increased complexity of modern heterogenous systems result in substantial decrease of the expected mean time between failures. Unfortunately though, developers are not aware of the importance of fault tolerance when executing applications on large scale machines. This tutorial is very useful to researchers and developers who want to learn mature fault tolerance techniques to increase the robustness of their application," said Parasyris.
View the presentation here.
More about the EuroMPI:
The EuroMPI conference is the preeminent meeting for users, developers and researchers to interact and discuss new developments and applications of message-passing parallel computing, in particular in and related to the Message Passing Interface (MPI). This includes parallel programming interfaces, libraries and languages, architectures, networks, algorithms, tools, applications, and High Performance Computing with particular focus on quality, portability, performance and scalability. The annual meeting has a long, rich tradition, and has been held in European countries since 1994.
In 2019, EuroMPI will take place at ETH Zurich. The university for science and technology dates back to the year 1855 and has hosted many graduates and researchers who have made a global impact with their groundbreaking work. Its most famous alumnus, Albert Einstein, developed the principles of his theory of relativity while at ETH Zurich. Today, ETH counts 20'600 students including 4'100 doctoral students from over 120 countries and ranks among the world’s leading universities of science and technology. Zurich is situated in the centre of Switzerland and at the heart of Europe with excellent connections by plane, train and road.