LEGaTO is present at HiPEAC19. Leonardo Bautista and Osman Unsal organise the MCESS: Easy and Efficient Multilevel Checkpointing for Extreme Scale Systems tutorial, which is attended by Konstantinos Parasyris on 23 January, 14:00 - 17:30 h in Multipurpose room 1.
The tutorial explores extreme scale supercomputers. "Extreme scale supercomputers offer thousands of computing nodes to their users to satisfy their computing needs. As the need for massively parallel computing increases in industry, computing centers are being forced to increase in size and to transition to new computing technologies. While the advantage for the users is clear, such evolution imposes significant challenges, such as energy consumption and reliability. In this tutorial, we focus on how to guarantee high reliability to high performance applications running in extreme scale supercomputers. In particular, we cover all the technical content necessary to implement scalable multilevel checkpointing for tightly coupled applications. This includes an overview of failure types and frequency in current HPC systems. The tutorial will also cover the theoretical analysis necessary to achieve optimal utilization of the computing resources. Moreover, we will present the internals of the FTI library tool, to demonstrate how multilevel checkpointing is implemented today. This includes code analysis and execution traces to help the audience grasp the fundamental parts of this technique. Finally, we will have hands-on examples that the audience can analyze in their own laptops, so that they learn how to use FTI in practice, and lately transfer that knowledge to their production runs."