Konstantinos Parasyris, a researcher in the Computer Architecture for Parallel Paradigms group at the Barcelona Supercomputing Center (BSC) and a member of the LEGaTO project, presented his poster at the 6th BSC Severo Ochoa Doctoral Symposium. The event was held at the BSC´s Universitat Politècnica de Catalunya (UPC) Nord premises on May 7-9 and was aimed at providing a forum in which PhD students and PostDoc researchers could present the results of their research work.
The poster, titled ¨C/R Support for Heterogeneous HPC Applications¨, is about the importance of fault tolerance on large scale heterogeneous systems. This topic is particularly crucial as the next generation of computing—exascale supercomputers—are more prone to failures due to their large size. Thus, there is a need for fault tolerance techniques, which trade-off system efficiency for increased failure robustness. Konstantinos´s work addresses that issue.
¨We presented an application level checkpoint library, called fault tolerance interface (FTI) which transparently supports multi-node/multi-GPU checkpoints,¨ he said.
Konstantinos shared that there is still more to be done. ¨As a next step we will start supporting other heterogeneous devices, such as FPGAs. Our vision is to provide a single library that efficiently checkpoints HPC applications that execute on systems consisting of multiple nodes with multiple GPUs and multiple FPGAs,¨ he said.
Click here to view the poster on SlideShare.