Souza, Samuel Xavier deSantana, Carla dos Santos2025-05-292025-05-292024-10-04SANTANA, Carla dos Santos. A configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilities. Orientador: Dr. Samuel Xavier de Souza. 2024. 90f. Tese (Doutorado em Engenharia Elétrica e de Computação) - Centro de Tecnologia, Universidade Federal do Rio Grande do Norte, Natal, 2024.https://repositorio.ufrn.br/handle/123456789/63748High-performance computing, a dynamic field within computer science, provides the processing power necessary for algorithms across diverse domains. Large-scale supercomputers are indispensable for tackling complex problems; however, their size and complexity make them susceptible to failure. This underscores the criticality of employing fault tolerance techniques to mitigate the impact of interruptions or failures. These methods are instrumental in addressing hardware and software malfunctions and preemptive scenarios. Given the imperative for fault tolerance, we present new methodologies for improving fault tolerance in bulk synchronous programs. These new methodologies are presented as the Dependability Library for Iterative Applications. This library offers a versatile solution that combines data conservation at the application level, fault detection, and failover capabilities. The proposed library simplifies the integration of fault tolerance abilities into the applications, offering high configurability options. This thesis presents data conservation methodologies, including application-level checkpointing and process data replication, to ensure reliability by allowing a backup unit to take over in case of failure. This work also presents fault detection methods such as termination signal detection and heartbeat monitoring with inexpensive communication to trigger the data conservation only if there is a possibility of failure; this approach permits low overhead. The proposed library is compatible with user-level failure mitigation, which allows failover capabilities; in other words, the programs can continue operating after crashes, minimizing downtime and ensuring continuous operation. Our proposal was successfully applied to the geophysical problem of full-waveform inversion, a standard algorithm for oil and gas exploration geophysics processing. This application serves as a high-performance practical scenario for analysis, demonstrating the real-world applicability of the library. All methods were rigorously validated, and the overhead in this problem was thoroughly analyzed using more realistic examples. In our experiments, the application did not lose all data processed until the failure moment, and it could continue execution even in the presence of node failure, with minimal overhead. This work also shows other case studies in the initial stage of applying the library and discusses some fault tolerance concepts and related works.enAcesso AbertoFault toleranceInterruption detectionData conservationFailoverHigh performance computingA configurable dependability library for high-performance computing iterative applications with interruption detection, data preservation and failover capabilitiesdoctoralThesisENGENHARIAS::ENGENHARIA ELETRICA