The Impact of Asynchronous I/O in Checkpoint-Restart Workloads
2024 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS, IPDPSW 2024(2024)
摘要
With increasing data production from scientific checkpoint-restart (CR) workloads, data management libraries utilize asynchronous I/O operations to hide the cost of accessing global file systems in high-performance computing (HPC) systems. However, the characteristics of asynchronous operation, such as software and hardware architecture and workload features, dictate the impact of asynchronous I/O operations on checkpoint-restart application resources and asynchronous I/O performance. In this work, we aim to study the impact of these design choices on CR performance, such as CPU, memory, and checkpointing, while also considering the performance of asynchronous I/O operation and the performance variability introduced in the workload. We observe the following three main findings. First, the architecture used for asynchronous I/O must be flexible to manage its resources to reduce application interference while maximizing asynchronous I/O performance. Second, modern MPI-aware schedulers apply an implicit core affinity to each workload, which is helpful for compute-bound applications but can significantly reduce checkpointing performance. Finally, asynchronous I/O designs need to consider the cost of communication between the service and application to reduce the overhead of checkpointing operations. Through this study, we aim to pave the path for future middleware libraries to adopt different designs based on their target workload and HPC system architecture.
更多查看译文
关键词
Asynchronous I/O,background threads,interference study,IBM burst buffer,checkpoint-restart,SCR
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要