Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights
2021 IEEE International Conference on Cluster Computing (CLUSTER)(2021)
摘要
In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the ...
更多查看译文
关键词
Training,Fault diagnosis,Solid modeling,Fault tolerance,File systems,Computational modeling,Fault tolerant systems
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要