Coordinated Fault Tolerance for High-Performance Computing
[摘要] Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack???from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.
[发布日期] 2013-04-08 [发布机构]
[效力级别] [学科分类] 数学(综合)
[关键词] [时效性]