已收录 272628 条政策
 政策提纲
  • 暂无提纲
Improving Memory Error Handling Using Linux
[摘要] As supercomputers continue to get faster and more powerful in the future, they will also have more nodes. If nothing is done, then the amount of memory in supercomputer clusters will soon grow large enough that memory failures will be unmanageable to deal with by manually replacing memory DIMMs. "Improving Memory Error Handling Using Linux" is a process oriented method to solve this problem by using the Linux kernel to disable (offline) faulty memory pages containing bad addresses, preventing them from being used again by a process. The process of offlining memory pages simplifies error handling and results in reducing both hardware and manpower costs required to run Los Alamos National Laboratory (LANL) clusters. This process will be necessary for the future of supercomputing to allow the development of exascale computers. It will not be feasible without memory error handling to manually replace the number of DIMMs that will fail daily on a machine consisting of 32-128 petabytes of memory. Testing reveals the process of offlining memory pages works and is relatively simple to use. As more and more testing is conducted, the entire process will be automated within the high-performance computing (HPC) monitoring software, Zenoss, at LANL.
[发布日期] 2014-07-25 [发布机构] 
[效力级别]  [学科分类] 数学(综合)
[关键词] COMPUTER HARDWARE;COMPUTER SCIENCE [时效性] 
   浏览次数:8      统一登录查看全文      激活码登录查看全文