[论文解读] Fault-tolerant linear solvers via selective reliability
本文提出FT-GMRES,一种容错的迭代线性求解器,通过选择性可靠性机制在不可纠正的内存故障下仍能保持收敛。通过仅对关键数据和阶段(如外层迭代和预条件器更新)应用可靠性,该方法在标准求解器因故障而失败的情况下仍能实现收敛,且随着故障率上升,性能下降幅度较小。
Energy increasingly constrains modern computer hardware, yet protecting computations and data against errors costs energy. This holds at all scales, but especially for the largest parallel computers being built and planned today. As processor counts continue to grow, the cost of ensuring reliability consistently throughout an application will become unbearable. However, many algorithms only need reliability for certain data and phases of computation. This suggests an algorithm and system codesign approach. We show that if the system lets applications apply reliability selectively, we can develop algorithms that compute the right answer despite faults. These "fault-tolerant" iterative methods either converge eventually, at a rate that degrades gracefully with increased fault rate, or return a clear failure indication in the rare case that they cannot converge. Furthermore, they store most of their data unreliably, and spend most of their time in unreliable mode. We demonstrate this for the specific case of detected but uncorrectable memory faults, which we argue are representative of all kinds of faults. We developed a cross-layer application / operating system framework that intercepts and reports uncorrectable memory faults to the application, rather than killing the application, as current operating systems do. The application in turn can mark memory allocations as subject to such faults. Using this framework, we wrote a fault-tolerant iterative linear solver using components from the Trilinos solvers library. Our solver exploits hybrid parallelism (MPI and threads). It performs just as well as other solvers if no faults occur, and converges where other solvers do not in the presence of faults. We show convergence results for representative test problems. Near-term future work will include performance tests.
研究动机与目标
- 应对极端规模计算中日益严峻的硬件不可靠性挑战,其中能效限制使得全系统容错不可行。
- 通过仅对关键算法组件应用可靠性,克服在大规模并行应用中实施端到端可靠性的不切实际性。
- 开发一种混合式应用/操作系统框架,拦截并报告不可纠正的内存故障,而非终止应用程序。
- 通过仅对关键数据结构应用可靠性,证明迭代求解器即使在浮点数据发生故障时仍能保持鲁棒性和收敛性。
- 实现算法与系统协同设计,降低容错的能耗成本,同时保持计算正确性和收敛性。
提出的方法
- 设计一种容错GMRES求解器(FT-GMRES),在大部分计算中采用不可靠模式,仅在外层迭代和预条件器更新时使用可靠模式。
- 实现一种跨层框架,在操作系统层面拦截不可纠正的内存故障,并将其报告给应用程序而不终止执行。
- 利用应用程序级别的故障报告机制,在检测到故障时仅触发从可靠存储中刷新数据,从而最小化开销。
- 将生产级别的Trilinos组件(GMRES、ILUT预条件器)集成到原型中,以确保真实场景下的性能和可扩展性。
- 采用混合并行模型,结合MPI和OpenMP线程,以支持大规模分布式计算环境。
- 在测试期间使用每小时每MB 1000次的伪随机故障注入,模拟真实的故障条件。
实验结果
研究问题
- RQ1如果仅对关键组件应用可靠性,迭代线性求解器是否仍能在不可纠正内存故障下保持收敛?
- RQ2随着故障率上升,容错求解器的收敛速率如何退化?这种退化在实际中是否可接受?
- RQ3系统级故障拦截框架是否能实现应用程序级容错,而无需在发生不可纠正错误时终止进程?
- RQ4与全系统冗余相比,算法级容错能在多大程度上降低可靠性的能耗成本?
- RQ5选择性可靠性能否推广到科学计算中除线性求解器外的其他迭代算法?
主要发现
- FT-GMRES在存在不可纠正内存故障的情况下仍成功收敛至解,而标准GMRES和重启GMRES在相同故障条件下均未能收敛。
- 随着故障率上升,FT-GMRES的收敛速率逐渐退化,即使在高故障注入率下,总迭代次数也仅出现适度增加。
- 通过仅在检测到故障时刷新预条件器和外层迭代数据,求解器保持了鲁棒性,最大限度减少了性能影响。
- 原型表明,故障检测机制可无缝集成到生产级求解器中,且在无故障条件下不会牺牲性能。
- 跨层框架成功拦截并报告了不可纠正的内存故障至应用程序,使程序得以继续执行而非被终止。
- 结果表明,选择性可靠性可显著降低容错的能耗成本,同时保持算法正确性和收敛性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。