QUICK REVIEW

[论文解读] Precision-Aware application execution for Energy-optimization in HPC node system

Radim Vavřík, Antoni Portero|arXiv (Cornell University)|Jan 1, 2015

Distributed and Parallel Computing Systems被引用 3

一句话总结

本文提出了一种面向高性能计算（HPC）系统的精度感知运行时资源管理（RTRM）框架，通过根据应用程序服务质量（QoS）需求动态调整计算资源，实现对能耗的动态优化。通过监控系统健康状况并利用预先计算的精度、执行时间和能耗之间的最优权衡，RTRM在相比原生执行的能耗降低下，实现了高达65%的精度提升，且时间开销低于10%，从而在可控精度损失的前提下实现全天候运行并节省能源。

ABSTRACT

Power consumption is a critical consideration in high performance computing systems and it is becoming the limiting factor to build and operate Petascale and Exascale systems. When studying the power consumption of existing systems running HPC workloads, we find that power, energy and performance are closely related which leads to the possibility to optimize energy consumption without sacrificing (much or at all) the performance. In this paper, we propose a HPC system running with a GNU/Linux OS and a Real Time Resource Manager (RTRM) that is aware and monitors the healthy of the platform. On the system, an application for disaster management runs. The application can run with different QoS depending on the situation. We defined two main situations. Normal execution, when there is no risk of a disaster, even though we still have to run the system to look ahead in the near future if the situation changes suddenly. In the second scenario, the possibilities for a disaster are very high. Then the allocation of more resources for improving the precision and the human decision has to be taken into account. The paper shows that at design time, it is possible to describe different optimal points that are going to be used at runtime by the RTOS with the application. This environment helps to the system that must run 24/7 in saving energy with the trade-off of losing precision. The paper shows a model execution which can improve the precision of results by 65% in average by increasing the number of iterations from 1e3 to 1e4. This also produces one order of magnitude longer execution time which leads to the need to use a multi-node solution. The optimal trade-off between precision vs. execution time is computed by the RTOS with the time overhead less than 10% against a native execution.

研究动机与目标

为解决大型规模和百亿亿次HPC系统中日益增长的能耗成本障碍。
通过利用精度、执行时间和能耗消耗之间的运行时权衡，实现在不牺牲性能的前提下实现能耗优化。
设计一种实时资源管理器（RTRM），用于监控系统健康状况，并根据应用程序QoS需求动态调整资源分配。
通过可控的精度退化实现关键HPC应用的全天候运行，从而实现能耗节省。

提出的方法

RTRM实时监控系统传感器（功耗、温度、负载）以评估平台健康状况和资源可用性。
在设计阶段，系统计算出在精度、执行时间和能耗消耗之间实现平衡的Pareto最优配置。
RTRM使用动态功耗模型：Pn = (Pmax − Pidle) × n/100 + Pidle，其中n为系统负载，用于估算功耗和能耗消耗。
能耗计算为E = P × t，其中P通过动态功耗模型估算，t为执行时间。
该框架支持单节点（SMP）和多节点（HPC集群）执行，通过可扩展的资源分配实现更高精度。
系统使用具有可调迭代次数的灾难管理应用来建模精度权衡，结果在SMP和集群平台均得到验证。

实验结果

研究问题

RQ1实时资源管理器能否在不降低性能的前提下，动态优化HPC系统的能耗？
RQ2灾难管理模拟中，精度、执行时间和能耗消耗之间的最优权衡是什么？
RQ3与原生执行相比，精度感知的RTRM引入了多少时间开销？
RQ4RTRM能否通过在低风险场景中降低精度来实现显著的能耗节省，同时保持可接受的精度？
RQ5在多节点环境中，系统如何扩展以支持高精度模拟？

主要发现

与原生执行相比，RTRM引入的时间开销低于10%，表明性能影响极小。
将迭代次数从103增加到104，平均使模拟精度提升了65%，验证了更高计算量对准确性的有效性。
在SMP平台上，能耗估算清晰显示出精度、核心数量和频率之间的权衡，从而支持Pareto最优配置的选择。
在HPC集群上，随着节点数量增加（最多16×16核节点），功耗急剧上升，但执行时间未显著改善，凸显了智能资源管理的必要性。
该框架通过根据风险状况在低精度（节能）和高精度（高准确率）模式之间动态切换，实现了关键HPC应用的全天候运行。
系统在设计阶段成功识别并利用了最优配置，这些配置随后由RTRM在运行时应用，以在能耗效率和应用QoS之间实现平衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。

[论文解读] Precision-Aware application execution for Energy-optimization in HPC&#13; node system