QUICK REVIEW

[论文解读] Quantifying Performance Changes with Effect Size Confidence Intervals

Tomáš Kalibera, Richard Jones|arXiv (Cornell University)|Jul 21, 2020

Software System Performance and Reliability参考文献 37被引用 23

一句话总结

本文提出一种使用随机效应和Fieller定理的统计模型，通过考虑运行时波动和非确定性编译等非确定性来源，量化性能比估计（例如加速比）的不确定性。该方法可生成如‘5.5% ± 2.5%，置信水平95%’的置信区间，为当前忽略不确定性和非确定性的系统性能评估实践提供更严格、更易理解的替代方案。

ABSTRACT

Measuring performance & quantifying a performance change are core evaluation techniques in programming language and systems research. Of 122 recent scientific papers, as many as 65 included experimental evaluation that quantified a performance change using a ratio of execution times. Few of these papers evaluated their results with the level of rigour that has come to be expected in other experimental sciences. The uncertainty of measured results was largely ignored. Scarcely any of the papers mentioned uncertainty in the ratio of the mean execution times, and most did not even mention uncertainty in the two means themselves. Most of the papers failed to address the non-deterministic execution of computer programs (caused by factors such as memory placement, for example), and none addressed non-deterministic compilation. It turns out that the statistical methods presented in the computer systems performance evaluation literature for the design and summary of experiments do not readily allow this either. This poses a hazard to the repeatability, reproducibility and even validity of quantitative results. Inspired by statistical methods used in other fields of science, and building on results in statistics that did not make it to introductory textbooks, we present a statistical model that allows us both to quantify uncertainty in the ratio of (execution time) means and to design experiments with a rigorous treatment of those multiple sources of non-determinism that might impact measured performance. Better still, under our framework summaries can be as simple as "system A is faster than system B by 5.5% $\pm$ 2.5%, with 95% confidence", a more natural statement than those derived from typical current practice, which are often misinterpreted. November 2013

研究动机与目标

解决编程语言与系统研究中性能评估普遍缺乏不确定性报告的问题。
解决运行时波动和非确定性编译等非确定性来源引起的性能变化问题。
开发一种统计模型，使在现实实验条件下能够准确计算性能比（例如加速比）的置信区间。
提供一种比当前依赖显著性检验或置信区间视觉重叠的实践更具可解释性和科学严谨性的替代方法。
改进实验设计与报告，以增强计算机系统性能评估的可复现性和有效性。

提出的方法

将性能测量形式化为分层随机效应模型，以捕捉多种非确定性来源：执行内变异、执行间差异以及非确定性编译。
应用Fieller定理，基于该随机效应模型计算两个均值之比（例如系统A与B的执行时间）的置信区间。
使用经验贝叶斯估计法，从不同变异层次（例如多次编译、多次运行）的重复测量中联合估计方差分量。
设计实验协议，以优化不同层次（例如编译、执行）的重复次数，实现精度与成本之间的平衡。
通过真实基准程序（例如FFT、Ping）的统计模拟验证该方法，与现有实践进行比较，并评估覆盖率和第一类错误率。
将该框架集成到实际报告中，支持简洁、易懂的表述，如‘系统A比B快5.5% ± 2.5%，置信水平95%’。

实验结果

研究问题

RQ1当存在多个非确定性来源（例如运行时波动、非确定性编译）时，如何严格量化性能比估计的不确定性？
RQ2一种在多个实验层次上考虑随机效应的统计模型，是否能相比当前实践，提高性能评估的准确性和可靠性？
RQ3非确定性编译对性能测量有何影响？如何系统地建模并缓解其影响？
RQ4不同的重复策略（例如多次运行、多次编译）如何影响性能评估的精度与效率？
RQ5与依赖显著性检验或置信区间视觉重叠的现有方法相比，所提方法在多大程度上有所改进？

主要发现

所提方法生成的性能比置信区间比当前实践更准确、更易理解，后者通常完全忽略不确定性估计。
非确定性编译显著影响性能测量——例如，在Mono中重复编译同一源代码会产生不同的执行时间，因此必须显式建模。
该方法支持优化实验设计：对于某些基准（如Ping），重复执行并非必要；而对于其他基准（如FFT），重复编译对捕捉变异至关重要。
该框架支持简洁、自然的报告方式，如‘系统A比B快5.5% ± 2.5%，置信水平95%’，相比基于显著性的表述更直观，且不易被误解。
该方法优于依赖置信区间视觉重叠的现有方法，后者为二元判断，信息量少于对实际性能比的区间估计。
统计模拟表明，该方法在真实条件下保持了适当的覆盖率和第一类错误率，验证了其可靠性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。