QUICK REVIEW

[论文解读] MLPerf Inference Benchmark

Vijay Janapa Reddi, Christine Cheng|arXiv (Cornell University)|Nov 6, 2019

Radiation Effects in Electronics参考文献 47被引用 40

一句话总结

MLPerf Inference 为评估跨多样化软硬件堆栈的机器学习推理系统，引入了一套标准化的行业级基准测试套件。它定义了四种现实场景——单流、多流、服务器和离线——并设定了严格的准确率目标和延迟上限，从而实现对来自14家机构的30多个系统的公平、可复现且与架构无关的性能比较。

ABSTRACT

Machine-learning (ML) hardware and software system demand is burgeoning. Driven by ML applications, the number of different ML inference systems has exploded. Over 100 organizations are building ML inference chips, and the systems that incorporate existing models span at least three orders of magnitude in power consumption and five orders of magnitude in performance; they range from embedded devices to data-center solutions. Fueling the hardware are a dozen or more software frameworks and libraries. The myriad combinations of ML hardware and ML software make assessing ML-system performance in an architecture-neutral, representative, and reproducible manner challenging. There is a clear need for industry-wide standard ML benchmarking and evaluation criteria. MLPerf Inference answers that call. In this paper, we present our benchmarking method for evaluating ML inference systems. Driven by more than 30 organizations as well as more than 200 ML engineers and practitioners, MLPerf prescribes a set of rules and best practices to ensure comparability across systems with wildly differing architectures. The first call for submissions garnered more than 600 reproducible inference-performance measurements from 14 organizations, representing over 30 systems that showcase a wide range of capabilities. The submissions attest to the benchmark's flexibility and adaptability.

研究动机与目标

解决机器学习推理系统缺乏标准化、代表性及可复现基准测试的问题。
实现对多样化机器学习软硬件堆栈之间公平、直接的性能比较。
建立与真实世界部署约束一致的性能指标、准确率目标和延迟上限的共识。
通过在既定规则下允许灵活实现，支持硬件与软件优化。
通过社区驱动的基准测试框架，促进整个行业的协作。

提出的方法

定义了四种不同的推理场景——单流、多流、服务器和离线，每种场景均配有特定的性能指标。
基于200多名机器学习工程师和实践者的输入，设定了强制性的模型质量目标和延迟上限。
使用标准化的 LoadGen 工具模拟真实世界工作负载，并强制执行一致的数据输入输出处理。
支持封闭和开放两类参赛类别：封闭类要求严格遵守规则，开放类则提供更广泛的软硬件灵活性。
提供基于 PyTorch 和 TensorFlow 的参考实现，以确保结果的可复现性和可访问性。
采用自动化检查器和提交验证工具，确保结果的准确性、合规性与可审计性。

实验结果

研究问题

RQ1如何在差异巨大的硬件与软件系统之间公平地衡量机器学习推理性能？
RQ2哪些性能指标最能反映数据中心、边缘设备和移动系统中的真实世界部署约束？
RQ3如何对模型准确率进行标准化，以实现性能与质量之间有意义的权衡分析？
RQ4哪些基准测试规则和工作流程能够确保在多样化提交中实现可复现性和完整性？
RQ5如何通过共识驱动的基准测试框架有效涵盖机器学习推理工作负载的全谱？

主要发现

第一轮 MLPerf Inference 提交共收集了来自14家机构的600多项可复现的性能测量结果，涵盖30多个不同系统。
在四种定义的场景中，性能表现存在显著差异，凸显了场景特定基准测试的重要性。
准确率目标和延迟上限的引入，使得跨系统的准确率/性能权衡得以一致评估。
LoadGen 工具和自动化检查器显著减少了人工审计工作量，提升了结果完整性，仅需约三名工程师即可完成提交验证。
该基准成功捕捉了跨多样化平台的多种优化技术，如批处理、模型量化以及软硬件协同设计。
由超过30家机构和200多名实践者参与的社区驱动开发过程，确保了基准测试的广泛相关性与真实世界适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。