QUICK REVIEW

[论文解读] AIBench: An Industry Standard Internet Service AI Benchmark Suite

Wanling Gao, Fei Tang|arXiv (Cornell University)|Aug 13, 2019

IoT and Edge/Fog Computing参考文献 58被引用 31

一句话总结

AIBench 是首个面向互联网服务中人工智能工作负载的行业标准基准套件，由 17 家行业合作伙伴共同开发。它提供了一个灵活且可扩展的框架，涵盖 16 个关键人工智能问题领域（如排序学习、目标检测和推荐）的组件基准，以及一个端到端的电子商务搜索应用基准，支持从微内核到全栈工作负载的全面性能分析，基于真实规模的数据和工作负载。

ABSTRACT

Today's Internet Services are undergoing fundamental changes and shifting to an intelligent computing era where AI is widely employed to augment services. In this context, many innovative AI algorithms, systems, and architectures are proposed, and thus the importance of benchmarking and evaluating them rises. However, modern Internet services adopt a microservice-based architecture and consist of various modules. The diversity of these modules and complexity of execution paths, the massive scale and complex hierarchy of datacenter infrastructure, the confidential issues of data sets and workloads pose great challenges to benchmarking. In this paper, we present the first industry-standard Internet service AI benchmark suite---AIBench with seventeen industry partners, including several top Internet service providers. AIBench provides a highly extensible, configurable, and flexible benchmark framework that contains loosely coupled modules. We identify sixteen prominent AI problem domains like learning to rank, each of which forms an AI component benchmark, from three most important Internet service domains: search engine, social network, and e-commerce, which is by far the most comprehensive AI benchmarking effort. On the basis of the AIBench framework, abstracting the real-world data sets and workloads from one of the top e-commerce providers, we design and implement the first end-to-end Internet service AI benchmark, which contains the primary modules in the critical paths of an industry scale application and is scalable to deploy on different cluster scales. The specifications, source code, and performance numbers are publicly available from the benchmark council web site http://www.benchcouncil.org/AIBench/index.html.

研究动机与目标

解决当前缺乏公开可用、具有代表性且可扩展的基准，以评估互联网服务中产业规模的人工智能工作负载的问题。
克服在真实人工智能应用基准测试中面临的数据保密性、系统复杂性和架构多样性等挑战。
开发一个全面的基准框架，支持细粒度的组件基准和全栈端到端应用评估。
通过提供公开可访问的规范、源代码和性能数据，促进跨产业和跨学术界的研究。
通过基于顶级电商平台的真实人工智能工作负载建模，弥合学术研究与工业实践之间的差距。

提出的方法

设计一个模块化、松耦合的基准框架，支持可插拔的数据输入、人工智能问题领域、在线推理、离线训练和部署组件。
识别并实现 16 个突出的人工智能问题领域（如图文生成、语音转写、三维物体重建和排序学习），均源自真实世界的搜索、社交网络和电子商务工作负载。
基于一家顶级电商平台的真实生产数据和工作负载，构建一个端到端的电子商务搜索基准，大规模复现关键路径模块。
在组件基准中实现低层级微基准（12 个基本计算单元），以支持内核级性能分析。
使用详细的停顿分析（如内存依赖、执行依赖、纹理停顿）在内核和函数级别进行 GPU 执行效率分析。
利用性能分析工具识别热点函数和性能瓶颈，例如在卷积内核 maxwell_scudnn_128x32_stridedB_splitK_interior_nn 中，SM 效率仅为 18.5%。

实验结果

研究问题

RQ1如何设计一个全面、可扩展且经行业验证的基准套件，以准确反映大规模互联网服务中真实的人工智能工作负载？
RQ2哪些最具代表性的 AI 问题领域能够捕捉现代互联网服务的关键计算特征？
RQ3AI 组件在多大程度上改变了端到端互联网服务工作负载中的关键路径和性能瓶颈？
RQ4如何在不同 AI 工作负载中，于内核和函数级别识别并分析性能瓶颈？
RQ5GPU 执行中的关键性能退化（如停顿）是什么？它们在不同 AI 操作和硬件内核之间如何变化？

主要发现

排序学习（learning_to_rank）组件的 SM 效率最低（29%），主要由于高内存依赖停顿（61%）以及低优化内核（如 maxwell_scudnn_128x32_stridedB_splitK_interior_nn）仅实现 18.5% 的 SM 效率。
在逐元素操作中，内存依赖停顿最高可达总停顿的 68%，表明数据局部性和访问模式是主要性能瓶颈。
在许多内核中，执行依赖停顿显著，表明通过更优的内核调度或代码生成可进一步提升指令级并行性。
函数级性能分析显示，卷积中的 maxwell_scudnn_128x32_stridedB_splitK_interior_nn 存在 61% 的内存依赖停顿，而 GEMM 中的 maxwell_sgemm_128x64_nn 仅 18%，表明其优化需求存在显著差异。
端到端基准成功捕捉了关键路径中由 AI 驱动的工作负载变化，验证了全栈应用基准相较于孤立微基准的必要性。
该基准套件提供了以往因缺乏对产业级服务的真实数据集、工作负载和用户日志的公开访问而无法获得的详细性能洞察。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。