QUICK REVIEW

[论文解读] Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Yu Emma Wang, Gu-Yeon Wei|arXiv (Cornell University)|Jul 24, 2019

Parallel Computing and Optimization Techniques参考文献 48被引用 230

一句话总结

该论文提出 ParaDnn，是一个参数化的深度学习基准，引入 TPU v2/v3、NVIDIA V100 GPU 和 Intel Skylake CPU 在端到端的 FC、CNN、RNN 工作负载上的对比，揭示了各平台的特定优势与瓶颈。

ABSTRACT

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

研究动机与目标

推动对深度学习硬件进行系统性、端到端的基准测试，超越小型模型样本。
提出 ParaDnn 以生成数千个参数化的端到端模型，覆盖 FC、CNN 和 RNN 架构。
利用 ParaDnn 与真实世界工作负载对 TPU、GPU 与 CPU 平台进行全面比较。
识别架构与软件设计洞见，为未来的专用硬件和软件栈优化提供指导。

提出的方法

介绍 ParaDnn，一个参数化基准套件，能够生成端到端的 FC、CNN 和 RNN 模型。
将 ParaDnn 工作负载与六个真实世界模型结合，形成一个广泛的基准集。
评估 Google Cloud TPU v2/v3、NVIDIA V100 GPU，以及 Intel Skylake CPU 平台。
分析 TPU 架构瓶颈，包括计算、内存带宽、多芯片开销以及主机与设备之间的平衡。
使用 FLOPS 利用率、屋脊线分析和操作分解来表征跨模型的性能。

实验结果

研究问题

RQ1在多样化的端到端模型中，限制 TPU v2/v3 性能的主要瓶颈是什么？
RQ2在大规模的 ParaDnn 生成与真实世界的 DL 工作负载上，TPU、GPU 与 CPU 平台的对比如何？
RQ3模型属性（例如批大小、宽度、嵌入维度）如何影响硬件利用率和性能瓶颈？
RQ4哪些软件和数据精度策略能够提升在 TPU 和 GPU 平平台上的性能？

主要发现

尽管批量大小扩展良好，许多 FC 和 CNN 工作负载中，TPU 的性能受限于内存带宽和芯片间通信。
TPU v3 相对于 v2 提供显著的加速，来自更大内存容量和更高带宽的驱动，超越了原始 FLOPS 的提升。
内存带宽限制和数据输入瓶颈显著影响 TPU 和 GPU 的性能，数据输入优化带来显著收益。
较大的批量可以降低多芯片通信开销，而模型深度（层数）提供了未充分利用的并行性机会，可通过模型并行或流水线来探索。
量化和软件栈的改进可为 TPU 和 GPU 平台带来有意义的性能提升，进一步提升可能来自编译器和内核优化。
最大的全连接模型往往偏好 CPU，原因是内存限制，而某些 CNN/RNN 工作负载则根据架构看到 TPU/GPU 的优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。