QUICK REVIEW

[Paper Review] Benchmarking TPU, GPU, and CPU Platforms for Deep Learning

Yu Emma Wang, Gu-Yeon Wei|arXiv (Cornell University)|Jul 24, 2019

Parallel Computing and Optimization Techniques48 references230 citations

TL;DR

The paper introduces ParaDnn, a parameterized deep learning benchmark, and compares TPU v2/v3, NVIDIA V100 GPU, and Intel Skylake CPU across end-to-end FC, CNN, and RNN workloads, revealing platform-specific strengths and bottlenecks.

ABSTRACT

Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

Motivation & Objective

Motivate systematic, end-to-end benchmarking of deep learning hardware beyond small model samples.
Propose ParaDnn to generate thousands of parameterized end-to-end models covering FC, CNN, and RNN architectures.
Provide a comprehensive comparison of TPU, GPU, and CPU platforms using ParaDnn and real-world workloads.
Identify architectural and software design insights to guide future specialized hardware and stack optimizations.

Proposed method

Introduce ParaDnn, a parameterized benchmark suite that generates end-to-end FC, CNN, and RNN models.
Combine ParaDnn workloads with six real-world models to create a broad benchmark set.
Evaluate Google Cloud TPU v2/v3, NVIDIA V100 GPU, and an Intel Skylake CPU platform.
Analyze TPU architecture bottlenecks, including computation, memory bandwidth, multi-chip overhead, and host-device balance.
Use FLOPS utilization, roofline analysis, and operation breakdowns to characterize performance across models.

Experimental results

Research questions

RQ1What are the main bottlenecks limiting TPU v2/v3 performance across diverse end-to-end models?
RQ2How do TPU, GPU, and CPU platforms compare on a broad set of ParaDnn-generated and real-world DL workloads?
RQ3How do model attributes (e.g., batch size, width, embedding size) affect hardware utilization and performance bottlenecks?
RQ4What software and data precision strategies can improve performance on TPU and GPU platforms?

Key findings

TPU performance is constrained by memory bandwidth and inter-chip communication for many FC and CNN workloads, despite good batch-size scaling.
TPU v3 delivers substantial speedups over v2, driven by larger memory capacity and higher bandwidth, beyond raw FLOPS increases.
Memory bandwidth limitations and data infeed bottlenecks significantly impact TPU and GPU performance, with data infeed optimization offering notable gains.
Large batch sizes can reduce multi-chip communication overhead, while model depth (layer count) provides underutilized parallelism opportunities to explore via model parallelism or pipelining.
Quantization and software stack improvements can yield meaningful performance gains on TPU and GPU platforms, with further gains possible through compiler and kernel optimizations.
The largest fully-connected models tend to prefer CPU due to memory constraints, while some CNN/RNN workloads see TPU/GPU advantages depending on the architecture.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.