QUICK REVIEW

[论文解读] In-Datacenter Performance Analysis of a Tensor Processing Unit

Norman P. Jouppi, Cliff Young|arXiv (Cornell University)|Apr 16, 2017

Parallel Computing and Optimization Techniques被引用 23

一句话总结

本文评估了谷歌自研的张量处理单元（TPU），这是一种专为数据中心神经网络推理设计的领域专用ASIC。通过利用65,536并行8位乘加（MAC）阵列和确定性执行模型，TPU在性能上比同期的CPU和GPU高出15–30倍，能效比（TOPS/Watt）高出30–80倍；若采用GDDR5内存，能效比最高可达70× TOPS/Watt。

ABSTRACT

Many architects believe that major improvements in cost-energy-performance must now come from domain-specific hardware. This paper evaluates a custom ASIC---called a Tensor Processing Unit (TPU)---deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN). The heart of the TPU is a 65,536 8-bit MAC matrix multiply unit that offers a peak throughput of 92 TeraOps/second (TOPS) and a large (28 MiB) software-managed on-chip memory. The TPU's deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency. The lack of such features helps explain why, despite having myriad MACs and a big memory, the TPU is relatively small and low power. We compare the TPU to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the same datacenters. Our workload, written in the high-level TensorFlow framework, uses production NN applications (MLPs, CNNs, and LSTMs) that represent 95% of our datacenters' NN inference demand. Despite low utilization for some applications, the TPU is on average about 15X - 30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU's GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.

研究动机与目标

评估自2015年起在生产数据中心中部署的定制ASIC——张量处理单元（TPU）在性能、效率和可扩展性方面的表现。
应对机器学习推理工作负载中对领域专用硬件日益增长的需求，以提升成本、能耗和性能表现。
基于真实生产工作负载，将TPU的性能和效率与同期的服务器级CPU和GPU进行对比。
证明确定性执行和软件管理的内存机制相比通用处理器中基于缓存、时变的优化技术，能提供更优的低延迟保障。

提出的方法

设计一种领域专用ASIC，配备65,536并行8位乘加（MAC）单元，峰值吞吐量达92 TOPS。
实现大容量（28 MiB）、由软件管理的片上内存，以减少对外部内存带宽和延迟的依赖。
采用确定性执行模型，避免使用缓存、乱序执行和多线程等时变优化技术。
使用TensorFlow框架的真实生产工作负载进行评估，涵盖MLP、CNN和LSTM，代表了数据中心95%的推理需求。
在相同数据中心环境和工作负载配置下，与英特尔Haswell CPU和Nvidia K80 GPU进行基准测试。
通过TOPS、TOPS/Watt和第99百分位响应时间等指标，分析性能和能效表现。

实验结果

研究问题

RQ1TPU在神经网络推理工作负载中，其性能和能效比相较于同期CPU和GPU如何？
RQ2与采用动态优化技术的CPU和GPU相比，TPU的确定性执行模型在第99百分位响应时间上的改善程度如何？
RQ3若将TPU的HBM2内存替换为GDDR5内存，可实现多大的性能提升？对TOPS和TOPS/Watt有何影响？
RQ4为何TPU在某些应用中利用率相对较低，却仍能实现高吞吐量和高效率？
RQ5软件管理的28 MiB片上内存如何在TPU架构中提升性能和能效？

主要发现

TPU在真实生产神经网络推理工作负载中，性能比同期的英特尔Haswell CPU和Nvidia K80 GPU高出15–30倍。
TPU的TOPS/Watt能效比比CPU和GPU高出30–80倍，展现出卓越的能效优势。
若将TPU的HBM2内存替换为GDDR5内存，可使峰值TOPS提升三倍，TOPS/Watt将接近GPU的70倍、CPU的200倍。
TPU的确定性执行模型相比CPU和GPU的时变优化技术，能提供更优的第99百分位响应时间保障。
尽管部分工作负载中利用率较低，TPU的专用架构和内存层次结构仍能为推理工作负载提供一致的高性能表现。
28 MiB的片上内存显著减少了对外部内存的访问，从而有效降低延迟并提升能效。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。