[论文解读] Full Stack Optimization of Transformer Inference: a Survey
本综述分析端到端的方法以实现高效 Transformer 推理,并展示了 Gemmini 的案例研究,最高可实现 88.7× 的加速,且性能下降极小。
Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.
研究动机与目标
- 分析 Transformer 架构的运行时瓶颈与工作负载特征。
- 考察非线性和线性 Transformer 操作对推理效率的硬件影响。
- 综述用于固定 Transformer 架构的优化技术(如剪枝、量化)。
- 探讨 Transformer 工作负载在不同硬件上的调度/映射挑战。
- 研究神经网络架构搜索以使 Transformer 更适合硬件效率。
提出的方法
- 综述 Transformer 的运行时特性并对瓶颈进行性能分析(Sec. 2)。
- 分析非线性运算(LayerNorm、Softmax、GELU)和线性运算(矩阵乘法)对加速器的硬件影响(Sec. 3)。
- 评述固定架构的优化技术(剪枝、量化)(Sec. 4)。
- 讨论运算映射与调度挑战(Sec. 5)。
- 描述用于将 Transformer 架构适配硬件效率的神经架构搜索方法(Sec. 6)。
- 给出在 Gemmini 上应用所综述优化的案例研究并报告性能影响(Sec. 3.4、图 14、Sec. 5.5)。
实验结果
研究问题
- RQ1硬件下 Transformer 编码器和解码器的运行时瓶颈是什么?
- RQ2Transformer 中的非线性运算如何影响加速器设计与利用率?
- RQ3哪些优化策略可以在固定 Transformer 架构下最大化性能?
- RQ4哪些调度/映射决策对 Transformer 推理延迟影响最大?
- RQ5神经网络架构搜索是否能够产生硬件高效的 Transformer 变体,取舍是什么?
主要发现
- 端到端的协同设计方法可在 Gemmini 上为 Transformer 推理带来高达 88.7× 的加速且性能下降极小。
- Gemmini 的 CNN 优化架构并不完全适合 Transformer 推理,因为在浮点非线性运算以及量化/反量化操作上耗时较多,如未解决可能导致硬件利用率低于 1%(或不到 1%)。
- 对于 Transformer 加速器而言,较大的累加器尺寸与较小的 scratchpad 尺寸通常比为 CNN 优化的设计能提升性能(在所报道的案例中约实现 36% 的延迟提升)。
- 在 Transformer 中对矩阵乘法的调度与在 CNN 中一样具有挑战性,最佳与最差解之间的差异可高达四个数量级(Sec. 5.5.1)。
- 将 LayerNorm 与前面的 matmul 融合会引入瓦片大小约束,在某些情形下可能抵消融合带来的收益(Sec. 5.5.2)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。