QUICK REVIEW

[논문 리뷰] Full Stack Optimization of Transformer Inference: a Survey

Sehoon Kim, Coleman Hooper|arXiv (Cornell University)|2023. 02. 27.

Ferroelectric and Negative Capacitance Devices인용 수 27

한 줄 요약

이 설문은 효율적인 Transformer 추론을 위한 풀스택 접근법을 분석하고 Gemmini 사례연구에서 최대 88.7× 속도 증가와 최소한의 성능 저하를 시연합니다.

ABSTRACT

Recent advances in state-of-the-art DNN architecture design have been moving toward Transformer models. These models achieve superior accuracy across a wide range of applications. This trend has been consistent over the past several years since Transformer models were originally introduced. However, the amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate, and this has made their deployment in latency-sensitive applications challenging. As such, there has been an increased focus on making Transformer models more efficient, with methods that range from changing the architecture design, all the way to developing dedicated domain-specific accelerators. In this work, we survey different approaches for efficient Transformer inference, including: (i) analysis and profiling of the bottlenecks in existing Transformer architectures and their similarities and differences with previous convolutional models; (ii) implications of Transformer architecture on hardware, including the impact of non-linear operations such as Layer Normalization, Softmax, and GELU, as well as linear operations, on hardware design; (iii) approaches for optimizing a fixed Transformer architecture; (iv) challenges in finding the right mapping and scheduling of operations for Transformer models; and (v) approaches for optimizing Transformer models by adapting the architecture using neural architecture search. Finally, we perform a case study by applying the surveyed optimizations on Gemmini, the open-source, full-stack DNN accelerator generator, and we show how each of these approaches can yield improvements, compared to previous benchmark results on Gemmini. Among other things, we find that a full-stack co-design approach with the aforementioned methods can result in up to 88.7x speedup with a minimal performance degradation for Transformer inference.

연구 동기 및 목표

Transformer 아키텍처의 런타임 병목 및 워크로드 특성 분석.
추론 효율성에 대한 비선형 및 선형 Transformer 연산의 하드웨어 영향 분석.
고정된 Transformer 아키텍처(예: 가지치기, 양자화)에 대한 최적화 기법 조사.
하드웨어 전반에 걸친 Transformer 워크로드의 스케줄링/매핑 문제 탐구.
하드웨어 효율성을 위해 Transformer를 맞춤화하기 위한 신경망 아키텍처 검색(NAS) 탐구

제안 방법

런타임 특성 및 병목 프로파일링(Sec. 2) 조사.
가속기용 비선형 연산(LayerNorm, Softmax, GELU) 및 선형 연산(matmuls)의 하드웨어 시사점 분석(Sec. 3).
고정된 아키텍처에 대한 최적화 기법 검토(가지치기, 양자화)(Sec. 4).
Transformer 워크로드의 매핑/스케줄링 문제 논의(Sec. 5).
하드웨어 효율성을 위한 Transformer 아키텍처를 조정하기 위한 신경망 아키텍처 검색 접근법 설명(Sec. 6).
조사된 최적화를 Gemmini에 적용한 사례 연구 제시 및 성능 함의 보고(Sec. 3.4, Fig. 14, Sec. 5.5).

실험 결과

연구 질문

RQ1하드웨어에서 Transformer 인코더/디코더의 런타임 병목은 무엇인가?
RQ2Transformer의 비선형 연산이 가속기 설계 및 활용도에 어떤 영향을 미치는가?
RQ3고정된 Transformer 아키텍처에서 성능을 최대화하는 최적화 전략은 무엇인가?
RQ4Transformer 추론 지연에 가장 큰 영향을 미치는 스케줄링/매핑 결정은 무엇인가?
RQ5신경망 아키텍처 검색으로 하드웨어 효율적 Transformer 변형을 얻을 수 있으며, 그 트레이드오프는 무엇인가?

주요 결과

풀스택 공동 설계 접근 방식으로 Gemmini에서 Transformer 추론 시 최대 88.7× 속도 증가와 성능 저하 최소화를 달성할 수 있습니다.
Gemmini의 CNN-최적화 아키텍처는 Transformer 추론에 적합하지 않습니다. 이유는 부동소수점 비선형 연산 및 양자화/비양자화 연산에 시간이 많이 소요되기 때문이며, 해결하지 않으면 하드웨어 활용도가 1% 미만으로 떨어질 수 있습니다.
가속기용 Transformer의 경우, 큰 누산기 크기와 작은 Scratchpad 크기가 CNN 최적화 설계보다 성능을 개선하는 경우가 많으며(보고된 사례에서 대략 36% 지연 개선)
Transformer에서 matmul를 스케줄링하는 것은 CNN에서만큼이나 도전적이며, 최적/최악의 해법은 최대 네 자리 수 차이까지 차이가 날 수 있습니다(Sec. 5.5.1).
LayerNorm을 앞선 matmul과 융합하는 것은 타일 크기 제약을 도입하여 일부 시나리오에서 융합의 이익을 상쇄할 수 있습니다(Sec. 5.5.2).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.