QUICK REVIEW

[논문 리뷰] Efficiently Scaling Transformer Inference

Reiner Pope, Sholto Douglas|arXiv (Cornell University)|2022. 11. 09.

Generative Adversarial Networks and Image Synthesis인용 수 57

한 줄 요약

이 논문은 대형 Transformer 모델을 다수의 TPU 칩에 걸쳐 분할하는 공학 프레임워크를 제시하여 추론 지연 및 FLOPS 활용을 최적화하고, 500B 파라미터를 초과하는 모델에서 새로운 파레토 프런티어를 달성하며 다중 질의 어텐션으로 더 긴 컨텍스트를 가능하게 한다.

ABSTRACT

We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.

연구 동기 및 목표

대상 애플리케이션 요구에 기반한 대형 Transformer 추론을 위한 다축(partitioning) 전략 선택을 위한 간단한 분석 모델 개발
파트리셔닝을 메모리 및 저수준 스케줄링 최적화와 결합해 500B+ 파라미터 모델에서 MFU 및 지연 개선
KV 캐시 메모리 비용을 줄이고 더 긴 컨텍스트 길이를 가능하게 하는 다중쿼리 어텐션의 효과를 보임
64 TPU v4 칩에서 PaLM 540B를 검증한 실용적이고 구성 가능한 추론 프레임워크를 demonstrated
다른 워크로드 요건 하에서 선충전(prefill) 및 생성(generation) 단계에서 파티션 레이아웃을 선택하기 위한 가이드라인 제공

제안 방법

지연 시간, 처리량, 모델 FLOPS 활용도(MFU) 지표 정의
피드포워드 계층에 대한 1D/2D 가중치 고정화(weight-stationary) 및 가중치 모아(weight-gathered) 레이아웃 포함 파티셔닝 구성 개발
메모리 시간 감소를 위한 KV 캐시를 배치 단위로 묶어 다중쿼리 어텐션 파티셔닝 제안
연산 융합 및 통신 감소를 위해 병렬 어텐션/피드포워드 구성 활용
Looped CollectiveEinsum, 비동기 수집(async collectives) 등의 저수준 최적화와 int8 가중치 양자화를 적용해 성능 개선
맥락 길이 최대 2048 토큰인 PaLM 540B(64 TPU v4 칩)에서 컨텍스트 생성 중 토큰당 지연 29 ms(int8) 및 대형 배치에서 MFU 76% 달성 검증

실험 결과

연구 질문

RQ1대형 Transformer 추론에서 파티셔닝 전략이 지연, MFU, 메모리 트래픽에 어떤 영향을 미치는가?
RQ2다양한 배치 크기와 컨텍스트 길이에서 피드포워드 계층의 최적 1D/2D/가중치 모아 레이아웃 조합은 무엇인가?
RQ3다중쿼리 어텐션이 KV 캐시 메모리 부담을 크게 감소시키고 통신비용을 억제하지 않고도 더 긴 컨텍스트를 가능하게 하는가?
RQ4병렬 어텐션/피드포워드 구성은 직렬 구현과 비교해 지연 및 MFU에 어떤 영향을 주는가?
RQ5PaLM 규모 모델에서 가장 실용적인 파레토 프런티어를 산출하는 양자화 및 저수준 최적화는 무엇인가?

주요 결과

주어진 모델 크기, 컨텍스트 길이 및 칩 수에 대해 거의 최적에 가까운 다축 파티셔닝을 식별할 수 있는 간단한 분석 파티셔닝 프레임워크.
2D 가중치 고정화와 가중치 모아 피드포워드 레이아웃은 배치 크기가 커질수록 전환되며, 대형 배치에서 가중치 모아 레이아웃이 더 우수해져 MFU가 최대 76%까지 달성될 수 있음.
다중쿼리 어텐션은 KV 캐시 메모리를 칩 수(n_chips)만큼 최대 감소시켜 보고된 구성에서 다중헤드 설정보다 더 긴 컨텍스트(32–64배 더 긴)를 가능하게 함.
병렬 어텐션/피드포워드 구성은 직렬 변형 대비 지연 및 통신을 줄이고 FLOPS 활용도를 증가시킴.
64 TPUs에서 PaLM 540B의 프리필(prefill) 지연 및 생성 처리량은 29 ms/토큰(int8)의 저배치 지연과 MFU 76%를 달성하며, 2048-토큰 컨텍스트에서 우수한 성능을 보임

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.