QUICK REVIEW

[논문 리뷰] Lite Transformer with Long-Short Range Attention

Zhanghao Wu, Zhijian Liu|arXiv (Cornell University)|2020. 04. 24.

Topic Modeling인용 수 130

한 줄 요약

Lite Transformer는 로컬 및 글로벌 컨텍스트를 각각 모델링하기 위해 두 분기 Long-Short Range Attention (LSRA)을 도입하여 제약된 컴퓨트 하에서 Transformer 대비 BLEU 이득으로 모바일 친화적 NLP를 제공합니다. 또한 상당한 모델 크기 축소를 가능하게 하고 무거운 설계 비용 없이 AutoML로 탐색된 벤치마크를 능가합니다.

ABSTRACT

Transformer has become ubiquitous in natural language processing (e.g., machine translation, question answering); however, it requires enormous amount of computations to achieve high performance, which makes it not suitable for mobile applications that are tightly constrained by the hardware resources and battery. In this paper, we present an efficient mobile NLP architecture, Lite Transformer to facilitate deploying mobile NLP applications on edge devices. The key primitive is the Long-Short Range Attention (LSRA), where one group of heads specializes in the local context modeling (by convolution) while another group specializes in the long-distance relationship modeling (by attention). Such specialization brings consistent improvement over the vanilla transformer on three well-established language tasks: machine translation, abstractive summarization, and language modeling. Under constrained resources (500M/100M MACs), Lite Transformer outperforms transformer on WMT'14 English-French by 1.2/1.7 BLEU, respectively. Lite Transformer reduces the computation of transformer base model by 2.5x with 0.3 BLEU score degradation. Combining with pruning and quantization, we further compressed the model size of Lite Transformer by 18.2x. For language modeling, Lite Transformer achieves 1.8 lower perplexity than the transformer at around 500M MACs. Notably, Lite Transformer outperforms the AutoML-based Evolved Transformer by 0.5 higher BLEU for the mobile NLP setting without the costly architecture search that requires more than 250 GPU years. Code has been made available at https://github.com/mit-han-lab/lite-transformer.

연구 동기 및 목표

Motivate efficient NLP inference on edge devices under strict compute constraints.
Design a lightweight transformer architecture that preserves or improves performance under 500M Mult-Adds.
Introduce LSRA to replace bottleneck attention with specialized local and global branches.
Demonstrate that LSRA enables compression (pruning/quantization) with substantial size reductions.
Compare performance and costs against AutoML-based baselines (Evolved Transformer) under mobile settings.

제안 방법

Long-Short Range Attention (LSRA)을 두 개의 병렬 분기로 제안: 글로벌 어텐션 분기와 로컬 컨볼루션 분기.
두 분기에 피드를 전달하기 위해 입력 채널을 나눈 다음 FFN으로 융합하여 분기별 계산을 실질적으로 절반으로 축소.
채널 차원을 평탄화하여 모델 용량에서 어텐션의 비중을 높이는 방식으로 Transformer 블록의 전통적 보틀넥을 대체.
로컬 분기에서 경량 컨볼루션 모듈(깊이별 유사, 매개변수 효율적)을 사용해 로컬 컨텍스트를 캡처.
모바일 제약 예산(≤500M Mult-Adds)으로 MT(IWSLT, WMT) 및 추가 작업(요약, 언어 모델링)에서 Lite Transformer를 훈련 및 평가.
Transformer 베이스라인 및 Evolved Transformer와 비교하고 Pruning 및 Quantization을 통한 압축 분석.

실험 결과

연구 질문

RQ1LSRA가 모바일 리소스 제약 하에서 기계번역(MT)과 언어 작업의 성능을 해치지 않으면서 트랜스포머 기반 모델의 효율성을 개선할 수 있는가?
RQ2동일한 컴퓨트 예산에서 MT, 요약, 언어 모델링에서 Lite Transformer의 성능은 표준 Transformer 및 AutoML 기반 벤치마크에 비해 어떤가?
RQ3Lite Transformer를 표준 압축 기술(프루닝, 양자화)과 결합할 때 모델 크기와 성능에 미치는 영향은 무엇인가?

주요 결과

Lite Transformer는 모바일 설정에서 MT 벤치마크 전반에 걸쳐 Transformer 대비 BLEU를 개선: WMT En-De에서 500M Mult-Adds 시 +1.2 BLEU, 100M Mult-Adds 시 +1.7 BLEU; WMT En-Fr에서 100M Mult-Adds 시 +1.7 BLEU, 500M Mult-Adds 시 +1.2 BLEU.
IWSLT De-En에서 Lite Transformer는 약 100M Mult-Adds에서 Transformer 벤치마크를 약 1.6 BLEU 이상 상회.
CNN-DailyMail 요약에서 계산량을 최대 약 2.4배 감소, 언어 모델링에서 약 1.8배의 perplexity 개선, 대략 500M Mult-Adds 기준으로 Transformer 벤치마크 대비.
프루닝 및 8비트 양자화와 결합 시 WMT En-Fr에서 BLEU 저하를 미미하게 하면서 모델 크기를 최대 18.2배까지 축소 가능.
AutoML 기반 Evolved Transformer에 비해 모바일 설정에서 WMT En-De에서 0.5 BLEU 더 높은 성능을 제공하되 대규모 탐색 비용(GPU 연도 및 CO2 배출)이 필요 없음.
전반적으로 LSRA의 글로벌/로컬 컨텍스트 특화가 모바일 NLP의 효율성과 확장성을 개선하면서 벤치마크 성능을 유지하거나 능가합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.