QUICK REVIEW

[논문 리뷰] Speculative Decoding with Big Little Decoder

Sehoon Kim, Karttikeya Mangalam|arXiv (Cornell University)|2023. 02. 15.

Topic Modeling인용 수 8

한 줄 요약

BiLD는 작은 autoregressive 디코더를 큰 non-autoregressive 디코더와 결합하여, 텍스트 생성의 품질 저하를 최소화하면서 속도를 높이기 위해 fallback 및 rollback 정책을 사용합니다.

ABSTRACT

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model's inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2.12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced

연구 동기 및 목표

autoregressive 텍스트 생성의 추론 지연을 최소한의 품질 손실로 줄이려는 동기 부여.
기본 모델 재학습 없이 서로 다른 크기의 두 디코더를 조정하는 플러그 앤 플레이 프레임워크를 도입.
대형 모델이 개입해야 하는 시점과 예측을 되돌려야 하는 시점을 결정하는 간단한 정책( fall back 및 rollback ) 제안.
오픈 소스 구현으로 기계 번역 및 요약 벤치마드에서의 적용 가능성 시演示.

제안 방법

작은 모델이 토큰을 자기회귀적으로 생성하고, 큰 모델이 비자기회귀적으로 예측을 정제하는 두 모델 디코딩.
Fallback 정책: 작은 모델의 신뢰도(max p_S)가 임계값 alpha_FB 아래로 떨어지면 대형 모델 추론을 트리거.
Rollback 정책: 배포가 서로 다른 분포를 보이면 큰 모델이 이전의 작은 모델 예측을 무시하고 이후 토큰들을 되돌리는 롤백을 수행할 수 있음, alpha_RB를 초과하는 편차.
Calibration 데이터를 사용하여 작은 모델과 큰 모델의 출력을 정렬시키는 예측 정렬 기법으로 불필요한 롤백을 줄임.
Algorithm 1은 엔드-투-엔드 BiLD 프로세스와 각 디코딩 단계의 정책 체크를 자세히 설명.

실험 결과

연구 질문

RQ1작은 autoregressive 모델이 생성의 대부분을 제공하고 간헐적인 대형 모델의 정제가 지연 시간을 줄이는가?
RQ2간단한 fallback 및 rollback 정책이 작업 간 품질 손실에 허용 가능한 수준으로 의미 있는 속도 향상을 제공하는가?
RQ3독립적으로 학습된 작은 모델과 대형 모델 간의 정렬이 BiLD 성능을 향상시키는가?
RQ4BiLD를 조기 종료 전략과 연결하여 전문화된 학습 파이프라인과 경쟁력이 있는가?
RQ5전체 autoregressive 디코딩에 비해 BiLD가 번역 및 요약 벤치마크에서 어떻게 수행하는가?

주요 결과

Task (Model)	BLEU (IWSLT)	Speedup (IWSLT)	BLEU (WMT)	Speedup (WMT)	ROUGE-L (XSUM)	Speedup (XSUM)	ROUGE-L (CNN/DM)	Speedup (CNN/DM)
Vanilla Inference (large)	40.32	-	31.38	-	35.08	-	41.54	-
BiLD (Unaligned)	40.33	1.43x	31.28	1.34x	35.12	1.48x	41.44	1.71x
BiLD (Unaligned) Degraded	39.44	1.58x	30.47	1.43x	34.02	1.72x	40.57	2.05x
BiLD (Aligned)	40.24	1.62x	31.26	1.47x	35.05	1.50x	41.52	1.85x
BiLD (Aligned) Degraded	39.13	1.78x	30.33	1.70x	33.95	1.80x	40.96	2.12x

BiLD는 특정 작업에서 약 1-point BLEU/ROUGE-L의 품질 저하로 엔드-투-엔드 속도에서 최대 2.12x의 가속을 달성합니다.
Unaligned BiLD는 벤치마크 전반에서 평균 1.50x의 속도up를 보이며 일부 작업에서 품질 저하 없이; aligned BiLD는 평균 1.61x의 속도 up를 보이고 최대 1.85x의 속도향상을 보입니다.
Prediction alignment는 정렬된 BiLD보다 더 높은 속도향상과 더 나은 품질을 제공하며 성능을 향상시킵니다.
ablation은 rollback 및 fallback 정책이 품질과 지연 시간상 이점을 유지하는 데 각각 중요함을 보여줍니다.
BiLD는 조기 종료 시나리오로 확장될 수 있으며 MT 벤치마크에서 BLEU 손실을 최소로 하면서 최대 1.74x의 속도 향상을 달성합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.