QUICK REVIEW

[논문 리뷰] The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Rui Wei, Rui Du|arXiv (Cornell University)|2026. 03. 24.

Topic Modeling인용 수 0

한 줄 요약

논문은 초기-종료 적응성 점수(EAS)와 벤치마크를 도입하여 현대 LLM이 층별 초기 종료 디코딩에 얼마나 적합한지 평가하고, 최신 모델에서 초기 종료의 이점이 감소하는 경향을 발견하며, 모델 계열 및 작업 부하 전체에서 초기 종료 가능성에 영향을 주는 요소를 분석합니다.

ABSTRACT

In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.

연구 동기 및 목표

현대 LLM이 계층별 초기 종료 디코딩에 내재된 적합성을 여전히 보유하고 있는지 평가한다.
출력 품질을 해치지 않으면서 초기 종료를 통해 어느 정도 가속이 제공될 수 있는지 정량화한다.
초기 종료 효과에 영향을 주는 아키텍처, 학습, 작업 부하 요인들을 식별한다.
초기 종료 방법을 구현하기 전에 상한 가속을 추정하는 프레임워크를 제공한다.

제안 방법

스킵 비율과 층-최종 유사성을 결합한 초기 종료 적응성 점수(EAS)를 정의한다.
상한 가속 추정용 오라클 초기 종료 평가를 포함한 벤치마크를 제안한다.
EXIT 층 간 은닉 상태, 로짓(logits), 상위 K 토큰 중첩을 사용하여 층-최종 유사성을 계산한다.
아키텍처별로 다양한 오픈-웨이트 LLM들(밀집(Dense), Mixture-of-Experts(MoE), State Space Models(SSM)) 및 모델 세대에 대해 평가한다.
모델 규모, 아키텍처, 학습 및 작업 부하가 초기 종료 가능성에 미치는 영향을 분석한다.

Figure 1: Layer-wise early-exit decoding in LLMs.

실험 결과

연구 질문

RQ1RQ1: 현대의 디코더-전용 LLM이 계층별 초기 종료에 본질적으로 적합한가, 그리고 층 간의 유사성이 초기 종료 하에서 엔드 투 엔드 정확도를 예측할 수 있는가?
RQ2RQ2: 어떤 요인들(규모, 아키텍처, 학습, 작업 부하)이 모델이 초기 종료를 지원하는 능력에 영향을 주는가?
RQ3RQ3: 현재 모델과 작업 부하에서 초기 종료를 통해 달성할 수 있는 가속의 상한은 무엇인가?

주요 결과

신형 모델 세대에서 초기 종료의 효과가 감소하는 경향이 있어 현대 LLM에서 층 중복성이 감소했음을 시사한다.
Dense 트랜스포머가 Mixture-of-Experts와 State Space Models보다 초기 종료 가능성이 더 크다.
더 큰 모델(특히 매개변수 >20B)에서 초기 종료 가능성이 더 큰 경향이 있다.
계속된 사전 학습과 사후 학습 조정은 초기 종료 적합성을 감소시키는 경향이 있다.
초기 종료 패턴은 대체로 모델 특이적이며 작업 부하의 영향은 약하다.

Figure 2: The trend of relative early-exit scores (§ 3.3 ) in recent LLMs and models specifically tuned for early-exit, compared to Llama2-7B . We explain the model selection details in Appendix B .

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.