QUICK REVIEW

[논문 리뷰] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Christopher R. Snell, Jae‐Hoon Lee|arXiv (Cornell University)|2024. 08. 06.

Magnetic confinement fusion research인용 수 17

한 줄 요약

본 논문은 LLM에 대한 테스트 시 컴퓨트(test-time compute)를 최적하게 할당하는 방법을 분석하고, compute-optimal 전략이 best-of-N baselines를 능가할 수 있으며 FLOPs가 맞춰진 설정에서 테스트 시 컴퓨트를 효과적으로 활용하면 훨씬 더 큰 모델을 이길 수 있음을 보여준다.

ABSTRACT

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

연구 동기 및 목표

도전적인 프롬프트에서 LLM 출력 향상을 위해 추가적인 테스트 시 컴퓨테이션 사용을 동기 부여한다.
제안 분포 보정과 검증자 기반 검색을 테스트 시 컴퓨테이션의 메커니즘으로 통합한다.
프롬프트별로 적응적으로 컴퓨트를 할당하는 compute-optimal 스케일링 전략을 도입한다.
FLOP 매칭 조건에서 테스트 시 컴퓨트가 사전 학습 규모와 어떻게 비교되는지 평가한다.
추가적인 사전 학습 없이도 테스트 시 컴퓨트를 가진 더 작은 모델이 더 큰 모델을 능가할 수 있는지 입증한다.

제안 방법

프롬프트가 주어졌을 때 출력 분포의 튜닝으로 테스트 시 컴퓨트를 다루는 모델 무관한 형식.
두 가지 주요 메커니즘 비교: (i) 순차적 또는 병렬 생성을 통해 제안(distribution) 분포를 수정하는 것, (ii) 프로세스 기반 검증자 모델(PRM)과의 검색.
기본 모델의 몬테 카를로 롤아웃의 각 단계 정합성 추정치를 사용하여 인간 라벨 없이 PRM을 학습한다.
PRM에 대해 세 가지 검색 방법을 평가한다: best-of-N 가중치, 빔 검색, lookahead 검색.
고정된 컴퓨트 예산하에서 주어진 프롬프트에 대한 정확도를 극대화하도록 하이퍼파라미터를 선택하는 compute-optimal 전략을 정의한다.
모델 예측 난이도 또는 오라클 난이도를 사용하여 프롬프트 난이도를 다섯 수준으로 구분하고 난이도별 컴퓨트 할당을 안내한다.

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

실험 결과

연구 질문

RQ1예산 하에서 각 프롬프트별로 테스트 시 컴퓨트를 최적으로 할당하여 정확도를 최대화할 수 있는가?
RQ2다양한 테스트 시 전략(수정-based vs. PRM 기반 검색)이 프롬프트 난이도와 컴퓨트 예산에 따라 어떻게 확장되는가?
RQ3compute-optimal 테스트 시 컴퓨트가 best-of-N 베이스라인을 능가하는가, 그리고 얼마나?
RQ4FLOPs 매칭 조건에서 더 작은 모델의 테스트 시 컴퓨트가 상당히 큰 모델을 능가할 수 있는가?
RQ5테스트 시 전략의 난이도 조건부 컴퓨트 할당의 실질적 이점과 한계는 무엇인가?

주요 결과

Compute-optimal 확장은 revision과 PRM 검색 전반에서 best-of-N 대비 약 4× 더 적은 테스트 시 컴퓨트를 통해 능가할 수 있다.
PRM 기반 검색은 난이도 의존적 효능을 보이며, 더 어려운/저 예산 프롬프트에는 빔 탐색이, 더 쉬운 프롬프트에서는 더 높은 예산에서 best-of-N이 더 나을 수 있다.
쉬운~중간 프롬프트에서 특정 조건하에 FLOPs 매칭에서 테스트 시 컴퓨트가 14× 더 큰 모델을 능가할 수 있다.
수정 기반 제안은 더 긴 수정 체인에서 개선되며, 모델이 컨텍스트 내의 실수로부터 학습함을 시사한다.
난이도 추정 전략은 프롬프트 유형 전반에 걸쳐 최적 전략에 거의 접근하거나 일치하는 적응형 할당을 가능하게 한다.
예산이 증가함에 따라 검증자의 신호에 과적합되면서 수익이 감소하는 경향이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.