[논문 리뷰] Not All Steps are Informative: On the Linearity of LLMs' RLVR Training
논문은 RLVR 훈련 중 가중치 업데이트와 토큰 로그-확률에서 강한 선형성을 보여주고, 표준 RLVR보다 더 적은 계산으로 일치하거나 능가하는 외삽 기반 가속 방법들을 제안한다. Weight Extrapolation, Logit Extrapolation, 그리고 RL-Extra를 도입해 훈련 속도를 높인다.
Reinforcement learning with verifiable rewards (RLVR) has become a central component of large language model (LLM) post-training. Unlike supervised fine-tuning (SFT), RLVR lets an LLM generate multiple candidate solutions and reinforces those that lead to a verifiably correct final answer. However, in practice, RLVR often requires thousands of training steps to reach strong performance, incurring substantial computation largely attributed to prolonged exploration. In this work, we make a surprising observation: during RLVR, LLMs evolve in a strongly linear manner. Specifically, both model weights and model output log-probabilities exhibit strong linear correlations with RL training steps. This suggests that RLVR predominantly amplifies trends that emerge early in training, rather than continuously discovering new behaviors throughout the entire optimization trajectory. Motivated by this linearity, we investigate whether future model states can be predicted from intermediate checkpoints via extrapolation, avoiding continued expensive training. We show that Weight Extrapolation produces models with performance comparable to standard RL training while requiring significantly less computation. Moreover, Logits Extrapolation consistently outperforms continued RL training on mathematics and code benchmarks by extrapolating beyond the step range where RL training remains stable. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity
연구 동기 및 목표
- Identify and quantify linear trends in weight updates during RLVR training across diverse models and algorithms.
- Analyze linearity in model output log-probabilities and logits over RL steps.
- Theoretically explain the origin of observed linearity in weights and outputs.
- Develop fast extrapolation-based methods to accelerate RLVR without sacrificing performance.
- Propose an intermittent training scheme (RL-Extra) that maintains performance while reducing wall-clock time.
제안 방법
- Perform linear regression of sampled weights across RLVR checkpoints to measure R^2 and assess weight linearity across models and algorithms.
- Analyze token log-probabilities and logits across checkpoints to assess output linearity.
- Theoretically explain how linear weight changes can lead to linear output changes in transformer layers.
- Develop Logit Extrapolation and Weight Extrapolation formulas to predict future states from two past checkpoints (Equations 1 and 2).
- Introduce RL-Extra as a cycle of m RL steps followed by n extrapolation steps to balance gradient updates and extrapolation (Equation 3).
- Evaluate on DeepScaleR-Preview with a 1.5B base model using AIME-24/25, MATH-500, and LiveCodeBench benchmarks.
실험 결과
연구 질문
- RQ1Do RLVR training steps produce strong linear trends in model weights across diverse base models and RL algorithms?
- RQ2Do model outputs, including token log-probabilities and logits, exhibit linear evolution over RLVR training steps?
- RQ3Can extrapolation of weights or logits preserve or improve performance compared with continued RLVR training?
- RQ4Does interleaving RL steps with weight extrapolation (RL-Extra) achieve comparable performance with less compute?
주요 결과
- Most RLVR-facing weights show R^2 > 0.7, with distributions concentrated around 0.9, indicating strong weight linearity.
- Token log-probabilities also show strong linear correlations with training steps, with R^2 around 0.9.
- Logits extrapolation consistently improves performance on math and code benchmarks over standard RL within the extrapolation horizon, avoiding late-stage instability.
- Weight extrapolation can approach but not exceed a certain horizon, with best gains near moderate extrapolation steps.
- RL-Extra matches standard RL performance while delivering up to 6.1× wall-clock speedup across settings.
- Direct extrapolation methods achieve up to 3% performance gains over baselines on select benchmarks.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.