[논문 리뷰] Training Language Models to Self-Correct via Reinforcement Learning
The paper introduces SCoRe, a two-stage on-policy multi-turn reinforcement learning method that trains a single LLM to self-correct its own mistakes using self-generated data, achieving state-of-the-art intrinsic self-correction on math and code tasks.
Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.
연구 동기 및 목표
- Motivate and quantify the gap in intrinsic self-correction for current LLMs and show limitations of existing SFT and offline RL methods.
- Develop a self-correction framework that learns from the model’s own correction traces without external feedback or teacher signals.
- Propose SCoRe, a two-stage RL method with reward shaping to enable robust, test-time self-correction.
- Demonstrate that SCoRe improves self-correction performance on math (MATH) and coding (HumanEval, MBPP) benchmarks over strong baselines.
제안 방법
- Analyze failure modes of supervised fine-tuning and naive RL for self-correction, including distribution shift and behavior collapse.
- Introduce SCoRe, which uses Stage I RL to initialize a decoupled first and second attempt, constraining the first-turn to mimic the base model.
- Apply Stage II multi-turn RL with a reward shaping bonus to incentivize progress toward self-correction and avoid collapsing to non-correcting behavior.
- Use on-policy data generated by the model itself, with an oracle reward for evaluation and a KL penalty to control distribution drift.
- Evaluate on MATH (MATH500) and code datasets (MBPP, HumanEval), comparing to Self-Refine, STaR, and Pair-SFT baselines.
실험 결과
연구 질문
- RQ1Can intrinsic self-correction be achieved by training a single LLM on its own self-generated traces without external feedback?
- RQ2Do SFT or offline RL approaches suffer from distribution shift and behavior collapse when teaching self-correction?
- RQ3Does a two-stage RL framework with reward shaping stabilize learning and produce meaningful self-correction strategies?
- RQ4To what extent does SCoRe improve self-correction on mathematical reasoning and code generation benchmarks compared to prior methods?
주요 결과
- SCoRe yields a 4.4% absolute gain in self-correction on MATH over the base Gemini model, the first significantly positive Δ(t1, t2).
- SCoRe achieves 64.4% acc.@t2 on MATH with a 4.4% Δ(t1,t2) and 60.0% acc.@t1, surpassing Self-Refine, STaR, and Pair-SFT baselines on MATH.
- On HumanEval, SCoRe attains 52.4% acc.@t2 and a 12.2% Δ(t1,t2), outperforming several baselines in intrinsic self-correction.
- Compared to online multi-turn RL alone, Stage I initialization reduces behavior collapse and Stage II reward shaping promotes progress toward self-correction.
- SCoRe demonstrates strong self-correction gains on both math and coding tasks, with favorable comparisons to multiple prior approaches.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.