QUICK REVIEW

[论文解读] Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang|arXiv (Cornell University)|Sep 19, 2024

Speech and dialogue systems被引用 6

一句话总结

这篇论文提出 SCoRe，一种两阶段的在策略的多轮强化学习方法，训练单个 LLM 使用自生成数据自我纠错自己的错误，在数学和代码任务上实现了最先进的内在自我纠错。

ABSTRACT

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

研究动机与目标

Motivate and quantify the gap in intrinsic self-correction for current LLMs and show limitations of existing SFT and offline RL methods.
Develop a self-correction framework that learns from the model’s own correction traces without external feedback or teacher signals.
Propose SCoRe, a two-stage RL method with reward shaping to enable robust, test-time self-correction.
Demonstrate that SCoRe improves self-correction performance on math (MATH) and coding (HumanEval, MBPP) benchmarks over strong baselines.

提出的方法

Analyze failure modes of supervised fine-tuning and naive RL for self-correction, including distribution shift and behavior collapse.
Introduce SCoRe, which uses Stage I RL to initialize a decoupled first and second attempt, constraining the first-turn to mimic the base model.
Apply Stage II multi-turn RL with a reward shaping bonus to incentivize progress toward self-correction and avoid collapsing to non-correcting behavior.
Use on-policy data generated by the model itself, with an oracle reward for evaluation and a KL penalty to control distribution drift.
Evaluate on MATH (MATH500) and code datasets (MBPP, HumanEval), comparing to Self-Refine, STaR, and Pair-SFT baselines.

实验结果

研究问题

RQ1Can intrinsic self-correction be achieved by training a single LLM on its own self-generated traces without external feedback?
RQ2Do SFT or offline RL approaches suffer from distribution shift and behavior collapse when teaching self-correction?
RQ3Does a two-stage RL framework with reward shaping stabilize learning and produce meaningful self-correction strategies?
RQ4To what extent does SCoRe improve self-correction on mathematical reasoning and code generation benchmarks compared to prior methods?

主要发现

SCoRe yields a 4.4% absolute gain in self-correction on MATH over the base Gemini model, the first significantly positive Δ(t1, t2).
SCoRe achieves 64.4% acc.@t2 on MATH with a 4.4% Δ(t1,t2) and 60.0% acc.@t1, surpassing Self-Refine, STaR, and Pair-SFT baselines on MATH.
On HumanEval, SCoRe attains 52.4% acc.@t2 and a 12.2% Δ(t1,t2), outperforming several baselines in intrinsic self-correction.
Compared to online multi-turn RL alone, Stage I initialization reduces behavior collapse and Stage II reward shaping promotes progress toward self-correction.
SCoRe demonstrates strong self-correction gains on both math and coding tasks, with favorable comparisons to multiple prior approaches.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。