QUICK REVIEW

[論文レビュー] Training Language Models to Self-Correct via Reinforcement Learning

Aviral Kumar, Vincent Zhuang|arXiv (Cornell University)|Sep 19, 2024

Speech and dialogue systems被引用数 6

ひとこと要約

本論文はSCoReという二段階のオンポリシーのマルチターン強化学習法を紹介。単一のLLMを自己生成データで自らの誤りを自己訂正するよう訓練し、数学とコードタスクにおける intrinsic self-correction で最先端を実現します。

ABSTRACT

Self-correction is a highly desirable capability of large language models (LLMs), yet it has consistently been found to be largely ineffective in modern LLMs. Current methods for training self-correction typically depend on either multiple models, a more advanced model, or additional forms of supervision. To address these shortcomings, we develop a multi-turn online reinforcement learning (RL) approach, SCoRe, that significantly improves an LLM's self-correction ability using entirely self-generated data. To build SCoRe, we first show that variants of supervised fine-tuning (SFT) on offline model-generated correction traces are often insufficient for instilling self-correction behavior. In particular, we observe that training via SFT falls prey to either a distribution mismatch between mistakes made by the data-collection policy and the model's own responses, or to behavior collapse, where learning implicitly prefers only a certain mode of correction behavior that is often not effective at self-correction on test problems. SCoRe addresses these challenges by training under the model's own distribution of self-generated correction traces and using appropriate regularization to steer the learning process into learning a self-correction behavior that is effective at test time as opposed to fitting high-reward responses for a given prompt. This regularization process includes an initial phase of multi-turn RL on a base model to generate a policy initialization that is less susceptible to collapse, followed by using a reward bonus to amplify self-correction. With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

研究の動機と目的

現行LLMのintrinsic self-correctionのギャップを動機付け、量的に示すとともに従来のSFTおよびオフラインRL手法の限界を示す。
外部フィードバックや教師信号を使わず、モデル自身の訂正トレースから学習する自己訂正フレームワークを開発する。
SCoRe、自己訂正を堅牢にテスト時に可能にする報酬設計を備えた二段階RL手法を提案する。
SCoReが数学（MATH）とコーディング（HumanEval, MBPP）ベンチマークで強力なベースラインを上回る自己訂正性能を示す。

提案手法

自己訂正のための監視付きファインチューニングとナイーブなRLの失敗モードを分析し、分布シフトと挙動崩壊を含む課題を特定する。
Stage I RLを用いてデカップリングされた第一回と第二回の試行を初期化するSCoReを導入し、第一ターンをベースモデルに似せるよう制約する。
Stage IIのマルチターンRLを報酬設計ボーナスとともに適用し、自己訂正へ向けた進歩を促進し、非訂正的な挙動への崩壊を回避する。
評価用のオラクル報酬と分布ドリフトを抑制するKLペナルティを用い、モデル自身が生成するオンポリシーデータを使用する。
MATH（MATH500）とコードデータセット（MBPP, HumanEval）で評価し、Self-Refine、STaR、Pair-SFTベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1外部フィードバックなしに、自己生成トレースだけを用いて単一のLLMを訓練することで intrinsic self-correction を達成できるか。
RQ2SFTまたはオフラインRLは自己訂正を教える際に分布シフトと挙動崩壊を生じるか。
RQ3報酬設計を伴う二段階RLフレームワークは学習を安定させ、有意な自己訂正戦略を生み出すか。
RQ4SCoReは従来手法と比較して数学的推論とコード生成の自己訂正をどの程度改善するか。

主な発見

SCoReはMATHでベースのGeminiモデルに対して自己訂正の絶対増分4.4%をもたらし、初めて有意に正のΔ(t1, t2)を示した。
SCoReはMATHで acc.@t2が64.4%、Δ(t1,t2)が4.4%、acc.@t1が60.0%となり、Self-Refine, STaR, Pair-SFTベースラインを上回る。
HumanEvalではSCoReがacc.@t2 52.4%、Δ(t1,t2) 12.2%を達成し、intrinsic self-correctionのいくつかのベースラインを上回る。
オンラインのマルチターンRLのみと比較して、Stage Iの初期化は挙動崩壊を減らし、Stage IIの報酬設計は自己訂正へ向けた進歩を促進する。
SCoReは数学とコーディングの両方のタスクで強力な自己訂正の利得を示し、複数の先行アプローチと良好な比較を得ている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。