QUICK REVIEW

[論文レビュー] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paul K. Chu, Yuexiang Zhai|ArXiv.org|Jan 28, 2025

Multimodal Machine Learning Applications被引用数 6

ひとこと要約

本論文は比較研究を行い、事後訓練としての強化学習（RL）が、テキストの規則ベースタスクと視覚タスクの両方において、監督付きファインチューニング（SFT）よりも一般化性能を向上させる一方、SFT が RL の訓練と出力の安定性を支援できることを示す。

ABSTRACT

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

研究の動機と目的

基盤モデルにおける一般化と記憶化に対する SFT と RL 事後訓練の影響を調査する。
テキストと画像入力を用いて、テキストの規則ベース領域と視覚領域の両方で一般化を評価する。
RL が分布内データを超えた規則ベース推論と視覚認識を改善するかを評価する。
RL 訓練における SFT の役割と、検証反復回数が一般化へ与える影響を検討する。

提案手法

結果ベースの報酬を得るために、検証者を組み込んだマルチターン RL フレームワークを用いる。
RL の適用前に、バックボーンモデル（Llama-3.2-Vision-11B）を SFT で事後訓練する。
純粋な言語版とビジョン-言語版の双方で、GeneralPoints と V-IRL の2つのタスクを評価する。
前の出力と verifier の結果を入力に含める逐次改訂を導入する。
一般化へ及ぼす規則の変動（例：J/Q/K の対応付け）および視覚的変種が一般化に与える影響を分析する。
テキストによるフィードバックと報酬で RL を誘導する結果ベースの検証器を組み込む。

Figure 1: A comparative study of RL and SFT on the visual navigation environment V-IRL (Yang et al., 2024a ) for OOD generalization. OOD curves represent performance on the same task, using a different textual action space . See detailed descriptions of the task in Section 5.1 .

実験結果

リサーチクエスチョン

RQ1RL は未知のルール変種を含むテキスト課題およびマルチモーダル課題の視覚変種に対して、SFT より一般化が優れているか。
RQ2SFT と比較して、RL は視覚言語モデル（VLMs）の視覚認識能力にどのような影響を与えるか？
RQ3基盤モデルの効果的な RL 訓練を可能にする上で SFT が果たす役割は何か？
RQ4検証の反復回数が RL の一般化性能に与える影響はどの程度か？

主な発見

RL はテキストの規則ベース環境と視覚環境の双方で一般化し、すべてのタスクでOOD性能を向上させる。
SFT は訓練ルールを記憶化し、評価対象のすべてのタスクと変種でOOD性能を低下させる。
RL は VLMs の視覚認識能力を向上させ、視覚領域での一般化を改善に寄与する。
SFT はモデルの出力形式を安定化させ、RL が性能向上を達成できるようにする。
推論時検証を拡大（検証ステップを増やす）すると、RL の一般化が向上する。
GP-VL において、RL は視覚のOODタスクで +17.6% から +61.1% の改善を達成する一方、SFT は低下を示す。

Figure 2: An example of the sequential revision formulation with a verifier. The model generate the next answer $\mathbf{v}^{\text{out}}_{t+1}$ conditioned on all previous answers and information $(\mathbf{v}^{\text{out}}_{i},\mathbf{v}^{\text{ver}}_{t},0\leq i\leq t)$ from the verifier.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。