Skip to main content
QUICK REVIEW

[論文レビュー] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng|arXiv (Cornell University)|Feb 11, 2026
Multimodal Machine Learning Applications被引用数 0
ひとこと要約

論文は RL が SFT よりも一般化しやすい理由として、中程度の難易度サンプルを強調する暗黙のデータフィルタリング効果を挙げ、DC-SFT というデータフィルタリング手法を導入して RL に比べたOOD一般化を上回り、トレーニングの安定性と効率を改善すると主張します。

ABSTRACT

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

研究の動機と目的

  • Investigate why RL-based post-training generalizes better than SFT for Vision-Language Models (VLMs).
  • Test whether data difficulty influences ID and OOD performance under SFT.
  • Propose a simple data-curation method (DC-SFT) to improve SFT generalization.
  • Demonstrate DC-SFT’s effectiveness across multiple models and tasks, including reasoning benchmarks.

提案手法

  • Define a data difficulty taxonomy (easy, medium, hard) based on model-consensus correctness across multiple responses per prompt.
  • Evaluate SFT models trained on subsets of data filtered by difficulty (easy/medium/hard) to assess ID and OOD performance.
  • Propose DC-SFT variants: SFT-M (train on medium-difficulty only) and SFT-EM (train on easy and medium with hard removed).
  • Compare DC-SFT to standard SFT and RL-based GRPO under LoRA and full fine-tuning setups.
  • Assess training stability and efficiency, including training-time comparisons and gradient-dynamics analyses.
  • Extend evaluation to reasoning-focused test data (MMK12, MMMU, WeMath, MathVerse, MathVista, MathVision) for test-time scaling insights.
Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.
Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.

実験結果

リサーチクエスチョン

  • RQ1Does training on medium-difficulty data improve OOD generalization compared to easy or hard data under SFT?
  • RQ2Can explicit filtering of hard data (DC-SFT) surpass RL-based generalization (GRPO) in OOD tasks?
  • RQ3Is DC-SFT more stable and efficient than RL during post-training of VLMs?
  • RQ4Do DC-SFT gains extend to reasoning-oriented tasks and test-time scaling scenarios?

主な発見

  • RL’s generalization advantage can be attributed to implicit focus on medium-difficulty samples, which yield informative gradients.
  • Hard data improves ID performance but significantly harms OOD generalization when used in SFT.
  • Medium-difficulty data provides balanced gains for ID and preserves or slightly improves OOD performance.
  • DC-SFT (SFT-M or SFT-EM) consistently outperforms standard SFT and RL baselines on average OOD metrics across datasets and model sizes.
  • DC-SFT delivers substantial efficiency gains over RL (GRPO) and maintains or improves OOD/test-time reasoning performance on reasoning benchmarks.
  • Training with hard samples tends to produce larger gradient norms and instability during SFT, contributing to degraded OOD generalization.
Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).
Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。