[論文レビュー] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training
論文は RL が SFT よりも一般化しやすい理由として、中程度の難易度サンプルを強調する暗黙のデータフィルタリング効果を挙げ、DC-SFT というデータフィルタリング手法を導入して RL に比べたOOD一般化を上回り、トレーニングの安定性と効率を改善すると主張します。
The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.
研究の動機と目的
- Investigate why RL-based post-training generalizes better than SFT for Vision-Language Models (VLMs).
- Test whether data difficulty influences ID and OOD performance under SFT.
- Propose a simple data-curation method (DC-SFT) to improve SFT generalization.
- Demonstrate DC-SFT’s effectiveness across multiple models and tasks, including reasoning benchmarks.
提案手法
- Define a data difficulty taxonomy (easy, medium, hard) based on model-consensus correctness across multiple responses per prompt.
- Evaluate SFT models trained on subsets of data filtered by difficulty (easy/medium/hard) to assess ID and OOD performance.
- Propose DC-SFT variants: SFT-M (train on medium-difficulty only) and SFT-EM (train on easy and medium with hard removed).
- Compare DC-SFT to standard SFT and RL-based GRPO under LoRA and full fine-tuning setups.
- Assess training stability and efficiency, including training-time comparisons and gradient-dynamics analyses.
- Extend evaluation to reasoning-focused test data (MMK12, MMMU, WeMath, MathVerse, MathVista, MathVision) for test-time scaling insights.

実験結果
リサーチクエスチョン
- RQ1Does training on medium-difficulty data improve OOD generalization compared to easy or hard data under SFT?
- RQ2Can explicit filtering of hard data (DC-SFT) surpass RL-based generalization (GRPO) in OOD tasks?
- RQ3Is DC-SFT more stable and efficient than RL during post-training of VLMs?
- RQ4Do DC-SFT gains extend to reasoning-oriented tasks and test-time scaling scenarios?
主な発見
- RL’s generalization advantage can be attributed to implicit focus on medium-difficulty samples, which yield informative gradients.
- Hard data improves ID performance but significantly harms OOD generalization when used in SFT.
- Medium-difficulty data provides balanced gains for ID and preserves or slightly improves OOD performance.
- DC-SFT (SFT-M or SFT-EM) consistently outperforms standard SFT and RL baselines on average OOD metrics across datasets and model sizes.
- DC-SFT delivers substantial efficiency gains over RL (GRPO) and maintains or improves OOD/test-time reasoning performance on reasoning benchmarks.
- Training with hard samples tends to produce larger gradient norms and instability during SFT, contributing to degraded OOD generalization.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。