Skip to main content
QUICK REVIEW

[论文解读] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng|arXiv (Cornell University)|Feb 11, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

该论文主张 RL 相比 SFT 在泛化方面表现更好,原因是隐含的数据过滤效应强调中等难度样本。它介绍了 DC-SFT,一种数据过滤方法,在 OOD 泛化上优于 RL,并提升训练稳定性和效率。

ABSTRACT

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

研究动机与目标

  • 研究 RL 基于后训练对 Vision-Language Models (VLMs) 泛化能力优于 SFT 的原因。
  • 验证在 SFT 下,数据难度是否会影响 ID 和 OOD 性能。
  • 提出一种简单的数据整理方法(DC-SFT)以提升 SFT 泛化。
  • 在多种模型和任务上证明 DC-SFT 的有效性,包括推理基准测试。

提出的方法

  • 基于模型共识正确性在多轮响应中对提示进行评估,定义数据难度分类(easy、medium、hard)。
  • 评估在不同难度子集(easy/medium/hard)上训练的 SFT 模型的 ID 与 OOD 性能。
  • 提出 DC-SFT 变体:SFT-M(仅在中等难度训练)和 SFT-EM(在易和中等上训练,去除困难数据)。
  • 将 DC-SFT 与标准 SFT 及在 LoRA 和完整微调设置下的 RL 基线 GRPO 进行比较。
  • 评估训练稳定性和效率,包括训练时间对比和梯度动力学分析。
  • 将评估扩展至推理导向的测试数据(MMK12、MMMU、WeMath、MathVerse、MathVista、MathVision)以获得测试时尺度洞察。
Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.
Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.

实验结果

研究问题

  • RQ1在 SFT 条件下,在中等难度数据上的训练是否比易数据或难数据更利于 OOD 泛化?
  • RQ2显式过滤困难数据(DC-SFT)是否能在 OOD 任务中超过以 RL 为基础的泛化(GRPO)?
  • RQ3在 VLM 的后训练阶段,DC-SFT 是否比 RL 更稳定且高效?
  • RQ4DC-SFT 的增益是否扩展至推理导向任务及测试时尺度情景?

主要发现

  • RL 的泛化优势可归因于对中等难度样本的隐性聚焦,这些样本能提供更具信息性的梯度。
  • 困难数据在 SFT 中提升了 ID 性能,但在 OOD 泛化上显著恶化。
  • 中等难度数据为 ID 提供了平衡的增益,并保持或略微提升 OOD 性能。
  • DC-SFT(SFT-M 或 SFT-EM)在跨数据集和模型规模上的平均 OOD 指标上,始终优于标准 SFT 和 RL 基线。
  • DC-SFT 相较于 RL(GRPO)提供显著的效率提升,并在推理基准测试的 OOD/推理性能上保持或提升。
  • 使用困难样本进行训练在 SFT 过程中往往产生更大的梯度范数,导致不稳定,进而减弱 OOD 泛化。
Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).
Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。