QUICK REVIEW

[论文解读] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng|arXiv (Cornell University)|Feb 11, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文主张 RL 相比 SFT 在泛化方面表现更好，原因是隐含的数据过滤效应强调中等难度样本。它介绍了 DC-SFT，一种数据过滤方法，在 OOD 泛化上优于 RL，并提升训练稳定性和效率。

ABSTRACT

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

研究动机与目标

研究 RL 基于后训练对 Vision-Language Models (VLMs) 泛化能力优于 SFT 的原因。
验证在 SFT 下，数据难度是否会影响 ID 和 OOD 性能。
提出一种简单的数据整理方法（DC-SFT）以提升 SFT 泛化。
在多种模型和任务上证明 DC-SFT 的有效性，包括推理基准测试。

提出的方法

基于模型共识正确性在多轮响应中对提示进行评估，定义数据难度分类（easy、medium、hard）。
评估在不同难度子集（easy/medium/hard）上训练的 SFT 模型的 ID 与 OOD 性能。
提出 DC-SFT 变体：SFT-M（仅在中等难度训练）和 SFT-EM（在易和中等上训练，去除困难数据）。
将 DC-SFT 与标准 SFT 及在 LoRA 和完整微调设置下的 RL 基线 GRPO 进行比较。
评估训练稳定性和效率，包括训练时间对比和梯度动力学分析。
将评估扩展至推理导向的测试数据（MMK12、MMMU、WeMath、MathVerse、MathVista、MathVision）以获得测试时尺度洞察。

Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.

实验结果

研究问题

RQ1在 SFT 条件下，在中等难度数据上的训练是否比易数据或难数据更利于 OOD 泛化？
RQ2显式过滤困难数据（DC-SFT）是否能在 OOD 任务中超过以 RL 为基础的泛化（GRPO）？
RQ3在 VLM 的后训练阶段，DC-SFT 是否比 RL 更稳定且高效？
RQ4DC-SFT 的增益是否扩展至推理导向任务及测试时尺度情景？

主要发现

RL 的泛化优势可归因于对中等难度样本的隐性聚焦，这些样本能提供更具信息性的梯度。
困难数据在 SFT 中提升了 ID 性能，但在 OOD 泛化上显著恶化。
中等难度数据为 ID 提供了平衡的增益，并保持或略微提升 OOD 性能。
DC-SFT（SFT-M 或 SFT-EM）在跨数据集和模型规模上的平均 OOD 指标上，始终优于标准 SFT 和 RL 基线。
DC-SFT 相较于 RL（GRPO）提供显著的效率提升，并在推理基准测试的 OOD/推理性能上保持或提升。
使用困难样本进行训练在 SFT 过程中往往产生更大的梯度范数，导致不稳定，进而减弱 OOD 泛化。

Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。