QUICK REVIEW

[论文解读] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

Paul K. Chu, Yuexiang Zhai|ArXiv.org|Jan 28, 2025

Multimodal Machine Learning Applications被引用 6

一句话总结

本论文进行比较研究，结果表明在文本规则基准和视觉任务中，强化学习（RL）在后训练阶段比监督微调（SFT）具有更好的泛化能力，而SFT可以帮助RL训练并提高输出稳定性。

ABSTRACT

Supervised fine-tuning (SFT) and reinforcement learning (RL) are widely used post-training techniques for foundation models. However, their roles in enhancing model generalization capabilities remain unclear. This paper studies the difference between SFT and RL on generalization and memorization, focusing on text-based rule variants and visual variants. We introduce GeneralPoints, an arithmetic reasoning card game, and adopt V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both textual and visual domains. We show that RL, especially when trained with an outcome-based reward, generalizes across both rule-based textual and visual variants. SFT, in contrast, tends to memorize training data and struggles to generalize out-of-distribution scenarios. Further analysis reveals that RL improves the model's underlying visual recognition capabilities, contributing to its enhanced generalization in the visual domain. Despite RL's superior generalization, we show that SFT remains essential for effective RL training; SFT stabilizes the model's output format, enabling subsequent RL to achieve its performance gains. These findings demonstrates the capability of RL for acquiring generalizable knowledge in complex, multi-modal tasks.

研究动机与目标

研究SFT和RL后训练如何影响基础模型的泛化和记忆能力。
在文本规则基准和视觉领域，使用文本和图像输入评估泛化能力。
评估RL是否能提升超出分布内数据的规则推理和视觉识别能力。
考察SFT在RL训练中的作用以及验证迭代对泛化的影响。

提出的方法

使用带有验证器的多轮RL框架以获得基于结果的奖励。
在应用RL之前，使用SFT对骨干模型（Llama-3.2-Vision-11B）进行后训练。
在两个任务 GeneralPoints 和 V-IRL 上进行评估，包含纯语言和视觉语言变体。
引入序列性修订，其中输入包含先前的输出和验证结果。
分析规则变体（如J/Q/K映射）和视觉变体对泛化的影响。
结合基于结果的验证器，以文本反馈和奖励来引导RL。

Figure 1: A comparative study of RL and SFT on the visual navigation environment V-IRL (Yang et al., 2024a ) for OOD generalization. OOD curves represent performance on the same task, using a different textual action space . See detailed descriptions of the task in Section 5.1 .

实验结果

研究问题

RQ1在文本任务中，RL在未知的规则变体上的泛化是否优于SFT；在多模态任务的视觉变体上是否如此？
RQ2与SFT相比，RL如何影响视觉语言模型（VLMs）的视觉识别能力？
RQ3SFT在使基础模型实现有效RL训练中起什么作用？
RQ4验证迭代次数如何影响RL的泛化性能？

主要发现

RL在文本规则基和视觉环境中具有泛化能力，提升了跨任务的OOD性能。
SFT记住训练规则并在所有评估任务及变体中降低OOD性能。
RL提升VLMs的视觉识别能力，有助于在视觉领域实现更好的泛化。
SFT稳定模型的输出格式，使RL能够实现性能提升。
在推理时增加验证步骤的规模（更多验证步骤）可提升 RL 的泛化。
在 GP-VL 中，RL在视觉OOD任务中实现+17.6%到+61.1% 的提升，而SFT显示下降。

Figure 2: An example of the sequential revision formulation with a verifier. The model generate the next answer $\mathbf{v}^{\text{out}}_{t+1}$ conditioned on all previous answers and information $(\mathbf{v}^{\text{out}}_{i},\mathbf{v}^{\text{ver}}_{t},0\leq i\leq t)$ from the verifier.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。