QUICK REVIEW

[논문 리뷰] Why Does RL Generalize Better Than SFT? A Data-Centric Perspective on VLM Post-Training

Aojun Lu, Tao Feng|arXiv (Cornell University)|2026. 02. 11.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

논문은 RL이 중간 난이도 샘플을 강조하는 암시적 데이터 필터링 효과로 인해 SFT보다 일반화가 더 잘된다고 주장합니다. DC-SFT를 도입하여 데이터 필터링 방식으로 OOD 일반화에서 RL을 앞서고 학습 안정성 및 효율성을 개선합니다.

ABSTRACT

The adaptation of large-scale Vision-Language Models (VLMs) through post-training reveals a pronounced generalization gap: models fine-tuned with Reinforcement Learning (RL) consistently achieve superior out-of-distribution (OOD) performance compared to those trained with Supervised Fine-Tuning (SFT). This paper posits a data-centric explanation for this phenomenon, contending that RL's generalization advantage arises from an implicit data filtering mechanism that inherently prioritizes medium-difficulty training samples. To test this hypothesis, we systematically evaluate the OOD generalization of SFT models across training datasets of varying difficulty levels. Our results confirm that data difficulty is a critical factor, revealing that training on hard samples significantly degrades OOD performance. Motivated by this finding, we introduce Difficulty-Curated SFT (DC-SFT), a straightforward method that explicitly filters the training set based on sample difficulty. Experiments show that DC-SFT not only substantially enhances OOD generalization over standard SFT, but also surpasses the performance of RL-based training, all while providing greater stability and computational efficiency. This work offers a data-centric account of the OOD generalization gap in VLMs and establishes a more efficient pathway to achieving robust generalization. Code is available at https://github.com/byyx666/DC-SFT.

연구 동기 및 목표

시각-언어 모델(VLMs)에서 RL 기반의 사후 학습이 SFT보다 일반화가 더 좋은 이유를 조사한다.
SFT 하에서 데이터 난이도가 ID와 OOD 성능에 영향을 미치는지 검증한다.
SFT 일반화를 개선하기 위한 간단한 데이터 선별 방법(DC-SFT)을 제안한다.
추론 벤치마크를 포함한 다수의 모델과 과제에서 DC-SFT의 효과를 입증한다.

제안 방법

프롬프트당 여러 응답에서 모델 합의(correctness)에 기반해 데이터 난이도 체계(easy, medium, hard)를 정의한다.
난이도에 따라 필터링된 데이터의 하위집합(easy/medium/hard)으로 학습된 SFT 모델을 평가하여 ID 및 OOD 성능을 분석한다.
DC-SFT 변형으로 SFT-M(중간 난이도 데이터만 학습)과 SFT-EM(쉬움과 중간만 학습하고 어려움은 제거)을 제안한다.
LoRA 및 전체 파인튜닝 설정에서 DC-SFT를 표준 SFT 및 RL 기반 GRPO와 비교한다.
학습 시간 비교 및 그래디언트 다이내믹스 분석을 포함하여 학습 안정성과 효율성을 평가한다.
추론 중심의 테스트 데이터(MMK12, MMMU, WeMath, MathVerse, MathVista, MathVision)에 대한 평가를 확장하여 테스트 시간 규모에 대한 인사이트를 얻는다.

Figure 1 : (a) RL implicitly focuses updates on medium-difficulty samples that yield high reward variance. (b) ID and OOD performance after SFT on data subsets of varying difficulty levels.

실험 결과

연구 질문

RQ1SFT 하에서 중간 난이도 데이터로의 학습이 쉬운 데이터나 어려운 데이터에 비해 OOD 일반화를 향상시키는가?
RQ2하드 데이터를 명시적으로 필터링한 DC-SFT가 OOD 작업에서 RL 기반 일반화(GRPO)를 능가할 수 있는가?
RQ3VLM의 사후 학습 동안 DC-SFT가 RL보다 더 안정적이고 효율적인가?
RQ4DC-SFT의 이점이 추론 중심 과제 및 테스트 시 확장성 시나리오에까지 확장되는가?

주요 결과

RL의 일반화 이점은 중간 난이도 샘플에 대한 암시적 초점에서 비롯된 것으로, 이 샘플이 정보성이 높은 그래디언트를 산출한다.
어려운 데이터는 ID 성능을 향상시키지만 SFT에서 사용할 경우 OOD 일반화에 크게 해를 준다.
중간 난이도 데이터는 ID에 대해 균형 잡힌 이점을 제공하고 OOD 성능을 유지하거나 소폭 향상시킨다.
DC-SFT (SFT-M 또는 SFT-EM)은 데이터 세트 및 모델 크기에 걸친 평균 OOD 지표에서 표준 SFT 및 RL 기준선보다 일관되게 우수하다.
DC-SFT는 RL(GRPO)에 비해 상당한 효율성 향상을 제공하고 추론 벤치마크에서 OOD/테스트 시 추론 성능을 유지하거나 향상시킨다.
하드 샘플을 학습에 사용하면 SFT 중 그래디언트 노름이 커지고 불안정성이 증가하는 경향이 있어 OOD 일반화 저하에 기여한다.

Figure 2 : (a) Illustrative examples of the data difficulty taxonomy. (b) Illustrative examples of generalization evaluation benchmarks for image classification (top) and visual grounding (bottom).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.