QUICK REVIEW

[논문 리뷰] R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

Jingyi Zhang, Tianyi Lin|arXiv (Cornell University)|2026. 02. 03.

Topic Modeling인용 수 0

한 줄 요약

이 논문은 CADS를 제시한다. Collective Adversarial Data Synthesis 프레임워크로 MLLMs를 위한 고품질의 다양하고 도전적인 멀티모달 데이터를 생성하여 MMSynthetic-20K를 산출하고 GRPO로 학습된 모델 R1-SyntheticVL를 얻는다.

ABSTRACT

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

연구 동기 및 목표

Multimodal Large Language Models (MLLMs)의 데이터 부족 문제를 멀티모달 학습 데이터의 자동 생성을 가능하게 하여 해결한다.
MLLM 추론 능력을 향상시키기 위해 고품질, 다양하며 어려운 샘플을 생성하는 일반적인 데이터 합성 프레임워크를 개발한다.
현실 벤치마크에 대해 MLLMs를 교육하고 평가하기 위한 고품질 합성 MMSynthetic-20K 데이터셋을 생성한다.
합성 데이터로 학습된 모델이 실제 데이터 기반의 기준선을 능가하고 실제 데이터를 보완할 수 있음을 입증한다.

제안 방법

CADS(COLLECTIVE Adversarial Data Synthesis)을 두 개의 순환 단계로 제안한다: CAD-Generate(집단 데이터 생성)와 CAD-Judge(집단 데이터 판단).
고가치 적대적 사례를 바탕으로 생성 맥락을 정제하기 위해 Adversarial Context Optimization을 사용한다.
다양성과 품질을 보장하기 위해 생성 및 판단에 다수의 MLLM을 활용한다.
CADS-생성 데이터로부터 MMSynthetic-20K를 구성하고 GRPO(강화 학습)를 사용하여 R1-SyntheticVL을 학습한다.
일반, 수학 및 차트 이해 과제를 포함하는 여섯 개 벤치마크에서 평가하고, 최첨단 오픈소스 및 클로즈드소스 모델과 비교한다.

실험 결과

연구 질문

RQ1집단적 적대 프레임워크로 생성된 합성 멀티모달 데이터가 복잡한 추론 과제에서 MLLM 성능을 향상시킬 수 있는가?
RQ2CADS가 단일 모델 생성 접근법보다 더 높은 품질의 데이터, 더 다양한 데이터, 더 도전적인 데이터를 생성하는가?
RQ3적대적 맥락 최적화가 데이터 품질과 모델 성능에 미치는 영향은 무엇인가?
RQ4합성 데이터가 MLLMs에 대해 실제 데이터를 보완하거나 대체하는 방식은 무엇인가?
RQ5합성 데이터 크기가 모델 성능에 미치는 확장성 효과는 무엇인가?

주요 결과

모델	MathVista	MathVerse	MathVision	MMMU	MMMU-Pro	CharXiv	평균	표준편차-10	비전	추론	설명
R1-SyntheticVL (Ours)	75.6	51.2	29.1	56.3	42.0	38.7	47.8	75.5	52.0

CADS는 직접적인 Nano Banana Pro 사용보다 더 높은 품질의 합성 멀티모달 데이터를 제공하며 벤치마크 점수 향상으로 입증된다.
MMSynthetic-20K 데이터로 학습된 R1-SyntheticVL은 여러 추론 벤치마크에서 최상위 성능을 달성하며 특히 MMMU-Pro에서 두드러진다.
축소 실험은 CAD-Generate와 CAD-Judge가 데이터 품질을 대폭 향상시키고, Adversarial Context Optimization이 추가 이득을 제공함을 보여준다.
MathVista에서 MMSynthetic-20K만 사용해도 합성 데이터가 실제 데이터를 능가할 수 있으며, 실제 데이터와 결합하면 결과가 더 향상될 수 있다.
확대 실험은 합성 데이터 크기가 20K까지 증가함에 따라 성능이 계속 향상되며 포화되지 않음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.