QUICK REVIEW

[논문 리뷰] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang|ArXiv.org|2025. 04. 14.

Multimodal Machine Learning Applications인용 수 4

한 줄 요약

InternVL3는 네이티브 멀티모달 사전학습, 가변 시각 위치 인코딩, 그리고 사후 학습 전략을 도입하여 오픈소스 MLLMs 중 최첨단 성능을 달성하고, 멀티모달 및 언어 성능이 강력합니다.

ABSTRACT

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

연구 동기 및 목표

Develop a native multimodal pre-training paradigm that learns linguistic and multimodal capabilities in a single stage without post-hoc alignment.
Improve scalability and context handling through Variable Visual Position Encoding (V2PE).
Enhance performance via post-training strategies (Supervised Fine-Tuning and Mixed Preference Optimization) and test-time scaling.
Demonstrate open-source competitiveness with state-of-the-art MMMU and other multimodal benchmarks.
Provide infrastructure and data release plans to support open science in next-generation MLLMs.

제안 방법

Propose native multimodal pre-training that jointly optimizes text and multimodal data rather than a two-stage text-pretraining then multimodal alignment.
Use a multimodal autoregressive objective that computes loss only on text tokens while conditioning on visual inputs.
Incorporate Variable Visual Position Encoding (V2PE) to allow longer multimodal contexts with modality-specific position increments.
Apply two-stage post-training: Supervised Fine-Tuning (SFT) and Mixed Preference Optimization (MPO) to boost multimodal conversation and reasoning.
Employ test-time scaling (Best-of-N with VisualPRM as critic) to enhance reasoning and mathematics tasks.
Extend the training infrastructure with an enhanced InternEVO framework to support scalable, balanced, multi-module training across ViT, MLP, and LLM components.

실험 결과

연구 질문

RQ1Can a native multimodal pre-training approach surpass post-hoc alignment pipelines for open-source MLLMs in diverse multimodal tasks?
RQ2How does variable visual position encoding affect handling of long multimodal contexts and downstream performance?
RQ3What is the impact of SFT and MPO on multimodal reasoning, tool usage, GUI tasks, and domain-specific understanding?
RQ4How effective is test-time scaling with a critic model in improving reasoning and mathematics benchmarks for open-source MLLMs?
RQ5What infrastructure optimizations are required to train substantially large InternVL3 models efficiently?

주요 결과

InternVL3-78B achieves 72.2 on MMMU, setting a new state-of-the-art among open-source MLLMs.
InternVL3 variants substantially outperform prior InternVL iterations and are competitive with leading closed-source models on several benchmarks.
In multimodal reasoning and mathematics, InternVL3-variants show strong performance across MMMU, MathVista, MathVision, MathVerse, and other benchmarks, with gains amplifying with model size.
Test-time Best-of-N with VisualPRM as critic yields notable gains even for smaller models (e.g., 6–9 percentage-point improvements in MathVerse Vision-Only).
SFT with higher-quality data and MPO significantly improves multimodal reasoning and generation quality by aligning preferred and rejected responses.
The extended InternEVO-based infrastructure delivers 50%–200% training speedups for models of comparable size, enabling efficient scaling to hundreds of billions of parameters.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.