[논문 리뷰] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
InternVL3는 네이티브 멀티모달 사전학습, 가변 시각 위치 인코딩, 그리고 사후 학습 전략을 도입하여 오픈소스 MLLMs 중 최첨단 성능을 달성하고, 멀티모달 및 언어 성능이 강력합니다.
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
연구 동기 및 목표
- Develop a native multimodal pre-training paradigm that learns linguistic and multimodal capabilities in a single stage without post-hoc alignment.
- Improve scalability and context handling through Variable Visual Position Encoding (V2PE).
- Enhance performance via post-training strategies (Supervised Fine-Tuning and Mixed Preference Optimization) and test-time scaling.
- Demonstrate open-source competitiveness with state-of-the-art MMMU and other multimodal benchmarks.
- Provide infrastructure and data release plans to support open science in next-generation MLLMs.
제안 방법
- Propose native multimodal pre-training that jointly optimizes text and multimodal data rather than a two-stage text-pretraining then multimodal alignment.
- Use a multimodal autoregressive objective that computes loss only on text tokens while conditioning on visual inputs.
- Incorporate Variable Visual Position Encoding (V2PE) to allow longer multimodal contexts with modality-specific position increments.
- Apply two-stage post-training: Supervised Fine-Tuning (SFT) and Mixed Preference Optimization (MPO) to boost multimodal conversation and reasoning.
- Employ test-time scaling (Best-of-N with VisualPRM as critic) to enhance reasoning and mathematics tasks.
- Extend the training infrastructure with an enhanced InternEVO framework to support scalable, balanced, multi-module training across ViT, MLP, and LLM components.
실험 결과
연구 질문
- RQ1Can a native multimodal pre-training approach surpass post-hoc alignment pipelines for open-source MLLMs in diverse multimodal tasks?
- RQ2How does variable visual position encoding affect handling of long multimodal contexts and downstream performance?
- RQ3What is the impact of SFT and MPO on multimodal reasoning, tool usage, GUI tasks, and domain-specific understanding?
- RQ4How effective is test-time scaling with a critic model in improving reasoning and mathematics benchmarks for open-source MLLMs?
- RQ5What infrastructure optimizations are required to train substantially large InternVL3 models efficiently?
주요 결과
- InternVL3-78B achieves 72.2 on MMMU, setting a new state-of-the-art among open-source MLLMs.
- InternVL3 variants substantially outperform prior InternVL iterations and are competitive with leading closed-source models on several benchmarks.
- In multimodal reasoning and mathematics, InternVL3-variants show strong performance across MMMU, MathVista, MathVision, MathVerse, and other benchmarks, with gains amplifying with model size.
- Test-time Best-of-N with VisualPRM as critic yields notable gains even for smaller models (e.g., 6–9 percentage-point improvements in MathVerse Vision-Only).
- SFT with higher-quality data and MPO significantly improves multimodal reasoning and generation quality by aligning preferred and rejected responses.
- The extended InternEVO-based infrastructure delivers 50%–200% training speedups for models of comparable size, enabling efficient scaling to hundreds of billions of parameters.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.