Skip to main content
QUICK REVIEW

[논문 리뷰] InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang|ArXiv.org|2025. 04. 14.
Multimodal Machine Learning Applications인용 수 4
한 줄 요약

InternVL3는 네이티브 멀티모달 사전학습, 가변 시각 위치 인코딩, 그리고 사후 학습 전략을 도입하여 오픈소스 MLLMs 중 최첨단 성능을 달성하고, 멀티모달 및 언어 성능이 강력합니다.

ABSTRACT

We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.

연구 동기 및 목표

  • Develop a native multimodal pre-training paradigm that learns linguistic and multimodal capabilities in a single stage without post-hoc alignment.
  • Improve scalability and context handling through Variable Visual Position Encoding (V2PE).
  • Enhance performance via post-training strategies (Supervised Fine-Tuning and Mixed Preference Optimization) and test-time scaling.
  • Demonstrate open-source competitiveness with state-of-the-art MMMU and other multimodal benchmarks.
  • Provide infrastructure and data release plans to support open science in next-generation MLLMs.

제안 방법

  • Propose native multimodal pre-training that jointly optimizes text and multimodal data rather than a two-stage text-pretraining then multimodal alignment.
  • Use a multimodal autoregressive objective that computes loss only on text tokens while conditioning on visual inputs.
  • Incorporate Variable Visual Position Encoding (V2PE) to allow longer multimodal contexts with modality-specific position increments.
  • Apply two-stage post-training: Supervised Fine-Tuning (SFT) and Mixed Preference Optimization (MPO) to boost multimodal conversation and reasoning.
  • Employ test-time scaling (Best-of-N with VisualPRM as critic) to enhance reasoning and mathematics tasks.
  • Extend the training infrastructure with an enhanced InternEVO framework to support scalable, balanced, multi-module training across ViT, MLP, and LLM components.

실험 결과

연구 질문

  • RQ1Can a native multimodal pre-training approach surpass post-hoc alignment pipelines for open-source MLLMs in diverse multimodal tasks?
  • RQ2How does variable visual position encoding affect handling of long multimodal contexts and downstream performance?
  • RQ3What is the impact of SFT and MPO on multimodal reasoning, tool usage, GUI tasks, and domain-specific understanding?
  • RQ4How effective is test-time scaling with a critic model in improving reasoning and mathematics benchmarks for open-source MLLMs?
  • RQ5What infrastructure optimizations are required to train substantially large InternVL3 models efficiently?

주요 결과

  • InternVL3-78B achieves 72.2 on MMMU, setting a new state-of-the-art among open-source MLLMs.
  • InternVL3 variants substantially outperform prior InternVL iterations and are competitive with leading closed-source models on several benchmarks.
  • In multimodal reasoning and mathematics, InternVL3-variants show strong performance across MMMU, MathVista, MathVision, MathVerse, and other benchmarks, with gains amplifying with model size.
  • Test-time Best-of-N with VisualPRM as critic yields notable gains even for smaller models (e.g., 6–9 percentage-point improvements in MathVerse Vision-Only).
  • SFT with higher-quality data and MPO significantly improves multimodal reasoning and generation quality by aligning preferred and rejected responses.
  • The extended InternEVO-based infrastructure delivers 50%–200% training speedups for models of comparable size, enabling efficient scaling to hundreds of billions of parameters.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.