QUICK REVIEW

[論文レビュー] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang|arXiv (Cornell University)|Mar 10, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

InternVL-U は、最先端の MLLM を MMDiT ベースの視覚生成ヘッドと統合した軽量な統一型マルチモーダルモデルで、理解・推論・生成・編集を高効率で実現します。生成と編集の性能はより大規模な統一ベースラインを上回り、マルチモーダル理解を維持します。

ABSTRACT

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

研究の動機と目的

理解と生成のバランスを取ったコンパクトなアーキテクチャで統一型マルチモーダルモデリングを民主化する。
事前学習済み MLLM バックボーンと専用の MMDiT ベース視覚生成ヘッドを統合する。
高意味密度タスクと推論に焦点を当てたデータ合成パイプラインを設計する。
Chain-of-Thought を用いた推論中心の生成を可能にし、ユーザ意図と視覚出力の整合を図る。
UMMs の効率的なトレーニング戦略と評価ベンチマークを提供する。

提案手法

モダリティ適応型生成ターゲットを用いた統一文脈モデリングを採用し、文脈と生成タスクを整合させる。
テキスト自己回帰モデリングと画像の Flow Matching を組み合わせたハイブリッド生成目的を使用する。
ViT ベースのエンコーダと専用の MMDiT 生成ヘッドを備えたモダリティ固有のモジュラー設計を採用する。
理解には意味特徴を、生成には VAE 潜在空間を用いて視覚表現を分離する。
解像度間の空間構造を保つために Unified MSRoPE と解像度補間を組み込む。

Figure 1 : Showcases of InternVL-U for general text-to-image generation (top) and image editing (bottom). InternVL-U supports high-fidelity image generation and editing at any resolution.

実験結果

リサーチクエスチョン

RQ1コンパクトな 4B パラメータの UMM がどのように強力な理解・推論・生成・編集を達成できるか。
RQ2性能と効率のバランスを最も取るアーキテクチャの選択肢（モダリティ固有のエンコーダ、分離表現、専用生成ヘッド）は何か。
RQ3推論中心のデータ合成パイプラインはテキスト描画、科学的推論、知識集約的な生成/編集を改善するか。
RQ4CoT ベースの推論は抽象的なユーザ意図と正確な視覚出力の整合を高めるか。

主な発見

InternVL-U は生成・編集タスクでより大規模な統一ベースラインを一貫して上回る。
モデルは強力なマルチモーダル理解を維持しつつ、高品質な生成と編集を提供する。
Chain-of-Thought の組み込みは知識リッチな生成と複雑な編集タスクの性能を向上させる。

Figure 2 : Showcases of InternVL-U for spatial-centric, perception, science-centric, humor-centric, and reasoning-centric text-to-image generation or editing tasks. InternVL-U demonstrates such core multimodal capabilities across various visual domains.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。