QUICK REVIEW

[论文解读] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang|arXiv (Cornell University)|Mar 10, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

InternVL-U 是一款轻量级4B参数的统一多模态模型，集成了最先进的MLLM与基于MMDiT的视觉生成头，在理解、推理、生成和编辑方面具备高效性与强大能力。它在生成和编辑方面超越更大规模的统一基线，同时保持多模态理解能力。

ABSTRACT

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

研究动机与目标

通过在紧凑架构内平衡理解与生成，推动统一多模态建模的民主化。
将基于 MMDiT 的专用视觉生成头与预训练的 MLLM 主干集成。
设计聚焦高语义密度任务与推理的数据合成管线。
实现以 Chain-of-Thought 为核心的推理驱动生成，以使用户意图与视觉输出对齐。
提供高效的训练策略与统一多模态模型的评估基准。

提出的方法

采用统一上下文建模与模态自适应生成目标，以使上下文与生成任务保持对齐。
使用文本自回归建模与图像 Flow Matching 的混合生成目标。
采用模态特定的模块化设计，配备基于 ViT 的编码器与专用的 MMDiT 生成头。
通过使用语义特征进行理解、将生成放在 VAE 潜在空间来实现对视觉表示的解耦。
结合 Unified MSRoPE 与分辨率插值，以在不同分辨率下保留空间结构。

Figure 1 : Showcases of InternVL-U for general text-to-image generation (top) and image editing (bottom). InternVL-U supports high-fidelity image generation and editing at any resolution.

实验结果

研究问题

RQ14B 参数的紧凑型 UMM 如何实现强理解、推理、生成与编辑？
RQ2哪些架构选择（模态特定编码器、解耦表示、专用生成头）在性能与效率之间取得最佳平衡？
RQ3以推理为中心的数据合成管线是否提升文本呈现、科学推理与知识密集型生成/编辑的效果？
RQ4基于 CoT 的推理是否能提升抽象用户意图与精准视觉输出之间的对齐？

主要发现

InternVL-U 在生成与编辑任务上始终超越规模更大的统一基线。
该模型在保持坚实的多模态理解的同时，输出高质量的生成与编辑。
将 Chain-of-Thought 引入后可提升知识密集型生成与复杂编辑任务的性能。

Figure 2 : Showcases of InternVL-U for spatial-centric, perception, science-centric, humor-centric, and reasoning-centric text-to-image generation or editing tasks. InternVL-U demonstrates such core multimodal capabilities across various visual domains.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。