QUICK REVIEW

[논문 리뷰] InternVL-U: Democratizing Unified Multimodal Models for Understanding, Reasoning, Generation and Editing

Changyao Tian, Danni Yang|arXiv (Cornell University)|2026. 03. 10.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

InternVL-U는 최첨단 MLLM과 MMDiT 기반 시각 생성 헤드를 통합한 경량의 4B-파라미터 통합 다중모달 모델로, 높은 효율성으로 강력한 이해, 추론, 생성 및 편집 성능을 달성합니다. 생성 및 편집에서 더 큰 통합 베이스라인보다 우수하며 다중모달 이해를 유지합니다.

ABSTRACT

Unified multimodal models (UMMs) that integrate understanding, reasoning, generation, and editing face inherent trade-offs between maintaining strong semantic comprehension and acquiring powerful generation capabilities. In this report, we present InternVL-U, a lightweight 4B-parameter UMM that democratizes these capabilities within a unified framework. Guided by the principles of unified contextual modeling and modality-specific modular design with decoupled visual representations, InternVL-U integrates a state-of-the-art Multimodal Large Language Model (MLLM) with a specialized MMDiT-based visual generation head. To further bridge the gap between aesthetic generation and high-level intelligence, we construct a comprehensive data synthesis pipeline targeting high-semantic-density tasks, such as text rendering and scientific reasoning, under a reasoning-centric paradigm that leverages Chain-of-Thought (CoT) to better align abstract user intent with fine-grained visual generation details. Extensive experiments demonstrate that InternVL-U achieves a superior performance - efficiency balance. Despite using only 4B parameters, it consistently outperforms unified baseline models with over 3x larger scales such as BAGEL (14B) on various generation and editing tasks, while retaining strong multimodal understanding and reasoning capabilities.

연구 동기 및 목표

컴팩트한 구조 내에서 이해와 생성을 균형 있게 다루어 통합 다중모달 모델링을 대중화한다.
사전 학습된 MLLM 백본과 특화된 MMDiT 기반 시각 생성 헤드를 통합한다.
높은 시맨틱 밀도 작업과 추론에 초점을 맞춘 데이터 합성 파이프라인을 설계한다.
Chain-of-Thought를 이용한 추론 중심의 생성을 가능하게 하여 사용자 의도를 시각 출력과 일치시킨다.
UMMs에 대한 효율적인 학습 전략과 평가 벤치마크를 제공한다.

제안 방법

모달리티 적응 생성 타깃을 갖춘 통합 맥락 모델링을 채택하여 맥락과 생성 작업을 정렬한다.
텍스트 자기회귀 모델링과 이미지용 Flow Matching을 결합한 하이브리드 생성 목표를 사용한다.
ViT 기반 인코더와 전용 MMDiT 생성 헤드를 갖춘 모달리티별 모듈식 설계를 채용한다.
이해를 위한 시맨틱 피처와 생성용 VAE 잠재 공간을 사용하여 시각 표현을 디커플링한다.
해상도 보간과 함께 Unified MSRoPE를 도입하여 해상도 전반에 걸쳐 공간 구조를 보존한다.

Figure 1 : Showcases of InternVL-U for general text-to-image generation (top) and image editing (bottom). InternVL-U supports high-fidelity image generation and editing at any resolution.

실험 결과

연구 질문

RQ1컴팩트한 4B 파라미터 UMM이 어떻게 강력한 이해, 추론, 생성 및 편집을 달성할 수 있을까?
RQ2성능과 효율성의 균형을 가장 잘 맞추는 아키텍처 선택(모달리티별 인코더, 분리된 표현, 특화된 생성 헤드)은 무엇인가?
RQ3추론 중심의 데이터 합성 파이프라인이 텍스트 렌더링, 과학적 추론, 지식 집중적 생성/편집을 향상시키는가?
RQ4CoT 기반 추론이 추상적 사용자 의도와 구체적 시각 출력 간의 정합성을 향상시킬 수 있는가?

주요 결과

InternVL-U는 생성 및 편집 작업에서 더 큰 규모의 통합 베이스라인을 꾸준히 능가합니다.
모델은 강력한 다중모달 이해를 유지하면서 높은 품질의 생성과 편집을 제공합니다.
Chain-of-Thought를 도입하면 지식 중심의 생성 및 복잡한 편집 작업의 성능이 향상됩니다.

Figure 2 : Showcases of InternVL-U for spatial-centric, perception, science-centric, humor-centric, and reasoning-centric text-to-image generation or editing tasks. InternVL-U demonstrates such core multimodal capabilities across various visual domains.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.