Skip to main content
QUICK REVIEW

[論文レビュー] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu|ArXiv.org|Jan 29, 2025
Semantic Web and Ontologies被引用数 10
ひとこと要約

Janus-Pro は分離した視覚エンコーディング手法を強化し、データとモデルサイズを拡張(1B および 7B)、安定性を改善した状態で最先端のマルチモーダル理解とテキストから画像生成の結果を達成します。

ABSTRACT

In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.

研究の動機と目的

  • Motivate unified multimodal models that separate visual encoding for understanding and generation to reduce task conflict.
  • Investigate how data scaling (understanding and generation data) and synthetic aesthetic data affect performance and stability.
  • Demonstrate scalability by evaluating 1B and 7B variants on multimodal benchmarks and text-to-image instruction tasks.

提案手法

  • Decouple visual encoding for understanding (SigLIP-based encoder) and generation (VQ tokenizer-based) with independent adaptors.
  • Use a unified autoregressive transformer to process concatenated multimodal features.
  • Adopt an optimized three-stage training strategy with redesigned data utilization to improve efficiency (longer Stage I, focused Stage II, adjusted Stage III data ratios).
  • Scale data for multimodal understanding and visual generation by adding ~90M understanding samples and ~72M synthetic aesthetic samples (ratio 1:1 with real data during pretraining).
  • Explore model scaling from 1.5B to 7B LLMs and report hyperparameters and training setup across stages.
(a) Average performance on four multimodal understanding benchmarks.
(a) Average performance on four multimodal understanding benchmarks.

実験結果

リサーチクエスチョン

  • RQ1Can decoupling visual encoders for understanding and generation improve cross-task performance in a unified multimodal model?
  • RQ2How do data scaling and synthetic data affect multimodal understanding and image-generation capabilities?
  • RQ3Does increasing model size (1B vs 7B) accelerate convergence and improve benchmark performance in both understanding and generation tasks?

主な発見

  • Janus-Pro-7B achieves 79.2 on MMBench, surpassing Janus (69.4), TokenFlow (68.9), and MetaMorph (75.2).
  • On GenEval, Janus-Pro-7B scores 0.80, higher than Janus (0.61), DALL-E 3 (0.67), and SD3-Medium (0.74).
  • Janus-Pro-7B attains 84.19 on DPG-Bench, outperforming all other methods.
  • Janus-Pro-1B and Janus-Pro-7B demonstrate strong multimodal understanding and generation, with 7B yielding better overall performance and generation-following abilities.
  • Using ~72M synthetic aesthetic samples with a 1:1 real-to-synthetic data ratio improves generation stability and output quality without sacrificing understanding performance.
  • Training convergence speeds and stability improve with larger LLMs (7B vs 1.5B) when using the decoupled visual encoding strategy.
(b) Performance on instruction-following benchmarks for text-to-image generation.
(b) Performance on instruction-following benchmarks for text-to-image generation.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。