QUICK REVIEW
[論文レビュー] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu|ArXiv.org|Jan 29, 2025
Semantic Web and Ontologies被引用数 10
ひとこと要約
Janus-Pro は分離した視覚エンコーディング手法を強化し、データとモデルサイズを拡張(1B および 7B)、安定性を改善した状態で最先端のマルチモーダル理解とテキストから画像生成の結果を達成します。
ABSTRACT
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
研究の動機と目的
- Motivate unified multimodal models that separate visual encoding for understanding and generation to reduce task conflict.
- Investigate how data scaling (understanding and generation data) and synthetic aesthetic data affect performance and stability.
- Demonstrate scalability by evaluating 1B and 7B variants on multimodal benchmarks and text-to-image instruction tasks.
提案手法
- Decouple visual encoding for understanding (SigLIP-based encoder) and generation (VQ tokenizer-based) with independent adaptors.
- Use a unified autoregressive transformer to process concatenated multimodal features.
- Adopt an optimized three-stage training strategy with redesigned data utilization to improve efficiency (longer Stage I, focused Stage II, adjusted Stage III data ratios).
- Scale data for multimodal understanding and visual generation by adding ~90M understanding samples and ~72M synthetic aesthetic samples (ratio 1:1 with real data during pretraining).
- Explore model scaling from 1.5B to 7B LLMs and report hyperparameters and training setup across stages.

実験結果
リサーチクエスチョン
- RQ1Can decoupling visual encoders for understanding and generation improve cross-task performance in a unified multimodal model?
- RQ2How do data scaling and synthetic data affect multimodal understanding and image-generation capabilities?
- RQ3Does increasing model size (1B vs 7B) accelerate convergence and improve benchmark performance in both understanding and generation tasks?
主な発見
- Janus-Pro-7B achieves 79.2 on MMBench, surpassing Janus (69.4), TokenFlow (68.9), and MetaMorph (75.2).
- On GenEval, Janus-Pro-7B scores 0.80, higher than Janus (0.61), DALL-E 3 (0.67), and SD3-Medium (0.74).
- Janus-Pro-7B attains 84.19 on DPG-Bench, outperforming all other methods.
- Janus-Pro-1B and Janus-Pro-7B demonstrate strong multimodal understanding and generation, with 7B yielding better overall performance and generation-following abilities.
- Using ~72M synthetic aesthetic samples with a 1:1 real-to-synthetic data ratio improves generation stability and output quality without sacrificing understanding performance.
- Training convergence speeds and stability improve with larger LLMs (7B vs 1.5B) when using the decoupled visual encoding strategy.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。