QUICK REVIEW

[论文解读] MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens

Kaizhi Zheng, Xuehai He|arXiv (Cornell University)|Oct 3, 2023

Multimodal Machine Learning Applications被引用 16

一句话总结

MiniGPT-5 通过生成性 voken 将 LLM 与 Stable Diffusion 连接起来，实现文本和图像的交错生成，采用两阶段、无描述训练以及无分类器引导的策略来提升多模态输出。它在 CC3M 上实现了最先进的结果，在 VIST 和 MMDialog 上也表现出色。

ABSTRACT

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of ``generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56\% cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

研究动机与目标

通过引入生成性 vokens 桥接 LLMs 与文本到图像模型，推进文本-图像交错生成。
开发一种两阶段、无描述的训练策略，使多模态特征在有限的图像描述下对齐。
通过分类器无引导引导和参数高效微调来提升生成质量。
在 CC3M、VIST 与 MMDialog 数据集上展示强的多模态生成性能。

提出的方法

将生成性 vokens 作为 LLM 词汇表中的特殊标记，用于输出图像生成的视觉特征。
使用映射模块（MLP 和编码-解码 Transformer）将 voken 特征转换为潜在扩散模型的条件特征空间。
采用两阶段策略训练：单模态对齐（UAS）在 CC3M 上，以及多模态学习阶段（MLS）在 VIST/MMDialog 上。
在基于扩散的图像生成过程中应用无分类器引导，以增强条件一致性。
使用 PEFT（LoRA/前缀）在不破坏预训练权重的前提下高效微调 LLM。
采用包含 LDM 损失和文本空间损失的两阶段损失框架，以及用于 CC3M 的字幕对齐的辅助 CAP 损失。

实验结果

研究问题

RQ1生成性 vokens 是否能够在一个统一的多模态模型中实现文本与图像的连贯交错生成？
RQ2与端到端训练相比，两阶段、无描述的训练策略是否能改善视觉与文本模态之间的对齐？
RQ3在 VIST 与 MMDialog 等数据集上，分类器无引导引导和 PEFT 对多模态输出质量有何影响？
RQ4与 GILL 和 Divter 相比，MiniGPT-5 在 CC3M、VIST、MMDialog 的单轮与多轮设置中的表现如何？

主要发现

MiniGPT-5 在 VIST 单步生成中优于对 Stable Diffusion 2 的微调模型，适用于多种提示类型。
MiniGPT-5 结合 LoRA 在 VIST 全步评估中始终获得更高的 CLIP-I 分数，以及具有竞争力的图像质量（IS）和连贯性（FID）。
人工评估显示，在大多数情形下，MiniGPT-5 的语言连贯性、图像质量与多模态连贯性优于或等于两阶段基线。
在 MMDialog 上，MiniGPT-5 在文本准确性和 MM-Relevance 方面优于 Divter，且图像质量相当。
消融研究表明 CAP 损失和 CFG 对图像质量有积极贡献，CFG 提升了扩散去噪的性能。
在 CC3M 的单模态对齐中，MiniGPT-5 在所有报告的指标上均超越 GILL，说明了生成性 voken 与 Stable Diffusion 的对齐效果良好。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。