QUICK REVIEW

[论文解读] NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Fei Hao|arXiv (Cornell University)|Sep 11, 2023

Topic Modeling被引用 94

一句话总结

NExT-GPT 是一个端到端的任意到任意多模态大语言模型，能够通过将 LLM 与多模态编码器/扩散解码器以及轻量级投影相连接，并结合模态切换指令微调（MosIT），实现文本、图像、视频和音频的输入输出。

ABSTRACT

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/

研究动机与目标

解决仅理解输入而不能跨多模态输出的 MM-LLMs 的空白。
开发一个端到端的任意到任意 MM-LLM，能够在输入和输出中处理文本、图像、视频和音频。
利用现成的编码器/解码器以最小化训练成本并实现便捷的模态扩展。
引入带高质量数据集的模态切换指令微调（MosIT），以增强跨模态推理与生成。

提出的方法

三层架构：多模态编码，使用现成的编码器和一个投影层映射到语言空间；基于LLM的理解与推理；通过条件模态信号的扩散解码器进行多模态解码。
使用以 ImageBind 为基础的编码器或其他编码器，将输入映射到类似语言的表征；保持编码器/解码器冻结，仅训练输入/输出投影层（大约1%的参数）。
LLM（Vicuna 7B）输出文本标记和模态信号标记，指示解码器在每种模态下要生成什么以及是否生成。
模态信号被定义为特定的标记（例如 <IMG_i>, <AUD_i>, <VID_i>），将表征路由到相应的扩散解码器以生成内容。
轻量级对齐：在编码端进行基于 LLM 的多模态对齐，使用类似标题的目标进行训练；解码端的指令跟随对齐将扩散条件编码器与 LLM 输出对齐。
MosIT 数据：一个5K高质量多模态指令微调数据集，使用模板和 GPT-4 设计，以覆盖复杂的跨模态指令和多轮对话。

实验结果

研究问题

RQ1一个以 LLM 为中心的端到端系统是否能够理解并生成任意文本、图像、视频和音频组合的内容？
RQ2哪种训练策略能够在最小参数更新的前提下实现高效的跨模态对齐？
RQ3模态切换指令微调是否提高跨模态推理和在多样模态转换中的生成质量？

主要发现

NExT-GPT 在若干文本到X 和 X到文本任务上实现与基线相比有竞争力或更优的生成质量（如文本到图像：NExT-GPT 在 COCO-caption 的 FID 为 11.28，CogVideo 为 27.10；CoDi 为 11.26）。
文本到音频：NExT-GPT FD 23.58 和 IS 8.35，在 AudioCaps 上，与若干基线相比表现良好。
文本到视频：NExT-GPT FD 13.04 和 CLIPSIM 0.3085，在 MSR-VTT 上显示出在扩散基础系统中的强大性能。
图像到文本（字幕生成）在 COCO-caption 上：NExT-GPT B@4 44.3 和 CIDEr 156.7，超过若干基线。
音频到文本：NExT-GPT B@4 58.4 和 METEOR 38.5，在 AudioCaps 上，优于许多替代方法。
视频到文本在 MSR-VTT：NExT-GPT CIDEr 0.802，表明强的视频字幕能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。