Skip to main content
QUICK REVIEW

[論文レビュー] NExT-GPT: Any-to-Any Multimodal LLM

Shengqiong Wu, Fei Hao|arXiv (Cornell University)|Sep 11, 2023
Topic Modeling被引用数 94
ひとこと要約

NExT-GPT は、テキスト、画像、動画、音声のコンテンツを受け取り生成できる、LMM をマルチモーダル encoders/diffusion decoders と軽量な projection、さらにモダリティ切替指示チューニング(MosIT)と組み合わせたエンドツーエンドの any-to-any マルチモーダル LLM です。

ABSTRACT

While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/

研究の動機と目的

  • MM-LLMs が入力のみ理解して出力できないギャップに対処する。
  • テキスト、画像、動画、音声を入力と出力の両方で扱えるエンドツーエンドの any-to-any MM-LLM を開発する。
  • 学習コストを最小化し、モダリティ拡張を容易にするために、市販のエンコーダ/デコーダを活用する。
  • モダリティ切替指示チューニング(MosIT)を高品質データセットと共に導入し、横断的モーダル推論と生成を強化する。

提案手法

  • Three-tier architecture: multimodal encoding with off-the-shelf encoders and a projection layer to language space; LLM-based understanding and reasoning; multimodal decoding via diffusion decoders conditioned on modality signals.
  • Use ImageBind-based or other encoders to map inputs into language-like representations; keep encoders/decoders frozen and only train input/output projection layers (approximately 1% of parameters).
  • LLM (Vicuna 7B) outputs textual tokens and modality signal tokens that instruct decoders whether and what to generate in each modality.
  • Modal signals are defined as specific tokens (e.g., <IMG_i>, <AUD_i>, <VID_i>) that route representations to corresponding diffusion decoders for content generation.
  • Lightweight alignment: encoding-side LLM-centric multimodal alignment trained with caption-like objectives; decoding-side instruction-following alignment aligns diffusion condition encoders with LLM outputs.
  • MosIT data: a 5K high-quality multimodal instruction-tuning dataset crafted with templates and GPT-4 to cover complex cross-modal instructions and multi-turn dialogues.

実験結果

リサーチクエスチョン

  • RQ1LLM-centric end-to-end system がテキスト、画像、動画、音声の任意の組み合わせに対して理解と生成を行えるか。
  • RQ2最小パラメータ更新で効率的な横断モダル整合を実現する訓練戦略は何か。
  • RQ3モダリティ切替指示チューニングは、多様なモダリティ変換における横断モーダル推論と生成品質を向上させるか。

主な発見

  • NExT-GPT は、いくつかの text-to-X および X-to-text タスクで、ベースラインと比較して競争力のある、または優れた生成品質を達成している(例: text-to-image: NExT-GPT 11.28 FID on COCO-caption vs CogVideo 27.10; 11.26 for CoDi)。
  • Text-to-audio: NExT-GPT FD 23.58 and IS 8.35 on AudioCaps, comparing favorably with several baselines.
  • Text-to-video: NExT-GPT FD 13.04 and CLIPSIM 0.3085 on MSR-VTT, showing strong performance among diffusion-based systems.
  • Image-to-text (captioning) on COCO-caption: NExT-GPT B@4 44.3 and CIDEr 156.7, exceeding several baselines.
  • Audio-to-text: NExT-GPT B@4 58.4 and METEOR 38.5 on AudioCaps, outperforming many alternatives.
  • Video-to-text on MSR-VTT: NExT-GPT CIDEr 0.802, indicating strong video captioning capability.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。