QUICK REVIEW

[論文レビュー] Any-to-Any Generation via Composable Diffusion

Zineng Tang, Ziyi Yang|arXiv (Cornell University)|May 19, 2023

Multimodal Machine Learning Applications被引用数 29

ひとこと要約

CoDiはComposable Diffusionを導入します。これは Bridging と潜在整合を用いて、指数関数的に多くの訓練目的を必要とせず、任意の入力モダリティの任意の部分集合から任意の出力モダリティの部分集合を生成できるモデルであり、テキスト、画像、動画、音声のジョイントマルチモーダル生成を可能にします。

ABSTRACT

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io

研究の動機と目的

任意の入力・出力モダリティの組み合わせを扱える統一モデルの必要性を動機づける。
入力条件付けと拡散生成の双方でモダリティを整合させる訓練戦略を提案する。
ブリッジングと潜在整合を通じて、線形個数の訓練目的で任意対任意の生成を効率化する。
多様なデータセットに渡り、優れた単一モダリティおよびマルチモーダル生成品質を実証する。

提案手法

高い単一モダリティ品質を確保するため、各モダリティ（テキスト、画像、動画、音声）について別々の潜在拡散モデル（LDM）を並行して訓練する。
Bridging Alignment を用いて入力を共有空間に射影することで、モダリティ間の入力条件付けを整列させるとともに、モダリティ表現の補間を可能にする。
クロスアテンション機構と環境エンコーダ V を導入し、異なるモダリティの潜在変数を共同生成のための共有空間に射影する。
潜在整合を用いて、モダリティ拡散器間のクロスアテンションを可能にする。V(z^B_t) をモダリティ A の UNet のクロスアテンションに入力し、線形訓練目的で行う。
モジュール性を維持するため、モダリティ特化の LDM を凍結し、クロスアテンションのパラメータと環境エンコーダのみを訓練することで、推論時に見たことのないモダリティの組み合わせを可能にする。

Figure 1: CoDi can generate various (joint) combinations of output modalities from diverse (joint) sets of inputs: video, image, audio, and text (example combinations depicted by the colored arrows).

実験結果

リサーチクエスチョン

RQ1CoDi は、すべてのペアリングを訓練することなく、任意の入力モダリティの組み合わせから任意の出力モダリティの組み合わせを生成できますか？
RQ2ブリッジングアライメントと潜在アライメントは、モダリティ数の指数的な訓練計算量を線形に削減できますか？
RQ3混合入力から共同出力（例：動画+音声）を生成する際の単一モダリティおよびマルチモーダル生成品質はどの程度ですか？
RQ4共同生成されたモダリティは、時間的一貫性と意味的一貫性の点でどれだけ整合しますか？

主な発見

CoDi は、テキスト、画像、動画、音声にわたる単一-to-単一、複数条件、およびジョイントマルチモーダル生成をサポートします。
Bridging Alignment は、テキストを橋渡しとして用い、モダリティ間でプロンプトエンコーダを整列させ、ゼロショットのマルチ条件付けを可能にします。
Latent Alignment with cross-attention は、整列した潜在表現を補間することで共同生成を可能にし、訓練目的を線形のタスク数に削減します。
CoDi は、いくつかの設定で単一モダリティの最先端と同等またはそれを上回る生成品質を達成し、他の設定でも競争力を発揮します。
共同生成タスクは、独立して生成されたモダリティと比べて、一貫性と横断モダリティの整合性が向上します。

Figure 2: CoDi model architecture: (a) We first train individual diffusion model with aligned prompt encoder by “Bridging Alignment”; (b) Diffusion models learn to attend with each other via “Latent Alignment”; (c) CoDi achieves any-to-any generation with a linear number of training objectives.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。