QUICK REVIEW

[论文解读] Any-to-Any Generation via Composable Diffusion

Zineng Tang, Ziyi Yang|arXiv (Cornell University)|May 19, 2023

Multimodal Machine Learning Applications被引用 29

一句话总结

CoDi 引入可组合扩散（Composable Diffusion），是一种能够从任意子集的输入模态生成任意子集的输出模态的模型，使用桥接与潜在对齐实现联合多模态生成，而无需成倍增加训练目标。

ABSTRACT

We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io

研究动机与目标

需要一个能够处理任意输入-输出模态组合的统一模型的动机。
提出一种训练策略，通过输入条件和扩散生成中的模态对齐来实现模态对齐。
通过桥接与潜在对齐实现线性数量的目标，提升任意对任意生成的效率。
在多样数据集上展示强劲的单模态与多模态生成质量。

提出的方法

并行训练每种模态（文本、图像、视频、音频）的独立潜在扩散模型（LDM），以确保高质量的单模态输出。
通过桥接对齐将输入条件投影到共享空间，实现在模态表示上的插值，从而实现模态之间的对齐。
引入跨注意力机制与环境编码器 V，将来自不同模态的潜在变量投影到共享空间以实现联合生成。
使用潜在对齐使跨注意力能够通过将 V(z^B_t) 输入到模态 A 的 UNet 的跨注意力，来实现跨模态扩散器的交叉注意力，训练目标线性化。
通过冻结模态特定的 LDM，只训练跨注意力参数和环境编码器，保持模块化，从而在推理时能够处理未见模态组合。

Figure 1: CoDi can generate various (joint) combinations of output modalities from diverse (joint) sets of inputs: video, image, audio, and text (example combinations depicted by the colored arrows).

实验结果

研究问题

RQ1CoDi 能否在不对所有成对组合进行额外训练的情况下，从任意输入组合生成任意输出模态的组合？
RQ2桥接对齐与潜在对齐如何将训练复杂度从模态数量的指数级降低到线性？
RQ3在从混合输入生成联合输出（如视频+音频）时，单模态与多模态生成的质量如何？
RQ4联合生成的模态在时间一致性和语义连贯性方面的对齐程度如何？

主要发现

CoDi 支持文本、图像、视频和音频的单对单、多条件和联合多模态生成。
桥接对齐利用文本作为桥梁，促成跨模态的提示编码器对齐，从而实现零样本的多条件生成。
潜在对齐结合跨注意力通过插值对齐的潜在表示来实现联合生成，使训练目标降至线性数量的任务。
CoDi 在若干设置下达到或超越单模态的最先进水平，并在其他设置中具有竞争力。
联合生成任务相比独立生成的模态，在连贯性和跨模态一致性方面表现出更高的质量。

Figure 2: CoDi model architecture: (a) We first train individual diffusion model with aligned prompt encoder by “Bridging Alignment”; (b) Diffusion models learn to attend with each other via “Latent Alignment”; (c) CoDi achieves any-to-any generation with a linear number of training objectives.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。