QUICK REVIEW

[论文解读] Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models

Rongjie Huang, Jiawei Huang|arXiv (Cornell University)|Jan 30, 2023

Music and Audio Processing被引用 46

一句话总结

Make-An-Audio 使用提示增强扩散、频谱图自编码器和 CLAP 表示来实现最先进的文本到音频生成，并实现多输入模态的 X-to-Audio。

ABSTRACT

Large-scale multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. In this work, we propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps by 1) introducing pseudo prompt enhancement with a distill-then-reprogram approach, it alleviates data scarcity with orders of magnitude concept compositions by using language-free audios; 2) leveraging spectrogram autoencoder to predict the self-supervised audio representation instead of waveforms. Together with robust contrastive language-audio pretraining (CLAP) representations, Make-An-Audio achieves state-of-the-art results in both objective and subjective benchmark evaluation. Moreover, we present its controllability and generalization for X-to-Audio with "No Modality Left Behind", for the first time unlocking the ability to generate high-definition, high-fidelity audios given a user-defined modality input. Audio samples are available at https://Text-to-Audio.github.io

研究动机与目标

通过引入伪提示增强与蒸馏后再编程，解决文本到音频数据稀缺问题。
通过频谱图自编码器将长连续音频高效建模为自监督表征的潜在空间，而非波形。
利用对比语言-音频预训练（CLAP）和潜在扩散模型实现对文本-音频的忠实对齐与高保真生成。
在 No Modality Left Behind 框架下，展示对用户定义模态（文本、音频、图像、视频）的 X-to-Audio 泛化。
在标准基准和零样本设置下，评估目标指标（FID、KL、CLAP）和主观 MOS 分数。

提出的方法

在以文本表征为条件的频谱潜在空间上应用潜在扩散。
使用频谱图自编码器将音频压缩到用于扩散生成的潜在空间。
通过专家蒸馏和动态重新编程引入伪提示增强，从语言无关的音频数据中创建多样且与语言对齐的提示。
采用无分类器引导以在保真度和多样性之间取得平衡。
通过 CLAP 基于对齐进行训练，以提高文本-音频的忠实性。
用一个神经声码器（HiFi-GAN）将生成的梅尔频谱转换为波形。

实验结果

研究问题

RQ1伪提示增强是否能够利用大规模、语言无关的数据来改进文本到音频的生成？
RQ2对频谱潜在表示建模并使用扩散是否比基于波形的方法在保真度和文本-音频对齐方面表现更好？
RQ3Make-An-Audio 是否能够在用户定义模态下实现对文本-音频之外的 X-to-Audio 泛化？
RQ4不同文本表示（CLAP 与基于 LM 的编码器）对文本-音频合成性能有何影响？
RQ5在扩散生成下，音频修复与个性化文本到音频操作的表现如何？

主要发现

模型	文本条件	参数	FID	KL	CLAP	MOS-Q	MOS-F	FID-Z	KL-Z
参考	/	/	/	/	0.526	74.7 ± 0.94	80.5 ± 1.84	/	/
Diffsound	CLIP	520M	7.17	3.57	0.420	67.1 ± 1.03	70.9 ± 1.05	24.97	6.53
Make-An-Audio	CLAP	332M	4.61	2.79	0.482	72.5 ± 0.90	78.6 ± 1.01	17.38	6.98
BERT	BERT	809M	5.15	2.89	0.480	70.5 ± 0.87	77.2 ± 0.98	18.75	7.01
T5-Large	T5-Large	563M	4.83	2.81	0.486	71.8 ± 0.91	77.2 ± 0.93	17.23	7.02
CLIP	CLIP	576M	6.45	2.91	0.444	72.1 ± 0.92	75.4 ± 0.96	17.55	7.09

Make-An-Audio 在 AudioCaption 的文本到音频方面达到最先进的结果，FID 4.61，KL 2.79，CLAP 0.482。
在客观指标（FID、KL、CLAP）和主观 MOS 测量中优于基线，例如 MOS-Q 72.5 和 MOS-F 78.6。
基于 CLAP 的文本-音频对齐表现强劲，在评估模型中达到最高的 MOS 和 CLAP 分数。
在零样本 Clotho 的设定下实现零样本泛化，并在 No Modality Left Behind 框架下扩展到 X-to-Audio 生成（文本、音频、图像、视频）。
伪提示增强（蒸馏后再编程）显著缓解数据稀缺，使跨越多样音频领域的概念组合成为可能。
基于频谱图自编码器的扩散 enables 高效的长音频建模，具有高层语义保真度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。