QUICK REVIEW

[论文解读] Simple and Controllable Music Generation

Jade Copet, Felix Kreuk|arXiv (Cornell University)|Jun 8, 2023

Music and Audio Processing被引用 64

一句话总结

MusicGen 提供一个单阶段自回归变换器，在文本或旋律条件下，使用交错的 EnCodec 令牌流生成高质量的单声道和立体声音乐，优于 MusicCaps 上的基线。

ABSTRACT

We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchically or upsampling. Following this approach, we demonstrate how MusicGen can generate high-quality samples, both mono and stereo, while being conditioned on textual description or melodic features, allowing better controls over the generated output. We conduct extensive empirical evaluation, considering both automatic and human studies, showing the proposed approach is superior to the evaluated baselines on a standard text-to-music benchmark. Through ablation studies, we shed light over the importance of each of the components comprising MusicGen. Music samples, code, and models are available at https://github.com/facebookresearch/audiocraft

研究动机与目标

激发条件化音乐生成的动机，并指出对可控、高保真输出的需求。
提出一个简单的单阶段语言模型，覆盖多个离散音频令牌流。
引入码本交错模式，以高效建模并行令牌流。
启用文本和旋律条件，以提升生成的可控性。
展示无需额外计算成本的立体声扩展，并进行广泛评估。

提出的方法

使用 EnCodec 将音频在每个时间步分解为多个离散码本进行标记。
在交错的码本流上训练一个单一自回归变换器，采用基于模式的并行化。
引入码本交错模式（精确与不精确）来控制自回归依赖关系。
将生成条件化为文本编码（T5/FLAN-T5/CLAP）或无监督旋律（带瓶颈的色度图）。
通过处理左右声道，使用适配的交错模式提供立体声扩展。
进行消融实验，研究码本模式、模型规模和条件策略。

实验结果

研究问题

RQ1单阶段变换器在交错音频令牌上，是否能匹配或超越多阶段基线的文本到音乐生成？
RQ2不同的码本交错模式如何影响生成质量与可控性？
RQ3旋律（色度图）条件是否在不牺牲质量的前提下改善与和声结构的对齐？
RQ4模型是否能扩展到立体声生成且不增加计算量？
RQ5哪些文本编码器和条件策略最能支持高质量、可控的音乐生成？

主要发现

模型	FAD_vgg↓	KL↓	CLAP_scr↑	Ovl↑	Rel↑
Riffusion	14.8	2.06	0.19	79.31 ± 1.37	74.20 ± 2.17
Mousai	7.5	1.59	0.23	76.11 ± 1.56	77.35 ± 1.72
MusicLM	4.0	-	-	80.51 ± 1.07	82.35 ± 1.36
Noise2Music	2.1	-	-	-	-
MusicGen w.o melody (300M)	3.1	1.28	0.31	78.43 ± 1.30	81.11 ± 1.31
MusicGen w.o melody (1.5B)	3.4	1.23	0.32	80.74 ± 1.17	83.70 ± 1.21
MusicGen w.o melody (3.3B)	3.8	1.22	0.31	84.81 ± 0.95	82.47 ± 1.25
MusicGen w. random melody (1.5B)	5.0	1.31	0.28	81.30 ± 1.29	81.98 ± 1.79

MusicGen 在 MusicCaps 上在主观质量和文本相关性方面超过基线（Riffusion、Mousai、MusicLM、Noise2Music）。
通过色度图进行的旋律条件改善旋律一致性，在训练和测试阶段使用色度条件时对齐度提升。
立体声扩展产生高质量的立体样本，模式略有差异；混合为单声道仍保持质量。
码本交错模式很重要：展平模式提高客观指标但成本更高；基于延迟的模式在成本较低的情况下提供强性能。
模型规模能提升客观指标，1.5B 通常在主观质量方面最优；更大的模型更好地捕捉文本提示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。