QUICK REVIEW

[论文解读] LakhNES: Improving multi-instrumental music generation with cross-domain pre-training

Chris Donahue, Huanru Henry Mao|arXiv (Cornell University)|Jul 10, 2019

Music and Audio Processing被引用 48

一句话总结

LakhNES 将 Transformer-XL 适应于多乐器符号音乐生成，并通过在异质的 Lakh MIDI 数据集映射到 NES 风格四声部合奏进行预训练，然后在 NES-MDB 上微调，从而提升性能。

ABSTRACT

We are interested in the task of generating multi-instrumental music scores. The Transformer architecture has recently shown great promise for the task of piano score generation; here we adapt it to the multi-instrumental setting. Transformers are complex, high-dimensional language models which are capable of capturing long-term structure in sequence data, but require large amounts of data to fit. Their success on piano score generation is partially explained by the large volumes of symbolic data readily available for that domain. We leverage the recently-introduced NES-MDB dataset of four-instrument scores from an early video game sound synthesis chip (the NES), which we find to be well-suited to training with the Transformer architecture. To further improve the performance of our model, we propose a pre-training technique to leverage the information in a large collection of heterogeneous music, namely the Lakh MIDI dataset. Despite differences between the two corpora, we find that this transfer learning procedure improves both quantitative and qualitative performance for our primary task.

研究动机与目标

将基于 Transformer 的符号音乐生成扩展到一个固定的四乐器 NES 风格合奏，并且各声部具备复调。
为 NES-MDB 引入一种事件基表示，捕捉跨乐器的音乐上重要变化。
通过将 Lakh MIDI 映射到 NES 风格合奏并在 NES-MDB 上微调，探索跨域预训练以提升生成质量。
定量（困惑度 perplexity）与定性（人类研究）评估预训练和数据增强的收益。

提出的方法

采用 Transformer-XL 作为骨干网络，在事件基的 NES-MDB 序列中建模长程依赖。
使用包含时间移动和乐器特定音符事件在内的 631 种事件类型的事件基表示。
将 Lakh MIDI 映射到 NES 合奏，创建一个大型跨域预训练语料库，然后在 NES-MDB 上微调。
应用数据增强（移位、速度变动、乐器 dropout/打乱）以提高泛化。
使用测试集的 perplexity 进行评估，并进行类似图灵测试的用户研究和偏好研究以评估对人类友好性。

实验结果

研究问题

RQ1Transformer-XL 是否能够有效建模面向 NES-like 合奏的多乐器符号音乐中的长程结构？
RQ2在大型异质 MIDI 语料库映射到 NES 上进行预训练，是否能提升在 NES-MDB 上的生成质量？
RQ3数据增强对模型性能和人类对生成音乐感知的影响是什么？
RQ4在客观指标和人类评估中，LakhNES 与 n-gram 基线和 LSTM 基线相比如何？
RQ5事件基表示是否适合符号音乐的跨域迁移学习？

主要发现

Transformer-XL 在测试困惑度方面显著更低（PPL 3.50），优于 5-gram（37.25）和 LSTM（14.11）基线。
数据增强使 LSTM 和 Transformer-XL 的性能分别提升约 10% 和 22%。
在 Lakh MIDI 映射到 NES 上的预训练并在 NES-MDB 上微调（LakhNES）比仅数据增强的困惑度提升约 10%（微调后为 PPL 2.46）。
在微调前增加 Lakh MIDI 预训练轮次可降低困惑度，但收益递减（探索了 1、2、4 个轮次）。
用户研究表明，LakhNES 有时较基线更常被识别为类人，与未预训练的 Transformer-XL 的图灵测试相比也有优势，尽管真实数据仍然更优。
在受控比较中，LakhNES 相对于竞争方法获得更高的偏好，但人工评审仍更偏好真实数据。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。