QUICK REVIEW

[论文解读] Learning to Groove with Inverse Sequence Transformations

Jon Gillick, Adam P. Roberts|arXiv (Cornell University)|May 14, 2019

Teaching and Learning Programming被引用 40

一句话总结

该论文开发 Seq2Seq 和 Variational Information Bottleneck 模型，将鼓组乐谱翻译为富有表现力的演奏，提出 Groove MIDI 数据集以及 Humanization、Infilling 和 Tap2Drum 等任务。

ABSTRACT

We explore models for translating abstract musical ideas (scores, rhythms) into expressive performances using Seq2Seq and recurrent Variational Information Bottleneck (VIB) models. Though Seq2Seq models usually require painstakingly aligned corpora, we show that it is possible to adapt an approach from the Generative Adversarial Network (GAN) literature (e.g. Pix2Pix (Isola et al., 2017) and Vid2Vid (Wang et al. 2018a)) to sequences, creating large volumes of paired data by performing simple transformations and training generative models to plausibly invert these transformations. Music, and drumming in particular, provides a strong test case for this approach because many common transformations (quantization, removing voices) have clear semantics, and models for learning to invert them have real-world applications. Focusing on the case of drum set players, we create and release a new dataset for this purpose, containing over 13 hours of recordings by professional drummers aligned with fine-grained timing and dynamics information. We also explore some of the creative potential of these models, including demonstrating improvements on state-of-the-art methods for Humanization (instantiating a performance from a musical score).

研究动机与目标

创建一个具备精确时序和动态信息的大型对齐鼓乐演奏数据集（Groove MIDI Dataset），以实现富有表现力的演奏建模。
开发并评估将简化的鼓乐表示翻译为真实演出的模型（Humanization）。
引入并研究额外任务——Drum Infilling 和 Tap2Drum，以实现对鼓乐演出的更友好控制。
提出 GrooVAE 家族模型用于富表达的序列生成，并分析其感知质量。
推动在音乐等序列数据中学习逆序列变换的方法。

提出的方法

将 Seq2Seq 架构适配为从压缩的音乐表示映射到详细的鼓击演出（击打、偏移、速度）。
用连续高斯偏移和速度来表示时序，在16分音符分辨率预测 H、V、O。
使用教师 forcing 进行训练，并采用由击打预测、速度误差和偏移误差组成的多组件损失（方程 L_t）。
引入 Groove Transfer 变体以将 groove（演奏风格）与乐谱内容解耦，以实现风格迁移。
对嵌入应用变分信息瓶颈（VIB），在真实感与控制之间取得平衡（beta=0.2 的 ELBO）。
提供基线（Quantized、Linear Regression、KNN）以及若干神经模型（MLP、Seq2Seq、Groove Transfer）用于比较。

实验结果

研究问题

RQ1逆序列变换方法是否能够从简化表示生成逼真的鼓乐演出？
RQ2Seq2Seq 和基于 VIB 的模型在 Drum scores 的 Humanization 中是否胜过传统基线？
RQ3模型在 Infilling 和 Tap2Drum 任务上的表现如何，输出在感知上是否与真实数据具有竞争力？
RQ4Groove Transfer 是否能够在不牺牲真实感的前提下实现有效的鼓乐演出风格迁移？

主要发现

模型	MAE (ms)	MSE (16th note)	Timing KL	Velocity KL
Baseline	22.6 [22.45–22.72]	0.041 [0.041–0.042]	N/A	N/A
Linear	19.77 [19.63–19.88]	0.033 [0.033–0.034]	4.79 [4.68–4.88]	1.70 [1.66–1.74]
KNN	22.34 [22.19–22.45]	0.043 [0.042–0.0438]	1.10 [1.07–1.12]	0.53 [0.51–0.56]
MLP	19.25 [19.13–19.40]	0.032 [0.031–0.032]	7.62 [7.44–7.80]	2.22 [2.16–2.29]
Seq2Seq	18.80 [18.67–18.90]	0.032 [0.031–0.032]	0.31 [0.31–0.33]	0.08 [0.08–0.09]
Seq2Seq + VIB	18.47 [18.37–18.60]	0.028 [0.028–0.029]	2.80 [2.72–2.86]	0.22 [0.21–0.23]
Groove Transfer	25.04 [24.82–25.28]	0.052 [0.051–0.053]	0.24 [0.23–0.25]	0.12 [0.12–0.13]
Groove Transfer + VIB	24.49 [24.25–24.72]	0.051 [0.049–0.052]	0.27 [0.26–0.28]	0.20 [0.19–0.20]

在测试的模型中，带 VIB 的 Seq2Seq 在 Humanization 方面实现了最佳的感知和量化性能。
听众在对比测试中更偏好 Seq2Seq（带 VIB）相较于 KNN 基线，并认为输出与真实数据具有竞争力。
量化指标显示 Seq2Seq (+VIB) 实现 MAE 18.47 ms 与 MSE 0.028 (16th note)，优于许多基线。
Groove Transfer 提供了有意义的 groove/风格控制，但在时序精度方面通常落后于 Seq2Seq。
Infilling 在某些情况下可以产生被听众评为比真实数据更具人性化的输出，表明其潜在纠正工具的用途。
Tap2Drum 的输出略低于真实数据的偏好程度，但仍可用于基于控制的即兴演奏。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。