QUICK REVIEW

[论文解读] A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music

Adam P. Roberts, Jesse Engel|arXiv (Cornell University)|Mar 13, 2018

Music and Audio Processing参考文献 42被引用 257

一句话总结

本文提出 MusicVAE，一种分层潜在变量模型，使用分层解码器来有效建模音乐序列的长期结构，从而在重建、插值和属性操作方面优于扁平解码器的 VAE。

ABSTRACT

The Variational Autoencoder (VAE) has proven to be an effective model for producing semantically meaningful latent representations for natural data. However, it has thus far seen limited application to sequential data, and, as we demonstrate, existing recurrent VAE models have difficulty modeling sequences with long-term structure. To address this issue, we propose the use of a hierarchical decoder, which first outputs embeddings for subsequences of the input and then uses these embeddings to generate each subsequence independently. This structure encourages the model to utilize its latent code, thereby avoiding the "posterior collapse" problem, which remains an issue for recurrent VAEs. We apply this architecture to modeling sequences of musical notes and find that it exhibits dramatically better sampling, interpolation, and reconstruction performance than a "flat" baseline model. An implementation of our "MusicVAE" is available online at http://g.co/magenta/musicvae-code.

研究动机与目标

说明为什么变分自编码器在处理长序列数据时表现不佳，以及递归 VAE 的后验塌陷问题。
提出一个分层解码器以促进潜在变量的使用并捕捉音乐中的长程结构。
证明相较于扁平解码器，音乐序列在重建、插值和属性操作方面有改进。
展示在音乐数据中的多流（多乐器）建模的优势。
在大型 MIDI 数据集上进行定量与定性评估以验证该方法。

提出的方法

使用双向 LSTM 编码器将整个序列映射到一个单一的潜在向量 z。
引入分层解码器，其中指挥 RNN 首先输出每个子序列的嵌入向量，再为每个子序列初始化底层解码 RNN。
将输入序列划分为 U 个不重叠的子序列，并约束解码器使得长程上下文必须通过指挥嵌入来流动。
扩展到多流（三重奏）建模，由同一指挥嵌入驱动的单独乐器解码器实现。
使用标准 VAE 目标函数并在此基础上通过减少后验塌陷和对较长序列采用计划采样来训练。

实验结果

研究问题

RQ1在分层解码器是否能防止后验塌陷并比扁平解码器更好地建模长序列音乐？
RQ2分层 MusicVAE 是否比扁平基线在重建、插值和生成音乐上更容易产生连贯的长序列（16 小节及以上）？
RQ3多流建模（旋律、低音、鼓）如何有助于学习音乐序列中的结构？
RQ4潜在空间的操作（插值和属性向量）是否对音乐数据有意义且具有音乐连贯性？

主要发现

模型	平坦（教师强制）	分层（教师强制）	平坦（采样）	分层（采样）
2-bar Drum	0.979	-	0.917	-
2-bar Melody	0.986	-	0.951	-
16-bar Melody	0.883	0.919	0.620	0.812
16-bar Drum	0.884	0.928	0.549	0.879
Trio (Melody)	0.796	0.848	0.579	0.753
Trio (Bass)	0.829	0.880	0.565	0.773
Trio (Drums)	0.903	0.912	0.641	0.863

分层的 MusicVAE 在较长序列（16 小节的旋律/鼓模式和多流数据）上的重建准确性显著优于扁平解码器。
使用分层模型的潜在空间插值比数据空间插值或扁平模型产生更平滑、连贯的旋律转变。
潜在空间的属性向量代数在潜在空间中产生可预测的音乐变化（如密度、错音）并可对示例进行可控操作。
听感研究表明，分层模型的样本在旋律、三重奏和鼓任务中被评为比扁平基线更具音乐性。
分层模型缩小了教师强制重建与采样重建之间的差距，表明更好地利用潜在码并减轻暴露偏差。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。