QUICK REVIEW

[论文解读] The Jazz Transformer on the Front Line: Exploring the Shortcomings of AI-composed Music through Quantitative Measures

Shih-Lun Wu, Yi‐Hsuan Yang|arXiv (Cornell University)|Aug 4, 2020

Music and Audio Processing参考文献 35被引用 38

一句话总结

tldr: Jazz Transformer 将 Transformer-XL 应用于魏玛爵士数据库（Weimar Jazz Database）的主旋律表（lead sheets），结合结构事件来引导生成，并通过新的客观指标和主观研究评估其不足之处，揭示与人类创作的差距。

ABSTRACT

This paper presents the Jazz Transformer, a generative model that utilizes a neural sequence model called the Transformer-XL for modeling lead sheets of Jazz music. Moreover, the model endeavors to incorporate structural events present in the Weimar Jazz Database (WJazzD) for inducing structures in the generated music. While we are able to reduce the training loss to a low value, our listening test suggests however a clear gap between the average ratings of the generated and real compositions. We therefore go one step further and conduct a series of computational analysis of the generated compositions from different perspectives. This includes analyzing the statistics of the pitch class, grooving, and chord progression, assessing the structureness of the music with the help of the fitness scape plot, and evaluating the model's understanding of Jazz music through a MIREX-like continuation prediction task. Our work presents in an analytical manner why machine-generated music to date still falls short of the artwork of humanity, and sets some goals for future work on automatic composition to further pursue.

研究动机与目标

通过利用复杂的爵士专用数据集（WJazzD），激励在超越表面质量的层面探索基于人工智能的爵士作曲。
目标使用 Transformer 同时建模旋律、和声和结构事件。
结合主观听感测试与一系列客观指标来评估生成的音乐，从而定位失败模式。

提出的方法

用 Transformer-XL 构建 Jazz Transformer，以处理长期上下文。
将音乐表示为事件标记序列，其中包括音符、和弦、节奏以及 WJazzD 结构事件（短语 Phrase、最小粒度单位 MLU、段落 Part、重复 Repetition）。
将和弦分解为和音音阶（Chord-Tone）、和弦类型（Chord-Type）和和弦斜线（Chord-Slash）以减少标记稀疏性。
训练两个变体：模型 A（无结构事件）和模型 B（具备完整结构事件）。
将音符时值量化为 64 分音符的倍数，以捕捉爵士乐的简短性。
在训练中通过移调独奏来进行数据增强。

实验结果

研究问题

RQ1基于 Transformer 的模型是否能够学习生成具有旋律、和声和结构连贯性的爵士乐 lead sheets？
RQ2结构相关事件是否提升了人工智能生成的爵士乐的质量和可扩展性？
RQ3哪些客观指标最能揭示 AI 创作的爵士乐相对于人类作品的局限性？
RQ4模型在训练过程中的表现如何演变，是否在达到某个损失阈值后出现过拟合？

主要发现

主观听感显示 AI 生成的爵士乐在总体质量和结构感方面明显落后于人类作品。
模型 B（具备结构事件）在若干短期指标上通常最接近真实数据，但若训练损失过低，性能会下降。
客观指标揭示 AI 作品在音高使用上波动不稳定，且在长期重复性方面较弱，尤其是在较长时间尺度上。
Grooving 模式相似性指示机器生成的作品在节奏方面存在不一致。
结构性指标显示 AI 作品缺乏真实爵士乐中存在的长期重复结构，尽管结构事件有助于短期连贯性。
类似 MIREX 的续写预测准确性在损失约 0.25 时达到峰值，表明在发生过拟合之前达到最佳学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。