QUICK REVIEW

[论文解读] Automatic tagging using deep convolutional neural networks

Keunwoo Choi, George Fazekas|arXiv (Cornell University)|Jun 1, 2016

Multimodal Machine Learning Applications参考文献 18被引用 221

一句话总结

本文提出使用带有二维卷积的全卷积网络（FCN）来进行基于内容的自动音乐标签，并展示梅尔频谱输入可获得最先进的结果，且更深的模型在更多数据上受益。

ABSTRACT

We present a content-based automatic music tagging algorithm using fully convolutional neural networks (FCNs). We evaluate different architectures consisting of 2D convolutional layers and subsampling layers only. In the experiments, we measure the AUC-ROC scores of the architectures with different complexities and input types using the MagnaTagATune dataset, where a 4-layer architecture shows state-of-the-art performance with mel-spectrogram input. Furthermore, we evaluated the performances of the architectures with varying the number of layers on a larger dataset (Million Song Dataset), and found that deeper models outperformed the 4-layer architecture. The experiments show that mel-spectrogram is an effective time-frequency representation for automatic tagging and that more complex models benefit from more training data.

研究动机与目标

证明不使用全连接层的情况下，完全卷积神经网络可以执行多标签音乐标签。
评估输入表示（梅尔谱、短时傅里叶变换 STFT、MFCC）在标签任务中的表现。
评估模型深度（3–7 层）对不同数据集的影响。
表明梅尔谱在自动标签任务中优于其他表示。
探究训练数据规模如何影响更深模型的收益。

提出的方法

使用由 3–7 层卷积层及最大池化组成的全卷积网络，以产生一个 50 维的标签向量。
输入表示包括梅尔谱、STFT 和 MFCC；对于标签任务，首选梅尔谱。
使用 sigmoid 输出和二元交叉熵损失进行训练，以处理多标签数据。
应用批量归一化和 dropout 以提升收敛性并防止过拟合。
在 MagnaTagATune（50 标签）和 Million Song Dataset（前 50 标签）上评估架构，以 AUC 作为性能指标。
引用二维卷积来捕捉局部时频结构以及对整段音频的时域非线性聚合。

实验结果

研究问题

RQ1基于 FCN 的架构在不同输入表示下的自动音乐标签任务上的表现如何？
RQ2更深的网络深度是否会提升标签表现？这是否取决于训练数据的规模？
RQ3对于 FCN 的自动标签，梅尔谱输入是否优于 STFT 或 MFCC？
RQ4模型深度如何与数据集规模（MagnaTagATune 与 MSD）在多标签标签表现上相互作用？

主要发现

架构	输入	AUC 值
FCN-3	mel-spectrogram	0.852
FCN-4	mel-spectrogram	0.894
FCN-5	mel-spectrogram	0.890
FCN-4	STFT	0.846
FCN-4	MFCC	0.862
FCN-3	mel-spectrogram	0.786
FCN-4	mel-spectrogram	0.808
FCN-5	mel-spectrogram	0.848
FCN-6	mel-spectrogram	0.851
FCN-7	mel-spectrogram	0.845

在 MagnaTagATune 上，FCN-4 以梅尔谱输入达到 AUC 0.894，超越若干既有方法。
梅尔谱输入在该任务上持续优于 STFT 和 MFCC 输入。
在 MagnaTagATune 上，较深的架构（FCN-5）相比 FCN-4 增益有限，表明在数据有限时收益递减。
在 MSD 上，较深的模型（FCN-5、FCN-6、FCN-7）显著优于 FCN-4，表明更大的数据集有利于更深的网络。
FCN-6 在 MSD 上取得最佳表现，AUC 0.851，而在该设置下 FCN-7 略逊于 FCN-6。
总体而言，深层模型受益于更多的训练数据，且梅尔谱是用于自动标签的有效时频表示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。