QUICK REVIEW

[论文解读] Music Genre Classification using Machine Learning Techniques

Hareesh Bahuleyan|arXiv (Cornell University)|Apr 3, 2018

Music and Audio Processing参考文献 23被引用 47

一句话总结

本文将基于CNN的频谱图分类与传统手工特征在音乐流派分类（AudioSet 数据集）上进行比较，并展示通过 VGG-16 迁移学习与 XGBoost 的集成达到最佳 AUC 0.894。

ABSTRACT

Categorizing music files according to their genre is a challenging task in the area of music information retrieval (MIR). In this study, we compare the performance of two classes of models. The first is a deep learning approach wherein a CNN model is trained end-to-end, to predict the genre label of an audio signal, solely using its spectrogram. The second approach utilizes hand-crafted features, both from the time domain and the frequency domain. We train four traditional machine learning classifiers with these features and compare their performance. The features that contribute the most towards this multi-class classification task are identified. The experiments are conducted on the Audio set data set and we report an AUC value of 0.894 for an ensemble classifier which combines the two proposed approaches.

研究动机与目标

为大型库和流媒体服务的自动音乐流派标注提供动力。
比较使用频谱图的端到端 CNN 方法与传统特征基分类器。
识别哪些特征对流派分类贡献最大。
在 AudioSet 数据集上评估性能并分析特征重要性。

提出的方法

将音频转换为 MEL 频谱图，并输入到基于 VGG-16 的 CNN，进行迁移学习或微调。
使用 librosa 提取时域和频域的手工特征，并训练传统分类器（LR、RF、SVM、XGB）。
在扁平化的频谱图上训练基线前馈神经网络。
通过 L2 正则化和 dropout 对神经网络进行正则化，以缓解过拟合。
使用准确率、F1 分数和 AUC 进行评估，采用 90/5/5 的训练/验证/测试分割。
通过对预测概率进行平均，将最佳 CNN（VGG-16 TL）与最佳基于特征的模型（XGB）进行集成。

实验结果

研究问题

RQ1基于频谱图的 CNN 能否在流派分类性能上超过传统基于特征的分类器？
RQ2哪些手工特征对音乐流派分类性能贡献最大？
RQ3将 CNN 基于的模型和基于特征的模型进行集成，是否能在 AudioSet 上提升整体性能？

主要发现

模型	准确率	F-score	AUC
VGG-16 CNN 迁移学习	0.63	0.61	0.891
VGG-16 CNN 微调	0.64	0.61	0.889
前馈神经网络基线	0.43	0.33	0.759
逻辑回归（LR）	0.53	0.47	0.822
随机森林（RF）	0.54	0.48	0.840
支持向量机（SVM）	0.57	0.52	0.856
极端梯度提升（XGB）	0.59	0.55	0.865
VGG-16 CNN + XGB 集成	0.65	0.62	0.894

仅使用频谱图的 VGG-16 CNN 在单一模型中取得最高指标（准确率 0.63，F-score 0.61，AUC 0.891，迁移学习；微调后为 0.64、0.61、0.889）。
在基于特征的模型中，SVM (0.57/0.52/0.856) 与 XGB (0.59/0.55/0.865) 优于 LR 和 RF。
VGG-16 CNN 与 XGBoost 的集成实现最佳总体 AUC 0.894，准确率 0.65，F-score 0.62。
MFCCs 是 top features 之一，谱对比度均值/标准差和节拍也很重要。
使用前10、20、30、97 个特征显示显著性能，且仅用 30 个特征（AUC 0.845，准确率 0.55）就接近全特征集（AUC 0.865，准确率 0.59）。
频域特征在此任务中优于时域特征，且两者结合可获得最佳结果（AUC 0.865，准确率 0.59）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。