QUICK REVIEW

[论文解读] Instrument-Independent Dastgah Recognition of Iranian Classical Music Using AzarNet

Shahla RezezadehAzar, Ali Ahmadi|arXiv (Cornell University)|Jan 1, 2018

Music and Audio Processing参考文献 21被引用 2

一句话总结

本文提出 AzarNet，一种深度卷积神经网络，利用 Maryam 伊朗古典音乐（MICM）数据集，在无需依赖乐器特征的情况下实现了伊朗古典音乐中 Dastgah 的识别。通过应用短时傅里叶变换（STFT）将音频信号转换为时频表示，AzarNet 对七个 Dastgah 进行分类，整体 F1 得分为 86.21%，为该任务迄今报告的最高结果。

ABSTRACT

In this paper, AzarNet, a deep neural network (DNN), is proposed to recognizing seven different Dastgahs of Iranian classical music in Maryam Iranian classical music (MICM) dataset. Over the last years, there has been remarkable interest in employing feature learning and DNNs which lead to decreasing the required engineering effort. DNNs have shown better performance in many classification tasks such as audio signal classification compares to shallow processing architectures. Despite image data, audio data need some preprocessing steps to extract spectra and temporal features. Some transformations like Short-Time Fourier Transform (STFT) have been used in the state of art researches to transform audio signals from time-domain to time-frequency domain to extract both temporal and spectra features. In this research, the STFT output results which are extracted features are given to AzarNet for learning and classification processes. It is worth noting that, the mentioned dataset contains music tracks composed with two instruments (violin and straw). The overall f1 score of AzarNet on test set, for average of all seven classes was 86.21% which is the best result ever reported in Dastgah classification according to our best knowledge.

研究动机与目标

开发一种无需依赖乐器特征的伊朗古典音乐 Dastgah 分类方法。
在现有浅层学习与单层神经网络方法的基础上，提升分类准确率。
利用深度神经网络，通过 STFT 变换的频谱图自动学习原始音频的特征。
基于新引入的多样化数据集（MICM），建立 Dastgah 识别的新基准。
证明残差连接、批量归一化和门控循环单元（GRUs）在建模音乐谱系与时间模式方面的有效性。

提出的方法

将 MICM 数据集中的原始音频信号通过短时傅里叶变换（STFT）转换为时频表示。
将生成的频谱图输入 AzarNet，一种包含残差模块、批量归一化和 Dropout 层以实现正则化的深度卷积神经网络。
网络架构包含五个 2D 卷积层，卷积核大小为 3×3，随后是最大池化与批量归一化，激活函数采用 Leaky ReLU（α=0.1）。
在最后一个卷积模块后应用 GRU 层，以建模频谱图特征中的序列依赖关系。
对卷积层与 GRU 层同时应用 L2 与 L1 正则化（合并为 LAD+LSE，惩罚系数为 0.01），以防止过拟合。
最终分类器采用两个全连接层，激活函数为 Softmax，实现多类 Dastgah 分类。

实验结果

研究问题

RQ1深度神经网络是否能在不依赖乐器特性的前提下实现高精度的 Dastgah 识别？
RQ2结合 STFT 变换的频谱图与深度卷积神经网络，是否相比原始音频或基于 FFT 的方法能提升分类性能？
RQ3残差连接、批量归一化与 GRU 在建模波斯古典音乐的谱系与时间模式方面效果如何？
RQ4基于 DNN 的方法在新引入的、乐器多样化的 Dastgah 分类数据集上的表现如何？
RQ5所提出的方法是否能超越现有基于简单架构（如单层神经网络）的最先进模型？

主要发现

AzarNet 在 MICM 测试集上实现了 86.21% 的整体 F1 得分，为七种类别 Dastgah 分类迄今报告的最高结果。
该模型优于以往最先进方法，包括在 FFT 特征上使用单层神经网络的模型（F1 得分为 83%）与另一模型（准确率为 72%）。
Shour（92.21%）与 Nava（91.84%）两类取得了最高的个体 F1 得分，表明在这些 Dastgah 上表现优异。
引入 GRU 与瓶颈层后，模型鲁棒性显著提升，F1 得分从无 GRU 时的 84.80% 提升至 86.21%。
采用逐步增加的 Dropout 率（0.1 至 0.4）及 L1/L2 正则化组合，有效降低过拟合，提升所有类别的泛化能力。
模型在样本较少的 Dastgah（如 Segah，共 74 个样本）上也表现强劲，F1 得分为 84.26%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。