QUICK REVIEW

[论文解读] Carnatic Raga Identification System using Rigorous Time-Delay Neural Network

Sanjay Natesan, Homayoon Beigi|arXiv (Cornell University)|Jan 1, 2024

Remote Sensing and Land Use被引用 2

一句话总结

本文提出了一种基于混合时间延迟神经网络（TDNN）与长短期记忆网络（LSTM）架构的深度学习系统，用于卡纳提克拉加识别，并引入注意力机制以应对shruti变化。该模型在包含172种拉加的676段录音数据集上实现了95.31%的验证准确率，显著提升了先前研究在规模与复杂度方面的水平。

ABSTRACT

Large scale machine learning-based Raga identification continues to be a nontrivial issue in the computational aspects behind Carnatic music. Each raga consists of many unique and intrinsic melodic patterns that can be used to easily identify them from others. These ragas can also then be used to cluster songs within the same raga, as well as identify songs in other closely related ragas. In this case, the input sound is analyzed using a combination of steps including using a Discrete Fourier transformation and using Triangular Filtering to create custom bins of possible notes, extracting features from the presence of particular notes or lack thereof. Using a combination of Neural Networks including 1D Convolutional Neural Networks conventionally known as Time-Delay Neural Networks) and Long Short-Term Memory (LSTM), which are a form of Recurrent Neural Networks, the backbone of the classification strategy to build the model can be created. In addition, to help with variations in shruti, a long-time attention-based mechanism will be implemented to determine the relative changes in frequency rather than the absolute differences. This will provide a much more meaningful data point when training audio clips in different shrutis. To evaluate the accuracy of the classifier, a dataset of 676 recordings is used. The songs are distributed across the list of ragas. The goal of this program is to be able to effectively and efficiently label a much wider range of audio clips in more shrutis, ragas, and with more background noise.

研究动机与目标

开发一种可扩展、高精度的机器学习系统，用于识别多样化的卡纳提克拉加，涵盖不同拉加、shruti及演奏风格。
通过建模相对频率偏移而非绝对值，解决拉加识别中shruti变化的挑战。
将现有拉加识别系统从传统的72种梅拉卡特拉加扩展至包含贾尼亚拉加及更广泛的音乐样本。
通过先进的特征提取与注意力机制，提升在嘈杂或多样化音频条件下的泛化能力与鲁棒性。
构建一种计算效率高且准确的模型，适用于卡纳提克音乐的大规模音乐信息检索任务。

提出的方法

系统使用一维卷积神经网络（TDNN）从音频信号中提取局部旋律模式，前提为已完成谱特征提取。
通过离散傅里叶变换（DFT）与三角形滤波器组提取谱特征，以建模与感知相关的频率带。
长短期记忆网络（LSTM）用于处理序列模式，以建模旋律轮廓与加马卡效果的时间依赖性。
采用基于注意力的机制聚焦于相对频率偏移，提升对不同演奏中shruti差异的鲁棒性。
使用分类交叉熵损失函数与Adam优化器进行训练，并采用早停策略防止过拟合。
数据预处理包括归一化与数据增强，以提升在多样化音频条件下的泛化能力。

实验结果

研究问题

RQ1基于注意力机制的混合TDNN-LSTM模型是否能在涵盖贾尼亚拉加的广泛卡纳提克拉加中实现高精度的拉加识别？
RQ2与绝对频率相比，建模相对频率偏移在拉加识别中如何提升对shruti变化的鲁棒性？
RQ3在每种拉加训练数据有限的情况下，深度学习模型在标准72种梅拉卡特拉加之外的拉加上泛化能力如何？
RQ4加马卡模式的引入对端到端拉加分类模型的性能有何影响？
RQ5数据集规模与多样性对拉加识别任务中模型泛化能力与准确率的影响如何？

主要发现

该模型在包含172种不同拉加（含梅拉卡特与贾尼亚拉加）的676段录音数据集上实现了95.31%的验证准确率。
训练过程收敛高效，由于验证损失趋于平稳，早停机制在第132个周期后终止，表明正则化策略有效。
模型的验证损失为0.3544，远低于初始损失，表明尽管模型复杂度高且数据集多样化，仍实现了有效学习。
训练准确率达到99.57%，与验证准确率仅相差4.26%，表明尽管类别数量庞大且模式复杂，仍具备极小的过拟合风险。
该系统在性能上优于或匹配先前的最先进方法，尽管其训练数据集规模超过许多先前研究的200倍以上。
注意力机制有效捕捉了相对音高的变化，显著增强了对不同演奏中shruti差异的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。