QUICK REVIEW

[论文解读] Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Darvan Shvan Khairaldeen, Hossein Hassani|arXiv (Cornell University)|Feb 24, 2026

Music and Audio Processing被引用 0

一句话总结

本论文提出一个带注意力的双头CNN–BiLSTM，用于在库尔德Bayati-Kurd maqam歌唱中检测并分类 vocal errors，使用带标签的50首歌语料和对数梅尔谱特征。它报告了检测的宏F1以及各类别F1分数，强调在音高细微变化和节奏方面的优势，同时由于数据有限而在模态漂移方面存在挑战。

ABSTRACT

Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training. Automatic Singing Assessment (ASA) uses machine learning (ML) to provide the accuracy of singing styles and can help learners to improve their performance through error detection. Currently, the available ASA tools follow Western music rules. The musical composition requires all notes to stay within their expected pitch range from start to finish. The system fails to detect micro-intervals and pitch bends, so it identifies Kurdish maqam singing as incorrect even though the singer performs according to traditional rules. Kurdish maqam requires recognizing performance errors within microtonal spaces, which is beyond Western equal temperament. This research is the first attempt to address the mentioned gap. While many error types happen during singing, our focus is on pitch, rhythm, and modal stability errors in the context of Bayati-Kurd. We collected 50 songs from 13 vocalists ( 2-3 hours) and annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). The data was segmented into 15,199 overlapping windows and converted to log-mel spectrograms. We developed a two-headed CNN-BiLSTM with attention mode to decide whether a window contains an error and to classify it based on the chosen errors. Trained for 20 epochs with early stopping at epoch 10, the model reached a validation macro-F1 of 0.468. On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% . Within detected windows, type macro-F1 was 0.387, with F1 of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift); modal drift recall was 8.0%. The better performance on common error types shows that the method works, while the poor modal-drift recall shows that more data and balancing are needed.

研究动机与目标

推动库尔德马卡姆的自动歌唱评估（ASA），以解决微分音音高、节奏与模态漂移错误。
建立带专家注释错误区间的 Bayati-Kurd 声乐演出数据集。
提出一个带注意力的双头CNN–BiLSTM模型，用于检测错误并分类其类型。
在完整歌曲集上评估模型并分析失败模式，以指导未来数据收集与模型改进。

提出的方法

将音频转换为对数梅尔谱（1024 FFT，512跳帧，128梅尔带）。
设计带注意力的CNN–BiLSTM骨干，以捕捉局部光谱-时序模式和更长的音乐上下文。
实现两个输出头：检测头（sigmoid）和类型分类头（在三类上进行softmax）。
使用AdamW训练，利用加权交叉熵和带类别权重的焦点损失来解决不平衡，并应用数据增强和难负样本挖掘。
将数据分窗为10秒（1秒步长）和3秒（0.5秒步长）片段；用中心重叠规则标注窗口；按歌分割以避免数据泄露。

实验结果

研究问题

RQ1一个深度学习模型是否能够在库尔德微音阶 maqam 歌唱中检测声乐错误并对错误类型（音准细微、节奏、模态漂移）进行分类？
RQ2在高度不平衡且样本量较小的库尔德 maqam 声乐错误数据集上，带注意力的CNN–BiLSTM架构表现如何？
RQ3在检测Bayati-Kurd maqam中的模态漂移方面存在哪些挑战与局限，数据量如何影响性能？
RQ4模型输出能否为库尔德 maqam 歌唱教学提供反馈？

主要发现

在全部50首歌上，检测头在0.750阈值下实现了召回率39.4%、精确率25.8%（F1 0.311）。
所有检测的类别宏F1为0.387，各类别F1为0.492（音准微调）、0.536（节奏）、0.133（模态漂移）。
音准微调检测最为精确（89.5%），节奏具有最佳F1（0.536）且精确率/召回率较为平衡，而模态漂移仍然困难（召回率8.0%）。
模型在20个epoch训练，最佳验证宏F1为0.468（第10 epoch）；数据不平衡和模态漂移样本有限限制了性能。
广泛的标注与自定义 Vocal Annotator 工具使专家标注的窗口可用于监督学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。