QUICK REVIEW

[논문 리뷰] Voices of the Mountains: Deep Learning-Based Vocal Error Detection System for Kurdish Maqams

Darvan Shvan Khairaldeen, Hossein Hassani|arXiv (Cornell University)|2026. 02. 24.

Music and Audio Processing인용 수 0

한 줄 요약

이 논문은 라벨링된 50곡 코퍼스와 log-mel 스펙트로그램 특징을 사용하여 Kurdish Bayati-Kurd maqam 노래의 보컬 에러를 탐지하고 분류하기 위한 주의(attention)가 있는 두 개의 헤드를 가진 CNN–BiLSTM을 제시합니다. 감지에 대한 macro-F1과 클래스별 F1 점수를 보고하며, 미세 음정과 리듬에서의 강점과 제한된 데이터로 인한 모달 드리프트의 도전점을 강조합니다.

ABSTRACT

Maqam, a singing type, is a significant component of Kurdish music. A maqam singer receives training in a traditional face-to-face or through self-training. Automatic Singing Assessment (ASA) uses machine learning (ML) to provide the accuracy of singing styles and can help learners to improve their performance through error detection. Currently, the available ASA tools follow Western music rules. The musical composition requires all notes to stay within their expected pitch range from start to finish. The system fails to detect micro-intervals and pitch bends, so it identifies Kurdish maqam singing as incorrect even though the singer performs according to traditional rules. Kurdish maqam requires recognizing performance errors within microtonal spaces, which is beyond Western equal temperament. This research is the first attempt to address the mentioned gap. While many error types happen during singing, our focus is on pitch, rhythm, and modal stability errors in the context of Bayati-Kurd. We collected 50 songs from 13 vocalists ( 2-3 hours) and annotated 221 error spans (150 fine pitch, 46 rhythm, 25 modal drift). The data was segmented into 15,199 overlapping windows and converted to log-mel spectrograms. We developed a two-headed CNN-BiLSTM with attention mode to decide whether a window contains an error and to classify it based on the chosen errors. Trained for 20 epochs with early stopping at epoch 10, the model reached a validation macro-F1 of 0.468. On the full 50-song evaluation at a 0.750 threshold, recall was 39.4% and precision 25.8% . Within detected windows, type macro-F1 was 0.387, with F1 of 0.492 (fine pitch), 0.536 (rhythm), and 0.133 (modal drift); modal drift recall was 8.0%. The better performance on common error types shows that the method works, while the poor modal-drift recall shows that more data and balancing are needed.

연구 동기 및 목표

쿠르드 maqam를 위한 자동 노래 평가(ASA)를 동기화하기 위해 마이크로톤 음정, 리듬, 모달 드리프트 에러를 다룬다.
전문가 주석의 에러 구간으로 Bayati-Kurd 보컬 공연 데이터셋을 개발한다.
에러를 탐지하고 유형을 분류하기 위해 주의(attention)를 포함한 두 헤드 CNN–BiLSTM 모델을 제안한다.
전체 노래 세트에서 모델을 평가하고 실패 모드를 분석하여 향후 데이터 수집 및 모델 개선에 반영한다.

제안 방법

오디오를 log-mel 스펙트로그램으로 변환한다(1024 FFT, 512 hop, 128 mel Bin).
로컬 스펙트로-템포럴 패턴과 더 긴 음악 컨텍스트를 포착하기 위해 주의(attention)를 포함한 CNN–BiLSTM 백본을 설계한다.
출력 두 헤드를 구현한다: 감지 헤드(sigmoid)와 타입 분류 헤드(세 클래스에 대한 softmax).
AdamW로 학습하고 데이터 불균형을 다루기 위해 가중치가 있는 교차 엔트로피 및 focal 손실을 사용하며, 데이터 증강 및 Hard Negative Mining을 적용한다.
윈도 데이터를 10초(1초 간격) 및 3초(0.5초 간격) 세그먼트로 창을 나누고, 중심 겹침 규칙으로 윈도에 라벨을 부여하며, Leakage를 피하기 위해 곡별로 분할한다.

실험 결과

연구 질문

RQ1깊은 학습 모델이 Kurdish 마카음의 마이크로톤 노래에서 보컬 에러를 탐지하고 에러 유형(정밀 음정, 리듬, 모달 드리프트)을 분류할 수 있는가?
RQ2주의(attention)를 포함한 CNN–BiLSTM 아키텍처가 불균형하고 작은 데이터셋의 Kurdish maqam 보컬 에러에 얼마나 잘 작동하는가?
RQ3Bayati-Kurd maqam에서 모달 드리프트를 탐지하는 데 따른 도전과 한계는 무엇이며 데이터 양이 성능에 어떤 영향을 미치는가?
RQ4모델 출력으로부터 어떤 피드백을 생성하여 Kurdish maqam 노래 교육에 도움을 줄 수 있는가?

주요 결과

모든 50곡에서 감지 헤드는 임계값 0.750에서 재현율 39.4%, 정밀도 25.8%(F1 0.311)을 달성했습니다.
전체 탐지에 대한 Type macro-F1은 0.387였고, 클래스별 F1은 0.492(정밀 음정), 0.536(리듬), 0.133(모달 드리프트)였습니다.
정밀 음정 탐지는 가장 높은 정확도(89.5%)를 보였고, 리듬은 가장 높은 F1(0.536)과 균형 잡힌 정밀도/재현율을 나타냈으며, 모달 드리프트는 여전히 어려웠습니다(재현율 8.0%).
데이터가 불균형하고 모달 드리프트 예시가 제한되어 성능이 제한되었습니다. 모델은 20 에폭으로 학습했고, 검증 시 Macro-F1이 0.468(에폭 10)으로 최적이었습니다.
광범위한 주석 작업과 맞춤형 Vocal Annotator 도구를 통해 감독 학습에 사용된 전문가 라벨 윈도우를 확보했습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.