[논문 리뷰] musicnn: Pre-trained convolutional neural networks for music audio tagging
본 논문은 음악 태깅을 위한 사전 학습된 음악적으로 동기 부여된 CNN들(musicnn)과 VGG 유사 베이스라인을 제시하며, MagnaTagATune과 Million Song Dataset에서 학습하여 태깅, 추출, 및 전이 학습용 특징을 제공한다.
Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset.
연구 동기 및 목표
- Release pre-trained musically motivated CNNs for music tagging.
- Provide out-of-the-box tagging and feature-extraction capabilities.
- Enable transfer learning with pre-trained embeddings for downstream tasks.
- Offer VGG-like baselines for comparison and a training framework for reproducibility.
제안 방법
- Train musically motivated CNNs (musicnn) on MagnaTagATune (MTT) and Million Song Dataset (MSD).
- Provide larger MSD-based model (MSD_musicnn_big) to leverage more data.
- Offer VGG-like baseline models for comparison (MTT_vgg, MSD_vgg).
- Expose top-tagging utility and feature extractors returning timbral, temporal, and CNN features.
- Demonstrate transfer learning using SVM classifiers on pre-extracted features with a PCA step.
- Publish training code and architecture details for reproducibility.
실험 결과
연구 질문
- RQ1Can pre-trained musicnn and vgg models achieve state-of-the-art tagging on MagnaTagATune and MSD datasets?
- RQ2How do musicnn-based embeddings perform as features for transfer learning compared to other audio representations?
- RQ3What are the comparative performances of MTT vs MSD trained models and the impact of model size on MSD?
- RQ4Can attention-based variants improve tagging performance over the standard musicnn/VGG architectures?
주요 결과
| Model | Dataset | ROC-AUC | PR-AUC |
|---|---|---|---|
| MTT_musicnn | MagnaTagATune | 90.69 | 38.44 |
| MTT_vgg | MagnaTagATune | 90.26 | 38.19 |
| MSD_musicnn | Million Song Dataset | 88.01 | 28.90 |
| MSD_musicnn_big | Million Song Dataset | 88.41 | 30.02 |
| MSD_vgg | Million Song Dataset | 87.67 | 28.19 |
| MTT_musicnn_attention | MagnaTagATune (attention variant) | 90.77 | 38.61 |
| MSD_musicnn_attention | Million Song Dataset (attention variant) | 88.81 | 31.51 |
- MTT_musicnn achieves 90.69 ROC-AUC and 38.44 PR-AUC on MagnaTagATune.
- MTT_vgg achieves 90.26 ROC-AUC and 38.19 PR-AUC on MagnaTagATune.
- MSD_musicnn achieves 88.01 ROC-AUC and 28.90 PR-AUC on MSD.
- MSD_musicnn_big achieves 88.41 ROC-AUC and 30.02 PR-AUC on MSD.
- MSD_vgg achieves 87.67 ROC-AUC and 28.19 PR-AUC on MSD.
- An attention-based variant reportedly yields 90.77 ROC-AUC and 38.61 PR-AUC on MagnaTagATune and 88.81 ROC-AUC and 31.51 PR-AUC on MSD.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.