QUICK REVIEW

[논문 리뷰] musicnn: Pre-trained convolutional neural networks for music audio tagging

Jordi Pons, Xavier Serra|arXiv (Cornell University)|2019. 09. 14.

Music and Audio Processing참고 문헌 5인용 수 41

한 줄 요약

본 논문은 음악 태깅을 위한 사전 학습된 음악적으로 동기 부여된 CNN들(musicnn)과 VGG 유사 베이스라인을 제시하며, MagnaTagATune과 Million Song Dataset에서 학습하여 태깅, 추출, 및 전이 학습용 특징을 제공한다.

ABSTRACT

Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset.

연구 동기 및 목표

Release pre-trained musically motivated CNNs for music tagging.
Provide out-of-the-box tagging and feature-extraction capabilities.
Enable transfer learning with pre-trained embeddings for downstream tasks.
Offer VGG-like baselines for comparison and a training framework for reproducibility.

제안 방법

Train musically motivated CNNs (musicnn) on MagnaTagATune (MTT) and Million Song Dataset (MSD).
Provide larger MSD-based model (MSD_musicnn_big) to leverage more data.
Offer VGG-like baseline models for comparison (MTT_vgg, MSD_vgg).
Expose top-tagging utility and feature extractors returning timbral, temporal, and CNN features.
Demonstrate transfer learning using SVM classifiers on pre-extracted features with a PCA step.
Publish training code and architecture details for reproducibility.

실험 결과

연구 질문

RQ1Can pre-trained musicnn and vgg models achieve state-of-the-art tagging on MagnaTagATune and MSD datasets?
RQ2How do musicnn-based embeddings perform as features for transfer learning compared to other audio representations?
RQ3What are the comparative performances of MTT vs MSD trained models and the impact of model size on MSD?
RQ4Can attention-based variants improve tagging performance over the standard musicnn/VGG architectures?

주요 결과

Model	Dataset	ROC-AUC	PR-AUC
MTT_musicnn	MagnaTagATune	90.69	38.44
MTT_vgg	MagnaTagATune	90.26	38.19
MSD_musicnn	Million Song Dataset	88.01	28.90
MSD_musicnn_big	Million Song Dataset	88.41	30.02
MSD_vgg	Million Song Dataset	87.67	28.19
MTT_musicnn_attention	MagnaTagATune (attention variant)	90.77	38.61
MSD_musicnn_attention	Million Song Dataset (attention variant)	88.81	31.51

MTT_musicnn achieves 90.69 ROC-AUC and 38.44 PR-AUC on MagnaTagATune.
MTT_vgg achieves 90.26 ROC-AUC and 38.19 PR-AUC on MagnaTagATune.
MSD_musicnn achieves 88.01 ROC-AUC and 28.90 PR-AUC on MSD.
MSD_musicnn_big achieves 88.41 ROC-AUC and 30.02 PR-AUC on MSD.
MSD_vgg achieves 87.67 ROC-AUC and 28.19 PR-AUC on MSD.
An attention-based variant reportedly yields 90.77 ROC-AUC and 38.61 PR-AUC on MagnaTagATune and 88.81 ROC-AUC and 31.51 PR-AUC on MSD.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.