Skip to main content
QUICK REVIEW

[논문 리뷰] musicnn: Pre-trained convolutional neural networks for music audio tagging

Jordi Pons, Xavier Serra|arXiv (Cornell University)|2019. 09. 14.
Music and Audio Processing참고 문헌 5인용 수 41
한 줄 요약

본 논문은 음악 태깅을 위한 사전 학습된 음악적으로 동기 부여된 CNN들(musicnn)과 VGG 유사 베이스라인을 제시하며, MagnaTagATune과 Million Song Dataset에서 학습하여 태깅, 추출, 및 전이 학습용 특징을 제공한다.

ABSTRACT

Pronounced as "musician", the musicnn library contains a set of pre-trained musically motivated convolutional neural networks for music audio tagging: https://github.com/jordipons/musicnn. This repository also includes some pre-trained vgg-like baselines. These models can be used as out-of-the-box music audio taggers, as music feature extractors, or as pre-trained models for transfer learning. We also provide the code to train the aforementioned models: https://github.com/jordipons/musicnn-training. This framework also allows implementing novel models. For example, a musically motivated convolutional neural network with an attention-based output layer (instead of the temporal pooling layer) can achieve state-of-the-art results for music audio tagging: 90.77 ROC-AUC / 38.61 PR-AUC on the MagnaTagATune dataset --- and 88.81 ROC-AUC / 31.51 PR-AUC on the Million Song Dataset.

연구 동기 및 목표

  • Release pre-trained musically motivated CNNs for music tagging.
  • Provide out-of-the-box tagging and feature-extraction capabilities.
  • Enable transfer learning with pre-trained embeddings for downstream tasks.
  • Offer VGG-like baselines for comparison and a training framework for reproducibility.

제안 방법

  • Train musically motivated CNNs (musicnn) on MagnaTagATune (MTT) and Million Song Dataset (MSD).
  • Provide larger MSD-based model (MSD_musicnn_big) to leverage more data.
  • Offer VGG-like baseline models for comparison (MTT_vgg, MSD_vgg).
  • Expose top-tagging utility and feature extractors returning timbral, temporal, and CNN features.
  • Demonstrate transfer learning using SVM classifiers on pre-extracted features with a PCA step.
  • Publish training code and architecture details for reproducibility.

실험 결과

연구 질문

  • RQ1Can pre-trained musicnn and vgg models achieve state-of-the-art tagging on MagnaTagATune and MSD datasets?
  • RQ2How do musicnn-based embeddings perform as features for transfer learning compared to other audio representations?
  • RQ3What are the comparative performances of MTT vs MSD trained models and the impact of model size on MSD?
  • RQ4Can attention-based variants improve tagging performance over the standard musicnn/VGG architectures?

주요 결과

ModelDatasetROC-AUCPR-AUC
MTT_musicnnMagnaTagATune90.6938.44
MTT_vggMagnaTagATune90.2638.19
MSD_musicnnMillion Song Dataset88.0128.90
MSD_musicnn_bigMillion Song Dataset88.4130.02
MSD_vggMillion Song Dataset87.6728.19
MTT_musicnn_attentionMagnaTagATune (attention variant)90.7738.61
MSD_musicnn_attentionMillion Song Dataset (attention variant)88.8131.51
  • MTT_musicnn achieves 90.69 ROC-AUC and 38.44 PR-AUC on MagnaTagATune.
  • MTT_vgg achieves 90.26 ROC-AUC and 38.19 PR-AUC on MagnaTagATune.
  • MSD_musicnn achieves 88.01 ROC-AUC and 28.90 PR-AUC on MSD.
  • MSD_musicnn_big achieves 88.41 ROC-AUC and 30.02 PR-AUC on MSD.
  • MSD_vgg achieves 87.67 ROC-AUC and 28.19 PR-AUC on MSD.
  • An attention-based variant reportedly yields 90.77 ROC-AUC and 38.61 PR-AUC on MagnaTagATune and 88.81 ROC-AUC and 31.51 PR-AUC on MSD.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.