[Paper Review] Transfer learning for music classification and regression tasks
This paper trains a pre-labeled convnet for music tagging and transfers its multi-layer features to six target music and audio tasks, showing superior performance over MFCC baselines and competitive results with task-specific methods.
In this paper, we present a transfer learning approach for music classification and regression tasks. We propose to use a pre-trained convnet feature, a concatenated feature vector using the activations of feature maps of multiple layers in a trained convolutional network. We show how this convnet feature can serve as general-purpose music representation. In the experiments, a convnet is trained for music tagging and then transferred to other music-related classification and regression tasks. The convnet feature outperforms the baseline MFCC feature in all the considered tasks and several previous approaches that are aggregating MFCCs as well as low- and high-level music features.
Motivation & Objective
- Motivate transfer learning to address data sparsity in Music Information Retrieval (MIR).
- Propose a convnet feature extractor that concatenates activations from multiple layers for transfer.
- Evaluate the transfered features across six diverse music and audio tasks.
- Compare convnet features with MFCC baselines and random-weight convnets to assess knowledge transfer versus architecture.
Proposed method
- Train a convolutional neural network on a music tagging source task using mel-spectrogram inputs.
- Extract a concatenated convnet feature by aggregating activations from multiple layers (1st–5th) with average pooling where needed.
- Assess multiple layer-combination strategies (e.g., 123, 135, 12345) to find effective representations for each target task.
- Use SVMs for classification/regression on target tasks to focus on feature quality rather than classifier complexity.
- Compare convnet features against MFCC baselines and random convnet features, across six target tasks.
Experimental results
Research questions
- RQ1Can a pre-trained convnet on music tagging serve as a general-purpose feature extractor for diverse MIR tasks?
- RQ2Which layer-wise feature combinations provide the most effective representations for each target task?
- RQ3Do convnet features outperform MFCC baselines and how do they compare to task-specific state-of-the-art methods?
- RQ4Is concatenating MFCC features with convnet features beneficial or redundant for these tasks?
Key findings
- The convnet feature outperforms MFCC baselines in all six target tasks.
- Concatenating features from multiple layers (e.g., 12345) often yields the best performance, especially for complex tasks.
- In several tasks, the convnet features alone rival state-of-the-art approaches that rely on hand-crafted features or task-specific designs.
- Random convnet features underperform the trained convnet features, indicating gains come from learned transfer knowledge rather than network structure alone.
- For Task 6 (acoustic event detection), combining convnet features with MFCCs improves performance, suggesting complementary information, unlike the other tasks where MFCCs add little value.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.