[Paper Review] Multi-level Attention Model for Weakly Supervised Audio Classification
The paper extends a single-level attention model to a multi-level attention framework that applies attention modules at multiple intermediate layers, achieving higher mean average precision (mAP) on Audio Set than previous methods.
In this paper, we propose a multi-level attention model to solve the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or absence of audio events in an audio clip. Recently, Google published a large scale weakly labelled dataset called Audio Set, where each audio clip contains only the presence or absence of the audio events, without the onset and offset time of the audio events. Our multi-level attention model is an extension to the previously proposed single-level attention model. It consists of several attention modules applied on intermediate neural network layers. The output of these attention modules are concatenated to a vector followed by a multi-label classifier to make the final prediction of each class. Experiments shown that our model achieves a mean average precision (mAP) of 0.360, outperforms the state-of-the-art single-level attention model of 0.327 and Google baseline of 0.314.
Motivation & Objective
- Address weakly labeled audio classification where only presence/absence of events is known per clip.
- Leverage multi-level representations from intermediate neural network layers to improve event detection.
- Demonstrate that concatenating multi-level attended features yields superior performance on Audio Set.
Proposed method
- Apply attention modules after multiple intermediate layers of a neural network.
- Compute predictions from each attention module as y^(l) and concatenate them into a single vector u.
- Use a final fully connected layer with sigmoid activation to produce class probabilities.
- Train with dropout and batch normalization, using Adam optimizer.
- Compare nine variants including single-level and multi-level architectures.
- Evaluate using mAP, AUC, and d-prime on Audio Set.
Experimental results
Research questions
- RQ1Does incorporating attention at multiple network levels improve weakly supervised audio classification performance on Audio Set?
- RQ2Which configurations of multi-level attention yield the best trade-off between performance and complexity?
- RQ3How do multi-level features compare to single-level attention and Google baseline on key metrics (mAP, AUC, d-prime)?
Key findings
- Multi-level attention models outperform Google baseline and the single-level attention model across mAP, AUC, and d-prime.
- The best architecture (2-A-1-A) achieves mAP of 0.360, compared to 0.314 baseline and 0.327 prior work.
- Concatenating multi-level features provides richer representations and allows each class to benefit from different layer representations.
- Performance gains are not uniform across all classes; some classes favor different architectures.
- Overall, multi-level feature concatenation improves performance on most classes.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.