QUICK REVIEW

[논문 리뷰] Multi-level Attention Model for Weakly Supervised Audio Classification

Changsong Yu, Karim Said Barsim|arXiv (Cornell University)|2018. 03. 06.

Music and Audio Processing참고 문헌 22인용 수 63

한 줄 요약

The paper extends a single-level attention model to a multi-level attention framework that applies attention modules at multiple intermediate layers, achieving higher mean average precision (mAP) on Audio Set than previous methods.

ABSTRACT

In this paper, we propose a multi-level attention model to solve the weakly labelled audio classification problem. The objective of audio classification is to predict the presence or absence of audio events in an audio clip. Recently, Google published a large scale weakly labelled dataset called Audio Set, where each audio clip contains only the presence or absence of the audio events, without the onset and offset time of the audio events. Our multi-level attention model is an extension to the previously proposed single-level attention model. It consists of several attention modules applied on intermediate neural network layers. The output of these attention modules are concatenated to a vector followed by a multi-label classifier to make the final prediction of each class. Experiments shown that our model achieves a mean average precision (mAP) of 0.360, outperforms the state-of-the-art single-level attention model of 0.327 and Google baseline of 0.314.

연구 동기 및 목표

Address weakly labeled audio classification where only presence/absence of events is known per clip.
Leverage multi-level representations from intermediate neural network layers to improve event detection.
Demonstrate that concatenating multi-level attended features yields superior performance on Audio Set.

제안 방법

Apply attention modules after multiple intermediate layers of a neural network.
Compute predictions from each attention module as y^(l) and concatenate them into a single vector u.
Use a final fully connected layer with sigmoid activation to produce class probabilities.
Train with dropout and batch normalization, using Adam optimizer.
Compare nine variants including single-level and multi-level architectures.
Evaluate using mAP, AUC, and d-prime on Audio Set.

실험 결과

연구 질문

RQ1Does incorporating attention at multiple network levels improve weakly supervised audio classification performance on Audio Set?
RQ2Which configurations of multi-level attention yield the best trade-off between performance and complexity?
RQ3How do multi-level features compare to single-level attention and Google baseline on key metrics (mAP, AUC, d-prime)?

주요 결과

Multi-level attention models outperform Google baseline and the single-level attention model across mAP, AUC, and d-prime.
The best architecture (2-A-1-A) achieves mAP of 0.360, compared to 0.314 baseline and 0.327 prior work.
Concatenating multi-level features provides richer representations and allows each class to benefit from different layer representations.
Performance gains are not uniform across all classes; some classes favor different architectures.
Overall, multi-level feature concatenation improves performance on most classes.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.