QUICK REVIEW

[논문 리뷰] AST: Audio Spectrogram Transformer

Yuan Gong, Yu-An Chung|arXiv (Cornell University)|2021. 04. 05.

Music and Audio Processing참고 문헌 35인용 수 31

한 줄 요약

논문은 AST를 도입합니다. 이는 합성 컨볼루션 없이 순수한 어텐션 기반의 오디오 분류 모델이며, 사전 학습된 Vision Transformer 가중치를 이용한 transfer로 AudioSet, ESC-50, Speech Commands V2에서 최첨단 성능을 달성합니다.

ABSTRACT

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

연구 동기 및 목표

CNN이 강력한 오디오 분류 성능에 필요한가를 평가한다.
spectrogram에서 작동하고 장기 맥락을 포착하는 순수 어텐션 기반 모델을 개발한다.
ImageNet에서 사전 학습된 ViT를 AST로 전이 학습하는 것을 탐구한다.
다양한 오디오 데이터셋에서 AST를 CNN 기반 및 CNN-어텐션 하이브리드 모델과 비교한다.

제안 방법

log-Mel 스펙트로그램을 겹치는 16x16 패치로 분할하고 선형적으로 768-d 패치 임베딩으로 투영한다.
학습 가능한 768-d 위치 임베딩을 추가하고 [CLS] 토큰을 앞에 배치하여 12-layer, 12-head Transformer 인코더의 입력으로 형성한다.
[CLS] 토큰 출력 값을 선형 계층과 시그모이드 활성화를 사용해 분류에 이용한다.
입력 채널 가중치를 평균화하고 가변 입력 길이에 맞춰 위치 임베딩을 보간하여 ImageNet에서 사전 학습된 ViT 가중치를 AST에 적용한다.
ImageNet 사전 학습, 데이터 증강(mixup, SpecAugment 유사 마스킹) 및 모델 평균/앙상블을 통해 성능을 향상시킨다.
balanced/full AudioSet 및 ESC-50 및 Speech Commands V2에서 전이 학습 효과를 평가한다.

Figure 1: The proposed audio spectrogram transformer (AST) architecture. The 2D audio spectrogram is split into a sequence of 16 $\times$ 16 patches with overlap, and then linearly projected to a sequence of 1-D patch embeddings. Each patch embedding is added with a learnable positional embedding. A

실험 결과

연구 질문

RQ1합성 컨볼루션 없이 순수 어텐션 기반 모델이 CNN 기반 및 CNN-어텐션 하이브리드 아키텍처의 성능에 필적하거나 이를 상회할 수 있는가?
RQ2ImageNet에서 사전 학습된 ViT로부터의 전이 학습이 AST에 적용될 때 오디오 분류 성능을 향상시키는가?
RQ3다양한 입력 길이를 가진 AudioSet, ESC-50, Speech Commands V2에서 AST의 성능은 어떠한가?
RQ4패치 크기/중첩, 위치 임베딩 적응, 패치 형태 등 어떤 설계 선택이 AST 성능에 가장 큰 영향을 미치는가?

주요 결과

Model	Architecture	Balanced mAP	Full mAP
Baseline [15]	CNN+MLP	-	-
PANNs [7]	CNN+Attention	0.278	0.439
PSLA [8] (Single)	CNN+Attention	0.319	0.444
PSLA (Ensemble-S)	CNN+Attention	0.345	0.464
PSLA (Ensemble-M)	CNN+Attention	0.362	0.474
AST (Single)	Pure Attention	0.347 ± 0.001	0.459 ± 0.000
AST (Ensemble-S)	Pure Attention	0.363	0.475
AST (Ensemble-M)	Pure Attention	0.378	0.485

AST는 AudioSet에서 에 ensemble 전체 세트 mAP 0.485, 단일 모델 mAP 0.459로 최첨단 결과를 달성한다.
AST는 Balanced 및 Full 설정에서 과거의 CNN 및 CNN-어텐션 하이브리드를 AudioSet에서 능가한다.
ESC-50의 경우 AST-S(ImageNet 전처리만 해당)은 88.7% 정확도, AST-P는 95.6% 정확도에 도달하여 두 설정 모두에서 SOTA를 상회한다.
Speech Commands V2에서 AST-S는 98.11%, AST-P는 97.88% 정확도를 달성하며, 이 작업에서 ImageNet+AudioSet 사전 학습이 항상 최적의 성능에 필요한 것은 아니다.
ImageNet 사전 학습은 특히 도메인 내 데이터가 적을 때 성능을 크게 향상시키며; 평가된 ViT 가중치 중 DeiT 기반 가중치가 AudioSet에서 가장 좋은 결과를 보였다.
위치 임베딩 적응은 ViT priors를 AST에 활용하는 데 있어 잘라내기 및 양선형 보간이 중요하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.