QUICK REVIEW

[논문 리뷰] Evaluating CNN with Stacked Feature Representations and Audio Spectrogram Transformer Models for Sound Classification

Parinaz Binandeh Dehaghania, Danilo Penab|arXiv (Cornell University)|2026. 02. 10.

Music and Audio Processing인용 수 0

한 줄 요약

본 논문은 스택된 특징 표현을 사용하는 CNN과 Audio Spectrogram Transformer (AST) 모델을 환경 소리 분류에서 비교하며, 제한된 데이터 및 사전학습 환경에서 스택된 특징을 가진 CNN이 더 데이터를 적게 사용하고 계산 효율적임을 보이고; 대규모 사전학습에서는 AST가 뛰어남.

ABSTRACT

Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.

연구 동기 및 목표

스택형 특징 표현이 CNN 기반 환경 소리 분류(ESC)를 어떻게 향상시키는지 평가한다.
다양한 데이터 및 사전학습 체제에서 스택된 특징을 갖는 CNN과 AST 모델을 비교한다.
ESC-50과 UrbanSound8K 간의 전이 학습 및 더 큰 코퍼스(AudioSet)에 대한 사전학습의 영향 평가.
제안된 CNN 아키텍처의 계산 효율성과 추론 지연을 분석한다.

제안 방법

Librosa를 사용하여 여러 음향 특징(LM, MTCC, CH, TZ, SPC, MFCC, GTCC)을 추출한다.
입력으로 스택된 특징을 만들기 위해 128x128로 크기 조정하고 채널을 연결한다(예: 128x128x3 또는 128x128x4).
두 개의 CNN 아키텍처(CNN-1 및 CNN-2)를 ESC-50에서의 사전학습으로 학습한 뒤 UrbanSound8K에서 미세조정(Last.L)하거나 처음부터 학습(All.L)한다.
AudioSet에서 사전학습되고 ESC-50/UrbanSound8K에서 미세조정된 Audio Spectrogram Transformer(AST)와 비교; 128-빈 Log-Mel 스펙트로그램과 ViT-유사 패치 임베딩을 사용한다.
ESC-50 및 UrbanSound8K에서 5-폴드 교차검증으로 평가하고 정확도, 정밀도, 재현율, F1-score를 보고한다.
배포 고려를 위한 학습 시간과 추론 지연을 평가한다.

Figure 1: Block diagram of MFCC algorithm

실험 결과

연구 질문

RQ1ESC-50/UrbanSound8K 데이터셋에서 스택된 특징 CNN이 단일 특징 CNN과 어떻게 비교되는가?
RQ2다양한 사전학습 체제에서 스택된 특징을 가진 CNN의 성능이 AST 모델과 어떻게 비교되는가?
RQ3전이 학습(ESC-50에서 UrbanSound8K로)이 성능과 일반화에 미치는 영향은 무엇인가?
RQ4제안된 CNN과 AST의 계산 효율성 특성(학습 시간, 추론 시간)은 어떤가?

주요 결과

모델	특징	학습 설정	검증 정확도	학습 정확도	정밀도	재현율	F1-점수	에포크
CNN-1	LM	ESC, All.L	0.68	1.00	0.68	0.68	0.66	150
CNN-1	LM+TZ	ESC, All.L	0.65	1.00	0.66	0.66	0.64	150
CNN-1	LM+MFCC	ESC, All.L	0.64	1.00	0.68	0.64	0.63	150
CNN-1	MFCC+TZ	ESC, All.L	0.62	1.00	0.65	0.62	0.61	150
CNN-1	LM+SPC+CH	ESC, All.L	0.62	1.00	0.65	0.62	0.62	150
CNN-1	MFCC+GTCC+CH+LM	ESC, All.L	0.67	1.00	0.70	0.67	0.67	150
CNN-2	LM	ESC, All.L	0.45	0.68	0.59	0.45	0.44	150
CNN-2	LM+TZ	ESC, All.L	0.66	0.98	0.71	0.67	0.65	150
CNN-2	LM+MFCC	ESC, All.L	0.59	0.99	0.64	0.59	0.59	150
CNN-2	MFCC+TZ	ESC, All.L	0.62	0.97	0.69	0.63	0.61	150
CNN-2	LM+SPC+CH	ESC, All.L	0.53	0.81	0.67	0.54	0.54	150
CNN-2	MFCC+GTCC+CH+LM	ESC, All.L	0.58	0.97	0.67	0.59	0.56	150
CNN-1	LM	ESC+US8K, Last.L	0.87	0.95	0.88	0.88	0.87	50
CNN-1	LM+TZ	ESC+US8K, Last.L	0.88	0.96	0.89	0.88	0.88	50
CNN-1	LM+MFCC	ESC+US8K, Last.L	0.91	0.98	0.92	0.92	0.92	50
CNN-1	MFCC+TZ	ESC+US8K, Last.L	0.91	0.99	0.92	0.91	0.92	50
CNN-1	LM+SPC+CH	ESC+US8K, Last.L	0.85	0.92	0.86	0.85	0.85	50
CNN-1	MFCC+GTCC+CH+LM	ESC+US8K, Last.L	0.92	1.00	0.92	0.92	0.92	50
CNN-2	LM	ESC+US8K, Last.L	0.85	0.91	0.86	0.85	0.85	50
CNN-2	LM+TZ	ESC+US8K, Last.L	0.85	0.89	0.85	0.85	0.85	50
CNN-2	LM+MFCC	ESC+US8K, Last.L	0.86	0.90	0.87	0.86	0.86	50
CNN-2	MFCC+TZ	ESC+US8K, Last.L	0.87	0.92	0.87	0.87	0.87	50
CNN-2	LM+SPC+CH	ESC+US8K, Last.L	0.85	0.89	0.86	0.85	0.85	50
CNN-2	MFCC+GTCC+CH+LM	ESC+US8K, Last.L	0.87	0.90	0.88	0.87	0.87	50

CNN-1은 MFCC+GTCC+CH+LM으로 교차 검증에서 ESC-50의 최상위 검증 정확도 92.46%를 달성한다.
CNN-1은 특징 세트 전반에서 일관되게 CNN-2를 능가하여 더 강한 전달 가능 표현을 시사한다.
제안된 스택된 특징 CNN은 제한된 데이터로 ESC-50에서 학습했을 때 AST baseline을 능가하고, 제한된 데이터로 학습된 AST와의 비교에서도 우수하다; 대규모 사전학습(AudioSet)은 AST를 다른 설정에서 99%로 향상시킨다.
ESC-50에서 UrbanSound8K로 Last.L로의 전이 학습은 All.L에 비해 성능을 크게 향상시키며, 교차 데이터세트 다양성과 미세조정 전략의 가치를 강조한다.
CNN-1은 추론 시간이 더 짧다(평균 21.92 ms)로 CNN-2(평균 30.95 ms)보다 자원 제약 배포에 더 적합하다는 것을 시사한다.

Figure 2: Block diagram of Log-Mel Scale Spectrogram algorithm

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.