QUICK REVIEW

[論文レビュー] Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation

Hu Hu, Chao-Han Huck Yang|arXiv (Cornell University)|Jul 16, 2020

Music and Audio Processing参考文献 11被引用数 46

ひとこと要約

この論文は、デバイス不一致に対処するための extensive data augmentation を用いた2段階CNNベースのASCシステムを提案し、Task 1aで81.9%、Task 1bで96.7%の開発データ精度をそれぞれ達成する。

ABSTRACT

In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9\% is attained using our best single classifier and data augmentation. An accuracy of 81.9\% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7\% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.

研究の動機と目的

デバイス不一致を解消するため、デバイス不変の2段階分類器を作成し、3クラスと10クラスのCNNを組み合わせる。
Task 1b の低複雑度ASCモデル（≤500 KB）を、性能をほとんど落とさずに量子化およびモデル圧縮を用いて開発する。
デバイスを跨ぐ頑健性を高めるため、複数のCNNアーキテクチャとデータ拡張戦略を評価する。
単一モデルの結果を超えるASC性能を示すために、モデルエンサンブルを活用する。

提案手法

Two-stage classification: first a three-class classifier (indoor, outdoor, transportation) and second a ten-class classifier; final class is chosen by score fusion of the two outputs.
Five CNN-based architectures evaluated: FCNN, fsFCNN, fsFCNN-split, Resnet (17-layer with modifications, named Resnet-d when doubled filters), and Mobnet (MobileNet-v2).
Extensive data augmentation: mixup, random cropping, spectrum augmentation, spectrum correction, reverberation+DRC, pitch shift, speed change, random noise, and mix audios; channel confusion used for Task 1b only.
Task 1b: post-training quantization (dynamic range quantization to 8-bit) to reduce model size to about 1/8 while maintaining accuracy; ensemble of smaller models used to stay under 500 KB.
Feature extraction: log-mel filter banks (LMFB) with 2048-point FFT, 2048-s window, 1024 frame shift; LMFBs scaled to [0,1] and augmented with LMFB deltas; input shapes 423x128x3 (Task 1a) and 461x128x6 (Task 1b).
Training: SGD with cosine-decay-restart learning rate schedule; use of official train-test split for Task 1a and Task 1b; Keras implementation; development data fully used for final submissions.

実験結果

リサーチクエスチョン

RQ1Can a two-stage CNN-based ASC system improve robustness to device mismatch in Task 1a by combining a coarse three-class prediction with a fine ten-class prediction?
RQ2How do various CNN architectures (FCNN, fsFCNN, Resnet-d, Mobnet) interact with data augmentation to mitigate device-induced performance drops?
RQ3What is the impact of data augmentation strategies, including spectrum augmentation, spectrum correction, reverberation+DRC, and mixup, on ASC accuracy across seen and unseen devices?
RQ4Can post-training quantization enable a sub-500 KB ASC model for Task 1b with minimal accuracy loss, and does model ensembling further improve results?

主な発見

Two-stage fusion with multiple CNNs yields 81.9% ASC accuracy on Task 1a development data (best fusion of models).
FCNN-based ensembles achieve 76.9% standalone accuracy, while combining FCNN with fsFCNN variants reaches 81.9% when using the two-stage approach.
Applying extensive data augmentation (including reverberation, DRC, spectrum augmentation, and mixup) significantly boosts robustness, with notable improvements on unseen devices (s4–s6).
Task 1b results show Mobnet and small-FCNN achieving 95.2% and 96.4% accuracy respectively before compression; dynamic range quantization reduces sizes to about 1/8 with minimal accuracy loss (Mobnet: 0.4% drop; small-FCNN: 0.1% drop).
Final submissions use ensembles of multiple models to exceed single-model performance, achieving 81.9% on Task 1a and 96.7% on Task 1b development data.
Four final submissions for Task 1a involved ensembles of Resnet-d, fc-based nets, and fsFCNN variants with attention and data strategies.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。