[論文レビュー] QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design
この論文は Residual Normalization、BC-ResNet-Mod アーキテクチャ、スペクトログラム間のデバイス翻訳、以及び剪定・量子化・知識蒸馏によるモデル圧縮を用いた、デバイスバランスの取れていないデータのための効率的な ASC システムを提案し、低パラメータ数で強力なデバイス間一般化を達成する。
This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. We extend this work to [1].
研究の動機と目的
- Address device imbalance and low model complexity in ASC for multi-device data.
- Develop an efficient CNN architecture suited to audio scene classification with limited receptive field.
- Introduce Residual Normalization to improve device generalization while preserving discriminative information.
- Augment training with spectrogram-to-spectrogram device translation to mitigate domain gaps.
- Compress models using pruning, quantization, and knowledge distillation without large performance loss.
提案手法
- Propose BC-ResNet-Mod, a modified Broadcasting residual network with a limited receptive field and max-pooling to control temporal resolution.
- Introduce Residual Normalization (ResNorm), a frequency-wise instance normalization with a residual shortcut to retain useful domain information.
- Develop a device translator based on a U-Net with Subspectral Normalization to translate spectrograms between devices for data augmentation.
- Apply three compression techniques—one-shot magnitude pruning, quantization-aware training (QAT), and knowledge distillation from a teacher network—to reduce model size while maintaining accuracy.
実験結果
リサーチクエスチョン
- RQ1How can Residual Normalization improve generalization to unseen devices in device-imbalanced ASC datasets?
- RQ2What is the impact of a restricted receptive field and max-pooling on accuracy and efficiency in BC-ResNet variants for ASC?
- RQ3Can device translation and data augmentation reduce domain gaps across multiple devices?
- RQ4How do pruning, quantization, and knowledge distillation affect performance and compression for low-parameter ASC models?
主な発見
| Method | #Param | A | B | C | S1 | S2 | S3 | S4 | S5 | S6 | Overall | Std. Dev |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BC-ResNet-Mod-1 | 8.1k | 73.1 | 61.2 | 65.3 | 58.2 | 57.3 | 66.2 | 51.5 | 51.5 | 46.3 | 58.9 | 0.8 |
| BC-ResNet-Mod-1 + Global FreqNorm | 8.1k | 73.9 | 60.9 | 65.5 | 60.2 | 57.9 | 67.9 | 50.2 | 54.3 | 49.4 | 60.0 | 0.9 |
| BC-ResNet-Mod-1 + FreqIN | 8.1k | 69.9 | 63.5 | 60.0 | 65.3 | 66.7 | 67.6 | 65.9 | 64.9 | 62.0 | 65.1 | 0.6 |
| BC-ResNet-Mod-1 + Pre-ResNorm | 8.1k | 75.1 | 68.9 | 67.0 | 66.0 | 63.9 | 69.3 | 63.4 | 66.9 | 63.6 | 67.1 | 0.8 |
| BC-ResNet-Mod-1 + ResNorm | 8.1k | 76.4 | 65.1 | 68.3 | 66.0 | 62.2 | 69.7 | 63.0 | 63.0 | 58.3 | 65.8 | 0.7 |
| CP-ResNet, c=64 | 899k | 77.0 | 69.3 | 69.6 | 70.3 | 68.2 | 70.9 | 62.7 | 63.9 | 58.1 | 67.8 | - |
| BC-ResNet-8, num SSN group=4 | 317k | 77.9 | 70.4 | 72.4 | 69.5 | 68.3 | 69.8 | 66.3 | 64.1 | 58.6 | 68.6 | 0.4 |
| BC-ResNet-Mod-8 | 315k | 80.7 | 72.8 | 74.4 | 71.4 | 68.7 | 71.0 | 62.2 | 65.3 | 59.4 | 69.5 | 0.3 |
| BC-ResNet-Mod-8 + Pre-ResNorm | 315k | 80.8 | 73.7 | 73.0 | 74.0 | 72.9 | 77.8 | 73.3 | 72.1 | 71.0 | 74.3 | 0.3 |
| BC-ResNet-Mod-8 + ResNorm | 315k | 81.3 | 74.4 | 74.2 | 75.6 | 73.1 | 78.6 | 73.0 | 74.0 | 72.7 | 75.2 | 0.4 |
| BC-ResNet-Mod-8 + ResNorm, Device Translator | 315k | 80.5 | 74.4 | 73.9 | 76.0 | 73.2 | 78.5 | 74.1 | 74.1 | 73.6 | 75.4 | 0.3 |
| BC-ResNet-Mod-8 + ResNorm, 300epoch, KD | 315k | 82.6 | 75.6 | 74.7 | 77.0 | 74.2 | 78.7 | 75.1 | 74.8 | 73.4 | 76.3 | 0.8 |
| + model compress | - | 82.0 | 73.8 | 74.3 | 76.2 | 73.2 | 78.8 | 73.8 | 72.8 | 73.3 | 75.3 | 0.8 |
- BC-ResNet-Mod-8 with ResNorm achieves 75.2% average test accuracy on TAU 2020 Mobile development data with about one-third the parameters of a strong baseline.
- Applying ResNorm improves both seen-device performance and unseen-device generalization compared to baselines like Global FreqNorm and FreqIN.
- Device translation via spectrogram-to-spectrogram translation reduces inter-device performance gaps and enhances domain generalization when used in training.
- Knowledge distillation with a teacher network and 8-bit quantization can yield 76.3% average accuracy on the official split, with an 89% pruning rate and 8-bit conv weights, while compressing to 122 KB total size.
- The proposed BC-ResNet-Mod-8 with ResNorm substantially outperforms the baseline CP-ResNet and BC-ResNet-8 variants in overall accuracy on the development set.
- Final compressed model (KD + pruning + quantization) achieves 75.3% overall accuracy with 121.9 KB size on the TAU 2020 Mobile development setup.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。