QUICK REVIEW

[论文解读] Rethinking CNN Models for Audio Classification

Kamalesh Palanisamy, Dipika Singhania|arXiv (Cornell University)|Jul 22, 2020

Music and Audio Processing参考文献 62被引用 108

一句话总结

本文显示在 Mel-spectrogram 上微调的 ImageNet 预训练 CNNs（DenseNet、ResNet、Inception）在 ESC-50 和 UrbanSound8K 上达到 state-of-the-art，并在 GTZAN 上表现竞争力，且集成提高鲁棒性。

ABSTRACT

In this paper, we show that ImageNet-Pretrained standard deep CNN models can be used as strong baseline networks for audio classification. Even though there is a significant difference between audio Spectrogram and standard ImageNet image samples, transfer learning assumptions still hold firmly. To understand what enables the ImageNet pretrained models to learn useful audio representations, we systematically study how much of pretrained weights is useful for learning spectrograms. We show (1) that for a given standard model using pretrained weights is better than using randomly initialized weights (2) qualitative results of what the CNNs learn from the spectrograms by visualizing the gradients. Besides, we show that even though we use the pretrained model weights for initialization, there is variance in performance in various output runs of the same model. This variance in performance is due to the random initialization of linear classification layer and random mini-batch orderings in multiple runs. This brings significant diversity to build stronger ensemble models with an overall improvement in accuracy. An ensemble of ImageNet pretrained DenseNet achieves 92.89% validation accuracy on the ESC-50 dataset and 87.42% validation accuracy on the UrbanSound8K dataset which is the current state-of-the-art on both of these datasets.

研究动机与目标

证明使用 mel-spectrogram 输入的 ImageNet 预训练 CNNs 可以作为音频分类的强基线。
量化预训练权重相对于随机初始化在多个数据集上的收益。
分析在微调过程中预训练权重的变化，并识别在音频任务中最重要的网络部分。
通过基于梯度的可视化提供对 CNN 从声谱图学习内容的定性洞见。
表明深度集成在跨数据集上提高准确性和鲁棒性。

提出的方法

使用在音频数据集上得到的 mel-spectrogram 输入微调的 ImageNet 预训练 DenseNet-201、ResNet 和 Inception 模型。
将 mel-spectrogram 转换为三通道输入（可为复制单个谱图或多窗口通道方法），并应用标准增强（时间拉伸、音高偏移）。
在 ESC-50、UrbanSound8K 和 GTZAN 上训练模型，使用调整后的超参数（Adam，lr=1e-4，weight decay 1e-3）。
通过对 softmax 输出取平均来评估单个模型和集成（M=5）的性能提升。
进行迁移学习分析：权重变化、部分权重融合/冻结以及模型截断，以识别预训练知识最有帮助的区域。

实验结果

研究问题

RQ1在常见的音频分类数据集上，微调后的 ImageNet 预训练 CNNs 是否优于从零开始训练？
RQ2微调后预训练网络的哪些部分保留有用的音频表示，冻结或部分迁移权重如何影响性能？
RQ3使用来自 ImageNet 的迁移学习，简单的 mel-spectrogram 输入与标准 CNN 主干能否在 ESC-50 和 UrbanSound8K 上达到最先进的结果？
RQ4对多个微调的预训练模型进行集成是否能在多个数据集上带来稳健的提升？
RQ5基于梯度的可视化揭示了 CNN 如何解释声谱图输入？

主要发现

模型	GTZAN (预训练)	GTZAN (随机)	ESC-50 (预训练)	ESC-50 (随机)	UrbanSound8K (预训练)	UrbanSound8K (随机)
DenseNet	91.39% ±0.37	88.50%	91.16% ±0.36	92.89%	85.14% ±0.17	87.42%
ResNet	91.09% ±0.86	87.90%	90.65% ±0.28	92.64%	84.76% ±0.33	87.35%
Inception	90.00% ±0.70	86.30%	87.34% ±0.74	89.70%	84.37% ±0.50	86.34%

预训练权重在 ESC-50、UrbanSound8K 和 GTZAN 上持续优于随机初始化（例如，ESC-50 提升约 20%、UrbanSound8K 提升约 10%、GTZAN 超过 3%）。
一个 ImageNet 预训练 DenseNet 的集成在 ESC-50 上达到 92.89%，在 UrbanSound8K 上达到 87.42%（当时的最先进）。
网络的 Block3（中间阶段）对于将知识从 ImageNet 转移到音频至关重要；冻结或移除此区域会大幅降低性能。
Integrated Gradients 可视化显示模型聚焦在声谱图中的高能区域，表明学到了声音事件周围的边缘状边界。
权重变化分析（SVCCA）表明在微调后初始层保留了其预训练特征的大部分，而中间层经历了更多任务特定的适应。
对五个独立训练的模型进行集成在 ESC-50 和 UrbanSound8K 上大约获得 +2% 的绝对提升（GTZAN 则略有变动）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。