QUICK REVIEW

[论文解读] Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks

Muhammad Huzaifah|arXiv (Cornell University)|Jun 22, 2017

Music and Audio Processing参考文献 18被引用 122

一句话总结

本文比较了基于CNN的环境声音分类在ESC-50和UrbanSound8K上使用的STFT（线性和梅尔）、CQT、CWT和MFCC输入，发现 Mel-STFT 通常表现较强，MFCC 最弱；二维卷积通常优于一维，在不同信号类别下窗口大小会影响结果。

ABSTRACT

Recent successful applications of convolutional neural networks (CNNs) to audio classification and speech recognition have motivated the search for better input representations for more efficient training. Visual displays of an audio signal, through various time-frequency representations such as spectrograms offer a rich representation of the temporal and spectral structure of the original signal. In this letter, we compare various popular signal processing methods to obtain this representation, such as short-time Fourier transform (STFT) with linear and Mel scales, constant-Q transform (CQT) and continuous Wavelet transform (CWT), and assess their impact on the classification performance of two environmental sound datasets using CNNs. This study supports the hypothesis that time-frequency representations are valuable in learning useful features for sound classification. Moreover, the actual transformation used is shown to impact the classification accuracy, with Mel-scaled STFT outperforming the other discussed methods slightly and baseline MFCC features to a large degree. Additionally, we observe that the optimal window size during transformation is dependent on the characteristics of the audio signal and architecturally, 2D convolution yielded better results in most cases compared to 1D.

研究动机与目标

为CNN-based环境声音分类寻找有效的时-频输入表示提供动机。
评估不同谱时表示对两份公开数据集上CNN性能的影响。
评估CNN架构（2D 与 1D 卷积）及输入窗口对分类准确性的影响。

提出的方法

从4秒剪辑（重采样至22.05 kHz）计算多种时-频表示（线性-STFT、梅尔-STFT、CQT、CWT、MFCC 语谱向量）。
将输入制作为2D类光谱图并下采样到标准化尺寸。
训练CNN变体（Conv-5 和 Conv-3，滤波器为3x3和Mx3），采用ReLU、dropout、L2正则化和Adam优化。
使用5折交叉验证（ESC-50）和10折交叉验证（UrbanSound8K）评估；报告中位数准确率和MAD。
比较2D与1D卷积并分析窗口尺寸对结果的影响（宽带与窄带）。
使用ANOVA与Tukey事后检验来确定表示之间的显著差异。

实验结果

研究问题

RQ1哪种时-频表示在ESC-50与UrbanSound8K上可获得最佳的CNN基于环境声音分类性能？
RQ2宽带窗口与窄带窗口对跨表示的准确性有何影响？
RQ3对于基于光谱图的输入，2D卷积是否通常优于1D卷积？
RQ4在CNN中使用时，MFCC输入相对于现代光谱表示是否仍具竞争力？
RQ5不同输入下网络深度（Conv-3与Conv-5）对性能的相对影响如何？

主要发现

Representation/Model	Linear-STFT wideband	Linear-STFT narrowband	Mel-STFT wideband	Mel-STFT narrowband	CQT wideband	CQT narrowband	CWT wideband	MFCC
ESC-50 Conv-5: M×3	44.50 ± 2.00	46.62 ± 2.25	46.25 ± 2.00	48.00 ± 1.63	42.00 ± 2.37	42.62 ± 1.50	38.25 ± 1.50	30.50 ± 1.50
ESC-50 Conv-5: 3×3	49.25 ± 0.75	50.00 ± 1.88	50.87 ± 2.50	53.75 ± 1.75	46.87 ± 1.13	48.62 ± 2.00	40.50 ± 2.13	36.62 ± 2.13
ESC-50 Conv-3: M×3	52.12 ± 1.12	55.12 ± 1.88	56.37 ± 1.63	56.25 ± 1.75	54.37 ± 2.25	53.50 ± 1.87	46.50 ± 1.63	35.25 ± 2.75
ESC-50 Conv-3: 3×3	55.00 ± 1.37	53.00 ± 1.62	54.00 ± 1.25	55.00 ± 1.63	51.75 ± 1.25	51.62 ± 2.25	46.62 ± 1.87	35.00 ± 0.75
UrbanSound8K Conv-5: M×3	61.19 ± 4.81	63.44 ± 3.39	62.22 ± 5.19	64.97 ± 3.69	62.87 ± 3.25	63.12 ± 3.25	56.90 ± 2.10	59.23 ± 3.24
UrbanSound8K Conv-5: 3×3	67.94 ± 4.22	62.83 ± 4.73	69.59 ± 4.19	65.31 ± 2.19	69.25 ± 4.69	64.33 ± 3.60	61.56 ± 1.80	57.15 ± 1.81
UrbanSound8K Conv-3: M×3	68.81 ± 4.50	66.72 ± 2.72	70.69 ± 4.06	68.29 ± 3.00	70.94 ± 4.06	67.06 ± 3.12	64.00 ± 2.17	64.87 ± 2.17
UrbanSound8K Conv-3: 3×3	70.94 ± 2.94	68.19 ± 3.25	74.66 ± 3.39	71.25 ± 1.85	73.03 ± 3.56	68.31 ± 2.35	64.75 ± 1.44	62.81 ± 4.03

Mel-STFT光谱输入在不同模型和数据集上始终表现良好。
大多数光谱表示对MFCC基线有改进，在许多情况下MFCC落后显著。
2D卷积通常优于1D卷积，只有在较浅的ESC-50模型中出现异常。
宽带与窄带的影响因数据集和类别而异，表明窗口大小具有类别相关的优势。
Conv-3（3x3）通常优于Conv-5，表明在给定剪辑上的更深网络可能出现过拟合和数据受限。
在UrbanSound8K上，某些配置（如Conv-5或Conv-3结合某些输入）达到了最高准确率（例如UrbanSound8K上Conv-3+3x3中位数达到74.66%）。
CWT趋向于接近MFCC的表现，且有时低于Mel-STFT和CQT，尤其是在UrbanSound8K上。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。