QUICK REVIEW

[論文レビュー] Learning from Between-class Examples for Deep Sound Recognition

Yuji Tokozume, Yoshitaka Ushiku|arXiv (Cornell University)|Nov 28, 2017

Music and Audio Processing参考文献 14被引用数 153

ひとこと要約

BC learningは異なるクラスの2つの音を混合し混合比を予測するようモデルを訓練することで、ネットワークとデータセットを跨いだ精度向上を達成し、EnvNet-v2を用いたESC-50で人間レベルを上回る。

ABSTRACT

Deep learning methods have achieved high performance in sound recognition tasks. Deciding how to feed the training data is important for further performance improvement. We propose a novel learning method for deep sound recognition: Between-Class learning (BC learning). Our strategy is to learn a discriminative feature space by recognizing the between-class sounds as between-class sounds. We generate between-class sounds by mixing two sounds belonging to different classes with a random ratio. We then input the mixed sound to the model and train the model to output the mixing ratio. The advantages of BC learning are not limited only to the increase in variation of the training data; BC learning leads to an enlargement of Fisher's criterion in the feature space and a regularization of the positional relationship among the feature distributions of the classes. The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial. Furthermore, we construct a new deep sound recognition network (EnvNet-v2) and train it with BC learning. As a result, we achieved a performance surpasses the human level.

研究の動機と目的

深層音響認識のデータ利用を改善する動機づけ。
異なるクラスの音を混合してBetween-Class（BC）学習を導入する。
混合比を予測するようモデルを訓練し、Fisherの判別基準を拡張する。
複数のアーキテクチャとデータセットでBC学習を示す。
より深いネットワークでBC学習がESC-50の人間性能を超えることを示す。

提案手法

異なるクラスの2つの音をランダム比率で混合して訓練サンプルを作成する。
音圧レベルを考慮した混合式を用い、知覚比を保持する対応する p を計算する（式(2)）。
混合ラベルを t = r t1 + (1 - r) t2 と表現しKL発散損失で最適化する。
ミニバッチSGDで訓練する；BC学習は標準学習よりも多くのエポックを必要とする場合がある。
特徴空間を可視化してFisherの基準の拡大とクラス間関係の正則化を主張する。

実験結果

リサーチクエスチョン

RQ1BC学習はアーキテクチャ、データセット、データ拡張方式を跨いで認識性能を向上させるか？
RQ2BCを最大限活用するためには音をどう混合し、ラベルをどう割り当てるべきか？
RQ3BC学習が特徴空間におけるFisherの基準とクラス関係の正則化に与える影響は何か？
RQ4BC学習は難解な環境音データセットで人間の性能を超えることができるか？

主な発見

Model	Learning	ESC-50	ESC-10	UrbanSound8K
EnvNet (Tokozume & Harada, 2017)	標準	29.2±0.1	12.8±0.4	33.7
EnvNet (Tokozume & Harada, 2017)	BC（私たちの）	24.1±0.2	11.3±0.6	28.9
SoundNet5 (Aytar et al., 2016)	標準	33.8±0.2	16.4±0.8	33.3
SoundNet5 (Aytar et al., 2016)	BC（私たちの）	27.4±0.3	13.9±0.4	30.2
M18 (Dai et al., 2017)	標準	31.5±0.5	18.2±0.5	28.8
M18 (Dai et al., 2017)	BC（私たちの）	26.7±0.1	14.2±0.9	26.5
Logmel-CNN (Piczak, 2015a) + BN	標準	27.6±0.2	13.2±0.4	25.3
Logmel-CNN (Piczak, 2015a) + BN	BC（私たちの）	23.1±0.3	9.4±0.4	23.5
EnvNet-v2 (私たちの)	標準	25.6±0.3	14.2±0.8	30.9
EnvNet-v2 (私たちの)	BC（私たちの）	18.2±0.2	10.6±0.6	23.4
EnvNet-v2 (私たちの) + 強力な拡張	標準	21.2±0.3	10.9±0.6	24.9
EnvNet-v2 (私たちの) + 強力な拡張	BC（私たちの）	15.1±0.2	8.6±0.1	21.7

BC学習はESC-50、ESC-10、UrbanSound8Kに対してEnvNet、SoundNet5、M18、Logmel-CNN+BN、EnvNet-v2のすべての評価済みネットワークで性能を向上させた。
EnvNet-v2を用いたESC-50ではBC学習が誤差18.2%を達成（標準は25.6%）、強い拡張を適用するとさらに15.1%へ改善。
BC学習はFisherの基準を拡大し、クラス分布を正則化して混合クラス音の誤分類を減少させる。
EnvNet-v2 with BC learningはESC-50で人間の性能を上回る（18.2%対人間の18.7%との過去論文値）。
アブレーションでは提案した混合法（式(2)とAウェイティング）と比率ラベリングが最も良い性能を示し（ESC-50で誤差24.1%）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。