QUICK REVIEW

[論文レビュー] Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Humam Alwassel, Dhruv Mahajan|arXiv (Cornell University)|Nov 28, 2019

Music and Audio Processing参考文献 77被引用数 251

ひとこと要約

本論文は、自己教師付き学習のための Cross-Modal Deep Clustering (XDC) を導入します。ラベルなしの動画から、一方のモダリティ（音声または映像）でのクラスタリングをもう一方のモダリティの監視信号として用い、最先端の結果を達成し、時には大規模な監督付き事前学習を上回ることもあります。

ABSTRACT

Visual and audio modalities are highly correlated, yet they contain different information. Their strong correlation makes it possible to predict the semantics of one from the other with good accuracy. Their intrinsic differences make cross-modal prediction a potentially more rewarding pretext task for self-supervised learning of video and audio representations compared to within-modality learning. Based on this intuition, we propose Cross-Modal Deep Clustering (XDC), a novel self-supervised method that leverages unsupervised clustering in one modality (e.g., audio) as a supervisory signal for the other modality (e.g., video). This cross-modal supervision helps XDC utilize the semantic correlation and the differences between the two modalities. Our experiments show that XDC outperforms single-modality clustering and other multi-modal variants. XDC achieves state-of-the-art accuracy among self-supervised methods on multiple video and audio benchmarks. Most importantly, our video model pretrained on large-scale unlabeled data significantly outperforms the same model pretrained with full-supervision on ImageNet and Kinetics for action recognition on HMDB51 and UCF101. To the best of our knowledge, XDC is the first self-supervised learning method that outperforms large-scale fully-supervised pretraining for action recognition on the same architecture.

研究の動機と目的

手動でラベル付けされた動画データへの依存を減らす動機付け。
音声と映像のモダリティ間の強い相関と補完的情報を活用。
一方のモダリティの擬似ラベルを用いてもう一方を訓練するクロスモーダルクラスタリングフレームワークを提案。
クロスモーダル自己教師付学習が下流のアクション認識と音声分類を改善することを示す。

提案手法

DeepCluster風の自己教師付を、2つのエンコーダ（visual E_v と audio E_a）を持つマルチモーダル設定に適用。
3つのモデルを提案：Multi-Head Deep Clustering (MDC)、Concatenation Deep Clustering (CDC)、および Cross-Modal Deep Clustering (XDC)。
MDC は各エンコーダに対してもう一方のモダリティのクラスタ割り当てによって監督される2つ目のヘッドを追加。
CDC は視覚+音声特徴を結合してクラスタリングし、それらのクラスタを両エンコーダの擬似ラベルとして使用。
XDC はもう一方のモダリティのクラスタを各エンコーダに対して専用の監督として用い、クロスモーダル自己教師付を実現。
エンコーダはモダリティ特異的な特徴を生成し、それらをクラスタリング（k-means）して擬似ラベルを生成、表現を反復的に洗練させる。

実験結果

リサーチクエスチョン

RQ1マルチモーダル自己教師付きクラスタリングフレームワーク（MDC、CDC、XDC）は、単一モダリティのベースラインと比較してどうか。
RQ2k-means におけるクラスタ数 k が XDC の性能にデータセット間でどのように影響するか。
RQ3事前訓練データのタイプ（厳選されたデータ vs 未整備データ）と規模が、XDC の下流タスクへの転移にどう影響するか。
RQ4XDC は標準的なアクション認識および音声分類のベンチマークで、完全に supervisd された事前訓練を凌ぐことができるか。

主な発見

3つのマルチモーダルモデルはいずれも下流タスクにおいて単一モダリティの DeepCluster ベースラインを上回る。
XDC は評価データセット全体で提案モデルの中で一貫して最良の性能を示した。
大規模な未ラベルデータで事前訓練した XDC は、HMDB51/UCF101 におけるアクション認識で Kinetics/ImageNet の完全に監督付き事前訓練を超えることができ、この設定で初めての成果を示した。
AudioSet または IG-Random/IG-Kinetics で事前訓練した XDC は強い転移を示し、事前訓練データのサイズが増えるとパフォーマンスが向上する。
固定特徴抽出器として用いられた場合、XDC はしばしば複数の完全に監督されたモデルを凌ぐことがあり、XDC を用いた完全ファインチューニングも依然競合的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。