QUICK REVIEW

[論文レビュー] Active Contrastive Learning of Audio-Visual Video Representations

Shuang Ma, Zhaoyang Zeng|arXiv (Cornell University)|Aug 31, 2020

Speech and Audio Processing参考文献 80被引用数 46

ひとこと要約

CM-ACC は音声-映像の動画表現の積極的にサンプリングされたクロスモーダル対比フレームワークを導入し、標準ベンチマークで強力な改善をもたらし、巨大な辞書における冗長性を緩和します。

ABSTRACT

Contrastive learning has been shown to produce generalizable representations of audio and visual data by maximizing the lower bound on the mutual information (MI) between different views of an instance. However, obtaining a tight lower bound requires a sample size exponential in MI and thus a large set of negative samples. We can incorporate more samples by building a large queue-based dictionary, but there are theoretical limits to performance improvements even with a large number of negative samples. We hypothesize that extit{random negative sampling} leads to a highly redundant dictionary that results in suboptimal representations for downstream tasks. In this paper, we propose an active contrastive learning approach that builds an extit{actively sampled} dictionary with diverse and informative items, which improves the quality of negative samples and improves performances on tasks where there is high mutual information in the data, e.g., video classification. Our model achieves state-of-the-art performance on challenging audio and visual downstream benchmarks including UCF101, HMDB51 and ESC50.\footnote{Code is available at: \url{https://github.com/yunyikristy/CM-ACC}}

研究の動機と目的

自己教師付き対比学習を通じて、頑健な音声-映像動画表現の学習を動機づける。
活性サンプリングを提案することにより、大規模辞書の冗長性と効果的でないネガティブの問題に対処する。
音声-映像対応を活用するクロスモーダル対比フレームワークを開発する。
アクションおよび音声分類タスクにおけるダウンストリーム性能の改善を示す。

提案手法

MoCoスタイルの対比学習を音声-映像動画データへクロスモーダルに拡張する。
多様性を確保するために勾配不確実性とk-means++初期化を用いてアクティブにサンプリングされたネガティブサンプル辞書を構築する。
クエリエンコーダとモメンタム更新のキーエンコーダを用いて、モーダリティ間のクロスモーダル対比損失を計算する。
不確実性推定をガイドするために疑似ラベルを使用し、多様で情報量の多いネガティブを選択する。
クロスモーダル勾配流を可能にする追加のFC層を組み込み、安定性を向上させる。
AudioSetとKineticsデータセットで事前学習した後、ダウンストリームタスクで評価する。

実験結果

リサーチクエスチョン

RQ1高MIの音声-映像動画データに対して、ランダムネガティブと比較して不確実性に導かれた積極的ネガティブサンプリングは対比学習を改善できるか？
RQ2アクティブにサンプリングされた辞書を用いたクロスモーダル対比学習は、ダウンストリームの動画アクションおよび音声分類タスクの表現を改善するか？
RQ3勾配に基づくネガティブサンプリングは、特徴表現に基づくサンプリングと比較して多様性とダウンストリーム性能にどのような差をもたらすか？
RQ4クロスモーダル勾配流がトレーニングの安定性と表現品質に与える影響はどの程度か？

主な発見

CM-ACC は AudioSet および Kinetics データで事前学習した場合に、UCF101、HMDB51、および ESC50 で最先端の分類性能を達成する。
アクティブサンプリングはランダムサンプリングよりもセマンティックカテゴリ全体にわたってより多様なネガティブを生み出し、カテゴリのカバー率が高い。
ネガティブサンプリングのための勾配ベースの埋め込みは、特徴埋め込みのみを用いるよりもダウンストリームの利得を強く提供する。
CM-ACC は複数のベンチマークでランダムサンプリングMoCoのベースラインを大きなマージンで上回る（例えば報告された比較で ESC50 において最大 +6.2 ）。
補助的な FC 層を介したクロスモーダル勾配フローによるトレーニングは、クロスモーダルフローのない版本と比較して安定性と性能を向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。