QUICK REVIEW

[論文レビュー] Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adrià Recasens|arXiv (Cornell University)|Jun 29, 2020

Multimodal Machine Learning Applications参考文献 92被引用数 195

ひとこと要約

本論文は自己教師ありのマルチモーダル・ヴァーサタイル（MMV）ネットワークを提案し、未ラベルの動画から視覚・音声・言語表現を共同で学習させる。さらに画像にも適用するデフレーション機構と、ゼロショット・監視付き転移での高い性能を実現する。

ABSTRACT

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

研究の動機と目的

未ラベルの動画データから、汎用的で多用途なマルチモーダル表現の学習を動機づける。
視覚・音声・テキストを取り込み、モダリティ間で比較できるネットワークを開発する。
モダリティ固有の粒度を尊重し、視覚/音声の細粒度の類似性とテキストの粗粒度の整合を可能にする。
デフレーション機構を用いて、動画ストリームと静止画像の両方へ効率的に適用できるようにする。

提案手法

各モダリティを、モダリティ固有のバックボーンと射影ヘッドを用いて、共有空間または階層的空間に埋め込む。
モダリティをジョイント空間で整列させるため、3つのモダリティ埋め込みグラフ（Shared、Disjoint、Fine-and-Coarse FAC）を検討する。
同一動画内のポジティブペアを強制し、異なる動画のネガティブペアを学習するマルチモーダル対比損失で訓練する。
テキスト整列にはMIL-NCEを用いて、ナレーションと映像内容のズレに対処する。
ラベルなしで、動画訓練済みネットワークを画像用ネットワークへ変換するデフレーション手順を導入する。
欠損するモダリティを適切に扱い、対応する損失項を省略し、残りの損失を再重み付けする。

実験結果

リサーチクエスチョン

RQ1単一のマルチモーダルネットワークは、未ラベル動画から学習した視覚・聴覚・文本情報を効果的に統合できるだろうか？
RQ2モダリティ埋め込みグラフのうち、モーダリティ間整列、モダリティ内の粒度、モーダリティ間のナビゲーション性の最良のトレードオフを提供するのはどれか？
RQ3動画訓練済みネットワークをデフレートすることで、追加の監視なしに競争力のある画像表現を得られるか？
RQ4標準的な動画・音声・画像のベンチマークにおいて、3モダリティモデルは2モダリティのベースラインとどう比較されるか？

主な発見

FAC（Fine and Coarse）埋め込み戦略は、UCF101、HMDB51、MSRVTT、ESC-50の全てで高い性能を発揮し、2モダリティ構成を上回る。
3モダリティでの訓練は、視覚表現を一般的に改善し、モダリティ間検索タスクを支援する。
HowTo100MとAudioSetの組み合わせは、HMDB51・UCF101・ESC-50を改善し、テキストがない場合の音声データの活用を向上させる。
デフレートされた動画から静止画像へのネットワークは、新しい注釈を必要とせず、画像タスクで競争力のある性能を実現する。
本手法は、いくつかのベンチマークにおいて自己監督型方法の最先端結果を達成し、Kinetics600のような大規模タスクでは監督付き性能に近づく。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。