QUICK REVIEW

[論文レビュー] Support-set bottlenecks for video-text representation learning

Mandela Patrick, Po-Yao Huang|arXiv (Cornell University)|Oct 6, 2020

Multimodal Machine Learning Applications参考文献 96被引用数 37

ひとこと要約

本研究は、対照的なビデオ-テキスト学習を補完するサポートセット・ボトルネックを用いたクロスインスタンスキャプショニングを導入し、複数データセットにわたる意味共有と検索性能を改善します。

ABSTRACT

The dominant paradigm for learning video-text representations -- noise contrastive learning -- increases the similarity of the representations of pairs of samples that are known to be related, such as text and video from the same sample, and pushes away the representations of all other pairs. We posit that this last behaviour is too strict, enforcing dissimilar representations even for samples that are semantically-related -- for example, visually similar videos or ones that share the same depicted action. In this paper, we propose a novel method that alleviates this by leveraging a generative model to naturally push these related samples together: each sample's caption must be reconstructed as a weighted combination of other support samples' visual representations. This simple idea ensures that representations are not overly-specialized to individual samples, are reusable across the dataset, and results in representations that explicitly encode semantics shared between samples, unlike noise contrastive learning. Our proposed method outperforms others by a large margin on MSR-VTT, VATEX and ActivityNet, and MSVD for video-to-text and text-to-video retrieval.

研究の動機と目的

厳密なインスタンス識別を超える video-text 表現の改善を動機づける

提案手法

クロスモーダル対比学習と生成的なクロスキャプショニング目的を組み合わせる
バッチ内の他のビデオの重み付き混合からキャプションを再構成するクロスインスタンス・アテンション機構を導入する
サポートセットを選択し再構成されたテキスト表現を形成するためのバッチレベルのアテンションを定義する
ビデオ-テキストペアに対してヒンジ型トリプレット対比損失を用い、調整可能な重み lambda を持つクロスキャプショニング損失を併用する
クロスキャプショニング・アテンションの変種（Identity、Full、Hybrid、Cross）の実験と、サポートセットサイズの影響を検討する
Adamで訓練し、他をファインチューニングする一方でビデオエンコーダを凍結する

実験結果

リサーチクエスチョン

RQ1対比損失で学習したマルチモーダル表現を、生成的なクロスキャプショニング目的が改善できるか？
RQ2バッチベースのサポートセットからキャプションを再構成することは、サンプル間の意味共有を促進するか？
RQ3どのクロスキャプショニング変種がデータセット全体で最良の検索性能を提供するか？
RQ4サポートセットのサイズは検索性能にどう影響するか？
RQ5HowTo100M での事前学習が最終結果に与える影響は何か？

主な発見

Crossキャプショニングの Cross バリアントは MSR-VTT で最良のテキスト対ビデオ検索を達成（27.2% R@1、55.2% R@5 など関連指標）
アブレーション実験では、時間的アテンションの組み合わせ、より強力なテキストエンコーディング/デコーディング、トリプレットベースの対比損失を組み合わせることで、ベースラインより性能が向上
HowTo100M での事前学習は、MSR-VTT、VATEX、ActivityNet、MSVD の性能をさらに向上させる
クロスキャプショニング目的はボトルネックとして概念の共有を促進し、意味的な検索を改善する
小さすぎるまたは過度に大きいサポートセットは性能を低下させ、最適な中間サイズを示唆する
定性的なアテンション分析は、モデルが孤立したビデオ-キャプションペアを記憶するのではなく、意味的に関連するサンプルに焦点を当てることを示す

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。