QUICK REVIEW

[論文レビュー] Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Peng Jin, Jinfa Huang|arXiv (Cornell University)|Nov 21, 2022

Multimodal Machine Learning Applications被引用数 35

ひとこと要約

この論文では、期待値最大化（EM）に基づく対照学習フレームワーク EMCL を提案し、コンパクトで意味的に整合した video-and-language 表現を学習させ、MSR-VTT、ActivityNet、LSMDC で最先端の結果を達成するとともに、トレーニング時または推論時の既存手法へのプラグイン適用が可能である。

ABSTRACT

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

研究の動機と目的

モダリティ間ギャップと冗長な意味次元のために、テキスト-動画検索で標準的な対照学習の限界を特定する。
動画とテキスト表現の低ランクで意味的に関連する部分空間を学習するために EMCL を提案する。
EMCL-Net を開発し、EM 反復を安定化させ、共同利用またはプラグアンドプレー利用を可能にする初期値維持戦略を導入する。
MSR-VTT、ActivityNet、LSMDC で最先端の検索性能を示し、既存ベースラインへのアドオンとしての互換性を示す。

提案手法

クロスモーダル対照学習を、動画とテキスト特徴を共同に表現する K 個の潜在部分空間を見つける EM プロセスとして定式化する。
capped EM 設定でガウスカーネルを用い、サブスペースへの特徴成分のソフト割り当てを計算する E ステップと、サブスペース基底を更新する M ステップ。
特徴を K 個の低次元サブスペースでの再構成によって表し、同クラス内分散を減らし、モダリティ間の分散を増大させる。
最大確率射影と特徴再構成のステップを導入し、動画とテキストで共有されるコンパクトなサブスペース表現を得る。
EMCL を EMCL-Net に組み込み、Cross-batch 情報を転送する Initial Value Maintenance（M）と、beta によるスケーリングを用いた再構成融合を導入する。
再構成された動画-テキスト埋め込みのコサイン類似度に対して InfoNCE loss で学習する。

実験結果

リサーチクエスチョン

RQ1低ランクで共有されるサブスペース分解は、視覚-テキストモダリティ間のギャップを従来の対照学習よりも効果的に brid ge できるのか。
RQ2EM ベースのサブスペース射影を統合することで、同一クラスの跨モダリティペアの意味的クラスタリングが改善され、異なるクラスを分離できるのか。
RQ3EMCL は既存のテキスト-動画検索モデルをブplug-in または推論のみのモジュールとして広く互換性があるのか。
RQ4初期化戦略、サブスペース数 K、EM 反復回数が性能と安定性に与える影響はどのようになるのか。

主な発見

Method	Pre-trained	MSR-VTT R@1	MSR-VTT R@5	MSR-VTT R@10	MSR-VTT MdR	ActivityNet R@1	ActivityNet R@5	ActivityNet R@50	ActivityNet MdR	LSMDC R@1	LSMDC R@5	LSMDC R@10	LSMDC MdR
JSFusion	-	10.2	31.2	43.2	13.0	-	-	-	-	9.1	21.2	34.1	36.0
CE (Liu et al., 2019)	GPT-1	20.9	48.8	62.4	6.0	18.2	47.7	91.4	6.0	11.2	26.9	34.8	25.3
MMT (Gabeur et al., 2020)	BERT-Base	24.6	54.0	67.1	4.0	22.7	54.2	93.2	5.0	13.2	29.2	38.8	21.0
CLIP4Clip (Luo et al., 2021)	CLIP (ViT-B/32)	44.5	71.4	81.6	2.0	40.5	72.4	98.1	2.0	22.6	41.0	49.1	11.0
EMCL-Net (Ours)	CLIP (ViT-B/32)	46.8	73.1	83.1	2.0	41.2	72.7	98.1	2.0	23.9	42.4	50.9	10.0
EMCL-Net (Ours) ††	CLIP (ViT-B/32)	51.6	78.1	85.3	1.0	50.6	78.7	98.1	1.0	25.9	46.4	53.7	8.0

EMCL は、標準的な対照学習ベースラインと比較して、同クラス内のより小さな分散と異クラス間のより大きな分散を持つ、より識別力の高い video-and-language 表現をもたらす。
適切なパラメータ初期化を備えた EMCL-Net は、MSR-VTT、ActivityNet、LSMDC のテキスト-to-動画および動画-to-テキスト検索タスクで一貫してベースラインを改善する。
アブレーションでは、EMCL は同様の計算量で PCA、トランスフォーマ、全結合層、スパース Autoencoder を上回り、意味的に整合したサブスペース表現の利点を強調する。
EMCL を強力なベースライン（MMT、CLIP4Clip、DCR）へプラグインすると、テキスト-to-動画の R@1 が最大 3.5%pt の絶対改善を含む顕著な gains が得られ、動画-to-text 検索でも有意な向上を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。