QUICK REVIEW

[論文レビュー] CLIP2Video: Mastering Video-Text Retrieval via Image CLIP

Han Fang, Pengfei Xiong|arXiv (Cornell University)|Jun 21, 2021

Multimodal Machine Learning Applications参考文献 38被引用数 130

ひとこと要約

CLIP2Video は CLIP からの画像-言語事前学習を、Temporal Difference Block と Temporal Alignment Block の 2 つのモジュールを用いて video-text 検索へ転送し、MSR-VTT、MSVD、VATEX で最先端の結果を達成します。

ABSTRACT

We present CLIP2Video network to transfer the image-language pre-training model to video-text retrieval in an end-to-end manner. Leading approaches in the domain of video-and-language learning try to distill the spatio-temporal video features and multi-modal interaction between videos and languages from a large-scale video-text dataset. Different from them, we leverage pretrained image-language model, simplify it as a two-stage framework with co-learning of image-text and enhancing temporal relations between video frames and video-text respectively, make it able to train on comparatively small datasets. Specifically, based on the spatial semantics captured by Contrastive Language-Image Pretraining (CLIP) model, our model involves a Temporal Difference Block to capture motions at fine temporal video frames, and a Temporal Alignment Block to re-align the tokens of video clips and phrases and enhance the multi-modal correlation. We conduct thorough ablation studies, and achieve state-of-the-art performance on major text-to-video and video-to-text retrieval benchmarks, including new records of retrieval accuracy on MSR-VTT, MSVD and VATEX.

研究の動機と目的

動画-テキスト検索を画像-テキストのマルチモーダル学習と動画フレームとテキスト間の時間的関係の2つの独立した問題として再定式化する。
事前学習済みの画像-言語モデル（CLIP）を活用して、比較的小規模データセットでのエンドツーエンド訓練を可能にする。
動作を捉え、文脈語と動画クリップを整合させる2つの時間モジュールを導入し、横断モーダル検索を改善する。

提案手法

画像-テキスト埋め込みの CLIP ベース初期化と、動画フレームの別個の時間モデリングを使用する。
Temporal Difference Block (TDB) は隣接フレーム埋め込みの間に動作認識トークンを挿入し、動作表現を強化する。
Temporal Alignment Block (TAB) は共有中心を K 個学習し、フレーム埋め込みと語の埋め込みを結合空間で整合させ、動作関連性に基づいて再重み付けする。
グローバル表現 f^g と整合表現 f^a を統合して対称的コントラスト損失を算出する。
動画-テキストのペアで対称的クロスエントロピー損失で学習し、最終的な類似度を g-embedding と a-embedding の平均として計算する。

実験結果

リサーチクエスチョン

RQ1画像-言語事前学習をどのようにして動画-テキスト検索へ効果的に転送できるか。
RQ2大規模な動画-言語事前学習を必要とせず、時間情報を明示的にモデリングして動画-テキストの整合性を改善できるか。
RQ3時間差ブロックと整合ブロックは標準ベンチマークで測定可能な利得を提供するか。
RQ4整合中心の数は検索性能にどのような影響を与えるか。
RQ5推論時にはグローバル表現と整合表現をどのように組み合わせるべきか。

主な発見

Method	Text→Video R@1	Text→Video R@5	Text→Video R@10	Text→Video MdR	Text→Video MnR	Video→Text R@1	Video→Text R@5	Video→Text R@10	Video→Text MdR	Video→Text MnR
ours	45.6	72.6	81.7	2.0	14.6	43.5	72.3	82.1	2.0	10.2

MSR-VTT、MSVD、VATEX において、テキスト→動画および動画→テキストの検索で最先端の結果を達成する。
Temporal Difference Block は、時間処理前に動作認識トークンを注入することで性能を大幅に向上させる。
共有中心を持つ Temporal Alignment Block は、動画フレームと文脈語の横断的整合性を改善し、さらなる利得を生む。
グローバル表現と整合表現のバランスの取れた組み合わせ（w = 0.5）が最良の検索性能を提供する。
MSR-VTT（1k-A プロトコル）では、Our method は Text→Video R@1 45.6 および Video→Text R@1 43.5 を達成（表 3 の値）。
MSR-VTT（1k-A プロトコル）では、Our method は Text→Video MdR 2.0 および Video→Text MdR 2.0 を達成（表 3 の値）。
VATEX では、いくつかのベースラインを上回る強力な検索性能を達成する（Tables 4-5）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。