QUICK REVIEW

[論文レビュー] CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment

Xue Hong-wei, Yuchong Sun|arXiv (Cornell University)|Sep 14, 2022

Multimodal Machine Learning Applications被引用数 53

ひとこと要約

CLIP-ViP は事前学習済みの画像-テキストモデル（CLIP）を動画と言語の整合性のために適応させ、動画代理メカニズムと Omnisource Cross-modal Learning を用いて MSR-VTT、DiDeMo、LSMDC、ActivityNet で強力な動画テキスト検索結果を達成します。

ABSTRACT

The pre-trained image-text models, like CLIP, have demonstrated the strong power of vision-language representation learned from a large scale of web-collected image-text data. In light of the well-learned visual features, some existing works transfer image representation to video domain and achieve good results. However, how to utilize image-language pre-trained model (e.g., CLIP) for video-language pre-training (post-pretraining) is still under explored. In this paper, we investigate two questions: 1) what are the factors hindering post-pretraining CLIP to further improve the performance on video-language tasks? and 2) how to mitigate the impact of these factors? Through a series of comparative experiments and analyses, we find that the data scale and domain gap between language sources have great impacts. Motivated by these, we propose a Omnisource Cross-modal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP. Extensive results show that our approach improves the performance of CLIP on video-text retrieval by a large margin. Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet. We will release our code and pre-trained CLIP-ViP models at https://github.com/microsoft/XPretrain/tree/main/CLIP-ViP.

研究の動機と目的

画像-テキストモデルを動画-テキストタスクへ事後学習させる際に妨げとなる要因を調査する。
データスケールと言語ドメインのギャップを動画の事後学習における主要な課題として特定する。
ギャップを埋め、画像-テキストの事前学習を動画-テキストタスクへ活用する方法を提案する。

提案手法

画像キャプション生成モデルによって生成されたドメイン内補助キャプションを導入し、言語ドメインギャップを低減する。
Video Proxy (ViP) トークンと proxy-guided attention 機構を設計し、ViT が最小限の変更で画像と動画の両方を処理できるようにする。
Omnisource Cross-modal Learning (OCL) を提案し、情報-NCE 損失を用いて video-subtitle データと image-caption データを共同学習する。
多源間のクロスモーダル信号の効果的な統合を見つけるために、いくつかの OCL 損失変種を探索する。
コンポーネントを検証するための訓練詳細と大規模なアブレーションを提供する。

実験結果

リサーチクエスチョン

RQ1CLIP ライクなモデルが動画の事後学習から恩恵を受けることを妨げる要因は何か。
RQ2動画の事後学習中にデータスケールと言語ドメインのギャップを緩和するために、補助データとアーキテクチャの適応はどのように有効か。
RQ3補助キャプションと動画字幕の両方を活用する場合、オムニソース・クロスモーダル学習戦略は動画-テキスト検索を改善できるか。

主な発見

小規模なデータでの事後学習は過学習を招き性能を低下させる一方、大規模データ（HD-VILA-100M）は有益である。
事前学習時の字幕と下流の説明テキスト間に巨大な言語ドメインギャップが存在すると転移を妨げる可能性があり、補助キャプションの利用を促す。
Video Proxy トークンと proxy-guided attention は MeanPool、SeqTransformer、Full Attention ベースラインよりも動画テキスト検索を改善する。
補助キャプションを用いた Omnisource Cross-modal Learning は uni-source アプローチを超え、MSR-VTT と DiDeMo で大幅な向上をもたらす。
大規模な動画字幕データと補助キャプションおよびクロスモーダル損失を組み合わせると、MSR-VTT、DiDeMo、LSMDC、ActivityNet におけるテキスト対動画検索で最先端の結果を達成する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。