QUICK REVIEW

[論文レビュー] MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Xiaojie Jin, Bowen Zhang|arXiv (Cornell University)|Jan 19, 2023

Multimodal Machine Learning Applications被引用数 8

ひとこと要約

MV-AdapterはCLIPバックボーンを凍結し、ビデオ枝の時間適応とクロスモーダル相互作用という2つのモジュールを導入することで、5つのVTRベンチマークにおいて完全微調整と同等以上の性能を、パラメータオーバーヘッドをほとんど増やさずに実現する。

ABSTRACT

State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However, this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue, we present our pioneering work that enables parameter-efficient VTR using a pre-trained model, with only a small number of tunable parameters during training. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter utilizes bottleneck structures in both video and text branches, along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve comparable or better performance than standard full fine-tuning with negligible parameters overhead. Notably, MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet).

研究の動機と目的

ストレージとトレーニングコストを削減するためのパラメータ効率的ビデオ-テキスト検索（PE-VTR）の動機付けと形式化。
完全微調整なしにCLIPなどの事前学習済み画像-テキストモデルをビデオ-テキストタスクへ適応。
時間的文脈とクロスモーダル整合性を捉える軽量モジュールの設計。
標準的なVTRベンチマークで最小限のパラメータオーバーヘッドで高い性能を実証。

提案手法

バックボーンとしてCLIPを使用し、ビデオとテキストの分岐をボトルネックDownsample-Transformer-Upsample構造で採用。
グローバル/ローカルの時間的文脈とフレーム固有の較正ウェイトを注入するTemporal Adaptationを導入。
クロスモーダル整合のための共有パラメータ空間を用いたクラノッカー積（Kronecker products）でダウンサンプリングウェイトを生成するCross-Modal Interactionモジュールを開発。
ビデオ/テキスト枝間でクロスモーダルウェイトを共有して意味的に整列した特徴を奨励し、パラメータ数を削減。
パラメータオーバーヘッド、デプロイ/ストレージ、およびトレーニングメモリの利点を示す効率性分析を提供。

実験結果

リサーチクエスチョン

RQ1大規模バックボーンを完全微調整せずに、どのように効率的にビデオ-テキスト検索を実現できるか。
RQ2時間的モデリングを事前学習済みの画像-テキストモデルにビデオデータへ効果的に統合できるか。
RQ3共通のパラメータ空間を用いたCross-Modal Interaction機構は最小のパラメータコストでモダリティ整合を改善できるか。
RQ4標準的なVTRベンチマークにおける性能とパラメータ効率のトレードオフはどうなるか。

主な発見

MV-Adapterは五つのVTRベンチマーク（MSR-VTT、MSVD、LSMDC、DiDemo、ActivityNet）で完全微調整と同等またはそれ以上の性能を達成。
パラメータオーバーヘッドはごく小さく（追加パラメータ約2.56%）、展開とトレーニングの効率が大幅に向上。
Temporal AdaptationはVTR結果を大幅に改善し、フレームごとの時間文脈統合とフレーム間の動的較正を可能に。
Kronecker積構築を用いた共有ウェイト空間によるCross-Modal Interactionはモダリティ整合を改善しパラメータ数を削減。
MV-Adapterは複数のPETLベースライン（AdaptFormer、Convpass、ST-Adapter）を上回り、いくつかのタスクで完全微調整手法と同等またはそれを凌ぐ。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。