QUICK REVIEW

[論文レビュー] Self-Chained Image-Language Model for Video Localization and Question Answering

Shoubin Yu, Jaemin Cho|arXiv (Cornell University)|May 11, 2023

Multimodal Machine Learning Applications被引用数 25

ひとこと要約

SeViLAは単一の画像-言語モデル（BLIP-2）を用いて、動画内の言語認識付きキーフレームを共同で局在化し、前向き局在化と後向き自己精錬を通じて、いくつかの動画QAベンチマークで最先端を達成します。

ABSTRACT

Recent studies have shown promising results on utilizing large pre-trained image-language models for video question answering. While these image-language models can efficiently bootstrap the representation learning of video-language models, they typically concatenate uniformly sampled video frames as visual inputs without explicit language-aware, temporal modeling. When only a portion of a video input is relevant to the language query, such uniform frame sampling can often lead to missing important visual cues. Although humans often find a video moment to focus on and rewind the moment to answer questions, training a query-aware video moment localizer often requires expensive annotations and high computational costs. To address this issue, we propose Self-Chained Video Localization-Answering (SeViLA), a novel framework that leverages a single image-language model (BLIP-2) to tackle both temporal keyframe localization and QA on videos. SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2. We propose two ways of chaining these modules for cascaded inference and self-refinement. First, in the forward chain, the Localizer finds multiple language-aware keyframes in a video, which the Answerer uses to predict the answer. Second, in the reverse chain, the Answerer generates keyframe pseudo-labels to refine the Localizer, alleviating the need for expensive video moment localization annotations. Our SeViLA framework outperforms several strong baselines on 5 challenging video QA and event prediction benchmarks, and achieves the state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA, STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer, comparisons of Localizer with other temporal localization models, pre-training/self-refinement of Localizer, and varying the number of keyframes.

研究の動機と目的

事前学習済みの画像-言語モデルを時系列局在化と組み合わせることで、効率的な動画と言語の学習を動機づける。
BLIP-2から微調整された言語認識キーフレーム・ローカライザーと質問応答モデルを導入する。
前方チェーン（ローカライザー -> アンサー）と逆方向チェーン（擬似ラベルに基づくローカライザーの精錬）による自己精錬を可能にする。
ファインチューニングとゼロショット設定の下で、複数の動画QAおよびイベント予測ベンチマークで高い性能を示す。

提案手法

BLIP-2をバックボーンとして採用し、画像エンコーダとLLMを固定化、各モジュールごとにQ-Formersと線形層のみを微調整する。
ローカライザーは、均一にサンプリングしたフレームから言語認識付きキーフレーム上位-Kを選択し、言語情報を含むプロンプトとLLMを用いて回答の関連性をスコア付けする。
アンサーは、選択されたキーフレームの特徴を連結し、LLMへ入力して動画レベルの回答を生成する。
前方チェーンがローカライザーのキーフレームを用いてアンサーを訓練し、QA性能を向上させる。
逆方向チェーンはアンサーの出力からフレームレベルの擬似ラベルを生成し、明示的なフレームレベルの注釈を必要とせずローカライザーを精錬する。
局在化の事前学習をモーメント検索データ（QVHighlights）で行い、フレームレベルの局在の事前知識を提供する。
二段階の自己連鎖（前方推論と後方リファインメント）により、時間的局在とQA精度の向上を達成する。

実験結果

リサーチクエスチョン

RQ1単一の画像-言語モデルを再利用して、動画の時間的局在化とQAの双方を実行できるか？
RQ2言語認識付きキーフレーム選択は、均一なフレームサンプリングより動画QA/イベント予測を改善するか？
RQ3QA出力からの擬似ラベルは、フレームレベルの注釈なしで言語認識付きローカライザーを効果的に精練できるか？
RQ4局在化の動画モーメント検索データ（QVHighlights）の事前訓練が、下流のQA性能に与える影響はどのくらいか？
RQ5SeViLAは複数のベンチマークにおいて、ファインチューニングとゼロショット設定の両方でどのように性能を発揮するか？

主な発見

SeViLAは、5つの動画QAおよびイベント予測ベンチマークで、複数の強力なベースラインを上回る。
ゼロショットのローカライザー＋アンサーは、ゼロショット設定で複数データセット（NExT-QA、STAR、How2QA、TVQA、VLEP）で新しい最先端を達成。
擬似ラベルによる自己精錬は、タスクを横断してローカライザーの性能を一貫して向上させる（アブレーションで平均増分を報告）。
言語認識付きキーフレームによる時間的局在は、均一なフレームサンプリングと比較してQA精度を大幅に高め、特に時間的に難易度が高いタスクで顕著。
局在は、事前学習時に明示的な時間モデリングを欠くにもかかわらず、強力な単独のモーメント検索モデルとして機能し、競争力のある結果を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。