QUICK REVIEW

[論文レビュー] Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Teng Wang, Jinrui Zhang|arXiv (Cornell University)|Mar 11, 2023

Multimodal Machine Learning Applications被引用数 7

ひとこと要約

論文は、双方向のテキスト-to-イベント grounding およびイベント-to-text generation を備えた、未トリミング動画向けのグラウンデッド視聺覚フレームワークと、ロバストなイベント-文マッチングのためのセマンティック認識ラベル割り当てを提案し、密集ビデオキャプショニングで最先端の結果、VL理解/生成タスクで競争力のある結果を達成します。

ABSTRACT

Joint video-language learning has received increasing attention in recent years. However, existing works mainly focus on single or multiple trimmed video clips (events), which makes human-annotated event boundaries necessary during inference. To break away from the ties, we propose a grounded vision-language learning framework for untrimmed videos, which automatically detects informative events and effectively excavates the alignments between multi-sentence descriptions and corresponding event segments. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments, i.e., text-to-event grounding (TEG) and event-to-text generation (ETG). TEG learns to adaptively ground the possible event proposals given a set of sentences by estimating the cross-modal distance in a joint semantic space. Meanwhile, ETG aims to reconstruct (generate) the matched texts given event proposals, encouraging the event representation to retain meaningful semantic information. To encourage accurate label assignment between the event set and the text set, we propose a novel semantic-aware cost to mitigate the sub-optimal matching results caused by ambiguous boundary annotations. Our framework is easily extensible to tasks covering visually-grounded language understanding and generation. We achieve state-of-the-art dense video captioning performance on ActivityNet Captions, YouCook2 and YouMakeup, and competitive performance on several other language generation and understanding tasks. Our method also achieved 1st place in both the MTVG and MDVC tasks of the PIC 4th Challenge. Our code is publicly available at https://github.com/zjr2000/GVL.

研究の動機と目的

未トリミング動画から自然言語 supervision を用いて識別的で時間的に感度の高いイベント表現を学習する動機づけ。
テキスト-to-event grounding および event-to-text generation の二つのデュアルプリテキストタスクを開発し、ビデオ・言語間の細粒度的整合性を促進する。
境界注釈ノイズに対処するため、Robust な one-to-one マッチングを実現するセマンティック認識ラベル割り当て戦略を導入する。
複数データセットにまたがる Visually-Grounded Language Understanding and Generation タスクへの拡張性を実証する。

提案手法

トランスフォーマー型イベント検出器と学習可能なクエリを用いて未トリミング動画をイベント提案の集合にエンコードする。
視覚と言語のジョイント空間でのクロスモーダル類似度を計算し、対照学習損失を適用してテキスト-to-event grounding によってイベントを文へ結びつける。
イベント提案からイベント-to-text generation モジュールを通じて文を生成し、時間的境界と信頼度も予測する。
セマンティック類似度コスト（クロスモーダル類似度に基づく）と局在化コストを組み合わせたセマンティック認識ラベル割り当てを使用し、robust な one-to-one マッチングを実現する。
セマンティック認識マッチングに導かれた ETG および TEG 損失を共同最適化する結合 objective で訓練する。

実験結果

リサーチクエスチョン

RQ1双方向の監督信号（TEG と ETG）は、未トリミング動画からより識別的で意味的に豊かなイベント表現を生み出すか。
RQ2セマンティック認識ラベル割り当ては従来の IoU ベースのマッチングと比較してノイズのある境界に対する頑健性を改善するか。
RQ3提案フレームワークは密集ビデオキャプショニングおよび多様な VL 理解/生成タスクを、複数の未トリミング動画データセットでどのように性能を示すか。

主な発見

ActivityNet Captions、YouCook2、YouMakeup で密集ビデオキャプショニングの最先端結果を達成。
ビデオ段落キャプショニングや単文・複数文のビデオグラウンディングを含む他の言語生成・理解タスクで競争力のある性能を示す。
セマンティック認識マッチングにより境界ノイズに対してロバスト性を示し、ベースラインを上回る。
双方向のプリテキストタスクがキャプショニング品質と時間的アクション局在化の両方を改善することを示す。
PIC の MTVG および MDVC の特定課題で1位を報告する、実践的な有効性を裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。