QUICK REVIEW

[論文レビュー] Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of Sentence in Video

Zhenfang Chen, Lin Ma|arXiv (Cornell University)|Jan 25, 2020

Multimodal Machine Learning Applications参考文献 26被引用数 48

ひとこと要約

本論文は、まず多スケールのスライディングウィンドウと MIL により粗いビデオシグメントを選択し、次に細粒度のフレーム-文インタラクションと watershed ベースのグルーピングによって正確なフレーム境界を精緻化する、二段階の弱教師あり時系列定位手法を提案する。

ABSTRACT

In this paper, we study the problem of weakly-supervised temporal grounding of sentence in video. Specifically, given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence, with no reliance on any temporal annotation during training. We propose a two-stage model to tackle this problem in a coarse-to-fine manner. In the coarse stage, we first generate a set of fixed-length temporal proposals using multi-scale sliding windows, and match their visual features against the sentence features to identify the best-matched proposal as a coarse grounding result. In the fine stage, we perform a fine-grained matching between the visual features of the frames in the best-matched proposal and the sentence features to locate the precise frame boundary of the fine grounding result. Comprehensive experiments on the ActivityNet Captions dataset and the Charades-STA dataset demonstrate that our two-stage model achieves compelling performance.

研究の動機と目的

動画中の文を定位する際の高コストな時系列アノテーションへの依存を減らすことを動機づける。
訓練時に時系列アノテーションなしで、クエリと意味的に一致するビデオセグメントを定位する。
正確な開始/終了タイムスタンプを達成するための粗→細フレームワークを開発する。
ビデオ-文ペアとスライディングウィンドウ提案から学習するために MIL を活用する。
有効性を示すために ActivityNet Captions および Charades-STA で評価する。

提案手法

GloVe 埋め込みの後に Bi-LSTM で文をエンコードする。
フレーム特徴量と Bi-LSTM を用いて文脈情報を付加しビデオをエンコードする。
80% のオーバーラップを持つ多スケールのスライディングウィンドウで固定長の時系列提案を生成する。
粗い段階: 二流の grounder（分類+選択）を用いてマルチモーダルスコアを融合計算し、MIL 学習を行う。
細かい段階: 粗いセグメントを拡張し、文とフレームレベルの相互作用を行いフレームごとのスコアを予測する。watershed ベースのグルーピングを適用して正確な境界を得る。
二段階で学習する: まず MIL 損失を用いた粗段階、次にランクベースの損失を用いた細段階で正しい video-sentence ペアと誤ったペアを分離する。

実験結果

リサーチクエスチョン

RQ1弱教師あり（時系列アノテーションなし）で競争力のある時系列定位性能を達成できるか？
RQ2粗→細戦略は単一段階のアプローチより境界の精度を向上させるか？
RQ3提案レベル（粗い）推論と比較して、フレームレベルの細粒度相互作用は定位精度をどう影響するか？

主な発見

手法	R@1 IoU=0.1	R@1 IoU=0.3	R@1 IoU=0.5	mIoU
ActivityNet Captions - CTRL (fully-supervised)	49.1	28.7	14.0	20.5
ActivityNet Captions - Yuan et al. (fully-supervised)	73.3	55.7	36.8	37.0
ActivityNet Captions - Xu et al. (fully-supervised)	-	45.3	27.7	-
ActivityNet Captions - He et al. (fully-supervised)	-	-	36.9	-
ActivityNet Captions - Mithun et al. (weakly-supervised)	62.7	42.0	23.3	28.2
ActivityNet Captions - Gao et al. (GRU, weakly-supervised)	74.0	42.3	22.5	31.8
ActivityNet Captions - Gao et al. (BERT, weakly-supervised)	75.4	42.8	22.7	32.2
ActivityNet Captions - Ours (weakly-supervised)	74.2	44.3	23.6	32.2
Charades-STA - CTRL (fully-supervised)	-	23.6	8.9	-
Charades-STA - Xu et al. (fully-supervised)	54.7	35.6	15.8	-
Charades-STA - He et al. (fully-supervised)	-	36.7	-	-
Charades-STA - Mithun et al. (weakly-supervised)	32.1	19.9	8.8	-
Charades-STA - Ours (weakly-supervised)	39.8	27.3	12.9	27.3

提案された二段階モデルは弱教師ありの下で ActivityNet Captions および Charades-STA で競争力のある結果を達成する。
粗段階のみでランダム提案選択を大幅に上回り、堅牢な定位基盤を提供する。
細段階のフレームレベル相互作用と watershed ベースのグルーピングは、粗い結果より時間的境界の精度を改善する。
完全な粗→細モデルはベースラインおよびいくつかの弱教師あり手法を上回り、主要指標で一部の完全監視法に近づくか、凌駕する。
アブレーションにより、二流の coarse grounder の使用が粗段階に有利である一方、FC ベースの細段階 grounder がより細粒度の定位をもたらすことが示される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。