QUICK REVIEW

[論文レビュー] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu, Bing Fan|arXiv (Cornell University)|Feb 26, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

ART-STVGは、長時間の時空間ビデオグ grounding のためのメモリ拡張自己回帰トランスフォーマを導入し、フレームを順次処理して長いビデオを扱い、LF-STVGベンチマークで既存のSF-STVGアプローチを上回る。

ABSTRACT

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

研究の動機と目的

tensを超える長尺ビデオ上での時空間ビデオグ grounding (STVG) の動機づけ。
動画フレームを逐次処理するメモリ拡張自己回帰トランスフォーマ ART-STVG を提案。
関連する時空間コンテキストを抽出するためのメモリ選択戦略を開発。
解像度の高い空間手掛かりを利用して時刻定位を行う cascaded な時空間デコーダを導入。

提案手法

動画をストリーミング入力として扱い、自己回帰トランスフォーマでフレームを逐次処理する。
空間グ grounding 用の空間メモリと時空間グ grounding 用の時間メモリの2つのメモリバンクを使用。
過去フレームからタスクに関連するメモリのみを保持するメモリ選択戦略を実装。
デコード時に空間グ grounding が時空間グ grounding を導くよう、 cascaded設計を実装。
RoI-pooledの細粒度特徴を用いた時系列デコーディングにはクロスアテンションベースのメモリ拡張デコーダを採用。

実験結果

リサーチクエスチョン

RQ1STVGを全フレームを同時に処理せずに長尺ビデオ（LF-STVG）に効果的に拡張するにはどうすればよいか。
RQ2 selective memories を用いたメモリ拡張自己回帰デコードは長尺ビデオでのグ grounding を改善できるか。
RQ3 空間デコーダと時空デコーダをカスケード化することで、細粒度の空間手掛かりを活用してより良い時刻定位が得られるか。
RQ4 LF-STVGにおけるメモリ選択が grounding性能に与える影響はどの程度か。

主な発見

ART-STVGはLF-STVGベンチマークの全指標およびビデオ長さ（LF-STVG-1分/3分/5分）で既存のSTVG法を上回る。
TA-STVGと比較して、ART-STVGはm_tIoUとm_vIoUをそれぞれ0.7/0.9、9.1/6.8、7.3/5.5ポイント改善した。
空間デコーダと時間デコーダ双方のメモリ選択は非選択メモリに比べて顕著な利得をもたらす（アブレーションでの m_tIoU および m_vIoU の改善例）。
cascadedな時空間設計は並列デコーダを上回り、LF-STVG-3min で並列デザインより m_tIoU が1.5%、m_vIoU が1.4%の利得を達成。
LF-STVG の HCSTVG-v2 バリデートセットで、ART-STVGは m_tIoU 28.3、m_vIoU 18.8、vIoU@0.3 27.0、vIoU@0.5 11.9 を達成。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。