QUICK REVIEW

[论文解读] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu, Bing Fan|arXiv (Cornell University)|Feb 26, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

ART-STVG 引入了一个记忆增强的自回归变换器，用于长时域时空视频定位，逐帧处理并具备选择性记忆以处理长视频，在 LF-STVG 基准上超越现有方法。

ABSTRACT

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

研究动机与目标

在超过几十秒的长时域视频上激发时空视频定位（STVG）的研究动机。
提出 ART-STVG，这是一个逐帧处理视频的记忆增强自回归变换器。
开发记忆筛选策略以过滤相关的时空上下文。
引入一个级联的时空解码器，在解码过程中利用细粒度的空间线索进行时间定位。

提出的方法

将视频视为流输入，使用自回归变换器逐帧处理。
使用两个记忆库：用于空间定位的空间记忆和用于时间定位的时间记忆。
实现记忆选择策略，仅保留与任务相关的过去帧记忆。
实现级联设计，其中空间定位在解码时为时间定位提供信息。
采用基于交叉注意力的记忆增强解码器，并使用 RoI 池化的细粒度特征进行时间解码。

实验结果

研究问题

RQ1如何在不一次性处理所有帧的情况下，将 STVG 有效扩展到长时域视频（LF-STVG）？
RQ2带有选择性记忆的记忆增强自回归解码是否能够改进入长视频的定位？
RQ3级联的空间和时间解码器是否有助于利用细粒度的空间线索实现更好的时间定位？
RQ4记忆筛选对 LF-STVG 的定位性能有何影响？

主要发现

ART-STVG 在 LF-STVG 基准上在所有指标和所有视频长度（LF-STVG-1min/3min/5min）上均优于现有 STVG 方法。
与 TA-STVG 相比，ART-STVG 在三个长度上分别在 m_tIoU、m_vIoU 上提升了 0.7/0.9、9.1/6.8、以及 7.3/5.5。
空间与时间解码器中的记忆筛选相比非选择性记忆带来了显著提升（在消融实验中体现为 m_tIoU、m_vIoU 的提升）。
级联的时空设计优于并行解码器，在 LF-STVG-3min 上，m_tIoU 和 m_vIoU 分别比并行设计提升 1.5% 和 1.4%。
在 LF-STVG 的 HCSTVG-v2 验证集上，ART-STVG 达到 m_tIoU 28.3，m_vIoU 18.8，vIoU@0.3 为 27.0，vIoU@0.5 为 11.9。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。