QUICK REVIEW

[논문 리뷰] Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu, Bing Fan|arXiv (Cornell University)|2026. 02. 26.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

ART-STVG은 장기간 형식의 시공간 비디오 바인딩을 위한 메모리 보강 자회귀 트랜스포머를 도입하여, 선택적 메모리를 사용해 순차적으로 프레임을 처리하고 긴 비디오를 다루며 LF-STVG 벤치마크에서 기존의 SF-STVG 접근법을 능가합니다.

ABSTRACT

In real scenarios, videos can span several minutes or even hours. However, existing research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing targets in short videos of tens of seconds, typically less than one minute, which limits real-world applications. In this paper, we explore Long-Form STVG (LF-STVG), which aims to locate targets in long-term videos. Compared with short videos, long-term videos contain much longer temporal spans and more irrelevant information, making it difficult for existing STVG methods that process all frames at once. To address this challenge, we propose an AutoRegressive Transformer architecture for LF-STVG, termed ART-STVG. Unlike conventional STVG methods that require the entire video sequence to make predictions at once, ART-STVG treats the video as streaming input and processes frames sequentially, enabling efficient handling of long videos. To model spatio-temporal context, we design spatial and temporal memory banks and apply them to the decoders. Since memories from different moments are not always relevant to the current frame, we introduce simple yet effective memory selection strategies to provide more relevant information to the decoders, significantly improving performance. Furthermore, instead of parallel spatial and temporal localization, we propose a cascaded spatio-temporal design that connects the spatial decoder to the temporal decoder, allowing fine-grained spatial cues to assist complex temporal localization in long videos. Experiments on newly extended LF-STVG datasets show that ART-STVG significantly outperforms state-of-the-art methods, while achieving competitive performance on conventional short-form STVG.

연구 동기 및 목표

수십 초를 넘어서는 장편 비디오에서 시공간 비디오 바인딩(STVG)을 고취시키는 것.
비디오 프레임을 순차적으로 처리하는 메모리 보강 자회귀 트랜스포머인 ART-STVG를 제안합니다.
관련 시공간 맥락을 필터링하기 위한 메모리 선택 전략을 개발합니다.
시간적 위치결정을 위한 미세한 공간적 단서를 활용하는 연쇄적 시공간 디코더를 도입합니다.

제안 방법

비디오를 스트리밍 입력으로 간주하고 자회귀 트랜스포머로 프레임을 순차적으로 처리합니다.
공간 바운딩용 공간 기억과 시간 바운딩용 시간 기억의 두 개의 메모리 뱅크를 사용합니다.
과거 프레임에서 작업 관련 기억만 남기도록 메모리 선택 전략을 구현합니다.
디코딩 중 공간 바운딩이 시간 바운딩을 안내하는 연쇄형 설계를 구현합니다.
시간 디코딩을 위해 RoI-풀링된 미세한 특징을 포함하는 교차 주의 기반 메모리 보강 디코더를 사용합니다.

실험 결과

연구 질문

RQ1모든 프레임을 한 번에 처리하지 않고 STVG를 효과적으로 장편 비디오(LF-STVG)로 확장하려면 어떻게 해야 합니까?
RQ2선택적 기억을 갖춘 메모리 보강 자회귀 디코딩이 긴 비디오에서 바운딩을 개선할 수 있을까요?
RQ3연쇄적으로 공간 및 시간 디코더를 연결하면 더 미세한 공간 단서를 활용해 더 나은 시간적 위치결정에 도움이 될까요?
RQ4LF-STVG에서 메모리 선택이 바운딩 성능에 미치는 영향은 무엇입니까?

주요 결과

ART-STVG은 LF-STVG 벤치마크에서 모든 지표와 비디오 길이에 대해 기존 STVG 방법이나 목표치를 상회합니다(LF-STVG-1min/3min/5min).
TA-STVG와 비교할 때, ART-STVG는 세 길이 각각에서 m_tIoU와 m_vIoU를 0.7/0.9, 9.1/6.8, 그리고 7.3/5.5씩 개선합니다.
공간 및 시간 디코더 모두에서 메모리 선택은 비선택 메모리 대비 상당한 이점을 제공합니다(예: 제거 실험에서 m_tIoU 및 m_vIoU의 개선).
연쇄적 시공간 설계가 병렬 디코더보다 우수하며 LF-STVG-3min에서 병렬 설계 대비 m_tIoU와 m_vIoU 각각 1.5%와 1.4%의 이익을 제공합니다.
LF-STVG의 HCSTVG-v2 검증 세트에서 ART-STVG는 m_tIoU 28.3, m_vIoU 18.8, vIoU@0.3 27.0, vIoU@0.5 11.9를 달성합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.