QUICK REVIEW

[논문 리뷰] Synthetic Visual Genome 2: Extracting Large-scale Spatio-Temporal Scene Graphs from Videos

Ziqi Gao, Jieyu Zhang|arXiv (Cornell University)|2026. 02. 26.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

SVG2는 대규모 합성 파노픽 비디오 장면 그래프 데이터셋을 제공하고, 원시 비디오와 파노픽 궤적을 한 번의 통과로 시공간 장면 그래프로 변환하는 모델 TraSeR를 도입하며, 기저 baselines 대비 상당한 이점을 보여준다.

ABSTRACT

We introduce Synthetic Visual Genome 2 (SVG2), a large-scale panoptic video scene graph dataset. SVG2 contains over 636K videos with 6.6M objects, 52.0M attributes, and 6.7M relations, providing an order-of-magnitude increase in scale and diversity over prior spatio-temporal scene graph datasets. To create SVG2, we design a fully automated pipeline that combines multi-scale panoptic segmentation, online-offline trajectory tracking with automatic new-object discovery, per-trajectory semantic parsing, and GPT-5-based spatio-temporal relation inference. Human verification of SVG2 annotation accuracy confirms its reliability (objects: 93.8%, attributes: 88.3%, relations: 85.4%). Building on this resource, we train TRaSER, a video scene graph generation model. TRaSER augments VLMs with a trajectory-aligned token arrangement mechanism and new modules: an object-trajectory resampler and a temporal-window resampler to convert raw videos and panoptic trajectories into compact spatio-temporal scene graphs in a single forward pass. The temporal-window resampler binds visual tokens to short trajectory segments to preserve local motion and temporal semantics, while the object-trajectory resampler aggregates entire trajectories to maintain global context for objects. On the PVSG, VIPSeg, VidOR and SVG2 test datasets, TRaSER improves relation detection by +15 to 20%, object prediction by +30 to 40% over the strongest open-source baselines and by +13% over GPT-5, and attribute prediction by +15%. When TRaSER's generated scene graphs are sent to a VLM for video question answering, it delivers a +1.5 to 4.6% absolute accuracy gain over using video only or video augmented with Qwen2.5-VL's generated scene graphs, demonstrating the utility of explicit spatio-temporal scene graphs as an intermediate representation.

연구 동기 및 목표

오픈 어휘의 물체와 관계를 포함한 밀도 높고 시간적으로 근거 있는 비디오 장면 그래프의 필요성을 제시한다.
파노픽 궤적, 속성 및 관계를 갖춘 대규모 SVG2를 합성하기 위한 확장 가능하고 자동화된 파이프라인을 만든다.
단일 순전파로 비디오를 구조화된 시공간 장면 그래프로 해석하는 TraSeR를 개발한다.
오픈 벤치마크에서 TraSeR의 유효성을 입증하고, 장면 그래프를 중간 표현으로 사용한 비디오 QA에 대한 유용성을 보여준다.

제안 방법

다중 스케일 파노픽 세분화, 온라인–오프라인 궤적 추적 및 새 객체 발견, 각 궤적에 대한 의미 파싱, 그리고 GPT-5 기반 시공-시간 관계 추론을 결합한 완전 자동 SVG2 합성 파이프라인을 개발한다.
ViT 토큰을 객체 궤적에 바인딩하고 시간에 걸친 정체성을 보존하기 위한 궤적 정렬 토큰 배열 메커니즘을 도입한다.
글로벌 객체 맥락을 위한 객체-궤적 리샘플러와 로컬 운동 및 시간 의미를 위한 시간 창 리샘플러로 구성된 이중 리샘플러를 제안한다.
SVG2 및 외부 비디오 데이터셋에서 태스크 특정 프롬프트를 사용하여 단일 패스로 구조화된 장면 그래프를 출력하도록 TraSeR을 학습시킨다.
오픈 소스 베이스라인 및 GPT-5에 대해 TraSeR을 평가하고, 장면 그래프를 중간 표현으로 활용한 비디오 QA에 미치는 영향을 평가한다.

Figure 1 : Synthetic Visual Genome 2 ( SVG2 ) , a large-scale synthetic panoptic video scene graph dataset. SVG2 provides dense panoptic trajectories, fine-grained object categories and attributes, and temporally grounded spatialtemporal relations across over 636K videos, which is an order-of-magnit

실험 결과

연구 질문

RQ1완전 자동화된 파이프라인이 대규모로 밀도 높고 시간적으로 근거 있는 비디오 장면 그래프를 생성할 수 있는가?
RQ2궤적 정렬과 이중 리샘플러 설계가 비디오 장면 그래프에서 객체 접지(object grounding) 및 관계 추론에 어떤 영향을 미치는가?
RQ3SVG2로 생성된 그래프가 벤치마크 및 기존 장면 그래프에 비해 비디오 QA 같은 다운스트림 태스크를 개선하는가?
RQ4합성 SVG2 데이터와 실세계 비디오 주석을 결합하는 것이 VSG 성능에 어떤 기여를 하는가?

주요 결과

SVG2는 636K편 이상의 비디오를 포함하고 있으며, 6.6M 객체, 52.0M 속성, 6.7M 관계를 갖고 있어 이전 데이터셋에 비해 규모가 크게 증가했다.
TraSeR은 가장 강력한 오픈 소스 베이스라인 대비 관계 예측을 +15–20%, 객체 예측을 +30–40% 향상시키고, GPT-5 대비로는 +13% 향상시켰다.
TraSeR은 오픈 소스 최신 기술보다 속성 예측을 +15% 향상시키며 SVG2 테스트에서 강력한 성과를 달성한다.
TraSeR로 생성된 그래프를 비디오 QA용 VLM에 통합할 때, 순수 비디오 또는 다른 그래프와 함께 사용한 베이스라인 대비 절대 정확도 증가가 +1.5%에서 +4.6%까지 관찰된다.
제안된 LLM 기반 판정자는 객체 및 관계 평가에 대해 인간 주석자와 상당한 일치를 보이며 자동화된 의미 평가의 타당성을 입증한다.

Figure 2 : Overview of SVG2 synthesis pipeline. Phase 1 : panoptic trajectory generation with online–offline object tracking mechanism that discovers new objects and preserves identity consistency. Phase 2 : per-trajectory description and semantic parsing. Phase 3 : GPT5–based spatiotemporal relatio

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.