QUICK REVIEW

[論文レビュー] UDVideoQA: A Traffic Video Question Answering Dataset for Multi-Object Spatio-Temporal Reasoning in Urban Dynamics

Joseph Raj Vishal, Nagasiri Poluri|arXiv (Cornell University)|Feb 24, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

UDVideoQAは16時間の映像と28,800件のQAペアを持つ大規模なプライバシー保護交通映像QAデータセットとVideoQGenベンチマークを導入し、VideoLMsにおける知覚と推論のギャップを持続的に示し、ファインチューニングされたオープンモデルが専有系のパフォーマンスに近づく可能性を示す。

ABSTRACT

Understanding the complex, multi-agent dynamics of urban traffic remains a fundamental challenge for video language models. This paper introduces Urban Dynamics VideoQA, a benchmark dataset that captures the unscripted real-world behavior of dynamic urban scenes. UDVideoQA is curated from 16 hours of traffic footage recorded at multiple city intersections under diverse traffic, weather, and lighting conditions. It employs an event-driven dynamic blur technique to ensure privacy preservation without compromising scene fidelity. Using a unified annotation pipeline, the dataset contains 28K question-answer pairs generated across 8 hours of densely annotated video, averaging one question per second. Its taxonomy follows a hierarchical reasoning level, spanning basic understanding and attribution to event reasoning, reverse reasoning, and counterfactual inference, enabling systematic evaluation of both visual grounding and causal reasoning. Comprehensive experiments benchmark 10 SOTA VideoLMs on UDVideoQA and 8 models on a complementary video question generation benchmark. Results reveal a persistent perception-reasoning gap, showing models that excel in abstract inference often fail with fundamental visual grounding. While models like Gemini Pro achieve the highest zero-shot accuracy, fine-tuning the smaller Qwen2.5-VL 7B model on UDVideoQA bridges this gap, achieving performance comparable to proprietary systems. In VideoQGen, Gemini 2.5 Pro, and Qwen3 Max generate the most relevant and complex questions, though all models exhibit limited linguistic diversity, underscoring the need for human-centric evaluation. The UDVideoQA suite, including the dataset, annotation tools, and benchmarks for both VideoQA and VideoQGen, provides a foundation for advancing robust, privacy-aware, and real-world multimodal reasoning. UDVideoQA is available at https://ud-videoqa.github.io/UD-VideoQA/UD-VideoQA/.

研究の動機と目的

現実世界の多エージェント都市交通ダイナミクスを、 varied ライティングと天候条件下でdenseな質問–回答注釈とともに捕捉する。
イベント駆動型の動体模糊化によりプライバシーを保護しつつシーン忠実性を維持する。
交通シーンの grounding、時間的・因果推論を評価するVideoQAとVideoQGenのベンチマークを作成する。

提案手法

都市交差点からの16時間の監視映像（1.7Mフレーム）を30 fpsで多様な条件下でキュレーションする。
映像を10秒クリップに分割し、イベント駆動型の動きベースブラーを適用してプライバシーを保護する。
VideoQGenを用いたアノテーションパイプラインと人間-in-the-loop検証を組み合わせて自動的にQAペアを生成する。
属性付与、基礎理解、イベント推論、反推論、反事実推定を含む階層的QAタキソノミーを定義する。
VideoQAで10の最先端VideoLMを評価し、VideoQGenでは8モデルをゼロショットとファインチューニング設定で評価する。
LLM審査員を用いた意味的–意味的評価アプローチと、推論の正確さを測る重み付け付き複雑さスコアリングを採用する。

Figure 2 : Illustrates the pipeline for creating the UDVideoQA dataset. The process begins with traffic video recording, which is segmented and temporally clipped into $10$ s clips. These clips undergo dynamic anonymity blurring. The QA taxonomy and generation module then uses model based on VideoQG

実験結果

リサーチクエスチョン

RQ1現在のVideoLMは現実世界の多エージェント都市交通シーンを grounding and reason over どの程度達成できているか。
RQ2これらのモデルにおける知覚的 grounding と高次推論のギャップはどの程度か。
RQ3ファインチューニングによるドメイン適応はオープンソースモデルと専有系システム間のギャップを埋められるか。
RQ4VideoQGen設定における自動生成質問は交通シナリオでどれだけ多様で文脈的に grounded になり得るか。
RQ5監視データで匿名性を保ちながらシーン忠実性を維持するプライバシー保護技術は何か。

主な発見

UDVideoQAは varied な天候・照明・密度条件で28,800件のQAペアを含む16時間の映像（1.7Mフレーム）を含む。
イベント駆動型の動体ブラー法は、検出器–セグメンターのベースラインよりもプライバシーを保護しつつ時間的・文脈的完全性を維持する。
10件のSOTA VideoLMは持続的な知覚–推論ギャップを示す。高レベルの推論は低レベルの視覚 grounding よりも優れていることが多い。
Gemini 2.5 Proはゼロショット/総合パフォーマンスで最高を達成するが、朝条件では属性 grounding が弱い。小〜型のオープンモデルは適切なファインチューニングで専有系システムに匹敵する、あるいは近づくことができる。
VideoQGenではGemini 2.5 ProとQwen3シリーズが最も関連性が高く複雑な質問を生成するが、言語的多様性はモデル間で限定的。
UDVideoQAをオープンソースQwen-2.5-VL 7Bでファインチューニングすると専有系システムとのギャップが縮まり、属性付けとドメイン横断一般化で顕著な改善を見せる。
データセットはクロスデータセット一般化を可能にし、ファインチューニング済みのUDVideoQAモデルがRoadSocialおよびSUTDTrafficQAベンチマークで性能を向上させる。

Figure 3 : UDVideoQA dataset statistics. (a) Word frequency distribution by question type, (b) Distribution of question categories across semantic domains, including pedestrians, vehicles, and environmental signage. (c) Plot illustrating the spread of question sets across six contextual dimensions:

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。