[論文レビュー] DeepSVU: Towards In-depth Security-oriented Video Understanding via Unified Physical-world Regularized MoE
The paper introduces DeepSVU and the Unified Physical-world Regularized MoE (UPRM) to perform in-depth security-oriented video understanding that identifies, localizes, and attributes threat causes by modeling coarse-to-fine physical-world information via a unified MoE with a trade-off regularizer.
In the literature, prior research on Security-oriented Video Understanding (SVU) has predominantly focused on detecting and localize the threats (e.g., shootings, robberies) in videos, while largely lacking the effective capability to generate and evaluate the threat causes. Motivated by these gaps, this paper introduces a new chat paradigm SVU task, i.e., In-depth Security-oriented Video Understanding (DeepSVU), which aims to not only identify and locate the threats but also attribute and evaluate the causes threatening segments. Furthermore, this paper reveals two key challenges in the proposed task: 1) how to effectively model the coarse-to-fine physical-world information (e.g., human behavior, object interactions and background context) to boost the DeepSVU task; and 2) how to adaptively trade off these factors. To tackle these challenges, this paper proposes a new Unified Physical-world Regularized MoE (UPRM) approach. Specifically, UPRM incorporates two key components: the Unified Physical-world Enhanced MoE (UPE) Block and the Physical-world Trade-off Regularizer (PTR), to address the above two challenges, respectively. Extensive experiments conduct on our DeepSVU instructions datasets (i.e., UCF-C instructions and CUVA instructions) demonstrate that UPRM outperforms several advanced Video-LLMs as well as non-VLM approaches. Such information.These justify the importance of the coarse-to-fine physical-world information in the DeepSVU task and demonstrate the effectiveness of our UPRM in capturing such information.
研究の動機と目的
- Motivate a new task—In-depth Security-oriented Video Understanding (DeepSVU)—to not only detect threats but also attribute their causes in detail.
- Address two challenges: modeling coarse-to-fine physical-world information and adaptively trading off contributions from different information levels.
- Develop a unified MoE-based architecture (UPRM) to capture multi-granularity cues and balance their influence.
- Evaluate on two DeepSVU instruction datasets (UCF-C instructions and CUVA instructions) built from UCF-Crime and CUVA datasets.
提案手法
- Introduce Unified Physical-world Enhanced MoE (UPE) Block with three fine-grained experts (Human-Pose Expert, Object-Relation Expert, Visual-Background Expert) and one coarse-grained Expert (Video) to model physical-world information at multiple granularities.
- Incorporate Human-Pose aware Attention and cross-attention mechanisms to fuse pose cues with video tokens.
- Use a Graph Transformer Network for Masked Object Interactions to refine object-relations representations.
- Leverage Visual-Background Expert using SAM-based background tokens and a Fine-tuned Visual FFN for refinement.
- Adopt Coarse-grained Video Expert based on LanguageBinds within Video-LLaVA (ViT-L14 CLIP-based) to extract coarse visual semantics.
- Propose Physical-world Trade-off Regularizer (PTR) consisting of a Trade-off Aware Expert Router and a Gated Physical-world Trade-off Loss to balance contributions from four experts.
- Train in two stages: pre-tuning for physical-world understanding on RefCOCO, HumanML3D, RSI-CB, followed by DeepSVU instruction tuning on UCF-C and CUVA instructions datasets.

実験結果
リサーチクエスチョン
- RQ1How can coarse-to-fine physical-world information be effectively modeled for DeepSVU?
- RQ2How can the contributions of coarse-grained and fine-grained physical-world cues be balanced to avoid bias?
- RQ3Can a unified MoE with specialized experts outperform existing Video-LLMs and non-LLM approaches in identifying, locating, and attributing threats?
- RQ4Does the proposed approach generalize across two instruction datasets derived from UCF-Crime and CUVA for threat understanding?
主な発見
- UPRM outperforms several advanced Video-LLMs and non-LLM approaches on CUVA and UCF-C instruction datasets in identifying, locating, and attributing threats.
- On CUVA, UPRM achieves substantial improvements in FNRs reduction, F2-score, and mAP@tIoU compared to the best non-LLM baseline and to Hawkeye (best Video-LLM).
- On UCF-C, UPRM shows strong gains over baselines in identifying and locating threats, demonstrating effectiveness of the PTR-enabled multi-expert fusion.
- Ablation results indicate the importance of each component (HPE, ORE, VBE, UPE, PTR) for performance gains.
- Two-stage training (physical-world pre-tuning and DeepSVU instruction tuning) enables the model to reason about causes, timestamps, and contextual factors in threats.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。