QUICK REVIEW

[論文レビュー] Instance-Aligned Captions for Explainable Video Anomaly Detection

Inpyo Song, Minjun Joo|arXiv (Cornell University)|Jan 13, 2026

Anomaly Detection Techniques and Applications被引用数 0

ひとこと要約

この論文は、ビデオ異常検知における各テキスト主張を特定のオブジェクト実例に結びつけるインスタンス整合キャプションを導入し、8つのベンチマークを grounded with per-instance explanations に拡張し、VIEW360 を VIEW360+ に拡大する。現在の LLM/VLM の explanations は grounding と一貫性に難を抱えることを示し、検証可能な視覚-テキスト整合性の必要性を強調する。

ABSTRACT

Explainable video anomaly detection (VAD) is crucial for safety-critical applications, yet even with recent progress, much of the research still lacks spatial grounding, making the explanations unverifiable. This limitation is especially pronounced in multi-entity interactions, where existing explainable VAD methods often produce incomplete or visually misaligned descriptions, reducing their trustworthiness. To address these challenges, we introduce instance-aligned captions that link each textual claim to specific object instances with appearance and motion attributes. Our framework captures who caused the anomaly, what each entity was doing, whom it affected, and where the explanationis grounded, enabling verifiable and actionable reasoning. We annotate eight widely used VAD benchmarks and extend the 360-degree egocentric dataset, VIEW360, with 868 additional videos, eight locations, and four new anomaly types, creating VIEW360+, a comprehensive testbed for explainable VAD. Experiments show that our instance-level spatially grounded captions reveal significant limitations in current LLM- and VLM-based methods while providing a robust benchmark for future research in trustworthy and interpretable anomaly detection.

研究の動機と目的

Explainable VAD の空間的 grounding の欠如に対処する。
セグメンテーションマスクに結びついたインスタンスレベルのオブジェクト grounded キャプションを提供する。
VIEW360 データセットを VIEW360+ に拡張し、より広いシナリオと異常を含める。
役割認識型・インスタンス整合キャプションで8つの VAD ベンチマークを注釈付けし、統一評価を可能にする。
既存の LLM- および VLM ベースの説明の限界を示し、堅牢なベンチマークを確立する。

提案手法

犯人と被害者/ターゲットの役割を考慮したインスタンスマスクでビデオを注釈付ける。
PROMPT からフレームごとの分割マスクを生成する SAM2 を使用する。
オブジェクト列を切り出し、参照文脈と整合するオブジェクト固有のキャプションを生成する。
各キャプションを対応するインスタンス分割に結びつけて説明を grounding する。
共同 Cap-IoU F_SC 指標と偽陽性エンティティ数（FPE）を用いてキャプション品質と空間 grounding を評価する。
データセット間で caption-only、segmentation-only、multi-stage VLM+SAM2 パイプラインを比較する。

Figure 1 : Comparison of anomaly understanding paradigms. (a) Traditional score-only detection raises an alert but provides no explainability. (b) LLM/VLM-based systems generate textual explanations but lack spatial grounding—when multiple people match the description or the model attends to wrong o

実験結果

リサーチクエスチョン

RQ1インスタンス整合キャプションは VAD の説明に対して検証可能な grounding を提供できるか。
RQ2 explanations がインスタンスレベルの視覚的証拠に厳密に grounding されている場合、現在の LLM- および VLM ベースの方法はどう機能するか。
RQ3多エンティティ相互作用における既存の grounded/explanatory VAD アプローチの主な失敗モードは何か。
RQ4VIEW360+ は異常タイプと空間 grounding 要件の点で既存データセットとどう異なるか。
RQ5統一評価プロトコルは explainable VAD におけるキャプション品質と空間 grounding のギャップを明らかにするか。

主な発見

インスタンス整合キャプションは、各主張をオブジェクト実例に grounding することで who–what–whom–where の検証可能な推論を可能にする。
grounded な説明は、誤 grounding や幻のエンティティを含む、現行の LLM- および VLM ベースの手法の重要な限界を明らかにする。
multi-stage VLM+SAM2 パイプラインは、単一段モデルよりも信頼性のある grounding な説明をデータセット全体で提供する。
犯人の grounding は、モデルとデータセットを通じて被害者/ターゲットの grounding より一貫して強い。
VIEW360+ は都市部の安全シナリオをよりよく反映する異常タイプを拡張し、egocentric 360° 動画における空間 grounding 評価を強化する。
統一されたインスタンス整合注釈フレームワークは explainable VAD の堅牢な評価を促進し、現行アプローチの明確な失敗モードを浮き彫りにする。

Figure 2 : Comparison of anomaly‐understanding paradigms. (a) Traditional VAD predicts only anomaly scores without explanations. (b) VLM‐based VAD generates textual descriptions but lacks object‐level grounding. (c) Grounding VLMs provide spatial localization but do not produce object‐specific expla

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。