QUICK REVIEW

[论文解读] VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection

Peng Wu, Xuerong Zhou|arXiv (Cornell University)|Aug 22, 2023

Viral Infections and Outbreaks Research被引用 8

一句话总结

VadCLIP 利用一个冻结的 CLIP 模型，采用双分支设计，在弱监督视频异常检测中同时实现粗粒度分类和细粒度的视觉-语言对齐，在 XD-Violence 和 UCF-Crime 上达到最新状态的结果。

ABSTRACT

The recent contrastive language-image pre-training (CLIP) model has shown great success in a wide range of image-level tasks, revealing remarkable ability for learning powerful visual representations with rich semantics. An open and worthwhile problem is efficiently adapting such a strong model to the video domain and designing a robust video anomaly detector. In this work, we propose VadCLIP, a new paradigm for weakly supervised video anomaly detection (WSVAD) by leveraging the frozen CLIP model directly without any pre-training and fine-tuning process. Unlike current works that directly feed extracted features into the weakly supervised classifier for frame-level binary classification, VadCLIP makes full use of fine-grained associations between vision and language on the strength of CLIP and involves dual branch. One branch simply utilizes visual features for coarse-grained binary classification, while the other fully leverages the fine-grained language-image alignment. With the benefit of dual branch, VadCLIP achieves both coarse-grained and fine-grained video anomaly detection by transferring pre-trained knowledge from CLIP to WSVAD task. We conduct extensive experiments on two commonly-used benchmarks, demonstrating that VadCLIP achieves the best performance on both coarse-grained and fine-grained WSVAD, surpassing the state-of-the-art methods by a large margin. Specifically, VadCLIP achieves 84.51% AP and 88.02% AUC on XD-Violence and UCF-Crime, respectively. Code and features are released at https://github.com/nwpu-zxr/VadCLIP.

研究动机与目标

探索如何将大规模视觉-语言预训练（CLIP）在不进行微调的情况下，应用于弱监督视频异常检测（WSVAD）。
利用跨模态线索捕获粗粒度和细粒度的时序与语义信息。
在保持 CLIP 性能的同时利用视觉-语言关联，以实现弱监督。

提出的方法

引入 Local-Global Temporal Adapter（LGT-Adapter）以高效建模局部与全局时序依赖性。
部署双分支架构：C-Branch 用于粗粒度二元异常检测，A-Branch 使用 CLIP 的文本编码器进行细粒度的视觉-语言对齐。
使用可学习的提示词和异常聚焦的视觉提示来在 CLIP 内自适应文本标签与视觉上下文。
应用 MIL-Align 通过在弱监督下为每个标签选择前-K 帧-文本匹配来优化帧级对齐。
保持 CLIP 的图像和文本编码器冻结；将梯度反向传播到适配器和提示模块。
结合三类损失：视频级预测的二元交叉熵损失、基于 MIL 的对齐损失，以及正常/异常类别嵌入之间的对比损失。

实验结果

研究问题

RQ1如何在不对骨干网络进行再训练的情况下，将 CLIP 适配为有效的弱监督视频异常检测（WSVAD）？
RQ2双分支架构能否在同一模型内同时利用粗粒度分类和细粒度视觉-语言对齐来提升 WSVAD？
RQ3在弱监督下，哪些机制（提示、提示+视觉提示、LGT-Adapter）最能将 CLIP 知识迁移到 WSVAD 任务？
RQ4如何在弱监督下优化视觉-语言对齐，以在保持预训练知识的同时区分异常？

主要发现

VadCLIP 在 XD-Violence 上达到 84.51% 的 AP，在 UCF-Crime 上达到 88.02% 的 AUC，创造了两个基准的新状态。
双分支设计使在单一模型中实现粗粒度和细粒度的 WSVAD。
可学习的提示优于手工提示在将 CLIP 知识迁移到 WSVAD 方面的效果。
异常聚焦的视觉提示与 LGT-Adapter 显著提升时序建模和对齐性能。
通过 MIL-Align 与跨模态对齐，在保持 CLIP 冻结的同时，细粒度与粗粒度性能均得到提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。