QUICK REVIEW

[论文解读] CLIP-AD: A Language-Guided Staged Dual-Path Model for Zero-shot Anomaly Detection

Xuhai Chen, Jiangning Zhang|arXiv (Cornell University)|Nov 1, 2023

Anomaly Detection Techniques and Applications被引用 7

一句话总结

CLIP-AD 引入一个语言引导的零-shot 异常检测框架，具备 Staged Dual-Path (SDP) 模型及 SDP+ 微调扩展，在无需复杂提示或多尺度编码的情况下，在 MVTec-AD 与 VisA 上达到最先进的性能。

ABSTRACT

This paper considers zero-shot Anomaly Detection (AD), performing AD without reference images of the test objects. We propose a framework called CLIP-AD to leverage the zero-shot capabilities of the large vision-language model CLIP. Firstly, we reinterpret the text prompts design from a distributional perspective and propose a Representative Vector Selection (RVS) paradigm to obtain improved text features. Secondly, we note opposite predictions and irrelevant highlights in the direct computation of the anomaly maps. To address these issues, we introduce a Staged Dual-Path model (SDP) that leverages features from various levels and applies architecture and feature surgery. Lastly, delving deeply into the two phenomena, we point out that the image and text features are not aligned in the joint embedding space. Thus, we introduce a fine-tuning strategy by adding linear layers and construct an extended model SDP+, further enhancing the performance. Abundant experiments demonstrate the effectiveness of our approach, e.g., on MVTec-AD, SDP outperforms the SOTA WinCLIP by +4.2/+10.7 in segmentation metrics F1-max/PRO, while SDP+ achieves +8.3/+20.5 improvements.

研究动机与目标

利用 CLIP 的零-shot 分类能力进行无测试对象图像的异常检测。
解决 CLIP 在异常分割中的简单文本-图像特征相似性的失败模式（对立预测和嘈杂高亮）。
利用多层次特征和特征外科手术，在不微调的情况下生成准确的异常映射。
通过有针对性的微调（SDP+）改善图像特征与文本嵌入在 CLIP 中的对齐。
在标准基准上展示对现有 Zero-/Few-shot AD 方法的显著性能提升。

提出的方法

提出一个语言引导的 CLIP-AD 框架，简化多尺度编码和后处理的需求。
引入 Staged Dual-Path (SDP) 将多层 ViT 特征与架构和特征外科手术结合，生成干净的异常映射。
通过修改注意力/FFN 通路进行架构外科手术，以缓解颠倒预测。
实施特征外科手术，通过文本引导的减法机制去除冗余特征。
通过 SDP+ 增强，在图像特征与 CLIP 联合嵌入空间之间加入轻量级线性映射以实现更好的跨模态对齐。
仅微调少量线性层（不微调 CLIP 权重），使用 focal 与 dice 损失来提升分割效果，同时保持零-shot 本质。

实验结果

研究问题

RQ1基于 CLIP 的零-shot 方法在没有测试对象参考的情况下，能否实现具有竞争力的异常分类与分割？
RQ2为什么在 CLIP 中，简单的文本-图像相似性映射会在异常分割中表现异常，架构/特征层面的干预是否能修复？
RQ3阶段性、多层次特征融合（SDP）是否能在简单纹理和复杂对象缺陷上提升异常检测？
RQ4轻量级微调扩展（SDP+）是否通过对齐图像特征与文本嵌入显著提升性能？

主要发现

SDP 在 MVTec-AD 与 VisA 的分割评测指标上超过了先前的零-shot 方法，并在与 WinCLIP 的比较中获得显著提升。
SDP+ 进一步提升分割和分类指标，在若干基准上相较 SOTA 取得实质性提升。
直接的文本-图像相似性映射往往产生相反的预测和嘈杂的高亮，推动了 SDP 方法的必要性。
采用多阶段特征融合和架构外科手术，能够在不使用复杂提示或后处理的情况下，获得更准确、稳定的异常映射。
微调少量线性层以将图像特征对齐到 CLIP 的嵌入空间（SDP+）在分割 PRO 和像素级指标上带来较大提升。
在各种消融实验中，较大型的 CLIP 主干和不同的预训练权重会对性能产生影响，OpenAI 与 LAION 基础的模型各有不同的收益；然而 SDP/SDP+ 始终优于基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。