QUICK REVIEW

[论文解读] Random Word Data Augmentation with CLIP for Zero-Shot Anomaly Detection

Masato Tamura|arXiv (Cornell University)|Aug 22, 2023

Anomaly Detection Techniques and Applications被引用 7

一句话总结

提出一种零样本、类别不可知的异常检测器，使用通过随机词增强生成的多样化 CLIP 文本嵌入进行训练，在推理阶段无需目标物体提示，并在若干基准上达到或超过提示集合基线的表现。

ABSTRACT

This paper presents a novel method that leverages a visual-language model, CLIP, as a data source for zero-shot anomaly detection. Tremendous efforts have been put towards developing anomaly detectors due to their potential industrial applications. Considering the difficulty in acquiring various anomalous samples for training, most existing methods train models with only normal samples and measure discrepancies from the distribution of normal samples during inference, which requires training a model for each object category. The problem of this inefficient training requirement has been tackled by designing a CLIP-based anomaly detector that applies prompt-guided classification to each part of an image in a sliding window manner. However, the method still suffers from the labor of careful prompt ensembling with known object categories. To overcome the issues above, we propose leveraging CLIP as a data source for training. Our method generates text embeddings with the text encoder in CLIP with typical prompts that include words of normal and anomaly. In addition to these words, we insert several randomly generated words into prompts, which enables the encoder to generate a diverse set of normal and anomalous samples. Using the generated embeddings as training data, a feed-forward neural network learns to extract features of normal and anomaly from CLIP's embeddings, and as a result, a category-agnostic anomaly detector can be obtained without any training images. Experimental results demonstrate that our method achieves state-of-the-art performance without laborious prompt ensembling in zero-shot setups.

研究动机与目标

激励实现无需推理阶段物体类别提示的类别不可知异常检测。
利用 CLIP 作为数据源，生成用于正常样本与异常样本的多样化训练嵌入。
通过随机词数据增强创建稳健检测器，消除繁琐的提示集成。
在标准 AD 基准（MVTec-AD、VisA）以及真实世界、多样化异常数据集（SewerML）上展示具有竞争力的零样本性能。

提出的方法

使用正常与异常词的两类提示模板来引导基于 CLIP 的异常分数。
通过在提示中插入随机生成的词来进行随机词数据增强，为正常和异常样本创建多样化的嵌入对。
在 CLIP 文本嵌入上训练四层前馈网络（FNN），在没有对象特定训练图像的情况下对正常与异常进行分类。
在图像嵌入（来自 CLIP 的图像编码器）上使用训练好的 FNN 计算异常分数，必要时可与基于 CLIP 的提示分数结合。
在未知对象和已知对象设置下评估零样本性能，并探索与其他基于 CLIP 的分数（s_pr、s_img）的组合。
在训练循环中保持提示集成的排除，以避免繁琐的提示工作，同时仍实现强劲的零样本结果。

实验结果

研究问题

RQ1CLIP 是否可用作训练数据源来构建一个在推理阶段不依赖目标对象信息的类别不可知异常检测器？
RQ2随机词数据增强是否在嵌入中提供足够的多样性，以便在未知对象类别中区分正常与异常样本？
RQ3在零样本设置下，与标准 AD 基准上的提示引导异常检测方法与提示集成基线相比，该方法表现如何？
RQ4随机提示对数（N_p）与词语选取的数量对零样本性能有何影响？

主要发现

所提出的方法在 MVTec-AD 与 VisA 上实现了具竞争性的零样本性能，在未知对象设置中通常优于基于 CLIP 的提示引导异常检测与 WinCLIP。
在未知对象设置中，CLIP + ours 相较于单独的 CLIP 在零样本场景下持续改进，且在未指定对象类别时尤为显著。
使用随机词数据增强提供了多样化的嵌入，使类别不可知的 FNN 能在无训练对象特定数据的情况下检测异常。
在 SewerML 中，该方法在三种方法中表现最好，凸显对高度多样化缺陷的鲁棒性。
最佳结果出现在 N_p = 10,000 的训练对数附近；太少或太多的对数会因欠拟合或过拟合而降低性能。
该方法在各数据集上实现了强的 AUROC、AUPR、F1-max，且零样本评估中 CLIP + ours 常得分最高。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。