QUICK REVIEW

[论文解读] Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Jiangning Zhang, Xuhai Chen|arXiv (Cornell University)|Dec 12, 2023

Anomaly Detection Techniques and Applications被引用 9

一句话总结

该论文在 Meta-AD 框架下提出了一个基于 Vision Transformer 的 ViTAD 模型用于多类无监督异常检测，设计简单、训练高效，在 MVTec AD 与 VisA 数据集上取得了最先进的结果。

ABSTRACT

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.

研究动机与目标

将 MUAD 作为一个需要在多类正常图像上进行训练的实际场景来激励研究。
抽象出一个 Meta-AD 框架，以统一基于重构的异常检测任务。
实例化一个简单的基于 ViT 的对称 ViTAD 模型，并研究宏观/微观设计选择。
在标准异常检测基准上展示强性能与效率，同时分析设计因素。

提出的方法

将 Meta-AD 正式化为包含一个特征编码器、一个融合器（Fuser）和一个解码器的基于重构的异常检测框架。
将 ViTAD 作为一个简单的列式 ViT，具有四个阶段作为编码器和解码器，用一个简单的线性融合器。
在多阶段特征上仅使用一个像素级损失进行训练，以产生异常图。
研究宏观设计因素（跳跃连接、预训练、阶段使用）和微观细节（归一化、线性融合、位置编码、CLS token）。
在每个阶段使用编码器与解码器特征之间的余弦相似度来形成异常图，并在各阶段之间形成综合损失。

实验结果

研究问题

RQ1一个简单的（非金字塔结构）ViT 架构是否能够在 MUAD 上达到与基于金字塔的方法相当的性能？
RQ2ViTAD 的宏观与微观设计选择在 MUAD 下如何影响异常检测的准确性与定位？
RQ3预训练方案与特征使用对 MUAD 结果有何影响？
RQ4在使用简单 ViT 特征时，轻量级的 Fuser 是否足以实现强 MUAD 性能？
RQ5哪些评测基准和指标能最好地反映 MUAD 的性能与效率？

主要发现

使用简单 Fuser 的普通 ViT（ViTAD）在 MVTec AD 与 VisA 的 MUAD 上即可达到 SoTA，无需复杂的金字塔结构。
Fuser 使用最后一阶段特征可提升图像级指标，而多阶段特征提供对定位至关重要的多尺度信息。
基于 DINO 的自监督预训练在 MUAD 性能上优于其他预训练方法，且更小的 patch 尺寸和更高的分辨率可提升像素级指标。
一个轻量级的线性 Fuser 就足以获得强性能，与此前认为需要重型融合模块的观点相矛盾。
保留位置嵌入而省略 CLS token 可能略微提升或保持性能，而预归一化及其他微观细节具有细微的影响。
在 MUAD 任务中，ViTAD 在单卡 V100 GPU 上训练约 1.1 小时，整体 mAD 达到 85.4，图像级 mAU-ROC 98.3，像素级 mAU-ROC 97.7，以及论文中提到的其他指标。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。