QUICK REVIEW

[논문 리뷰] Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Jiangning Zhang, Xuhai Chen|arXiv (Cornell University)|2023. 12. 12.

Anomaly Detection Techniques and Applications인용 수 9

한 줄 요약

이 논문은 단순 Vision Transformer 기반의 ViTAD 모델을 Meta-AD 프레임워크 내에서 다클래스 비감독 이상 탐지 MUAD를 위한 간단한 설계와 효율적 학습으로 SOTA를 달성한다.

ABSTRACT

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.

연구 동기 및 목표

다중 클래스에 걸친 정상 이미지에 대한 학습이 필요한 실용적인 설정으로 MUAD를 제시한다.
재구성 기반 이상 탐지 task들을 통합하기 위해 Meta-AD 프레임워크를 추상화한다.
단순한 ViT 기반의 대칭 ViTAD 모델을 구현하고 매크로/마이크로 설계 선택을 연구한다.
표준 AD 벤치마크에서 강한 성능과 효율성을 시연하면서 설계 요인을 분석한다.

제안 방법

재구성 기반 AD에서 특징 인코더, 퓨저, 디코더를 갖춘 Meta-AD를 형식화한다.
4단으로 구성된 인코더/디코더를 가진 단순한 열형 ViT로 ViTAD를 구현하고, 간단한 선형 Fuser를 사용한다.
다중 단계 특징에 걸친 단일 픽셀 수준 손실로 학습하여 이상 맵을 생성한다.
매크로 수준 설계 요인(스킵 연결, 사전 학습, 단계 사용)과 미크로 수준 세부사항(정규화, 선형 융합, 위치 인코딩, CLS 토큰)을 조사한다.
각 단계에서 인코더와 디코터 특징 사이의 코사인 유사도를 이용해 이상 맵을 형성하고 단계 간 결합 손실을 사용한다.

실험 결과

연구 질문

RQ1단순(비피라미드) ViT 아키텍처가 피라미드 기반 방법에 비해 경쟁력 있는 MUAD 성능을 달성할 수 있는가?
RQ2MUAD 하에서 ViTAD의 매크로 및 마이크로 설계 선택이 이상 탐지 정확도와 위치 식별에 어떤 영향을 미치는가?
RQ3사전 학습 규칙과 특징 사용이 MUAD 결과에 어떤 영향을 미치는가?
RQ4단순 ViT 특징을 사용할 때 가벼운 Fuser로도 강력한 MUAD 성능이 충분한가?
RQ5MUAD 성능과 효율성을 가장 잘 반영하는 평가 벤치마크와 지표는 무엇인가?

주요 결과

단순 Fuser를 갖춘 단순 ViT(ViTAD)가 복잡한 피라미드 구조 없이도 MVTec AD와 VisA에서 MUAD의 최고 성능(SOTA)을 달성할 수 있다.
Fuser에 마지막 단계 특징을 사용하는 것이 이미지 수준 지표를 향상시키고, 다단계 특징은 위치 식별에 필요한 다중 스케일 정보를 제공한다.
DINO 기반 자기지도식 사전 학습이 다른 사전 학습 방법보다 MUAD 성능이 우수하며, 더 작은 패치 크기와 더 높은 해상도가 픽셀 수준 지표를 향상시킨다.
가벼운 선형 Fuser로도 강력한 성능이 가능하다는 점을 보여 주며, 무거운 융합 모듈이 필요하다는 기존 주장에 반박한다.
위치 임베딩을 유지하고 CLS 토큰을 생략하면 약간 성능이 향상되거나 유지되며, 사전 정규화 및 기타 마이크로 디테일은 미묘한 영향을 준다.
MUAD 작업에서 ViTAD는 단일 V100 GPU에서 1.1시간 학습으로 overall 85.4 mAD를 달성하고, 이미지 수준 mAU-ROC 98.3, 픽셀 수준 mAU-ROC 97.7 등 논문에 인용된 다른 지표를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.