QUICK REVIEW

[論文レビュー] Exploring Plain ViT Reconstruction for Multi-class Unsupervised Anomaly Detection

Jiangning Zhang, Xuhai Chen|arXiv (Cornell University)|Dec 12, 2023

Anomaly Detection Techniques and Applications被引用数 9

ひとこと要約

この論文は、Meta-ADフレームワーク内でPlain Vision TransformerベースのViTADモデルを用いたMulti-class Unsupervised Anomaly Detection（MUAD）を提案し、シンプルな設計と効率的な訓練によりMVTec ADとVisAで最先端の結果を達成します。

ABSTRACT

This work studies a challenging and practical issue known as multi-class unsupervised anomaly detection (MUAD). This problem requires only normal images for training while simultaneously testing both normal and anomaly images across multiple classes. Existing reconstruction-based methods typically adopt pyramidal networks as encoders and decoders to obtain multi-resolution features, often involving complex sub-modules with extensive handcraft engineering. In contrast, a plain Vision Transformer (ViT) showcasing a more straightforward architecture has proven effective in multiple domains, including detection and segmentation tasks. It is simpler, more effective, and elegant. Following this spirit, we explore the use of only plain ViT features for MUAD. We first abstract a Meta-AD concept by synthesizing current reconstruction-based methods. Subsequently, we instantiate a novel ViT-based ViTAD structure, designed incrementally from both global and local perspectives. This model provide a strong baseline to facilitate future research. Additionally, this paper uncovers several intriguing findings for further investigation. Finally, we comprehensively and fairly benchmark various approaches using eight metrics. Utilizing a basic training regimen with only an MSE loss, ViTAD achieves state-of-the-art results and efficiency on MVTec AD, VisA, and Uni-Medical datasets. \Eg, achieving 85.4 mAD that surpasses UniAD by +3.0 for the MVTec AD dataset, and it requires only 1.1 hours and 2.3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research. Full code is available at https://zhangzjn.github.io/projects/ViTAD/.

研究の動機と目的

複数クラスにわたる正常画像の訓練を必要とする実用的な設定としてMUADを動機付ける。
再構成ベースの異常検知タスクを統合するためのMeta-ADフレームワークを抽象化する。
プレーンなViTベースの対称的ViTADモデルを具現化し、マクロ/ミクロ設計の選択を検討する。
標準的なADベンチマークで堅牢な性能と効率を示しつつ、設計因子を分析する。

提案手法

再構成ベースのADにおいて特徴エンコーダー、フューザー、デコーダーを備えるMeta-ADを形式化する。
エンコーダーとデコーダーを4段階のプレーンな縦列ViTとしてViTADを具現化し、単純な線形フューザーを使用する。
複数段の特徴に対して単一ピクセルレベルの損失で訓練し、異常マップを生成する。
マクロレベルの設計要因（スキップ接続、事前学習、段の使用）とミクロレベルの詳細（正規化、線形フュージョン、位置エンコーディング、CLSトークン）を調査する。
各段でエンコーダーとデコーダーの特徴間のコサイン類似度を用いて異常マップを形成し、段を跨ぐ統合損失を適用する。

実験結果

リサーチクエスチョン

RQ1プレーン（非ピラミッド型）ViTアーキテクチャは、ピラミッドベースの手法と比較してMUADの性能を競合的に達成できるか。
RQ2ViTADのマクロおよびミクロの設計選択は、MUADにおける異常検知精度と定位にどのように影響するか。
RQ3事前学習のレジームと特徴量の使用がMUADの結果に与える影響は何か。
RQ4プレーンViT特徴を用いる場合、軽量なフューザーで強力なMUAD性能を得られるか。
RQ5MUADの性能と効率を最もよく反映する評価ベンチマークと指標は何か。

主な発見

プレーンViT（ViTAD）と単純なフューザーを用いることで、複雑なピラミッド構造なしにMVTec ADとVisAでMUADの最先端結果を達成できる。
フューザーには最後の段の特徴を用いると画像レベルの指標が改善され、多段特徴は定位のための多段スケール情報を提供する。
DINOベースの自己教師付き事前学習は他の事前学習法よりMUAD性能が高く、より小さなパッチサイズと高解像度がピクセルレベルの指標を改善する。
軽量な線形フューザーは高い性能を示し、重いフュージョンモジュールが必要だという以前の主張に反論する。
位置エンベディングを保持しCLSトークンを省略することで性能がわずかに改善または維持され、前正規化や他のミクロディテールは微妙な影響を与える。
MUADタスクでは、ViTADは1枚のV100 GPUで約1.1時間の訓練において85.4 mADを達成し、画像レベルのmAU-ROC 98.3、ピクセルレベルのmAU-ROC 97.7など、論文で引用された他の指標も示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。