QUICK REVIEW

[論文レビュー] PromptMAD: Cross-Modal Prompting for Multi-Class Visual Anomaly Localization

Duncan McCain, Hossein Kashiani|arXiv (Cornell University)|Jan 30, 2026

Anomaly Detection Techniques and Applications被引用数 0

ひとこと要約

PromptMAD は CLIP ベースのクロスモーダルプロンプトと拡散補正セグメータを組み合わせ、複数クラスにわたるピクセルレベルの異常局在化で最先端を達成する。

ABSTRACT

Visual anomaly detection in multi-class settings poses significant challenges due to the diversity of object categories, the scarcity of anomalous examples, and the presence of camouflaged defects. In this paper, we propose PromptMAD, a cross-modal prompting framework for unsupervised visual anomaly detection and localization that integrates semantic guidance through vision-language alignment. By leveraging CLIP-encoded text prompts describing both normal and anomalous class-specific characteristics, our method enriches visual reconstruction with semantic context, improving the detection of subtle and textural anomalies. To further address the challenge of class imbalance at the pixel level, we incorporate Focal loss function, which emphasizes hard-to-detect anomalous regions during training. Our architecture also includes a supervised segmentor that fuses multi-scale convolutional features with Transformer-based spatial attention and diffusion iterative refinement, yielding precise and high-resolution anomaly maps. Extensive experiments on the MVTec-AD dataset demonstrate that our method achieves state-of-the-art pixel-level performance, improving mean AUC to 98.35% and AP to 66.54%, while maintaining efficiency across diverse categories.

研究の動機と目的

camouflaged およびテクスチャ付き欠陥を含む産業環境での多クラス視覚異常検知と局在化を解決する。
視覚と言語の整合による意味的ガイダンスを活用して再構成品質を向上させる。
ピクセルレベルのクラス不均衡を Focal loss で緩和し、拡散と多段階アテンションで異常マップを精練する。
計算量を過度に増やさず、多様な物体/質感に対して良好に動作する統一モデルを提供する。

提案手法

OneNIP に触発されたプロンプトベースの再構成フレームワークを用いる。
正常および異常の特性を記述する CLIP ベースのクロスモーダルプロンプトを組み込み、再構成を導く。
テキスト誘導のセグメンターを導入し、マルチスケール CNN 特徴量、Transformer 空間注意、拡散ベースの精練を組み合わせる。
学習時には難検出の異常ピクセルを強調するため Focal loss を適用する。
セマンティックガイダンスのために CLIPText 埋め込みを視覚プロンプトと双方向デコーダに融合する。
テキストプロンプトを条件として拡散補正過程（10 ステップ、線形ベータスケジュール）を利用し、異常境界を鋭化する。

実験結果

リサーチクエスチョン

RQ1 テキスト+画像のクロスモーダルプロンプトは、教師なし異常検知において複数の物体/質感カテゴリでピクセルレベルの局在化を改善できるか。
RQ2 テキスト誘導の拡散補正セグメンターを統合すると、視覚プロンプトのみのベースラインと比較して異常マップがより鋭く高精度になるか。
RQ3 Focal loss の影響は、多クラス設定で稀でカモフラージュされた異常の検出にどのように現れるか。
RQ4 MVTec-AD における推論効率は強力なベースラインと比較してどうか。

主な発見

Class	AUC (OneNIP)	AP (OneNIP)	P-AUC (OneNIP)	P-AP (OneNIP)	AUC (PromptMAD)	AP (PromptMAD)	P-AUC (PromptMAD)	P-AP (PromptMAD)
Bottle	99.84	82.62	98.61	82.62	100.00	100.00	99.01	85.13
Cable	97.53	63.48	97.94	63.48	98.02	65.01	98.66	65.01
Capsule	85.08	50.06	88.59	50.06	98.50	?	98. -	?
Carpet	99.68	68.39	99.20	68.39	98.88	70.31	99.77	70.31
Grid	98.83	45.43	99.16	45.43	98.34	42.31	99.74	42.31
Hazelnut	100.00	72.92	100.00	72.92	99.46	82.45	99.77	82.45
Leather	100.00	71.07	100.00	71.07	99.67	71.62	100.00	71.62
Metal Nut	99.51	77.66	99.90	77.66	98.58	88.04	99.97	88.04
Pill	96.34	43.60	93.12	43.60	96.47	56.13	98.72	56.13
Screw	91.10	37.79	90.61	37.79	98.64	34.39	99.97	34.39
Tile	99.96	77.44	99.60	77.44	98.13	86.74	99.84	86.74
Toothbrush	93.33	52.94	97.50	52.94	98.88	57.07	99.06	57.07
Transistor	99.75	82.69	99.75	82.69	98.34	77.06	99.63	77.06
Wood	97.98	67.71	99.04	67.71	96.73	73.09	99.68	73.09
Zipper	99.03	59.04	99.87	59.04	97.61	59.89	?	?
Mean	97.20	63.52	97.62	63.52	98.35	66.54	98. ?	66.54

ピクセルレベルの平均 AUC が OneNIP ベースラインの 97.81% から PromptMAD で 98.35% に改善。
ピクセルレベルの平均 AP が OneNIP ベースラインの 63.52% から PromptMAD で 66.54% に改善。
画像レベルの最大 AUC が PromptMAD で 97.62% に、最大 AP が 99.23% に上昇。
PromptMAD は意味的なクロスモーダルガイダンスにより、質感クラス（例：ヘーゼルナッツ、タイル、金属ニップ、錠剤）で顕著な改善をもたらす。
推論速度はリアルタイムを維持：4 個の A100 GPU で 193 FPS、OneNIP ベースラインは 219 FPS。
アブレーション実験では、セグメンター、クロスモーダルプロンプト、Focal loss の全統合がほとんどのクラスで最良のピクセル AP と AUC を示した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。