[论文解读] AG-VAS: Anchor-Guided Zero-Shot Visual Anomaly Segmentation with Large Multimodal Models
AG-VAS 引入绝对[SEG]与相对[NOR]/[ANO]语义锚点,以在大规模多模态模型下实现零-shot视觉异常分割;并结合 SPAM 对齐与 AGMD 掩码解码,在工业和医疗基准上达到最先进的结果。
Large multimodal models (LMMs) exhibit strong task generalization capabilities, offering new opportunities for zero-shot visual anomaly segmentation (ZSAS). However, existing LMM-based segmentation approaches still face fundamental limitations: anomaly concepts are inherently abstract and context-dependent, lacking stable visual prototypes, and the weak alignment between high-level semantic embeddings and pixel-level spatial features hinders precise anomaly localization. To address these challenges, we present AG-VAS (Anchor-Guided Visual Anomaly Segmentation), a new framework that expands the LMM vocabulary with three learnable semantic anchor tokens-[SEG], [NOR], and [ANO], establishing a unified anchor-guided segmentation paradigm. Specifically, [SEG] serves as an absolute semantic anchor that translates abstract anomaly semantics into explicit, spatially grounded visual entities (e.g., holes or scratches), while [NOR] and [ANO] act as relative anchors that model the contextual contrast between normal and abnormal patterns across categories. To further enhance cross-modal alignment, we introduce a Semantic-Pixel Alignment Module (SPAM) that aligns language-level semantic embeddings with high-resolution visual features, along with an Anchor-Guided Mask Decoder (AGMD) that performs anchor-conditioned mask prediction for precise anomaly localization. In addition, we curate Anomaly-Instruct20K, a large-scale instruction dataset that organizes anomaly knowledge into structured descriptions of appearance, shape, and spatial attributes, facilitating effective learning and integration of the proposed semantic anchors. Extensive experiments on six industrial and medical benchmarks demonstrate that AG-VAS achieves consistent state-of-the-art performance in the zero-shot setting.
研究动机与目标
- 解决零-shot视觉异常分割中稳定视觉原型缺乏与跨模态对齐薄弱的问题。
- 引入可学习的语义锚点以桥接LMM嵌入与像素级分割。
- 开发跨模态对齐与锚点条件化掩码解码模块以生成二值异常掩码。
- curate Anomaly-Instruct20K 将异常相关世界知识注入 LMM 的指令微调。
- 在工业和医疗数据集上展示最先进的ZSAS性能,无需类别特定再训练。
提出的方法
- 引入绝对语义锚点[SEG]来编码异常的外观、结构与位置信息。
- 包含相对锚点[NOR]与[ANO]以建模正常–异常的上下文对比。
- 添加语义–像素对齐模块(SPAM)以将LMM语义嵌入与高分辨率像素特征对齐。
- 开发锚点引导掩码解码器(AGMD),利用 refined anchor embeddings 通过双向跨注意力生成二值异常掩码。
- 用多任务目标进行训练,将文本自回归损失与跨锚点的分割损失(二元交叉熵+BCE+Dice)结合。
- 创建Anomaly-Instruct20K,将结构化异常知识注入分割的指令微调中。
实验结果
研究问题
- RQ1可学习的语义锚点是否能在ZSAS中将高层LMM语义与像素级分割连接起来?
- RQ2SPAM如何提升语义嵌入和像素特征之间的跨模态对齐?
- RQ3在零-shot设置下,锚点引导解码是否能在工业与医疗领域可靠地产生二值化异常掩码?
- RQ4指令微调数据(Anomaly-Instruct20K)对零-shot 泛化有何影响?
- RQ5AG-VAS在拒绝正常样本的同时能否准确定位异常?
主要发现
- AG-VAS在六个工业/医疗基准上实现了最先进的零-shot异常分割。
- 在消融实验中,移除[SEG]会使所有指标受损,而移除[NOR]/[ANO]主要损害正常–异常对比指标。
- SPAM提升了对齐与掩码质量,其移除会降低性能。
- Anomaly-Instruct20K与Anomaly-Seg20K对异常分割性能贡献显著,超出一般分割数据的作用。
- 直接分割通常优于描述后再分割的模式,尽管Describe-then-Segment-Plus在理解上下文方面可带来改进。
- 模型在强烈拒绝正常样本方面表现出色(IoU_nor 在报道结果中高达87.7%),同时保持稳健的异常定位(IoU_ano ~45%)。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。