QUICK REVIEW

[论文解读] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection

Yuxin Jiang, Yunkang Cao|arXiv (Cornell University)|Jan 23, 2026

Anomaly Detection Techniques and Applications被引用 0

一句话总结

VTFusion 引入自适应图文特征提取器和专用的多模态融合模块，以解决少样本异常检测中的领域差距和语义错位，在工业数据集上实现强劲的图像级 AUROC 和 AUPRO 分数。

ABSTRACT

Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.

研究动机与目标

在工业场景中动机化少样本异常检测（FSAD），关注领域特定语义超越标准自然场景特征。
提出自适应、任务特定的视觉与文本特征提取器，以弥合预训练模型与工业数据之间的领域差距。
开发专用的多模态融合模块，实现鲁棒的跨模态信息交换并在多模态引导下细化像素级异常映射。
通过合成异常生成来提升特征表示的判别力。
在对准确性有严格要求的工业数据集上展示 VTFusion 的有效性。

提出的方法

学习任务特定表示的自适应图像和文本特征提取器，以弥合与工业数据的领域差距。
生成多样化的合成异常以提升特征的判别力。
具备跨模态信息交换的融合块的多模态预测融合模块。
一个在多模态引导下产生细化像素级异常映射的分割网络。

实验结果

研究问题

RQ1自适应的视觉与文本特征提取器如何在 FSAD 中弥合预训练模型与工业检测数据之间的领域差距？
RQ2专用的多模态融合模块是否能提升对视觉-文本跨模态错位的鲁棒性？
RQ3合成异常生成是否提升特征判别力与下游的异常定位性能？
RQ4多模态引导与分割对工业数据集上像素级异常映射带来哪些改进？

主要发现

在 2-shot 设置下，在图像级上实现 AUROC 分数为 96.8%（MVTec AD）和 86.2%（VisA）。
通过由多模态信息引导的像素级分割映射，展现出强劲的异常定位性能。
通过整合自适应特征提取器与鲁棒的多模态融合/预测框架，超越基线方法。
在真实世界工业汽车塑件数据集上展现出具有竞争力的 AUPRO，达到 93.5%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。