QUICK REVIEW

[논문 리뷰] VTFusion: A Vision-Text Multimodal Fusion Network for Few-Shot Anomaly Detection

Yuxin Jiang, Yunkang Cao|arXiv (Cornell University)|2026. 01. 23.

Anomaly Detection Techniques and Applications인용 수 0

한 줄 요약

VTFusion은 도메인 격차와 의미적 불일치를 해결하기 위해 적응형 이미지/텍스트 특징 추출기와 전용 다중모달 융합 모듈을 도입하여 소샷 이상 탐지에서 강력한 이미지 레벨 AUROC 및 AUPRO 점수를 산업 데이터셋에서 달성한다.

ABSTRACT

Few-Shot Anomaly Detection (FSAD) has emerged as a critical paradigm for identifying irregularities using scarce normal references. While recent methods have integrated textual semantics to complement visual data, they predominantly rely on features pre-trained on natural scenes, thereby neglecting the granular, domain-specific semantics essential for industrial inspection. Furthermore, prevalent fusion strategies often resort to superficial concatenation, failing to address the inherent semantic misalignment between visual and textual modalities, which compromises robustness against cross-modal interference. To bridge these gaps, this study proposes VTFusion, a vision-text multimodal fusion framework tailored for FSAD. The framework rests on two core designs. First, adaptive feature extractors for both image and text modalities are introduced to learn task-specific representations, bridging the domain gap between pre-trained models and industrial data; this is further augmented by generating diverse synthetic anomalies to enhance feature discriminability. Second, a dedicated multimodal prediction fusion module is developed, comprising a fusion block that facilitates rich cross-modal information exchange and a segmentation network that generates refined pixel-level anomaly maps under multimodal guidance. VTFusion significantly advances FSAD performance, achieving image-level AUROCs of 96.8% and 86.2% in the 2-shot scenario on the MVTec AD and VisA datasets, respectively. Furthermore, VTFusion achieves an AUPRO of 93.5% on a real-world dataset of industrial automotive plastic parts introduced in this paper, further demonstrating its practical applicability in demanding industrial scenarios.

연구 동기 및 목표

표준 자연 현장 특징을 넘어 도메인 특유의 의미를 활용하여 산업 현장에서 소샷 이상 탐지(FSAD)를 활성화한다.
사전 학습된 모델과 산업 데이터 간의 도메인 격차를 해소하기 위한 적응형의 작업 특화 시각 및 텍스트 특징 추출기를 제안한다.
다중 모달 가이던스 하에 강건한 교차 모달 교환을 가능하게 하고 픽셀 수준 이상 맵을 정제하기 위한 전용 다중모달 융합 모듈을 개발한다.
특징 표현을 향상시키기 위해 합성 이상 생성을 통해 구별 가능성을 강화한다.
엄격한 정확도 요구사항을 가진 산업 데이터셋에서 VTFusion의 효과를 입증한다.

제안 방법

산업 데이터와의 도메인 격차를 해소하기 위해 작업 특성 표현을 학습하는 적응형 이미지 및 텍스트 특징 추출기.
특징 구별 가능성을 높이기 위한 다양한 합성 이상 생성.
교차 모달 정보 교환을 위한 융합 블록이 포함된 다중모달 예측 융합 모듈.
다중모달 가이드하에 정제된 픽셀 수준의 이상 맵을 생성하는 세분화 네트워크.

실험 결과

연구 질문

RQ1적응형 비전 및 텍스트 특징 추출기가 FSAD에서 사전 학습된 모델과 산업 검사 데이터 간의 도메인 격차를 어떻게 해소할 수 있는가?
RQ2전용 다중모달 융합 모듈이 비전-텍스트 FSAD에서 교차 모달 불일치에 대한 강건성을 향상시킬 수 있는가?
RQ3합성 이상 생성이 특징 구별 가능성과 하류의 이상 위치 지정을 향상시키는가?
RQ4다중모달 가이드 및 세분화가 산업 데이터셋의 픽셀 수준 이상 맵에 어떤 개선을 가져오는가?

주요 결과

2샷 설정에서 이미지 수준 AUROC 96.8%(MVTec AD) 및 86.2%(VisA)를 달성한다.
다중모달 정보를 바탕으로 한 픽셀 수준의 세분화 맵으로 강력한 이상 위치 지정 성능을 시연한다.
적응형 특징 추출기와 강력한 다중모달 융합/예측 프레임워크를 통합하여 베이스라인을 능가한다.
실제 산업용 자동차 부품 데이터셋에서 93.5%의 경쟁력 있는 AUPRO를 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.