QUICK REVIEW

[논문 리뷰] A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Yi Li, Hualiang Wang|arXiv (Cornell University)|2023. 04. 12.

Multimodal Machine Learning Applications인용 수 42

한 줄 요약

논문은 CLIP Surgery를 도입하여 추론 중 아키텍처 및 특징 수정으로 CLIP 설명가능성을 향상시키고 재훈련 없이도 개방-어휘 태스크에서 상당한 이득을 얻는 방법을 제시한다.

ABSTRACT

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.

연구 동기 및 목표

CLIP이 유사도 맵에서 직관에 반한다고 보이는 시각화와 노이즈가 있는 활성화를 보이는 이유를 규명한다.
재훈련 없이 시각화를 보정하고 노이즈를 억제하기 위한 수술 기반 추론 기법을 개발한다.
오픈-어휘 분할, 다중 라벨 인식, 다중 모달 시각화에 걸쳐 설명가능성 프레임워크의 개선을 입증한다.
백본(CNN 및 ViT) 및 데이터셋 전반에 걸친 강건성을 보여준다.

제안 방법

추론 시 다층 출력을 병합하기 위해 이중 경로를 도입하고 q-k self-attention을 v-v self-attention으로 대체하는 CLIP 아키텍처 수술(CLIP Architecture Surgery)을 제안한다.
빈 텍스트 프롬프트와 카테고리 가중치를 사용하여 공통 활성화를 추정하고 빼는 방식으로 중복 특징을 제거하는 CLIP Feature Surgery를 도입한다.
역방향 시각화가 왜 발생하는지와 왜 노이즈 활성화가 생기는지 설명하기 위해 self-attention과 FFN의 기여를 분석한다.
레이블에 대한 미세조정이나 역전파 없이도 추론 시 수정사항을 제공한다.

실험 결과

연구 질문

RQ1백본 전반에서 CLIP이 실제 전경과 반대되는 시각화를 생성하는 이유는 무엇인가?
RQ2CLIP 유사도 맵에서 노이즈 활성화가 생기는 원인은 무엇이며 재훈련 없이 완화할 수 있는가?
RQ3추론 시 아키텍처 및 특징 수준 개입이 설명가능성과 오픈-보캐너리 작업을 여러 데이터셋과 백본에서 개선할 수 있는가?
RQ4CLIP Surgery가 오픈-어휘 의미론적 분할(Open-vocabulary semantic segmentation)과 다중 라벨 인식 성능에 어떤 영향을 미치는가?
RQ5이 방법은 다중 모달 시각화 및 인터랙티브 분할 도구에 적용 가능한가?

주요 결과

반대 시각화는 self-attention의 query-key(q-k) 매개변수와 연관되어 있다; 추론 시 v-v self-attention으로 교체하면 주의(attention)가 같은 의미 영역으로 정렬된다.
노이즈 활성화는 중복된 CLIP 특징에서 비롯된다; CLIP Feature Surgery를 통해 중복 특징을 제거하면 오염 활성화가 크게 감소한다.
CLIP Surgery는 백본(CNN들 및 ViT들)과 데이터셋 전반에서 큰 설명가능성 향상을 보이며, 설명가능성 지표에서 최대 38.42% mIoU 및 72.48% mSC 개선을 기록한다.
추가 학습 없이 NUS-Wide에서 mAP 기준으로 오픈-어휘 다중 라벨 인식이 4.41% 향상된다.
Cityscapes에서 mIoU가 8.74% 향상되고, COCO Stuff와 PASCAL Context에서는 각각 4.56%/4.44% 향상한다(기준과 비교).
이 접근법은 다중 모달 시각화 및 인터랙티브 분할 도구(SAM 등)에도 이점을 준다.
이 방법은 추론 시에 작동하며 미세조정이 필요 없고 백본과 작업 전반에 걸친 광범위한 적용성을 가능하게 한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.