QUICK REVIEW

[論文レビュー] A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Yi Li, Hualiang Wang|arXiv (Cornell University)|Apr 12, 2023

Multimodal Machine Learning Applications被引用数 42

ひとこと要約

本論文は CLIP Surgery を導入し、推論時のアーキテクチャと特徴の改変によって CLIP の説明可能性を向上させ、再訓練なしでオープン語彙タスクにかなりの成果をもたらす。

ABSTRACT

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy activations are owing to redundant features among categories. Building on these insights, we propose the CLIP Surgery for reliable CAM, a method that allows surgery-like modifications to the inference architecture and features, without further fine-tuning as classical CAM methods. This approach significantly improves the explainability of CLIP, surpassing existing methods by large margins. Besides, it enables multimodal visualization and extends the capacity of raw CLIP on open-vocabulary tasks without extra alignment. The code is available at https://github.com/xmed-lab/CLIP_Surgery.

研究の動機と目的

CLIP が類似度マップにおいて直感に反する可視化とノイズの多い活性化を示す理由を特定する。
再訓練なしで可視化を正し、ノイズを抑制するための手術ベースの推論技術を開発する。
オープン語彙のセグメンテーション、マルチラベル認識、マルチモーダル可視化に渡って説明可能性フレームワークの改善を実証する。
バックボーン（CNN と ViT）とデータセットを跨ぐロバスト性を示す。

提案手法

推論時に q-k 自注意を v-v 自注意に置換し、推論時に多層出力を統合するデュアルパスを導入する CLIP Architecture Surgery を提案する。
空のテキストプロンプトとカテゴリ重みを用いて共通の活性化を推定・減算することにより冗長な特徴を除去する CLIP Feature Surgery を導入する。
逆方向の可視化が発生する理由とノイズの多い活性化が生じる原因を説明するために自己注意と FFN の寄与を分析する。
ラベルのファインチューニングやバックプロパゲーションを必要としない、推論時の改変を提供する。

実験結果

リサーチクエスチョン

RQ1バックボーンを跨いで、CLIP が地上真実の前景と比較して逆方向の可視化を生み出す原因は何か？
RQ2CLIP の類似度マップにノイズの多い活性化が生じる原因は何か、再訓練なしで緩和できるか？
RQ3推論時のアーキテクチャと特徴レベルの介入は、複数のデータセットとバックボーンに渡って説明可能性とオープン語彙タスクを改善できるか？
RQ4CLIP Surgery はオープン語彙のセマンティックセグメンテーションとマルチラベル認識の性能にどのような影響を与えるか？
RQ5この手法はマルチモーダル可視化や対話型セグメンテーションツールに適用可能か？

主な発見

逆方向の可視化は自己注意のクエリ-キー（q-k）パラメータに関連しており、推論時に v-v 自注意へ置換することで注意が同じ意味領域へ整列する。
ノイズの多い活性化は冗長な CLIP 特徴に起因しており、CLIP Feature Surgery による冗長な特徴の除去が偽陽性活性化を大幅に減少させる。
CLIP Surgery はバックボーン（CNN と ViT）およびデータセットを跨いで大きな説明可能性の改善をもたらし、 explainability 指標で最大で 38.42% の mIoU、72.48% の mSC の改善を達成する。
オープン語彙のマルチラベル認識は NUS-Wide で追加訓練なしで mAP が 4.41% 改善。
オープン語彙のセマンティックセグメンテーションは Cityscapes で 8.74% の mIoU、COCO Stuff と PASCAL Context でそれぞれ 4.56% / 4.44% の改善（ベースラインと比較）.
このアプローチはマルチモーダル可視化と対話的セグメンテーションツール（例：SAM）にも恩恵をもたらす。
この手法は推論時に動作し、ファインチューニングを回避し、バックボーンとタスクを跨いで広範な適用性を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。