QUICK REVIEW

[論文レビュー] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin, Constantin Venhoff|arXiv (Cornell University)|Jan 27, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Sparse CLIP は前方伝播でのスパーシティを用いて CLIP を訓練し、下流の性能を犠牲にすることなく解釈可能なマルチモーダル特徴を得られることを示し、視覚言語モデルにおける視覚ベースのステアリングなどの応用を実証します。

ABSTRACT

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.

研究の動機と目的

CLIP の密な潜在空間における解釈可能性の課題を動機づけ、対処する。
精度を犠牲にせずに CLIP 訓練へ sparsity を組み込めるかを調査する。
マルチモーダル能力と解釈可能性を保持する sparsity-enabled CLIP モデルを開発・評価する。

提案手法

最終投影後の ReLU を介した非負制約を導入し、CLIP 訓練中の埋め込み次元を大幅に拡張して sparsity を誘発する。
非負の対比学習を NMF に結びつける辞書学習ビューとして sparsity を位置づけ、疎な表現を支持する。
次元数、 sparsity 誘導法、logit スケール下限の影響を小規模なアブレーションで調べ、 sparsity とゼロショット性能への影響を評価する。
2.2B の MetaCLIP データセット上で 55,296 次元の sparse 表現（埋め込み拡張係数 721）を用いて ViT-L/14 へスケールアップする。
Clarity とマルチモダリティ測定で解釈可能性を評価し、Sparse Autoencoders（SAE）および密結Baseline と比較する。
Sparse CLIP 特徴を用いて視覚言語モデル（VLM）を構築し、特徴活性化を調整して視覚ベースのステアリングを探る。

実験結果

リサーチクエスチョン

RQ1 sparsity が CLIP 訓練に native に取り込まれつつ、下流性能を維持または向上できるか。
RQ2 sparsity を用いた CLIP 表現は、SAE のような事後的 sparse 手法より解釈性とマルチモーダリティが向上するか。
RQ3 sparse CLIP 特徴はモダリティ間で人間が解釈できる概念とどのように整合するか。
RQ4 sparse CLIP 表現は interpretable steering など現実的な VLM アプリケーションを可能にするか。
RQ5 sparse CLIP 訓練中に概念はどのように出現し、進化するか。

主な発見

Sparse CLIP モデルは、ViT-L/14 Sparse および Sparse+ において 0.66% および 0.47% の活性化 sparsity という極めて高い sparsity を達成しても、ゼロショットおよび細分類の性能を競合的に維持する。
Sparse CLIP の特徴は主にマルチモーダルであり、画像とテキストの両方の入力に対して活性化を起こす傾向が強く、SAE ベースの多くの手法とは異なる。
Sparse CLIP は大規模語彙の中から上位活性化語を特徴へ結びつけることで概念ラベリングを可能にし、マルチモーダル概念に対して文本と視覚の相関が高い。
訓練時の sparsity により、公開ウェイト SAE よりも高い Clarity を示す解釈可能な表現を得られる。
Sparse CLIP の特徴を用いた視覚言語モデルは、画像 QA ベ benchmarks でベースラインと同等の性能を達成し、視覚ベースのステアリング機能も示す。
概念の出現研究では、マルチモーダルな特徴が早期に出現し、訓練中に進化すること、いくつかの特徴が時間とともに意味を意味深く変化することが分かった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。