QUICK REVIEW

[논문 리뷰] Sparse CLIP: Co-Optimizing Interpretability and Performance in Contrastive Learning

Chuan Qin, Constantin Venhoff|arXiv (Cornell University)|2026. 01. 27.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

스파스 CLIP은 순전파 경로에 희소성을 도입하여 해석가능한 다중 모달 특성을 제공하되 다운스트림 성능을 저하시키지 않으며, 비전-언어 모델에서의 시야 기반 조향과 같은 응용을 보여준다.

ABSTRACT

Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in vision-language representation learning, powering diverse downstream tasks and serving as the default vision backbone in multimodal large language models (MLLMs). Despite its success, CLIP's dense and opaque latent representations pose significant interpretability challenges. A common assumption is that interpretability and performance are in tension: enforcing sparsity during training degrades accuracy, motivating recent post-hoc approaches such as Sparse Autoencoders (SAEs). However, these post-hoc approaches often suffer from degraded downstream performance and loss of CLIP's inherent multimodal capabilities, with most learned features remaining unimodal. We propose a simple yet effective approach that integrates sparsity directly into CLIP training, yielding representations that are both interpretable and performant. Compared to SAEs, our Sparse CLIP representations preserve strong downstream task performance, achieve superior interpretability, and retain multimodal capabilities. We show that multimodal sparse features enable straightforward semantic concept alignment and reveal training dynamics of how cross-modal knowledge emerges. Finally, as a proof of concept, we train a vision-language model on sparse CLIP representations that enables interpretable, vision-based steering capabilities. Our findings challenge conventional wisdom that interpretability requires sacrificing accuracy and demonstrate that interpretability and performance can be co-optimized, offering a promising design principle for future models.

연구 동기 및 목표

CLIP의 조밀한 잠재공간에서 해석가능성 문제를 동기화하고 해결한다.
희소성이 정확도를 희생하지 않고 CLIP 학습에 통합될 수 있는지 조사한다.
다중 모달 기능성과 해석가능성을 보존하는 희소성 활성화 CLIP 모델을 개발·평가한다.

제안 방법

비음수 제약(ReLU 최종 투영 뒤)을 도입하고 CLIP 학습 중 임베딩 차원 확장을 크게 하여 희소성을 유도한다.
비음수 대조 학습을 NMF와 연결하는 사전 학습(dictionary-learning) 관점에서 희소성을 구성하여 희소 표현을 지원한다.
차원수, 희소성 유도 방법, 로그-스케일 캡이 희소성과 제로샷 성능에 미치는 영향을 연구하기 위해 소규모 완전실험(아블레이션)을 수행한다.
2.2B MetaCLIP 데이터셋에서 ViT-L/14로 확장하여 55,296 차원의 희소 표현(임베딩 확장 인자 721)을 사용한다.
Clarity 및 다중모달성 측정으로 해석가능성을 평가하고, Sparse Autoencoders(SAEs) 및 밀집 baselines와 비교한다.
Sparse CLIP 특징을 사용하는 비전-언어 모델(VLM)을 시연하고 특징 활성화를 조절하여 비전 기반 조향을 탐구한다.

실험 결과

연구 질문

RQ1희소성이 다운스트림 성능을 유지하거나 향상시키면서 CLIP 학습에 native하게 도입될 수 있는가?
RQ2희소하게 학습된 CLIP 표현이 SAEs와 같은 사후 희소화 방법보다 더 해석 가능하고 다중 모달적 특성을 갖게 되는가?
RQ3희소 CLIP 특징이 인간이 해석 가능한 개념과 모달리티 전반에 걸쳐 어떻게 정렬되는가?
RQ4희소 CLIP 표현이 해석가능한 조향과 같은 실용적인 VLM 응용을 가능하게 하는가?
RQ5희소 CLIP 학습 중 개념은 어떻게 나타나고 어떻게 진화하는가?

주요 결과

Sparse CLIP 모델은 ViT-L/14 Sparse 및 Sparse+에서 활성화 희소성이 각각 0.66% 및 0.47%에 이르는 극단적인 희소성에도 불구하고 경쟁력 있는 제로샷 및 세부 수준 성능을 유지한다.
Sparse CLIP 특징은 대다수의 SAE 기반 접근과 달리 이미지와 텍스트 입력 모두에 대해 활성화되는 다중 모달 특성이 지배적이다.
Sparse CLIP은 대형 어휘의 최상위 활성 단어를 특징과 연결하여 개념 라벨링을 가능하게 하며 다중 모달 개념에 대해 텍스트와 시각 간의 높은 상관관계를 보인다.
훈련 시 희소성은 해석가능한 표현을 제공하며 평가된 데이터셋에서 열림 가중치 SAEs보다 Clarity가 더 높게 나타난다.
Sparse CLIP 특징으로 구축된 VLM은 이미지 QA 벤치마크에서 기준선과 비슷한 성능을 달성하고 비전 기반 조향 능력을 시연한다.
개념의 등장 연구는 다중 모달 특성이 초기부터 나타나며 훈련 중 진화하고 일부 특징은 시간이 지남에 따라 의미 있게 변화한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.