QUICK REVIEW

[논문 리뷰] Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao|arXiv (Cornell University)|2026. 02. 23.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

CRAFT는 공유 코드북을 통해 이산 비전 인코더만 미세 조정하여 LVLM을 도메인 특화 작업에 적합하게 끌어올리고 재정렬 없이 교차-LLM 전이를 가능하게 하며 언어 능력을 유지하면서 도메인 정확도를 향상시킵니다.

ABSTRACT

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

연구 동기 및 목표

꼬리 도메인에서 비전 인코더의 성능이 저하되는 문제에 대한 도메인 적응의 필요성을 제시한다.
시각 표현을 고정시키기 위해 이산 코드북을 활용하는 분리형(디커플링된) 도메인 적응 프레임워크를 제안한다.
코드북을 공유하는 모든 LVLM에 연결될 수 있는 이산 비전 인코더를 학습시켜 교차-LLM 전이를 가능하게 한다.
언어 모델을 재훈련하지 않고 가벼운 학습과 테스트 시 토큰 프루닝으로 도메인 특화 이점을 달성한다.

제안 방법

연속 시각 특징을 고정된 코드북으로 양자화하여 이산 토큰을 얻는다.
시각 인코더를 합성 손실로 학습한다: 대리 정렬 손실, 약정 손실, 대조 손실(또는 대조 손실)(LCRAFT = lambda_con L_con + lambda_commit L_commit + L_SAL).
학습 중 토큰 선택을 안내하기 위해 대리 언어 모델을 사용한다 (L_SAL).
고정된 코드북을 유지하고 역전파 중 양자화에 대해 스트레이트-스루 추정기를 적용한다.
희소도 기반 토큰 할당량과 내부 토큰 선택을 사용한 테스트 시 토큰 프루닝으로 정보성이 높은 토큰만 남긴다.

Figure 1 : Continuous vs. Discrete Adaptation. (a) In conventional continuous-space adaptation, fine-tuning the vision encoder shifts its feature distribution, requiring costly re-alignment with each language model. (b) CRAFT introduces a discrete interface that anchors visual features to a shared c

실험 결과

연구 질문

RQ1다음 코드를 번역하려면 이 부분의 텍스트를 유지합니다. 실제 질문은 원문에 있음
RQ2대체 텍스트 필요 시 알려주세요

주요 결과

CRAFT는 열 가지 도메인 특화 벤치마크에서 평균 13.51 퍼센트 포인트의 향상을 달성한다.
이산 토큰 인터페이스는 재정렬 없이 교차-LLM 전이를 가능하게 하며 지시 따르기와 설명 생성을 유지한다.
연속 미세조정 및 PEFT 베이스라인과 비교했을 때, CRAFT는 균형 잡힌 추론 품질로 더 강한 도메인 특화 이해를 제공한다.
토큰 프루닝은 성능은 유지하면서 추론 FLOPs와 지연 시간을 줄이고(유지 비율이 약 0.8 정도일 때 신뢰 가능).
작은 대리 모델로의 학습은 상당한 이점을 달성하고 메모리/시간 비용을 줄일 수 있다.
특히 L_SAL 및 L_con 같은 각 손실 구성요소가 성능에 기여함을 보여주는 제거 연구(Ablation) 결과.
분리된 비전 인코더 적응은 백본 간 LLM 재훈련을 필요로 하지 않는다.

Figure 2 : Examples from plant pathology [ 37 ] , medical imaging [ 19 ] , and abstract diagram understanding [ 34 ] are shown using a general continuous LVLM [ 25 ] , its PEFT-tuned variant, and our CRAFT model built on the discrete LVLM [ 51 ] . General LVLM often lacks visual grounding or domain-

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.