QUICK REVIEW

[论文解读] Decoupling Vision and Language: Codebook Anchored Visual Adaptation

Jason Wu, Tianchen Zhao|arXiv (Cornell University)|Feb 23, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

CRAFT 仅通过一个离散码本对离散视觉编码器进行微调，以实现跨LLM 的迁移而无需重新对齐，并在保持语言能力的同时提升领域准确度。

ABSTRACT

Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.

研究动机与目标

为大型视觉–语言模型在尾部领域中，视觉编码器表现不足的情形，提出领域自适应挑战的动机。
提出一个解耦的自适应框架，使用离散码本来锚定视觉表征。
通过训练可插入任意共享码本的离散视觉编码器，实现跨-LLM 迁移。
以轻量级训练和在推理阶段进行 token 修剪实现领域特定提升，而不重新训练语言模型。

提出的方法

将连续视觉特征量化为固定码本以获得离散 token。
用组合损失训练视觉编码器：代理对齐损失、承诺损失以及对比损失（LCRAFT = lambda_con L_con + lambda_commit L_commit + L_SAL）。
在训练中使用代理语言模型来引导 token 选择（L_SAL）。
保持固定码本，在反向传播时通过量化应用直通估计器（straight-through）。
在推理时应用基于稀缺性的 token 配额和 token 内部筛选进行 token 修剪，仅保留信息量较大的 token。

Figure 1 : Continuous vs. Discrete Adaptation. (a) In conventional continuous-space adaptation, fine-tuning the vision encoder shifts its feature distribution, requiring costly re-alignment with each language model. (b) CRAFT introduces a discrete interface that anchors visual features to a shared c

实验结果

研究问题

RQ1离散码本接口是否能够在不改变冻结的语言模型的情况下实现 LVLM 的领域自适应？
RQ2离散视觉 token 结合代理监督信号是否比连续特征微调或 PEFT 方法在领域特定推理上更优？
RQ3当适配器共享同一个离散码本时，跨-LLM 迁移是否可行？
RQ4推理阶段的 token 修剪对不同领域的效率和准确度有何影响？

主要发现

CRAFT 在十个领域特定基准测试上平均提升 13.51 个百分点。
离散 token 界面使跨-LLM 迁移成为可能且无需重新对齐，仍保持指令遵循与解释能力。
相比于连续微调和 PEFT 基线，CRAFT 在领域特定理解与平衡的推理质量方面表现更强。
Token 修剪在降低推理 FLOPs 和延迟的同时保持性能（保留比例约为 0.8 时效果稳定）。
使用小规模代理进行训练即可实现显著增益，且降低了内存/时间成本。
消融分析显示每个损失分量（尤其是 L_SAL 与 L_con）对性能有贡献。
解耦的视觉编码器自适应无需在不同骨干上重新训练 LLM。

Figure 2 : Examples from plant pathology [ 37 ] , medical imaging [ 19 ] , and abstract diagram understanding [ 34 ] are shown using a general continuous LVLM [ 25 ] , its PEFT-tuned variant, and our CRAFT model built on the discrete LVLM [ 51 ] . General LVLM often lacks visual grounding or domain-

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。