QUICK REVIEW

[论文解读] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie, Anas Mahmoud|arXiv (Cornell University)|Feb 25, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

DeBias-CLIP 去除开头摘要句并采用句子抽样与令牌填充，缓解 CLIP 风格模型中的早期令牌偏差，在保持短文本性能的同时实现长文本检索的最新水平，并提升对句子排列的鲁棒性。

ABSTRACT

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

研究动机与目标

证明 CLIP 与 Long-CLIP 在用长标题训练时存在早期令牌偏差和摘要句偏差。
提出一个简单、无参数的增强方法（去掉摘要句、对句子进行采样、添加填充）以在标题令牌上分配监督。
展示所提出的方法在多个数据集与编码器上实现了长文本检索的 state-of-the-art，并在短文本方面保持性能。

提出的方法

用长标题数据集（如 DOCCI）对 CLIP 文本编码器的偏差进行经验分析。
识别在长标题中存在的早期令牌偏差及对开头摘要句的敏感性。
引入 DeBias-CLIP：去掉开头摘要句、从剩余句子中随机采样，并填充令牌以将注意力扩展到后续位置。
以一个多标题目标进行训练，结合长标题与短标题损失，而不增加可训练参数。
提供对 Long-CLIP 的即插即用替代，在相同的位置扩展方案和简单的标题增强流程下实现。

Figure 2 : Top-1 text-to-image retrieval on DOCCI as a function of the number of added padding sentences. One to five padding sentences ‘This is a photo.’ are added before the truncated original DOCCI caption (we keep the first two sentences only). We use the ViT-B/16 scale for all models.

实验结果

研究问题

RQ1预训练的 CLIP 与 Long-CLIP 模型是否在长文本检索中存在早期令牌和摘要句偏差？
RQ2在不损害短文本性能的前提下，是否可以通过省略摘要句、对句子进行采样并填充令牌的标题级增强来提升长标题检索？
RQ3所提出的 DeBias-CLIP 在多种编码器与数据集上对句子排列及摘要句移除的鲁棒性如何？

主要发现

CLIP 与 Long-CLIP 模型在长标题中显现出对早期令牌和开头摘要句的系统性偏见。
去掉摘要句并应用句子抽样与令牌填充，在多个基准数据集上实现了长文本检索的 state-of-the-art。
该方法同时提升了短文本检索性能，并增强对句子顺序排列以及摘要句移除的鲁棒性。
该方法是 Long-CLIP 的即插即用替代方案，且不引入额外的可训练参数。
DeBias-CLIP 在若干 CLIP-风格的编码器和数据分布上持续优于 Long-CLIP。

Figure 3 : Top-1 image-to-text retrieval on DOCCI with first two sentences permuted. We analyze three setups: the first two sentences in the correct order ( First 2 ), the same two sentences swapped ( Swap 2 ), and the first sentence only ( First only ). Results are reported for four models: OpenAI

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。