QUICK REVIEW

[論文レビュー] CLIP Is Shortsighted: Paying Attention Beyond the First Sentence

Marc-Antoine Lavoie, Anas Mahmoud|arXiv (Cornell University)|Feb 25, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

DeBias-CLIP は開示要約文を削除し、トークン padding で文をサンプリングすることで CLIP 型モデルの初期トークンバイアスを緩和し、長文検索で最先端を達成しつつ短文性能を保ち、文の並べ替えへの頑健性を向上させる。

ABSTRACT

CLIP models learn transferable multi-modal features via image-text contrastive learning on internet-scale data. They are widely used in zero-shot classification, multi-modal retrieval, text-to-image diffusion, and as image encoders in large vision-language models. However, CLIP's pretraining is dominated by images paired with short captions, biasing the model toward encoding simple descriptions of salient objects and leading to coarse alignment on complex scenes and dense descriptions. While recent work mitigates this by fine-tuning on small-scale long-caption datasets, we identify an important common bias: both human- and LLM-generated long captions typically begin with a one-sentence summary followed by a detailed description. We show that this acts as a shortcut during training, concentrating attention on the opening sentence and early tokens and weakening alignment over the rest of the caption. To resolve this, we introduce DeBias-CLIP, which removes the summary sentence during training and applies sentence sub-sampling and text token padding to distribute supervision across all token positions. DeBias-CLIP achieves state-of-the-art long-text retrieval, improves short-text retrieval, and is less sensitive to sentence order permutations. It is a drop-in replacement for Long-CLIP with no additional trainable parameters.

研究の動機と目的

CLIP と Long-CLIP が長いキャプションで初期トークンおよび要約文バイアスを示すことを実証する。
要約文を削除し、文をランダムにサンプリングし、パディングを追加する単純なパラメータフリーの augmentation を提案して、監督をキャプションのトークン全体に分散させる。
提案手法が複数のデータセットとエンコーダで長文検索の最先端を達成し、短文性能を維持することを示す。

提案手法

長文キャプションデータセット（例：DOCCI など）を用いた CLIP テキストエンコーダのバイアスを経験的に分析。
長文キャプションにおける初期トークンバイアスと冒頭の要約文への感度を特定。
DeBias-CLIP を導入：冒頭の要約文を削除し、残りの文からランダムにサンプリングし、後半の位置へ注意を拡張するようトークンをパディング。
パラメータを追加せず、長文キャプション損失と短文キャプション損失を組み合わせたマルチキャプション目的で訓練。
同じ位置拡張スキームと簡易キャプション拡張パイプラインを適用することで Long-CLIP のドロップイン置換を提供。

Figure 2 : Top-1 text-to-image retrieval on DOCCI as a function of the number of added padding sentences. One to five padding sentences ‘This is a photo.’ are added before the truncated original DOCCI caption (we keep the first two sentences only). We use the ViT-B/16 scale for all models.

実験結果

リサーチクエスチョン

RQ1事前学習済みの CLIP および Long-CLIP モデルは、長文検索を妨げる初期トークンおよび要約文バイアスを示すのか。
RQ2要約文を省略し、文をサンプルし、トークンをパディングするキャプションレベルの augmentation は、長文キャプション検索を改善し、短文キャプションの性能を害さずに済むのか。
RQ3提案された DeBias-CLIP は、複数のエンコーダーとデータセットにわたって文の順序入れ替えや要約文の削除に対して頑健か。

主な発見

CLIP および Long-CLIP モデルは、長文キャプションで初期トークンと冒頭の要約文に対する体系的なバイアスを示す。
要約文を削除し、文サンプリングとトークンパディングを適用することで、複数のベンチマークで長文検索の最先端を実現した。
このアプローチは短文検索の性能も改善し、文の順序の置換や要約文の除去に対する頑健性を高める。
この手法は Long-CLIP のドロップイン置換であり、追加の訓練可能パラメータを導入しない。
DeBias-CLIP は複数の CLIP 風エンコーダとデータ分布において一貫して Long-CLIP を上回る。

Figure 3 : Top-1 image-to-text retrieval on DOCCI with first two sentences permuted. We analyze three setups: the first two sentences in the correct order ( First 2 ), the same two sentences swapped ( Swap 2 ), and the first sentence only ( First only ). Results are reported for four models: OpenAI

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。