QUICK REVIEW

[論文レビュー] Visual Word Sense Disambiguation with CLIP through Dual-Channel Text Prompting and Image Augmentations

Shamik Bhattacharya, Daniel Perkins|arXiv (Cornell University)|Feb 6, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

この論文は、二重チャネルプロンプトとWordNetでテキスト埋込みを強化し、テスト時のデータ拡張で画像埋込みを安定化させるCLIPベースのVWSDフレームワークを提示し、SemEval-2023 VWSDでMRRとHit Rateを改善しています。

ABSTRACT

Ambiguity poses persistent challenges in natural language understanding for large language models (LLMs). To better understand how lexical ambiguity can be resolved through the visual domain, we develop an interpretable Visual Word Sense Disambiguation (VWSD) framework. The model leverages CLIP to project ambiguous language and candidate images into a shared multimodal space. We enrich textual embeddings using a dual-channel ensemble of semantic and photo-based prompts with WordNet synonyms, while image embeddings are refined through robust test-time augmentations. We then use cosine similarity to determine the image that best aligns with the ambiguous text. When evaluated on the SemEval-2023 VWSD dataset, enriching the embeddings raises the MRR from 0.7227 to 0.7590 and the Hit Rate from 0.5810 to 0.6220. Ablation studies reveal that dual-channel prompting provides strong, low-latency performance, whereas aggressive image augmentation yields only marginal gains. Additional experiments with WordNet definitions and multilingual prompt ensembles further suggest that noisy external signals tend to dilute semantic specificity, reinforcing the effectiveness of precise, CLIP-aligned prompts for visual word sense disambiguation.

研究の動機と目的

視覚的文脈における語彙的曖昧性に対処し、マルチモーダル埋め込み整合によってターゲット語の意味を解決する。
低遅延のプロンプトと堅牢な画像拡張を備えたCLIPベースの interpretable VWSD フレームワークを開発する。
プロンプト、拡張、外部知識信号の貢献を理解するために系統的なアブレーションを行う。
SemEval-2023 VWSDで、従来のCLIPに対する利得を定量化する。

提案手法

Ambiguousなテキストと候補画像をCLIPで共有マルチモーダル空間に埋め込む。
二重チャネルのプロンプトアンサンブルであるセマンティックプロンプトとフォトプロンプトをWordNetの同義語と統合し、チャンネルごとに平均プーリング後、重み付き和で結合する。
任意でWordNetの定義を語彙的アンカーとして取り入れ、文脈埋め込みと重み付き平均でバランスを取る。
テスト時拡張パイプライン（複数ビュー、クロップ、幾何変換および写真計変換）を用いて画像埋め込みを強化し、温度スケーリングで平均化する。
豊富化したテキストと画像埋め込み間のコサイン類似度を計算して最も一致する画像を選択する。
WordNetの定義を語彙アンカーとして選択し、文脈と組み合わせる（αウェイト付け）。
共有CLIP空間の必要性を検証するために、ベースラインとしてvanilla CLIPとBERT+BLIPと比較する。

Figure 1: Two images illustrating the ambiguity of the word “bank”: one shows riverbank erosion, the other a piggy bank.

実験結果

リサーチクエスチョン

RQ1デュアルチャネルテキストプロンプティング（セマンティックおよびフォトプロンプト）はVWSDにおけるCLIPの横断モーダル整合性をどう改善するか？
RQ2テスト時の画像拡張はVWSDの性能とレイテンシにどのような影響を与えるか？
RQ3WordNetベースの定義と多言語プロンプトは信頼できる利得を提供するか、それともノイズを導入するか？
RQ4プロンプトのみとプロンプト＋拡張を比較したとき、精度と効率はどう異なるか？
RQ5多言語翻訳はCLIPベースVWSDの性能にどのような影響を与えるか？

主な発見

モデル	最終的なMRR	最終的なヒット率
CLIP-ViT-B/32	0.7392	0.5940
CLIP-ViT-B/16	0.7522	0.6177
CLIP-ViT-B/32 (LAION)	0.7590	0.6220

デュアルチャネルプロンプティングで埋め込みを強化すると、SemEval-2023 VWSDでMRRが0.7227から0.7590へ、ヒット率が0.5810から0.6220へ改善された。
プロンプティングは強力で低遅延の利得を提供する一方、攻撃的な画像拡張は計算コストが高い割に利得は限定的。
WordNet定義と多言語プロンプトはノイズを導入し性能を低下させる場合がある。WordNetウェイト15%で文脈埋め込み85%が最良の変種である。
Vanilla CLIPはテストセットでMRR0.7227、Hit Rate0.5810を達成。BERT+BLIPベースラインは埋め込みの共有空間での整合性不足により性能が低い。
プロンプトと拡張を組み合わせるとレイテンシが増加し、意味的ガイダンスがある場合には視覚多様性の利益が限定的となる。

Figure 2: Normalization of the textual and visual input before they are passed into the vision language models. The sentence “Internet Router” (with the underlined target word “router”) is normalized and tokenized. Additionally, the images are resized and normalized.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。