QUICK REVIEW

[論文レビュー] Evaluating the encoding competence of visual language models using uncommon actions

Chen Ling, Nai Ding|arXiv (Cornell University)|Jan 12, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

本論文は、LLMsと拡散を用いて生成された稀有感覚の行動場面を検証するUAITベンチマークを導入し、現在のモデルは一般的なパターンを超える意味的推論に苦戦し、ファインチューニングの恩恵を受けるものの依然として人間には及ばないことを示す。

ABSTRACT

We propose UAIT (Uncommon-sense Action Image-Text) dataset, a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes. Unlike previous datasets that focus on common visual scenes with statistical frequency advantages, UAIT challenges models with grammatically reasonable but semantically counter-common sense image-text pairs. Such tasks require models to go beyond superficial pattern recognition and demonstrate a deep understanding of agent-patient relationships and physical feasibility. To build UAIT, we designed a semi-automated process to synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation. Each sample is accompanied by a carefully designed multiple-choice question to test the model's competence in fine-grained reasoning. We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning. Experiments show that all models perform significantly worse than humans in semantic judgment, especially in distinguishing grammatical correctness from semantic rationality. Further experiments show that even the lightweight model can improve its accuracy after fine-tuning, demonstrating the great potential of directional adaptation. This study not only reveals the key weaknesses of VLMs, but also provides diagnostic tools and research directions for the development of robust models with real visual semantic reasoning capabilities.

研究の動機と目的

最先端の視覚言語モデルが稀有感覚の行動をエンコードし推論できるかを評価する。
一般パターンを超える意味理解を試すため、LLMsと拡散を用いて稀有アクションの画像-テキストデータセット（UAIT）を作成する。
標準的な手法とファインチューニング（LoRA）を用いた手法との、counter-common-senseタスクにおける性能差を分析する。
VLMの堅牢な視覚-意味推論を改善するための診断ツールと方針を提供する。

提案手法

VerbNetから53クラスと318動詞を用いて、動詞重心の稀有感覚アクションデータセット（UAIT）を構築する。
few-shot promptingを用いるLLMsで稀有なテキスト記述を生成し、counter-common-senseな文ペアを作成する。
詳細な場面描写に基づいてStable Diffusionを導いて対応する画像を合成する。
画像と common text描述と uncommon text描述を対比する二択問題のVQA風データセットを作成する。
複数のVLM（Qwen2-VL-Instruct、LLaVA-1.5、LLaMA3.2-Vision）と対照学習モデル（CLIP、RWKV-CLIP）を評価する。
LoRAベースのファインチューニングを適用して転移性と性能向上を検討する。

実験結果

リサーチクエスチョン

RQ1現在のVLMは、文法的には正しくても意味的にはサポートされない場面（稀有感覚推論）を、アクションシナリオ全体で識別できるか。
RQ2対照学習や指示チューニングに依存するモデルは、稀有アクションベンチマークで異なる弱点を示すか。
RQ3ファインチューニング（LoRA）はUAITタスクの性能を向上させるか、モデルは人間レベルの意味判断にどこまで近づくか。
RQ4アクションとエージェント-患者関係の深い視覚意味エンコードの基本的な限界を示す診断パターンは何か。

主な発見

最先端のVLMは、稀有アクション場面における意味判断で人間より遅れている。
モデルは行動画像の統語的正確さと意味的妥当性を区別するのが難しい。
軽量モデル（LoRA）のファインチューニングはベンチマークでの正確さを改善できる。
モデルと人間のパフォーマンスには依然として substantial なギャップがあり、現状の多元モーダル理解には根本的な限界があることを示唆する。
評価は表面的なパターンに頼る弱点を浮き彫りにしており、深い視覚意味エンコードの欠如を指摘する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。