QUICK REVIEW

[論文レビュー] OODBench: Out-of-Distribution Benchmark for Large Vision-Language Models

Ling Lin, Yang Bai|arXiv (Cornell University)|Feb 20, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

OODBenchは大規模ビジョン言語モデル向けの共変量シフトによるOut-of-Distributionベンチマークを自動化されたOODデータ分割とBasic-to-Advanced Progression指標とともに提示し、リーディングモデル全体でOODデータ上の性能低下を顕著に示す。

ABSTRACT

Existing Visual-Language Models (VLMs) have achieved significant progress by being trained on massive-scale datasets, typically under the assumption that data are independent and identically distributed (IID). However, in real-world scenarios, it is often impractical to expect that all data processed by an AI system satisfy this assumption. Furthermore, failure to appropriately handle out-of-distribution (OOD) objects may introduce safety risks in real-world applications (e.g., autonomous driving or medical assistance). Unfortunately, current research has not yet provided valid benchmarks that can comprehensively assess the performance of VLMs in response to OOD data. Therefore, we propose OODBench, a predominantly automated method with minimal human verification, for constructing new benchmarks and evaluating the ability of VLMs to process OOD data. OODBench contains 40K instance-level OOD instance-category pairs, and we show that current VLMs still exhibit notable performance degradation on OODBench, even when the underlying image categories are common. In addition, we propose a reliable automated assessment metric that employs a Basic-to-Advanced Progression of prompted questions to assess the impact of OOD data on questions of varying difficulty more fully. Lastly, we summarize substantial findings and insights to facilitate future research in the acquisition and evaluation of OOD data.

研究の動機と目的

現実世界の設定でVision-Language Models（VLMs）の信頼性のあるOOD評価の必要性を動機づける。
VLMsの共変量シフトOODデータを構築する自動化かつ検証支援パイプラインを提案する。
OOD下での画像理解、カウント、推論を評価するBasic-to-Advanced Progression（BAP）指標を導入する。
最先端のVLMsがOODデータでインディストリビューションデータに比べ顕著な性能低下を示すことを示す。）

提案手法

VLMs向けの共変量シフトOODデータを、データ分布が変化するがラベルは訓練時のラベル空間内に留まる画像と定義し、主対象物の中心性から外れた物体やそれらの変種に焦点を当てる。
2つの一般化検出器（CLIPとBLIP2）を用いてOODサンプルを自動分割し、ラベル相互作用効果を避けるための浄化操作を満たすラベリングを行い、OODBenchを構築する。
堅牢なOOD信号と検出器特有のOOD信号を捉えるため、OOD-Hard（共通部分）とOOD-Simple（対称差）パーティションを確立する。
COCO、LVIS、nuScenes、CityscapesベースのソースからインスタンスレベルのOODデータを収集するため、ラベルとデータをバランスさせる2問法 prompting方式を採用する。
Existential（E-Acc）、Counting（C-Acc）、Logical（L-Acc）から成るBasic-to-Advanced Progression（BAP）指標を導入し、IDおよびOOD条件下での認識、カウント、推論を評価する。
ID、OOD-S、OOD-Hデータを横断して8つの最先端VLMs（オープン型・クローズド型・GPT系を含む）をOODBenchで評価し、標準指標（Accuracy, F1, Precision, Recall, MCC）とBAP固有スコアを報告する。

Figure 1 : Comparison of differences in ID data, covariate shift OOD data, and semantic shift data.

実験結果

リサーチクエスチョン

RQ1現代の大規模ビジョン言語モデルは、共変量シフトOODデータでインディストリビューションデータと比べてどのようにパフォーマンスを示すか？
RQ2検出器を横断するクロスバリデーションを備えた自動化OODデータ分割パイプラインは、VLMが現実世界の課題を代表するOODデータを生成できるか？
RQ3IDとOOD条件下での認識、カウント、推論タスクにおける画像理解へのOODデータの影響は、Basic-to-Advanced Progression指標でどう現れるか？
RQ4Chain-of-Thought promptingはOODデータ上でVLMの性能を改善するか、それとも低下させるか？

主な発見

主要なVLMはOOD-Hデータで著しい精度低下を示す（ID比で約20–30%相対低下）、対象モデルにはLLaVA-NeXT、DeepSeek-VL、InternVL2/2.5、Qwen2-VL、Llama-3.2-Vision、Gemini、GPT-4oが含まれる。
CoT promptingは混合的な結果を示す：OOD-Hで約10%程度精度が向上するモデルもあれば、IDまたはOOD-Sで低下するモデルもある。
OOD-HデータではGPT-4oも依然としてIDより約26%の精度差があり、トップクラスモデルでもOODに対する脆弱性が残る。
OOD-Sデータは1つの検出器で識別される場合、IDより難易度が高いがOOD-Hほどではなく、検出器依存のバイアスを浮き彫りにする。
BAP評価では、論理推論（L-Acc）がデータがIDからOOD-S、OOD-Hへ移行するにつれて認識やカウントよりも急速に劣化することを示す。
誤差解析は2つの主要なOOD失敗モードを示す：（i）主たる意味対象外の物体、（ii）意味的変異体であり、主たる意味オブジェクトを超えた画像-テキストの整合性ギャップを強調する。

Figure 2 : Distribution of categories and fields in OODBench .

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。