QUICK REVIEW

[論文レビュー] Vision-Language Models vs Human: Perceptual Image Quality Assessment

Imran Mehmood, Imad Ali Shah|arXiv (Cornell University)|Mar 25, 2026

Image and Video Quality Assessment被引用数 0

ひとこと要約

この論文は、三つの知覚IQスケール（コントラスト、色彩度、全体的好み）に対して人間の psychophysical データと比較する六つの vision-language モデルをベンチマークし、内部的一貫性、モデル間合意、および人間との整合性を分析します。

ABSTRACT

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

研究の動機と目的

Vision-Language Models (VLMs) がコントラスト、色彩度、全体的好みの三つのスケールで IQA における人間の知覚判断を近似できるかを評価する。
六つの VLMs を心理物理的 IQ データと比較する体系的ベンチマークを提供する。
VLMs が人間の判断と一致する際の強み、制約、条件を特定する。
モデルの信頼性、モデル間の合意、知覚の分離性が VLM ベースの IQA に与える影響を探る。

提案手法

人間と VLMs のために三つの IQ 属性に対して forced-choice プロンプトを用いた同一のペア画像比較を実施する。
API またはローカルプロンプトを介して六つの VLMs（Claude Opus 4.6、Gemini 3.1 Pro、GPT-5.2、Grok-4.1、InternVL-3.5-38B、Qwen3-VL-32B-Instruct）を評価する。
反復可能性フィルタリング、検証、z-score 標準化を用いて応答を処理し、モデル–人間の比較を可能にする。
ペアごとに三回のリピートからの intra-model 変動性（VR%）を算出する。
ペアのモデル同士の合意に対する cross-model 変動性（VR%）を算出する。
Spearman 順相関とブートストラップを用いて人間の psychophysical データとの整合性を評価する。
全体的好みの属性割り当てをコントラストと色彩度の線形結合として分析する。

Figure 1 : Workflow for comparing perceptual IQA between human observers and VLMs. (a) Evaluation acquisition: Human psychophysical data are obtained through pairwise comparisons, while VLM assessments are collected via prompt-based image comparisons using an identical query. (b) Data processing: Re

実験結果

リサーチクエスチョン

RQ1VLM は知覚 IQ 属性（コントラスト、色彩度、全体的好み）における人間のランキングを再現できるか。
RQ2各属性ごとに人間の判断と最も近い整合性を示す VLM はどれか。
RQ3モデル内の判断の安定性（intra-model 変動性）とモデル間の安定性（inter-model 変動性）はどれほどか。
RQ4場面の知覚的分離性は人間–VLM の合意にどのように影響するか。
RQ5全体的好みを形成する際に、VLM はコントラストと色彩度に対してどの属性重みを割り当てるか。

主な発見

色彩度の予測は複数のモデルで人間の整合性が高い（例：Claude および Qwen は色彩度で rho = 0.93 に達する）。
コントラストの予測は Qwen と Gemini が最も整合性が高く（rho = 0.86 および 0.79）、それぞれ。
全体的好みの整合性は GPT で最も高く（rho = 0.86）、Claude、Grok、Gemini でも中程度。
intra-model の一貫性は属性間で Claude にとって高いが人間との整合性を保証するものではない。GPT は変動性が高いが全体的な整合性は強い。
inter-model の合意は属性に依存し、コントラストはモデル間の最も大きな不一致を示す一方、色彩度は一部のペアで比較的高いモデル間合意を示す。
人間–VLM の合意は場面の知覚的分離性が高まるにつれて増加し、差が明確に表現される場合に信頼性が高まる。

Figure 2 : Attribute weighting for overall preference. The x-axis represents the contrast weight ( $\alpha$ ) and the y-axis represents the colorfulness weight ( $\beta$ ).

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。