QUICK REVIEW

[논문 리뷰] Vision-Language Models vs Human: Perceptual Image Quality Assessment

Imran Mehmood, Imad Ali Shah|arXiv (Cornell University)|2026. 03. 25.

Image and Video Quality Assessment인용 수 0

한 줄 요약

본 논문은 세 가지 지각 IQ 척도(대비, 색채 다양성, 전반적인 선호도)에 대해 여섯 개의 비전-언어 모델을 인간의 심리물리 데이터와 벤치마킹하고, 내부 일관성, 모델 간 합의 및 인간과의 정렬을 분석한다.

ABSTRACT

Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.

연구 동기 및 목표

Vision-Language Models (VLMs)가 IQA에서 세 가지 척도: 대비, 색채 다양성, 전반적인 선호도에 대해 인간의 지각 판단을 근접시킬 수 있는지 평가한다.
여섯 개의 VLM을 심리물리학적 IQ 데이터와 비교하는 체계적 벤치마크를 제공한다.
VLM이 인간 판단과 일치하는 강점, 한계 및 조건을 식별한다.
모델 신뢰도, 모델 간 합의, 지각적 구분 가능성이 VLM 기반 IQA에 미치는 영향을 탐구한다.

제안 방법

인간과 VLM 모두에 대해 세 가지 IQ 속성에 대해 강제 선택 프롬프트를 사용한 동일한 쌍 이미지 비교를 사용한다.
API 또는 로컬 프롬프트를 통해 여섯 개의 VLM(Claude Opus 4.6, Gemini 3.1 Pro, GPT-5.2, Grok-4.1, InternVL-3.5-38B, Qwen3-VL-32B-Instruct)을 평가한다.
모델-인간 비교를 가능하게 하기 위해 재현성 필터링, 검증 및 z-점수 표준화를 사용해 응답을 처리한다.
쌍당 세 번 반복에 걸친 모델 내 변동성(VR%)을 계산한다.
모델 간 쌍별 합의를 위한 교차 모델 변동성(VR%)을 계산한다.
장면별 부트스트랩과 스피어만 순위 상관을 사용해 인간의 심리물리 데이터와의 정렬을 평가한다.
전반적 선호도에 대한 속성 가중치를 대비와 색채 다양성의 선형 조합으로 분석한다.

Figure 1 : Workflow for comparing perceptual IQA between human observers and VLMs. (a) Evaluation acquisition: Human psychophysical data are obtained through pairwise comparisons, while VLM assessments are collected via prompt-based image comparisons using an identical query. (b) Data processing: Re

실험 결과

연구 질문

RQ1VLM이 지각적 IQ 속성(대비, 색채 다양성, 전반적인 선호도)에 대한 인간의 순위를 재현할 수 있는가?
RQ2각 속성마다 인간 판단에 가장 근접하게 정렬되는 VLM은 어느 것인가?
RQ3모델 내 판단의 안정성은 어느 정도이며(모델 내 변동성), 모델 간 판단은 어느 정도의 변동성을 보이는가?
RQ4장면의 지각적 구분 가능성은 인간–VLM 합의에 어떤 영향을 미치는가?
RQ5전반적 선호도를 형성할 때 VLM이 대비 대 색채 다양성에 부여하는 속성 가중치는 무엇인가?

주요 결과

색채 다양성 예측은 여러 모델에서 인간과의 강한 정합을 보이며(예: Claude와 Qwen은 색채 다양성에서 ρ = 0.93에 도달).
대비 예측은 Qwen과 Gemini에서 가장 잘 정렬되며(ρ = 0.86 및 0.79, 각각).
전반적인 선호도 정렬은 GPT에서 가장 높고(ρ = 0.86), Claude, Grok, Gemini에서도 보통이다.
속성 전반에 걸친 Claude의 모델 내 일관성은 높지만 인간 정렬을 보장하지는 않으며; GPT는 변동성이 더 크지만 전반적 정렬은 더 강하다.
모델 간 합의는 속성에 따라 다르며, 대비는 모델 간 가장 큰 견해 차이를 보이고 색채 다양성은 일부 쌍에서 상대적으로 더 높은 모델 간 합의를 보인다.
장면의 지각적 구분 가능성이 높아질수록 인간–VLM 합의가 증가하며 차이가 명확하게 표현될 때 신뢰도가 더 높아진다.

Figure 2 : Attribute weighting for overall preference. The x-axis represents the contrast weight ( $\alpha$ ) and the y-axis represents the colorfulness weight ( $\beta$ ).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.