QUICK REVIEW

[논문 리뷰] Diagnosing Causal Reasoning in Vision-Language Models via Structured Relevance Graphs

Dhita Putri Pratama, Soyeon Caren Han|arXiv (Cornell University)|2026. 02. 24.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

논문은 Vision-Language Causal Graphs (VLCGs)와 ViLCaR를 제안하여 LVLM의 인과 속성 부여 및 추론을 진단하고, 구조화된 관련성 지도가 최종 정답 정확도를 반드시 높이지 않아도 속성 부여 및 추론 일관성을 개선함을 보여준다.

ABSTRACT

Large Vision-Language Models (LVLMs) achieve strong performance on visual question answering benchmarks, yet often rely on spurious correlations rather than genuine causal reasoning. Existing evaluations primarily assess the correctness of the answers, making it unclear whether failures arise from limited reasoning capability or from misidentifying causally relevant information. We introduce Vision-Language Causal Graphs (VLCGs), a structured, query-conditioned representation that explicitly encodes causally relevant objects, attributes, relations, and scene-grounded assumptions. Building on this representation, we present ViLCaR, a diagnostic benchmark comprising tasks for Causal Attribution, Causal Inference, and Question Answering, along with graph-aligned evaluation metrics that assess relevance identification beyond final answer accuracy. Experiments in state-of-the-art LVLMs show that injecting structured relevance information significantly improves attribution and inference consistency compared to zero-shot and standard in-context learning. These findings suggest that current limitations in LVLM causal reasoning stem primarily from insufficient structural guidance rather than a lack of reasoning capacity.

연구 동기 및 목표

LVLM의 최종 정답 정확도를 넘어서 인과 추론 진단의 필요성을 동기부여한다.
주어진 이미지-질문 쌍에 대해 인과적으로 관련된 요소를 구조화된 질의 조건 표현으로 나타낸 Vision-Language Causal Graphs (VLCGs)를 소개한다.
인과 속성 부여(Causal Attribution), 인과 추론(Causal Inference), 질의 응답(QA) 작업을 포함하는 진단 벤치마크 ViLCaR를 만든다.
그래프 정렬된 평가 지표를 개발하여 관련성 식별과 최종 정답 정확도 간의 차이를 구분한다.
구조화된 인과 지도가 LVLM의 속성 부여 및 추론 일관성을 개선함을 입증한다.

제안 방법

VLCGs를 객체/속성/관계(V)와 인과 종속(E), 명시적 씬-기반 가정(A)을 연결하는 방향 그래프 G=(V,E,A)로 정의한다.
ViLCaR를 VQA/VCR 데이터로부터 인과 필터링, LVLM 프롬프팅을 통한 VLCG 생성, 독립 탐지기로의 바인딩(grounding), 최소 인과 프 pruning, 인간 품질 관리를 거쳐 구성한다.
세 가지 진단 작업(CA, CI, QA)을 사용하는 제로샷, 표준 인-컨텍스트 학습(ICL), 그리고 VLCG-강화 프롬 prompting 설정에서 LVLM을 평가한다.
QA 정확도 외에 CA(인과적으로 관련된 속성의 식별) 및 CI(VLCG에 따른 추론의 일관성) 측정을 위한 그래프 정렬 메트릭을 사용한다.
생성된 추론을 VLCG의 골드 인과 가정과 비교하는 LLM 기반 평가자 프로토콜을 제공한다.

Figure 1 . Example of a VLCG. Given an image-question pair (“Have these people just married?”), the graph encodes causally relevant objects (e.g., persons, cake), attributes (wedding dress, suit), relations (wear), and scene-grounded assumptions linking visual evidence to the conclusion. Unlike scen

실험 결과

연구 질문

RQ1주어진 이미지-질문 쌍에 대해 LVLM이 인과적으로 관련된 속성을 올바르게 식별할 수 있는가(CA)?
RQ2VLCG-식별 속성과 가정에 근거한 추론 체계가 일관되고 응집력 있는가(CI)?
RQ3VLCG-구조화된 관련성을 주입하는 것이 최종 정답 정확도나 추론의 질(QA) 측면에서 기존 프롬 prompting 대비 향상을 가져오는가?
RQ4구조화된 인과 그래프가 추론을 제약하고 조롱적 신호에 대한 의존성을 줄이는 유용한 사전으로 작용하는가?

주요 결과

설정	CA	CI	QA 정확도	BLEU (추론)	ROUGE (추론)
제로샷	0.458	0.652	0.763	0.164	0.266
표준 ICL	0.455	0.654	0.763	0.163	0.264
VLCG (최적)	0.488	0.690	0.768	0.177	0.273

VLCG-강화 프롬 prompting은 CA를 0.458에서 0.488로 증가시키며 상대적으로 +6.6% 향상시킨다.
VLCG-강화 프롬 prompting은 CI를 0.652에서 0.690로 증가시키며 상대적으로 +5.8% 향상시킨다.
QA 정확도는 VLCG 프롬 prompting에서도 거의 변하지 않는다(0.763에서 0.768).
제로샷 및 표준 ICL은 CA/CI에서 제한된 이점을 제공하는 반면, VLCG 지도가 더 안정적인 인과 추론을 유도한다.
BLEU 및 ROUGE 지표는 미미한 향상을 보이며, 개선은 어휘적 중복보다는 구조화된 관련성에 의해 주도된다.
구조화된 인과 지침은 관련성 사전으로 작용하여 모델이 인과적으로 의미 있는 변수에 근거한 추론을 이끌도록 돕는다.

Figure 2 . Three diagnostic tasks in ViLCaR derived from the verified and pruned VLCGs: CA, CI, and QA.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.