QUICK REVIEW

[논문 리뷰] VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Athanasios Efthymiou, Stevan Rudinac|arXiv (Cornell University)|2026. 03. 02.

Advanced Graph Neural Networks인용 수 0

한 줄 요약

VL-KGE는 사전 학습된 비전-언어 표현을 관계형 KG 백본과 결합하여 모달리티 비대칭을 처리하고 다중 모달 지식 그래프에서 링크 예측 성능을 향상시킨다. WN9-IMG 및 새로 도입된 WikiArt-MKGs에서 일관된 이득을 보여준다.

ABSTRACT

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

연구 동기 및 목표

다양한 엔터티의 이질적인 모달리티 이용 가능성을 가정하지 않고 모달리티가 다른 멀티모달 KGE를 목표로 한다.
시각-언어 표현을 구조적 관계 모델링과 결합하는 VL-KGE를 제안한다.
구조적 임베딩이 없을 때도 사전 학습된 VLM 특성만으로 보이지 않는 엔터티에 대한 귀납적 추론을 가능하게 한다.
모달리티 비대칭을 연구하기 위해 대형 미술 MKG(WikiArt-MKG-v1, WikiArt-MKG-v2)를 생성하고 공개한다.
특히 모달리티 비대칭 하에서 벤치마크 간 링크 예측 성능이 향상됨을 보여준다.

제안 방법

가용 모달리티(구조적, 시각적, 텍스트)를 융합 연산자를 통해 하나의 unified embedding으로 각 엔터티를 표현한다.
사전 학습된 비전-언어 인코더(BLIP 또는 CLIP)와 KGE 백본(TransE, DistMult, ComplEx, RotatE)을 선택적으로 파인튜닝 또는 고정하며 함께 사용한다.
가용 모달리티를 활용해 평균, 연결(concatenation), 또는 가중 융합으로 r_e를 만들어 모달리티 비대칭을 다룬다.
구조적 임베딩이 없을 때도 사전 학습된 특성만으로 보이지 않는 엔터티에 대해 귀납 추론을 지원한다.
복소수 백본에 도메인에 맞춘 사전 학습 VLM을 사용해 귀납적 호환성을 위한 허수부를 생성하는 메커니즘(P projection, gating)을 확장한다.
양성 삼중항의 로지스틱 손실을 사용해 양성 트리플이 음성보다 더 크게 점수화되도록 학습한다: L = sum log(1+exp(-y * f(h,r,t))).

Figure 3. Qualitative comparison of zero-shot CLIP and VL-ComplEx (base: CLIP) on WikiArt-MKG-v2. Given an artwork (top rows) or an artist (bottom rows) as a query, we show the top-5 predicted entities for selected relations. For artist queries, we use only textual input representations. Correctly r

실험 결과

연구 질문

RQ1사전 학습된 비전-언어 표현이 모달리티 비대칭 하의 KG 임베딩을 개선할 수 있는가?
RQ2보이지 않는 엔터티를 포함한 귀납 설정에서 VL-KGE의 성능은 어떠한가?
RQ3모달리티를 결합하는 어떤 융합 전략(평균, 연결, 가중) 인증된 KGE 작업에서 가장 잘 작동하는가?
RQ4VL-KGE의 이점은 표준 및 미술품 MKG 벤치마크에서 unimodal 및 다른 다중 모달 기준선과 비교하여 지속되는가?

주요 결과

VL-KGE는 WN9-IMG에서 모든 백본에 걸쳐 unimodal 및 다른 다중 모달 KGE 기준선 대비 일관되게 향상된다.
CLIP 기반 VL-KGE 변형은 전반적으로 강력한 성능을 달성하며, VL-DistMult 및 VL-ComplEx(CLIP)이 WN9-IMG에서 특히 높다.
VL-KGE는 모달리티 비대칭이 내재된 WikiArt-MKG-v1 및 WikiArt-MKG-v2에서 상당한 이득을 보이며 누락 모달리티에서도 견고함을 입증한다.
도메인에 맞춘 사전 학습 VLM(예: ImageNet에 정렬된 비주얼과 함께 CLIP)을 사용하는 것이 KG의 관계 추론을 강화한다.
새 엔터티마다 재학습 없이도 이용 가능한 모달리티로부터 표현을 도출해 보이지 않는 엔터티에 대한 귀납 추론을 지원한다.

Figure 4. Per-relation mean reciprocal rank (MRR) on the WikiArt-MKG-v2 validation set for zero-shot CLIP and VL-KGEs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.