QUICK REVIEW

[论文解读] VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Athanasios Efthymiou, Stevan Rudinac|arXiv (Cornell University)|Mar 2, 2026

Advanced Graph Neural Networks被引用 0

一句话总结

VL-KGE 将预训练的视觉-语言表示与关系知识图嵌入（KGE）骨干结合，以应对模态不对称并提升多模态知识图谱的链接预测。在 WN9-IMG 和新引入的 WikiArt-MKGs 上表现出一致的增益。

ABSTRACT

Real-world multimodal knowledge graphs (MKGs) are inherently heterogeneous, modeling entities that are associated with diverse modalities. Traditional knowledge graph embedding (KGE) methods excel at learning continuous representations of entities and relations, yet they are typically designed for unimodal settings. Recent approaches extend KGE to multimodal settings but remain constrained, often processing modalities in isolation, resulting in weak cross-modal alignment, and relying on simplistic assumptions such as uniform modality availability across entities. Vision-Language Models (VLMs) offer a powerful way to align diverse modalities within a shared embedding space. We propose Vision-Language Knowledge Graph Embeddings (VL-KGE), a framework that integrates cross-modal alignment from VLMs with structured relational modeling to learn unified multimodal representations of knowledge graphs. Experiments on WN9-IMG and two novel fine art MKGs, WikiArt-MKG-v1 and WikiArt-MKG-v2, demonstrate that VL-KGE consistently improves over traditional unimodal and multimodal KGE methods in link prediction tasks. Our results highlight the value of VLMs for multimodal KGE, enabling more robust and structured reasoning over large-scale heterogeneous knowledge graphs.

研究动机与目标

推动在实体可用模态异质性存在时的多模态 KGE，而非假设完全模态。
提出 VL-KGE 将视觉-语言表示与结构关系建模融合。
利用预训练的 VLM 特征实现对未见实体的归纳推理。
创建并发布大规模的Fine Art MKGs（WikiArt-MKG-v1、WikiArt-MKG-v2），以研究KG中的模态不对称。
在基准测试中展示提升，尤其在模态不对称场景下。

提出的方法

通过融合可用模态（结构、视觉、文本）来得到统一的实体嵌入，使用融合算子。
引入预训练的视觉-语言编码器（BLIP 或 CLIP），并搭配 KGE骨干（TransE、DistMult、ComplEx、RotatE），可选择微调或冻结。
通过平均、拼接或加权融合来从可用模态创建 r_e，以应对模态不对称。
在结构嵌入不可用时，仅从预训练特征推导表示，实现未见实体的归纳推理。
在复杂值骨干中扩展一个机制，以生成虚部以实现归纳兼容性（P 投影、门控）。
使用逻辑损失进行训练，使正三元组得分高于负样本：L = sum log(1+exp(-y * f(h,r,t))).

Figure 3. Qualitative comparison of zero-shot CLIP and VL-ComplEx (base: CLIP) on WikiArt-MKG-v2. Given an artwork (top rows) or an artist (bottom rows) as a query, we show the top-5 predicted entities for selected relations. For artist queries, we use only textual input representations. Correctly r

实验结果

研究问题

RQ1预训练的视觉-语言表示是否能在模态不对称下提升知识图嵌入？
RQ2VL-KGE 在包含未见实体的归纳设置下表现如何？
RQ3哪种模态融合策略（平均、拼接、加权）最适合KG E任务？
RQ4与单模态及其他多模态基线相比，VL-KGE 的增益是否在标准与美术MKD基准上普遍存在？

主要发现

Method	MRR	Hits@1	Hits@3	Hits@10
MMKRL	0.913	0.905	0.917	0.932
OTKGE	0.923	0.911	0.930	0.947
TransE	0.904	0.894	0.909	0.922
VB-TransE	0.910	0.890	0.923	0.944
VL-TransE (BLIP)	0.910	0.894	0.921	0.940
VL-TransE (CLIP)	0.913	0.890	0.928	0.950
DistMult	0.904	0.902	0.904	0.907
VB-DistMult	0.923	0.914	0.927	0.938
VL-DistMult (BLIP)	0.909	0.907	0.908	0.914
VL-DistMult (CLIP)	0.935	0.925	0.940	0.957
ComplEx	0.900	0.899	0.901	0.902
VB-ComplEx	0.916	0.910	0.918	0.924
VL-ComplEx (BLIP)	0.903	0.900	0.904	0.907
VL-ComplEx (CLIP)	0.927	0.920	0.929	0.941
RotatE	0.910	0.907	0.911	0.917
VB-RotatE	0.910	0.903	0.914	0.925
VL-RotatE (BLIP)	0.911	0.898	0.918	0.931
VL-RotatE (CLIP)	0.914	0.904	0.918	0.934

VL-KGE 对 WN9-IMG 上的单模态和其他多模态 KGE 基线均有持续改进，覆盖所有骨干。
基于 CLIP 的 VL-KGE 变体在整体性能上表现强劲，其中 VL-DistMult 与 VL-ComplEx（CLIP）在 WN9-IMG 上尤为突出。
VL-KGE 在 WikiArt-MKG-v1 与 WikiArt-MKG-v2 上显示显著增益，模态不对称性是内在特征，且在缺失模态时具有鲁棒性。
将与领域对齐的预训练 VLM（如 ImageNet 对齐视觉）用于提升关系推理能力。
该框架通过仅从可用模态推导未见实体的表示实现归纳推理，无需对新实体逐一重新训练。

Figure 4. Per-relation mean reciprocal rank (MRR) on the WikiArt-MKG-v2 validation set for zero-shot CLIP and VL-KGEs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。