QUICK REVIEW

[论文解读] Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment

Yangning Li, Jiaoyan Chen|arXiv (Cornell University)|Feb 17, 2023

Advanced Graph Neural Networks被引用 7

一句话总结

论文构建了八个大规模的多开放实体对齐基准（Multi-OpenEA）并包含图像信息，分析基于视觉信号对嵌入模型的影响，并提出自监督、多模态对齐方法LODEME，在现有基线上取得了最先进的结果。

ABSTRACT

Entity alignment (EA) for knowledge graphs (KGs) plays a critical role in knowledge engineering. Existing EA methods mostly focus on utilizing the graph structures and entity attributes (including literals), but ignore images that are common in modern multi-modal KGs. In this study we first constructed Multi-OpenEA -- eight large-scale, image-equipped EA benchmarks, and then evaluated some existing embedding-based methods for utilizing images. In view of the complementary nature of visual modal information and logical deduction, we further developed a new multi-modal EA method named LODEME using logical deduction and multi-modal KG embedding, with state-of-the-art performance achieved on Multi-OpenEA and other existing multi-modal EA benchmarks.

研究动机与目标

Motivate the use of visual modality to supplement structure and literals in entity alignment (EA).
Create large-scale, image-equipped EA benchmarks to reflect real-world multi-modal KGs.
Evaluate existing embedding-based EA models with visual extensions on Multi-OpenEA.
Propose a self-supervised, multi-modal EA method (LODEME) that combines logical deduction with multi-modal embeddings.

提出的方法

Extend four embedding-based EA models (BootEA, MultiKE, RDGCN, IMUSE) with image modalities (suffix -V).
Construct Multi-OpenEA benchmarks by adding multiple images per entity to eight OpenEA baselines.
Develop LODEME with a probabilistic reasoning (PR) module inspired by PARIS and a multi-modal semantic embedding (SE) module with structure-aware image attention.
In SE, encode structure (GCN), relations/attributes, name embeddings (M-BERT), and image embeddings (CLIP) with a weighted fusion of modalities.
Train with a margin-based alignment loss and use hard negative sampling; inference via greedy search and CSLS.

实验结果

研究问题

RQ1Do visual modalities improve embedding-based EA methods on large-scale, image-equipped KGs?
RQ2How does integrating logical deduction with multi-modal embeddings affect EA performance?
RQ3What is the impact of different modalities (structure, literals, names, images) and the number of images on EA accuracy?
RQ4How does LODEME compare to existing multi-modal EA methods on diverse benchmarks?
RQ5Can a structure-aware attention mechanism effectively utilize multiple images per entity?

主要发现

Model	Hit@1 (15K-V1)	Hit@5 (15K-V1)	MRR (15K-V1)	Hit@1 (15K-V2)	Hit@5 (15K-V2)	MRR (15K-V2)	Hit@1 (100K-V1)	Hit@5 (100K-V1)	MRR (100K-V1)	Hit@1 (100K-V2)	Hit@5 (100K-V2)	MRR (100K-V2)
BootEA	0.618	0.795	0.697	0.488	0.704	0.584	0.516	0.685	0.594	0.766	0.892	0.822
BootEA-V	0.730	0.901	0.805	0.728	0.926	0.814	0.643	0.837	0.730	0.830	0.937	0.866
MultiKE	0.426	0.513	0.471	0.561	0.723	0.636	0.291	0.352	0.324	0.327	0.410	0.371
MultiKE-V	0.737	0.771	0.754	0.727	0.765	0.746	0.743	0.766	0.755	0.687	0.727	0.707
RDGCN	0.561	0.714	0.722	0.640	0.777	0.702	0.362	0.485	0.420	0.421	0.528	0.473
RDGCN-V	0.683	0.800	0.736	0.686	0.817	0.744	0.537	0.656	0.592	0.489	0.704	0.584
IMUSE	0.327	0.523	0.419	0.581	0.778	0.671	0.276	0.437	0.355	0.431	0.631	0.525
IMUSE-V	0.404	0.593	0.492	0.606	0.806	0.696	0.351	0.521	0.432	0.494	0.701	0.590
PARIS	0.734	-	-	0.840	-	-	0.667	-	-	0.795	-	-
MSNEA	0.962	0.988	0.973	0.971	0.974	0.989	0.946	0.957	0.952	0.982	0.988	0.989
EVA	0.971	0.989	0.978	0.990	0.998	0.994	0.968	0.989	0.976	0.991	0.998	0.994
EN-FR	LODEME	0.989	0.997	0.992	0.997	1.000	0.998	0.966	0.983	0.972	0.978	0.996	0.985
D-W	LODEME	0.991	0.998	0.994	0.996	1.000	0.998	0.973	0.992	0.973	0.994	0.999	0.996

Visual modality improves embedding-based EA models by an average Hit@1 gain around 12%.
LODEME achieves state-of-the-art results on Multi-OpenEA benchmarks with Hit@1 over 95% (D-W and EN-FR variants shown).
Modified embedding-based models with images (suffix -V) show notable performance gains across datasets; highest average improvement observed for BootEA-V, MultiKE-V, and IMUSE-V.
Ablation shows structure information remains the most important modality; visual data provides stronger gains for sparser KGs, and removing all images degrades performance more than removing names or relations/attributes.
The structure-aware attention over multiple images outperforms mean-pooling and single best-image strategies, highlighting effective multi-image utilization.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。