QUICK REVIEW

[論文レビュー] Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment

Yangning Li, Jiaoyan Chen|arXiv (Cornell University)|Feb 17, 2023

Advanced Graph Neural Networks被引用数 7

ひとこと要約

The paper builds eight large-scale Multi-OpenEA benchmarks with images for entity alignment, analyzes Vision-based signals on embedding models, and introduces LODEME, a self-supervised, multi-modal alignment method that achieves state-of-the-art results.

ABSTRACT

Entity alignment (EA) for knowledge graphs (KGs) plays a critical role in knowledge engineering. Existing EA methods mostly focus on utilizing the graph structures and entity attributes (including literals), but ignore images that are common in modern multi-modal KGs. In this study we first constructed Multi-OpenEA -- eight large-scale, image-equipped EA benchmarks, and then evaluated some existing embedding-based methods for utilizing images. In view of the complementary nature of visual modal information and logical deduction, we further developed a new multi-modal EA method named LODEME using logical deduction and multi-modal KG embedding, with state-of-the-art performance achieved on Multi-OpenEA and other existing multi-modal EA benchmarks.

研究の動機と目的

Motivate the use of visual modality to supplement structure and literals in entity alignment (EA).
Create large-scale, image-equipped EA benchmarks to reflect real-world multi-modal KGs.
Evaluate existing embedding-based EA models with visual extensions on Multi-OpenEA.
Propose a self-supervised, multi-modal EA method (LODEME) that combines logical deduction with multi-modal embeddings.

提案手法

Extend four embedding-based EA models (BootEA, MultiKE, RDGCN, IMUSE) with image modalities (suffix -V).
Construct Multi-OpenEA benchmarks by adding multiple images per entity to eight OpenEA baselines.
Develop LODEME with a probabilistic reasoning (PR) module inspired by PARIS and a multi-modal semantic embedding (SE) module with structure-aware image attention.
In SE, encode structure (GCN), relations/attributes, name embeddings (M-BERT), and image embeddings (CLIP) with a weighted fusion of modalities.
Train with a margin-based alignment loss and use hard negative sampling; inference via greedy search and CSLS.

実験結果

リサーチクエスチョン

RQ1Do visual modalities improve embedding-based EA methods on large-scale, image-equipped KGs?
RQ2How does integrating logical deduction with multi-modal embeddings affect EA performance?
RQ3What is the impact of different modalities (structure, literals, names, images) and the number of images on EA accuracy?
RQ4How does LODEME compare to existing multi-modal EA methods on diverse benchmarks?
RQ5Can a structure-aware attention mechanism effectively utilize multiple images per entity?

主な発見

Model	Hit@1 (15K-V1)	Hit@5 (15K-V1)	MRR (15K-V1)	Hit@1 (15K-V2)	Hit@5 (15K-V2)	MRR (15K-V2)	Hit@1 (100K-V1)	Hit@5 (100K-V1)	MRR (100K-V1)	Hit@1 (100K-V2)	Hit@5 (100K-V2)	MRR (100K-V2)
BootEA	0.618	0.795	0.697	0.488	0.704	0.584	0.516	0.685	0.594	0.766	0.892	0.822
BootEA-V	0.730	0.901	0.805	0.728	0.926	0.814	0.643	0.837	0.730	0.830	0.937	0.866
MultiKE	0.426	0.513	0.471	0.561	0.723	0.636	0.291	0.352	0.324	0.327	0.410	0.371
MultiKE-V	0.737	0.771	0.754	0.727	0.765	0.746	0.743	0.766	0.755	0.687	0.727	0.707
RDGCN	0.561	0.714	0.722	0.640	0.777	0.702	0.362	0.485	0.420	0.421	0.528	0.473
RDGCN-V	0.683	0.800	0.736	0.686	0.817	0.744	0.537	0.656	0.592	0.489	0.704	0.584
IMUSE	0.327	0.523	0.419	0.581	0.778	0.671	0.276	0.437	0.355	0.431	0.631	0.525
IMUSE-V	0.404	0.593	0.492	0.606	0.806	0.696	0.351	0.521	0.432	0.494	0.701	0.590
PARIS	0.734	-	-	0.840	-	-	0.667	-	-	0.795	-	-
MSNEA	0.962	0.988	0.973	0.971	0.974	0.989	0.946	0.957	0.952	0.982	0.988	0.989
EVA	0.971	0.989	0.978	0.990	0.998	0.994	0.968	0.989	0.976	0.991	0.998	0.994
EN-FR	LODEME	0.989	0.997	0.992	0.997	1.000	0.998	0.966	0.983	0.972	0.978	0.996	0.985
D-W	LODEME	0.991	0.998	0.994	0.996	1.000	0.998	0.973	0.992	0.973	0.994	0.999	0.996

Visual modality improves embedding-based EA models by an average Hit@1 gain around 12%.
LODEME achieves state-of-the-art results on Multi-OpenEA benchmarks with Hit@1 over 95% (D-W and EN-FR variants shown).
Modified embedding-based models with images (suffix -V) show notable performance gains across datasets; highest average improvement observed for BootEA-V, MultiKE-V, and IMUSE-V.
Ablation shows structure information remains the most important modality; visual data provides stronger gains for sparser KGs, and removing all images degrades performance more than removing names or relations/attributes.
The structure-aware attention over multiple images outperforms mean-pooling and single best-image strategies, highlighting effective multi-image utilization.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。