[论文解读] Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment
论文构建了八个大规模的多开放实体对齐基准(Multi-OpenEA)并包含图像信息,分析基于视觉信号对嵌入模型的影响,并提出自监督、多模态对齐方法LODEME,在现有基线上取得了最先进的结果。
Entity alignment (EA) for knowledge graphs (KGs) plays a critical role in knowledge engineering. Existing EA methods mostly focus on utilizing the graph structures and entity attributes (including literals), but ignore images that are common in modern multi-modal KGs. In this study we first constructed Multi-OpenEA -- eight large-scale, image-equipped EA benchmarks, and then evaluated some existing embedding-based methods for utilizing images. In view of the complementary nature of visual modal information and logical deduction, we further developed a new multi-modal EA method named LODEME using logical deduction and multi-modal KG embedding, with state-of-the-art performance achieved on Multi-OpenEA and other existing multi-modal EA benchmarks.
研究动机与目标
- Motivate the use of visual modality to supplement structure and literals in entity alignment (EA).
- Create large-scale, image-equipped EA benchmarks to reflect real-world multi-modal KGs.
- Evaluate existing embedding-based EA models with visual extensions on Multi-OpenEA.
- Propose a self-supervised, multi-modal EA method (LODEME) that combines logical deduction with multi-modal embeddings.
提出的方法
- Extend four embedding-based EA models (BootEA, MultiKE, RDGCN, IMUSE) with image modalities (suffix -V).
- Construct Multi-OpenEA benchmarks by adding multiple images per entity to eight OpenEA baselines.
- Develop LODEME with a probabilistic reasoning (PR) module inspired by PARIS and a multi-modal semantic embedding (SE) module with structure-aware image attention.
- In SE, encode structure (GCN), relations/attributes, name embeddings (M-BERT), and image embeddings (CLIP) with a weighted fusion of modalities.
- Train with a margin-based alignment loss and use hard negative sampling; inference via greedy search and CSLS.
实验结果
研究问题
- RQ1Do visual modalities improve embedding-based EA methods on large-scale, image-equipped KGs?
- RQ2How does integrating logical deduction with multi-modal embeddings affect EA performance?
- RQ3What is the impact of different modalities (structure, literals, names, images) and the number of images on EA accuracy?
- RQ4How does LODEME compare to existing multi-modal EA methods on diverse benchmarks?
- RQ5Can a structure-aware attention mechanism effectively utilize multiple images per entity?
主要发现
| Model | Hit@1 (15K-V1) | Hit@5 (15K-V1) | MRR (15K-V1) | Hit@1 (15K-V2) | Hit@5 (15K-V2) | MRR (15K-V2) | Hit@1 (100K-V1) | Hit@5 (100K-V1) | MRR (100K-V1) | Hit@1 (100K-V2) | Hit@5 (100K-V2) | MRR (100K-V2) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BootEA | 0.618 | 0.795 | 0.697 | 0.488 | 0.704 | 0.584 | 0.516 | 0.685 | 0.594 | 0.766 | 0.892 | 0.822 | |
| BootEA-V | 0.730 | 0.901 | 0.805 | 0.728 | 0.926 | 0.814 | 0.643 | 0.837 | 0.730 | 0.830 | 0.937 | 0.866 | |
| MultiKE | 0.426 | 0.513 | 0.471 | 0.561 | 0.723 | 0.636 | 0.291 | 0.352 | 0.324 | 0.327 | 0.410 | 0.371 | |
| MultiKE-V | 0.737 | 0.771 | 0.754 | 0.727 | 0.765 | 0.746 | 0.743 | 0.766 | 0.755 | 0.687 | 0.727 | 0.707 | |
| RDGCN | 0.561 | 0.714 | 0.722 | 0.640 | 0.777 | 0.702 | 0.362 | 0.485 | 0.420 | 0.421 | 0.528 | 0.473 | |
| RDGCN-V | 0.683 | 0.800 | 0.736 | 0.686 | 0.817 | 0.744 | 0.537 | 0.656 | 0.592 | 0.489 | 0.704 | 0.584 | |
| IMUSE | 0.327 | 0.523 | 0.419 | 0.581 | 0.778 | 0.671 | 0.276 | 0.437 | 0.355 | 0.431 | 0.631 | 0.525 | |
| IMUSE-V | 0.404 | 0.593 | 0.492 | 0.606 | 0.806 | 0.696 | 0.351 | 0.521 | 0.432 | 0.494 | 0.701 | 0.590 | |
| PARIS | 0.734 | - | - | 0.840 | - | - | 0.667 | - | - | 0.795 | - | - | |
| MSNEA | 0.962 | 0.988 | 0.973 | 0.971 | 0.974 | 0.989 | 0.946 | 0.957 | 0.952 | 0.982 | 0.988 | 0.989 | |
| EVA | 0.971 | 0.989 | 0.978 | 0.990 | 0.998 | 0.994 | 0.968 | 0.989 | 0.976 | 0.991 | 0.998 | 0.994 | |
| EN-FR | LODEME | 0.989 | 0.997 | 0.992 | 0.997 | 1.000 | 0.998 | 0.966 | 0.983 | 0.972 | 0.978 | 0.996 | 0.985 |
| D-W | LODEME | 0.991 | 0.998 | 0.994 | 0.996 | 1.000 | 0.998 | 0.973 | 0.992 | 0.973 | 0.994 | 0.999 | 0.996 |
- Visual modality improves embedding-based EA models by an average Hit@1 gain around 12%.
- LODEME achieves state-of-the-art results on Multi-OpenEA benchmarks with Hit@1 over 95% (D-W and EN-FR variants shown).
- Modified embedding-based models with images (suffix -V) show notable performance gains across datasets; highest average improvement observed for BootEA-V, MultiKE-V, and IMUSE-V.
- Ablation shows structure information remains the most important modality; visual data provides stronger gains for sparser KGs, and removing all images degrades performance more than removing names or relations/attributes.
- The structure-aware attention over multiple images outperforms mean-pooling and single best-image strategies, highlighting effective multi-image utilization.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。