Skip to main content
QUICK REVIEW

[论文解读] Vision, Deduction and Alignment: An Empirical Study on Multi-modal Knowledge Graph Alignment

Yangning Li, Jiaoyan Chen|arXiv (Cornell University)|Feb 17, 2023
Advanced Graph Neural Networks被引用 7
一句话总结

论文构建了八个大规模的多开放实体对齐基准(Multi-OpenEA)并包含图像信息,分析基于视觉信号对嵌入模型的影响,并提出自监督、多模态对齐方法LODEME,在现有基线上取得了最先进的结果。

ABSTRACT

Entity alignment (EA) for knowledge graphs (KGs) plays a critical role in knowledge engineering. Existing EA methods mostly focus on utilizing the graph structures and entity attributes (including literals), but ignore images that are common in modern multi-modal KGs. In this study we first constructed Multi-OpenEA -- eight large-scale, image-equipped EA benchmarks, and then evaluated some existing embedding-based methods for utilizing images. In view of the complementary nature of visual modal information and logical deduction, we further developed a new multi-modal EA method named LODEME using logical deduction and multi-modal KG embedding, with state-of-the-art performance achieved on Multi-OpenEA and other existing multi-modal EA benchmarks.

研究动机与目标

  • Motivate the use of visual modality to supplement structure and literals in entity alignment (EA).
  • Create large-scale, image-equipped EA benchmarks to reflect real-world multi-modal KGs.
  • Evaluate existing embedding-based EA models with visual extensions on Multi-OpenEA.
  • Propose a self-supervised, multi-modal EA method (LODEME) that combines logical deduction with multi-modal embeddings.

提出的方法

  • Extend four embedding-based EA models (BootEA, MultiKE, RDGCN, IMUSE) with image modalities (suffix -V).
  • Construct Multi-OpenEA benchmarks by adding multiple images per entity to eight OpenEA baselines.
  • Develop LODEME with a probabilistic reasoning (PR) module inspired by PARIS and a multi-modal semantic embedding (SE) module with structure-aware image attention.
  • In SE, encode structure (GCN), relations/attributes, name embeddings (M-BERT), and image embeddings (CLIP) with a weighted fusion of modalities.
  • Train with a margin-based alignment loss and use hard negative sampling; inference via greedy search and CSLS.

实验结果

研究问题

  • RQ1Do visual modalities improve embedding-based EA methods on large-scale, image-equipped KGs?
  • RQ2How does integrating logical deduction with multi-modal embeddings affect EA performance?
  • RQ3What is the impact of different modalities (structure, literals, names, images) and the number of images on EA accuracy?
  • RQ4How does LODEME compare to existing multi-modal EA methods on diverse benchmarks?
  • RQ5Can a structure-aware attention mechanism effectively utilize multiple images per entity?

主要发现

ModelHit@1 (15K-V1)Hit@5 (15K-V1)MRR (15K-V1)Hit@1 (15K-V2)Hit@5 (15K-V2)MRR (15K-V2)Hit@1 (100K-V1)Hit@5 (100K-V1)MRR (100K-V1)Hit@1 (100K-V2)Hit@5 (100K-V2)MRR (100K-V2)
BootEA0.6180.7950.6970.4880.7040.5840.5160.6850.5940.7660.8920.822
BootEA-V0.7300.9010.8050.7280.9260.8140.6430.8370.7300.8300.9370.866
MultiKE0.4260.5130.4710.5610.7230.6360.2910.3520.3240.3270.4100.371
MultiKE-V0.7370.7710.7540.7270.7650.7460.7430.7660.7550.6870.7270.707
RDGCN0.5610.7140.7220.6400.7770.7020.3620.4850.4200.4210.5280.473
RDGCN-V0.6830.8000.7360.6860.8170.7440.5370.6560.5920.4890.7040.584
IMUSE0.3270.5230.4190.5810.7780.6710.2760.4370.3550.4310.6310.525
IMUSE-V0.4040.5930.4920.6060.8060.6960.3510.5210.4320.4940.7010.590
PARIS0.734--0.840--0.667--0.795--
MSNEA0.9620.9880.9730.9710.9740.9890.9460.9570.9520.9820.9880.989
EVA0.9710.9890.9780.9900.9980.9940.9680.9890.9760.9910.9980.994
EN-FRLODEME0.9890.9970.9920.9971.0000.9980.9660.9830.9720.9780.9960.985
D-WLODEME0.9910.9980.9940.9961.0000.9980.9730.9920.9730.9940.9990.996
  • Visual modality improves embedding-based EA models by an average Hit@1 gain around 12%.
  • LODEME achieves state-of-the-art results on Multi-OpenEA benchmarks with Hit@1 over 95% (D-W and EN-FR variants shown).
  • Modified embedding-based models with images (suffix -V) show notable performance gains across datasets; highest average improvement observed for BootEA-V, MultiKE-V, and IMUSE-V.
  • Ablation shows structure information remains the most important modality; visual data provides stronger gains for sparser KGs, and removing all images degrades performance more than removing names or relations/attributes.
  • The structure-aware attention over multiple images outperforms mean-pooling and single best-image strategies, highlighting effective multi-image utilization.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。