Skip to main content
QUICK REVIEW

[Paper Review] Survey on English Entity Linking on Wikidata

Cedric Möller, Jens Lehmann|arXiv (Cornell University)|Dec 3, 2021
Topic Modeling3 citations
TL;DR

This survey analyzes English Entity Linking (EL) on Wikidata, evaluating existing datasets, approaches, and Wikidata-specific characteristics. It reveals that most EL methods treat Wikidata like any other knowledge graph, underutilizing its multilingualism, time-aware updates, and hyper-relational structure—highlighting opportunities for improvement via graph embeddings and type information.

ABSTRACT

Wikidata is a frequently updated, community-driven, and multilingual knowledge graph. Hence, Wikidata is an attractive basis for Entity Linking, which is evident by the recent increase in published papers. This survey focuses on four subjects: (1) Which Wikidata Entity Linking datasets exist, how widely used are they and how are they constructed? (2) Do the characteristics of Wikidata matter for the design of Entity Linking datasets and if so, how? (3) How do current Entity Linking approaches exploit the specific characteristics of Wikidata? (4) Which Wikidata characteristics are unexploited by existing Entity Linking approaches? This survey reveals that current Wikidata-specific Entity Linking datasets do not differ in their annotation scheme from schemes for other knowledge graphs like DBpedia. Thus, the potential for multilingual and time-dependent datasets, naturally suited for Wikidata, is not lifted. Furthermore, we show that most Entity Linking approaches use Wikidata in the same way as any other knowledge graph missing the chance to leverage Wikidata-specific characteristics to increase quality. Almost all approaches employ specific properties like labels and sometimes descriptions but ignore characteristics such as the hyper-relational structure. Hence, there is still room for improvement, for example, by including hyper-relational graph embeddings or type information. Many approaches also include information from Wikipedia, which is easily combinable with Wikidata and provides valuable textual information, which Wikidata lacks.

Motivation & Objective

  • To analyze the landscape of Wikidata-specific Entity Linking datasets and their construction.
  • To assess how Wikidata's unique characteristics influence EL dataset design.
  • To investigate how current EL approaches exploit Wikidata-specific features such as multilingualism and hyper-relational structure.
  • To identify underutilized characteristics of Wikidata in existing EL approaches.
  • To guide future research by exposing gaps in dataset design and model utilization of Wikidata's full potential.

Proposed method

  • Systematic survey of 42 Wikidata-based EL papers from 2011 to 2020.
  • Categorization of datasets by annotation scheme, construction method, and language support.
  • Analysis of 12 EL approaches, focusing on their use of Wikidata properties like labels, descriptions, types, and graph structure.
  • Comparison of approaches using metrics such as F1, accuracy, and recall on benchmark datasets.
  • Evaluation of model architectures including HITS, PageRank, Word2Vec, and transformer-based models like RoBERTa.
  • Identification of underused features such as hyper-relational structure and time-dependent updates in current EL pipelines.

Experimental results

Research questions

  • RQ1What Wikidata-specific Entity Linking datasets exist, and how are they constructed?
  • RQ2How do the unique characteristics of Wikidata—such as multilingualism and temporal updates—affect EL dataset design?
  • RQ3To what extent do current EL approaches exploit Wikidata-specific features like hyper-relational structure and type information?
  • RQ4Which Wikidata characteristics remain underutilized in existing EL approaches?
  • RQ5How do EL models that combine Wikidata with Wikipedia data improve performance?

Key findings

  • Most Wikidata-based EL datasets use the same annotation scheme as DBpedia, failing to leverage multilingual or time-dependent features.
  • Only 30% of EL approaches utilize Wikidata’s hyper-relational structure, despite its potential to improve disambiguation.
  • Approaches using PageRank or HITS for candidate ranking show improved performance, but few exploit graph structure beyond basic connectivity.
  • Multilingual models like Botha et al. [15] achieve F1 of 0.91, demonstrating strong performance when Wikidata’s multilingual nature is leveraged.
  • Models combining Wikidata with Wikipedia text (e.g., DoSeR) achieve higher accuracy by enriching entity descriptions.
  • Despite widespread use, only 15% of approaches use type information, and hyper-relational graph embeddings remain largely unexplored in EL.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.