Skip to main content
QUICK REVIEW

[논문 리뷰] Wikipedia2Vec: An Efficient Toolkit for Learning and Visualizing the Embeddings of Words and Entities from Wikipedia

Ikuya Yamada, Akari Asai|arXiv (Cornell University)|2018. 12. 15.
Topic Modeling참고 문헌 37인용 수 41
한 줄 요약

Wikipedia2Vec는 Wikipedia에서 단어와 엔티티 임베딩을 공동으로 학습하는 Python 오픈 소스 도구로, 엔티티 관련성에서 최첨단 결과를 달성하고 표준 단어 임베딩 벤치마크에서도 경쟁력 있는 성능을 보이며, 인터랙티브 웹 데모와 사전 학습된 다국어 임베딩을 제공합니다.

ABSTRACT

The embeddings of entities in a large knowledge base (e.g., Wikipedia) are highly beneficial for solving various natural language tasks that involve real world knowledge. In this paper, we present Wikipedia2Vec, a Python-based open-source tool for learning the embeddings of words and entities from Wikipedia. The proposed tool enables users to learn the embeddings efficiently by issuing a single command with a Wikipedia dump file as an argument. We also introduce a web-based demonstration of our tool that allows users to visualize and explore the learned embeddings. In our experiments, our tool achieved a state-of-the-art result on the KORE entity relatedness dataset, and competitive results on various standard benchmark datasets. Furthermore, our tool has been used as a key component in various recent studies. We publicize the source code, demonstration, and the pretrained embeddings for 12 languages at https://wikipedia2vec.github.io.

연구 동기 및 목표

  • Provide a scalable method to learn joint word and entity embeddings from Wikipedia.
  • Improve embedding quality by combining word, anchor context, and link-graph signals.
  • Offer a fast, easy-to-use training workflow with single-command operation.
  • Deliver visual and interactive tools for exploring learned embeddings.
  • Release pretrained multilingual embeddings and open-source code for community use.

제안 방법

  • Jointly optimize three skip-gram-based sub-models: word-based skip-gram, anchor context, and link graph models.
  • Represent words and entities in a shared d-dimensional vector space using two embedding matrices V and U.
  • Use negative sampling to approximate the softmax in the objective and train via stochastic gradient descent.
  • Automatically generate hyperlinks with a mention-entity dictionary to enrich the anchor context.
  • Efficient data structures: CSR sparse matrix for the link graph and Aho–Corasick for mention detection.
  • Provide a web-based demonstration using dimensionality reduction (t-SNE, UMAP, PCA) to visualize embeddings.

실험 결과

연구 질문

  • RQ1Can jointly learned word and entity embeddings from Wikipedia outperform baselines on entity-relatedness and word-embedding benchmarks?
  • RQ2How does incorporating anchor context and link graph signals affect embedding quality compared to word-only models?
  • RQ3Is the training process efficient enough to compete with established word-embedding tools like gensim and fastText?
  • RQ4Do automatically generated hyperlinks contribute to embedding quality in practice?
  • RQ5Can the embeddings be effectively visualized and explored via an interactive web demo?

주요 결과

  • Achieved state-of-the-art results on the KORE entity relatedness dataset (Table 1).
  • Outperformed RDF2Vec and Wiki2vec baselines on entity embeddings and showed competitive word embedding performance (Table 2).
  • Link graph and anchor context signals improve KORE performance, while hyperlink generation provides mixed—or limited—benefits for word tasks.
  • Word-based skip-gram alone is faster than gensim and fastText, with full model training time comparable to baselines.
  • Provided pretrained embeddings for 12 languages and released open-source code and demonstration tools.
  • Web demo enables 2D/3D visualization and similarity querying of words and entities.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.