[Paper Review] Meta-Path Guided Embedding for Similarity Search in Large-Scale Heterogeneous Information Networks
ESim learns vertex embeddings guided by user-specified meta-paths to enable efficient similarity search in large heterogeneous information networks, with a scalable sampling-based optimization framework. It outperforms state-of-the-art methods and scales to big HINs.
Most real-world data can be modeled as heterogeneous information networks (HINs) consisting of vertices of multiple types and their relationships. Search for similar vertices of the same type in large HINs, such as bibliographic networks and business-review networks, is a fundamental problem with broad applications. Although similarity search in HINs has been studied previously, most existing approaches neither explore rich semantic information embedded in the network structures nor take user's preference as a guidance. In this paper, we re-examine similarity search in HINs and propose a novel embedding-based framework. It models vertices as low-dimensional vectors to explore network structure-embedded similarity. To accommodate user preferences at defining similarity semantics, our proposed framework, ESim, accepts user-defined meta-paths as guidance to learn vertex vectors in a user-preferred embedding space. Moreover, an efficient and parallel sampling-based optimization algorithm has been developed to learn embeddings in large-scale HINs. Extensive experiments on real-world large-scale HINs demonstrate a significant improvement on the effectiveness of ESim over several state-of-the-art algorithms as well as its scalability.
Motivation & Objective
- Motivate similarity search in heterogeneous information networks (HINs) and capture rich semantics via user-guided meta-paths.
- Propose an embedding-based framework that represents vertices as low-dimensional vectors aligned with meta-path semantics.
- Develop a scalable, sampling-based optimization algorithm to train embeddings on large-scale HINs.
- Enable online similarity queries using cosine similarity on learned embeddings.
- Compare ESim against state-of-the-art methods and demonstrate scalability and effectiveness on real-world HINs.
Proposed method
- Introduce a probabilistic embedding model that preserves HIN structure by maximizing co-occurrence in path instances following a user-specified meta-path M.
- Use a scoring function f(u,v,M) = μ_M + p_M^T x_u + q_M^T x_v + x_u^T x_v to encode meta-path semantics and compute Pr(v|u,M) via a softmax over f(u,v,M).
- Adopt Noise-Contrastive Estimation (NCE) to train embeddings efficiently by distinguishing observed path instances from noise samples.
- Explore two path-definition options: sequential (seq) and pairwise (pair), with pairwise found to be more effective.
- Perform online training with stochastic gradient descent and parallelization (Hogwild) for scalability; use cosine similarity on normalized embeddings for online queries.
- Develop a dynamic programming based pre-computation of C(u,i|M) to enable constant-time online sampling of path instances following M.
- Optionally support a weighted combination of multiple meta-paths by summing their respective loss functions with weights.
Experimental results
Research questions
- RQ1How can user-guided meta-paths be incorporated into an embedding framework to define semantic similarity in HINs?
- RQ2Can a sampling-based, embedding-driven approach outperform existing meta-path based similarity measures (e.g., PathSim) and homogeneous-network embeddings on large HINs?
- RQ3What algorithms and data structures enable scalable training and fast online similarity queries on very large HINs?
- RQ4Does incorporating meta-path guidance improve similarity search quality across diverse real-world datasets like DBLP and Yelp?
Key findings
- The proposed ESim framework achieves significant improvements in effectiveness over several state-of-the-art methods.
- ESim scales to large-scale HINs through a novel sampling-based optimization and parallel training framework.
- Efficient pre-computation and online sampling enable constant-time path-instance sampling within each iteration.
- Cosine similarity on learned embeddings supports fast online top-k similarity queries via approximate nearest neighbor search.
- Experiments on real-world HINs (DBLP and Yelp) validate the approach and demonstrate scalability.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.