[Paper Review] graph2vec: Learning Distributed Representations of Graphs
graph2vec learns unsupervised, data-driven embeddings for entire graphs by treating rooted subgraphs as words in a document, enabling graph classification and clustering with competitive performance to graph kernels.
Recent works on representation learning for graph structured data predominantly focus on learning distributed representations of graph substructures such as nodes and subgraphs. However, many graph analytics tasks such as graph classification and clustering require representing entire graphs as fixed length feature vectors. While the aforementioned approaches are naturally unequipped to learn such representations, graph kernels remain as the most effective way of obtaining them. However, these graph kernels use handcrafted features (e.g., shortest paths, graphlets, etc.) and hence are hampered by problems such as poor generalization. To address this limitation, in this work, we propose a neural embedding framework named graph2vec to learn data-driven distributed representations of arbitrary sized graphs. graph2vec's embeddings are learnt in an unsupervised manner and are task agnostic. Hence, they could be used for any downstream task such as graph classification, clustering and even seeding supervised representation learning approaches. Our experiments on several benchmark and large real-world datasets show that graph2vec achieves significant improvements in classification and clustering accuracies over substructure representation learning approaches and are competitive with state-of-the-art graph kernels.
Motivation & Objective
- Motivate learning fixed-length embeddings for entire graphs to enable downstream ML tasks like classification and clustering.
- Address limitations of handcrafted graph kernels and substructure embeddings by proposing a data-driven, unsupervised, task-agnostic approach.
- Leverage ideas from document embeddings to model graphs as documents of rooted subgraphs.
- Demonstrate effectiveness on benchmark datasets and large real-world data (malware graphs) across classification and clustering tasks.
Proposed method
- Represent each graph as a document consisting of rooted subgraphs around nodes (up to degree D).
- Use WL relabeling to generate and label rooted subgraphs as vocabulary items.
- Train a skipgram model with negative sampling to learn graph embeddings, optimizing Pr(sg|G).
- Iteratively update graph embeddings across epochs via stochastic gradient descent.
- Compare graph2vec to node2vec, sub2vec, WL kernel, and Deep WL kernel on multiple datasets.
- Embed graphs with a fixed dimension delta using an unsupervised, task-agnostic objective.
Experimental results
Research questions
- RQ1How does graph2vec compare to state-of-the-art substructure representation learning approaches and graph kernels for graph classification in accuracy and efficiency on benchmark datasets?
- RQ2How does graph2vec perform on large-scale real-world graph classification tasks (e.g., malware detection) compared to existing methods?
- RQ3How does graph2vec perform on graph clustering tasks (e.g., malware familial clustering) relative to competing approaches?
Key findings
| Dataset | node2vec | sub2vec | WL kernel | Deep WL kernel | graph2vec |
|---|---|---|---|---|---|
| MUTAG | 72.63 b1 10.20 | 61.05 b1 15.79 | 80.63 b1 3.07 | 82.95 b1 1.96 | 83.15 b1 9.25 |
| PTC | 58.85 b1 8.00 | 59.99 b1 6.38 | 56.91 b1 2.79 | 59.04 b1 1.09 | 60.17 b1 6.86 |
| PROTEINS | 57.49 b1 3.57 | 53.03 b1 5.55 | 72.92 b1 0.56 | 73.30 b1 0.82 | 73.30 b1 2.05 |
| NCI1 | 54.89 b1 1.61 | 52.84 b1 1.47 | 80.01 b1 0.50 | 80.31 b1 0.46 | 73.22 b1 1.81 |
| NCI109 | 52.68 b1 1.56 | 50.67 b1 1.50 | 80.12 b1 0.34 | 80.32 b1 0.33 | 74.26 b1 1.47 |
- On benchmark datasets, graph2vec outperforms other representation learning and kernel methods on MUTAG, PTC and PROTEINS and has comparable accuracy on NCI1 and NCI109.
- In large real-world malware classification, graph2vec achieves 99.03% accuracy, outperforming node2vec, sub2vec, WL kernel, and Deep WL kernel.
- Sub2vec generally underperforms across datasets due to sampling limitations; node2vec struggles on larger graphs; WL kernels remain strong baselines with competitive gaps to graph2vec.
- Graph2vec provides a data-driven, structure-preserving representation that captures local and global graph similarities.
- Embeddings can be used with general classifiers (RF, NN, SVM) for graph classification and clustering, unlike some kernel-based methods.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.