[论文解读] Watch Your Step: Learning Node Embeddings via Graph Attention
该论文引入 Graph Attention 来学习可训练的上下文分布,用于基于随机游走的图嵌入,通过对注意力参数与嵌入的端到端训练,达到最先进的链接预测结果。
Graph embedding methods represent nodes in a continuous vector space, preserving information from the graph (e.g. by sampling random walks). There are many hyper-parameters to these methods (such as random walk length) which have to be manually tuned for every graph. In this paper, we replace random walk hyper-parameters with trainable parameters that we automatically learn via backpropagation. In particular, we learn a novel attention model on the power series of the transition matrix, which guides the random walk to optimize an upstream objective. Unlike previous approaches to attention models, the method that we propose utilizes attention parameters exclusively on the data (e.g. on the random walk), and not used by the model for inference. We experiment on link prediction tasks, as we aim to produce embeddings that best-preserve the graph structure, generalizing to unseen information. We improve state-of-the-art on a comprehensive suite of real world datasets including social, collaboration, and biological networks. Adding attention to random walks can reduce the error by 20% to 45% on datasets we attempted. Further, our learned attention parameters are different for every graph, and our automatically-found values agree with the optimal choice of hyper-parameter if we manually tune existing methods.
研究动机与目标
- 动机在于用通过反向传播学习的可训练参数替代图嵌入中的固定超参数。
- 提出在图转移矩阵幂级数上应用注意力机制以引导随机游走。
- 推导共现统计的闭式期望以实现端到端训练。
- 展示在多种真实世界图上提升的链接预测性能与鲁棒性。
提出的方法
- Represent embeddings as g(Y) = L × R^T with Y = [L|R].
- Set f(A) to the expectation of the co-occurrence matrix D produced by random walks, i.e., E[D].
- Introduce a context distribution Q over walk lengths and express E[D] as E[D;Q] = P^(0) ∑k Q_k (T^k), where T is the transition matrix of the graph.
- Parameterize Q via a Graph Attention Model as Q = softmax(q) and learn q jointly with embeddings.
- Extend to an infinite power-series attention by softmax over an infinite set of powers, i.e., E[D^{softmax[∞]}; q] = P^(0) lim_{C→∞} ∑k softmax(q)_k (T^k).
- Train by maximizing/minimizing the NLGL objective with attention parameters, while keeping inference-time parameters separate.
实验结果
研究问题
- RQ1Can attention parameters learn graph-specific context distributions for random-walk-based embeddings?
- RQ2How does learned context distribution compare to hand-tuned C and fixed context schemes in terms of link prediction performance?
- RQ3Does the proposed graph attention mechanism generalize across diverse graph types (social, collaboration, biological) and remain robust to hyper-parameter choices?
主要发现
| 数据集 | 维度 | 邻接矩阵 | 通过仿真得到的 D | 图注意力 | 误差减少 | 特征映射 | 奇异值分解 | DNGR | node2vec C=2 | node2vec C=5 | 非对称投影 | 我们的方法(NLGL) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| wiki-vote | 64 | 61.3 | 86.0 | 59.8 | 64.4 | 63.6 | 91.7 | 93.8±0.13 | 25.2% | 25.2% | ||
| ego-Facebook | 64 | 96.4 | 96.7 | 98.1 | 99.1 | 99.0 | 97.4 | 99.4±0.10 | 33.3% | 33.3% | ||
| ego-Facebook | 128 | 95.4 | 94.5 | 98.4 | 99.3 | 99.2 | 97.3 | 99.5±0.03 | 28.6% | 28.6% | ||
| ca-AstroPh | 64 | 82.4 | 91.1 | 93.9 | 97.4 | 96.9 | 95.7 | 97.9±0.21 | 19.2% | 19.2% | ||
| ca-AstroPh | 128 | 82.9 | 92.4 | 96.8 | 97.7 | 97.5 | 95.7 | 98.1±0.49 | 24.0% | 24.0% | ||
| ca-HepTh | 64 | 80.2 | 79.3 | 86.8 | 90.6 | 91.8 | 90.3 | 93.6±0.06 | 22.0% | 22.0% | ||
| ca-HepTh | 128 | 81.2 | 78.0 | 89.7 | 90.1 | 92.0 | 90.3 | 93.9±0.05 | 23.8% | 23.8% | ||
| PPI | 64 | 70.7 | 75.4 | 76.7 | 79.7 | 70.6 | 82.4 | 89.8±1.05 | 43.5% | 43.5% | ||
| PPI | 128 | 73.7 | 71.2 | 76.9 | 81.8 | 74.4 | 83.9 | 91.0±0.28 | 44.2% | 44.2% |
- The Graph Attention model significantly improves link prediction across multiple real-world datasets compared to fixed-context baselines, reducing error by up to 20%–40%.
- Learned attention weights Q vary by dataset and often align with grid-search results over fixed context windows, indicating the model discovers appropriate short- vs. long-range dependencies per graph.
- The method remains robust to hyper-parameter choices (C and regularization β), maintaining performance across a wide range of settings.
- On node classification tasks (Cora, Citeseer) the unsupervised embeddings yield better separation than competitive baselines, even without node features during training.
- The attention parameters are learned only during training (not used for inference), enabling end-to-end optimization without inference-time complexity increases.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。