QUICK REVIEW

[论文解读] When Do Graph Neural Networks Help with Node Classification? Investigating the Impact of Homophily Principle on Node Distinguishability

Sitao Luan, Chenqing Hua|arXiv (Cornell University)|Apr 25, 2023

Advanced Graph Neural Networks被引用 12

一句话总结

本文提出 CSBM-H，以在不同同质性下共同研究节点在类内和类间的可区分性（ND），提出两种 ND 指标（Probabilistic Bayes Error 和 negative generalized Jeffreys divergence），分析图滤波器和度分布如何影响 ND，并提出基于分类器的性能度量（CPM）以预测 GNN 相对于不考虑图结构的模型的优越性，超越传统的同质性度量。

ABSTRACT

Homophily principle, i.e., nodes with the same labels are more likely to be connected, has been believed to be the main reason for the performance superiority of Graph Neural Networks (GNNs) over Neural Networks on node classification tasks. Recent research suggests that, even in the absence of homophily, the advantage of GNNs still exists as long as nodes from the same class share similar neighborhood patterns. However, this argument only considers intra-class Node Distinguishability (ND) but neglects inter-class ND, which provides incomplete understanding of homophily on GNNs. In this paper, we first demonstrate such deficiency with examples and argue that an ideal situation for ND is to have smaller intra-class ND than inter-class ND. To formulate this idea and study ND deeply, we propose Contextual Stochastic Block Model for Homophily (CSBM-H) and define two metrics, Probabilistic Bayes Error (PBE) and negative generalized Jeffreys divergence, to quantify ND. With the metrics, we visualize and analyze how graph filters, node degree distributions and class variances influence ND, and investigate the combined effect of intra- and inter-class ND. Besides, we discovered the mid-homophily pitfall, which occurs widely in graph datasets. Furthermore, we verified that, in real-work tasks, the superiority of GNNs is indeed closely related to both intra- and inter-class ND regardless of homophily levels. Grounded in this observation, we propose a new hypothesis-testing based performance metric beyond homophily, which is non-linear, feature-based and can provide statistical threshold value for GNNs' the superiority. Experiments indicate that it is significantly more effective than the existing homophily metrics on revealing the advantage and disadvantage of graph-aware modes on both synthetic and benchmark real-world datasets.

研究动机与目标

通过考虑类内与类间距离，推动对同质性如何影响节点可区分性的全面理解。
提出显式包含同质性、类方差和节点度的图生成模型 CSBM-H，以研究 ND。
定义并计算 ND 指标（Probabilistic Bayes Error 和 negative generalized Jeffreys divergence），以在 CSBM-H 下量化 ND。
分析图滤波器（LP、FP、HP）和度分布如何影响 ND，并识别中等同质性陷阱。
提出并评估一个非线性、基于特征的性能度量 CPM，以预测在图感知模型优于图无关模型时的情形。

提出的方法

将 CSBM-H 作为具有显式同质性参数 h 和类方差 σ0^2、σ1^2 的两类情境随机块模型引入。
推导 CSBM-H 的 Bayes 分类器，并通过带有参数 a、b、c 的 Q(x) 表达其判决边界。
使用 Q(x) 的广义卡方分布来定义 Probabilistic Bayes Error（PBE），以量化 ND。
定义负的广义 Jeffreys 散度 D_NGJ(CSBM-H)，将 ND 分解为 ENND 和 NVR 项。
展示 LP（A_rw）、全通过滤和 HP（I - A_rw）过滤特征如何通过解析表达式和消融研究影响 ND。
提出基于分类器的性能度量（CPM），使用假设检验阈值来预测在图感知模型优于图无关模型时的情形。

Figure 1: Example of intra- and inter-class node distinguishability.

实验结果

研究问题

RQ1 intra- 类内和 inter- 类间 ND 如何相互影响以决定 GNN 在节点分类中的有效性？
RQ2同质性水平、类方差和节点度如何在图滤波下塑造 ND？
RQ3是否存在一个非线性、基于特征的度量（CPM）阈值，使图感知模型的优势超越传统的同质性指标？
RQ4LP、FP、HP 图滤波在不同同质性区域对 ND 的影响如何？
RQ5真实数据集是否存在中等同质性陷阱，即中等同质性会降低 ND 或模型性能？

主要发现

ND 取决于类内和类间距离，而不仅仅是类内ND，且类内 ND 小于类间 ND 对节点分类最理想。
在 CSBM-H 下，PBE 和 D_NGJ(CSBM-H) 对 LP 过滤的特征呈现与同质性相关的钟形关系，指示存在中等同质性陷阱。
HP 过滤在异质性区域提升 ND，而在低同质性和高同质性区间，LP 过滤有帮助；FP 在中到高同质性下仍有利。
消融研究表明更高的高方差类度可能压缩 LP 和 HP 的 ND，扩大 FP 区间，而增加低方差类度的效果则较为细致。
CPM，一种基于假设检验的度量，在预测何时图感知方法优于图无关方法方面，在合成数据和真实数据集上优于传统的同质性指标。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。