[论文解读] Implications of Topological Imbalance for Representation Learning on Biomedical Knowledge Graphs
本文研究了生物医学知识图谱中由高度连接的“超级枢纽”实体驱动的拓扑不平衡如何导致知识图谱嵌入(KGE)模型产生偏差,进而在链接预测任务中高估这些实体。无论数据集、模型或任务如何变化,KGE模型始终将高程度实体排在更高位置,而不论其生物相关性如何,凸显了在药物发现应用中需谨慎处理图谱构建与模型解释。
Adoption of recently developed methods from machine learning has given rise to creation of drug-discovery knowledge graphs (KG) that utilize the interconnected nature of the domain. Graph-based modelling of the data, combined with KG embedding (KGE) methods, are promising as they provide a more intuitive representation and are suitable for inference tasks such as predicting missing links. One common application is to produce ranked lists of genes for a given disease, where the rank is based on the perceived likelihood of association between the gene and the disease. It is thus critical that these predictions are not only pertinent but also biologically meaningful. However, KGs can be biased either directly due to the underlying data sources that are integrated or due to modeling choices in the construction of the graph, one consequence of which is that certain entities can get topologically overrepresented. We demonstrate the effect of these inherent structural imbalances, resulting in densely-connected entities being highly ranked no matter the context. We provide support for this observation across different datasets, models as well as predictive tasks. Further, we present various graph perturbation experiments which yield more support to the observation that KGE models can be more influenced by the frequency of entities rather than any biological information encoded within the relations. Our results highlight the importance of data modeling choices, and emphasizes the need for practitioners to be mindful of these issues when interpreting model outputs and during KG composition.
研究动机与目标
- 研究生物医学知识图谱中的拓扑不平衡对KGE模型性能的影响。
- 确定高度连接的实体是否在链接预测任务中系统性地被高估。
- 评估KGE模型在改变实体连通性的图扰动下的鲁棒性。
- 为减轻KG构建和KGE应用中的拓扑偏差提供可操作的建议。
提出的方法
- 在公开的生物医学KG(包括Hetionet)上评估多种KGE模型(如ComplEx)。
- 通过重新布线高程度实体的边来实施图扰动,以评估排名稳定性。
- 在保持图结构不变的前提下,测量连通性改变后预测实体排名的变化。
- 分析不同疾病和任务下实体度分布及其与预测分数的相关性。
- 通过靶点发现任务的案例研究,比较预测结果与拓扑特征。
- 提出关于图投影、边置信度过滤以及按连通性水平进行性能评估的建议。
实验结果
研究问题
- RQ1生物医学KG中的拓扑不平衡是否会导致KGE-based链接预测中高度连接实体的系统性高估?
- RQ2KGE模型预测对降低超级枢纽实体度的结构扰动是否具有鲁棒性?
- RQ3与生物关系语义相比,实体度在多大程度上主导了预测分数?
- RQ4Hits@k和MRR等标准评估指标为何未能反映对高程度实体的偏差?
- RQ5在KG构建和KGE应用中,有哪些实际策略可减轻拓扑不平衡?
主要发现
- KGE模型在多个数据集、模型和预测任务中始终高估高度连接的实体,且与生物相关性无关。
- 图扰动实验表明,将UBC等高程度基因的边重新布线会导致排名显著下降,表明其对连通性的强依赖。
- 图的拓扑结构,特别是实体度,对预测分数的影响强于生物关系语义。
- Hits@k和MRR等标准评估指标受实体频率影响,可能无法真实反映模型在低度实体上的性能。
- 即使生物关联微弱或非特异,高程度实体仍常被排在高排名,表明当前KGE推理存在根本性偏差。
- 本研究表明,数据建模选择(尤其是来自NLP管道的边创建)可能加剧拓扑不平衡,应仔细评估。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。