[论文解读] Can x2vec Save Lives? Integrating Graph and Language Embeddings for Automatic Mental Health Classification
本文提出将图嵌入(metapath2vec)与语言嵌入(doc2vec)相结合,以提升在线支持群组中罕见心理事件(如自杀意念)的自动心理健康分类性能。通过融合关系网络结构与语言内容,集成模型在预测自杀意念方面达到90%的准确率,显著优于单一模态(分别为69%和76%),且仅产生10%的假阳性与12%的假阴性。
Graph and language embedding models are becoming commonplace in large scale analyses given their ability to represent complex sparse data densely in low-dimensional space. Integrating these models' complementary relational and communicative data may be especially helpful if predicting rare events or classifying members of hidden populations - tasks requiring huge and sparse datasets for generalizable analyses. For example, due to social stigma and comorbidities, mental health support groups often form in amorphous online groups. Predicting suicidality among individuals in these settings using standard network analyses is prohibitive due to resource limits (e.g., memory), and adding auxiliary data like text to such models exacerbates complexity- and sparsity-related issues. Here, I show how merging graph and language embedding models (metapath2vec and doc2vec) avoids these limits and extracts unsupervised clustering data without domain expertise or feature engineering. Graph and language distances to a suicide support group have little correlation ( {ho} < 0.23), implying the two models are not embedding redundant information. When used separately to predict suicidality among individuals, graph and language data generate relatively accurate results (69% and 76%, respectively); however, when integrated, both data produce highly accurate predictions (90%, with 10% false-positives and 12% false-negatives). Visualizing graph embeddings annotated with predictions of potentially suicidal individuals shows the integrated model could classify such individuals even if they are positioned far from the support group. These results extend research on the importance of simultaneously analyzing behavior and language in massive networks and efforts to integrate embedding models for different kinds of data when predicting and classifying, particularly when they involve rare events.
研究动机与目标
- 为解决在稀疏、隐蔽的在线社区中预测罕见心理健康事件(如自杀意念)的挑战。
- 克服因数据稀疏性和高维性导致的标准网络分析与自然语言处理分析的局限性。
- 评估图嵌入与语言嵌入是否捕捉互补信息而非冗余模式。
- 开发一种集成嵌入模型,以提升预测准确率,且无需领域专业知识或特征工程。
- 评估模型在社交网络中与已知支持群体结构上相距较远的个体中,识别高风险个体的能力。
提出的方法
- 使用metapath2vec基于异质网络结构生成低维图嵌入,以捕捉关系与结构相似性。
- 应用doc2vec从Reddit的r/SuicideWatch中用户提交的文本生成密集的文档级嵌入。
- 将图嵌入与语言嵌入结合,构建联合表示空间以提升分类性能。
- 采用余弦相似度与皮尔逊相关系数(Pearson ρ)评估图嵌入与语言嵌入距离之间的冗余程度。
- 在集成嵌入上训练并评估二分类器,基于发帖行为预测自杀意念。
- 可视化嵌入空间,以评估模型在识别高风险个体方面的能力,即使这些个体在社交网络中远离已知支持群体。
实验结果
研究问题
- RQ1图嵌入与语言嵌入在预测自杀意念时,捕捉的是冗余信息还是互补信息?
- RQ2与单独使用任一模态相比,集成图嵌入与语言嵌入是否能显著提升对罕见心理健康事件的预测准确率?
- RQ3集成模型是否能降低识别自杀意念高风险个体时的假阳性与假阴性率?
- RQ4当个体在社交网络中与已知支持群体在结构上相距较远时,模型是否仍能检测到潜在的自杀倾向个体?
- RQ5在无需领域特定特征工程的情况下,模型在无监督聚类中的表现如何?
主要发现
- 集成模型在预测自杀意念方面达到90%的准确率,假阳性率仅10%,假阴性率12%。
- 图嵌入单独使用时准确率为69%,语言嵌入单独使用时准确率为76%,表明集成方法带来显著性能提升。
- 图嵌入与语言嵌入到自杀支持群体的距离之间相关性较低(ρ < 0.23),证实二者捕捉的是非冗余的互补信息。
- 可视化结果表明,即使个体在社交网络中与自杀支持群体相距较远,集成模型仍能识别出高风险个体,表明对结构孤立具有鲁棒性。
- 该模型无需领域专业知识或人工特征工程即可成功识别高风险个体,展现出良好的可扩展性与泛化能力。
- 结果支持将行为(网络)与语言(文本)指标相结合的临床价值,与临床诊断中同时使用两类证据的做法一致。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。