QUICK REVIEW

[论文解读] A Study into patient similarity through representation learning from medical records

Hoda Memarzadeh, Nasser Ghadiri|arXiv (Cornell University)|Apr 29, 2021

Machine Learning in Healthcare参考文献 54被引用 7

一句话总结

本文提出UTTree和UTTree-H两种新型患者表征模型，通过结合UMLS标注实体的时序树结构，整合非结构化临床笔记与结构化电子病历（EMR）数据。通过应用重标记策略以捕捉既往与当前医疗事件，该方法生成基于序列的嵌入表示，在患者相似性与死亡率预测任务中显著优于基线模型，MSE、精确率与NDCG得分均表现更优。

ABSTRACT

Patient similarity assessment, which identifies patients similar to a given patient, can help improve medical care. The assessment can be performed using Electronic Medical Records (EMRs). Patient similarity measurement requires converting heterogeneous EMRs into comparable formats to calculate their distance. While versatile document representation learning methods have been developed in recent years, it is still unclear how complex EMR data should be processed to create the most useful patient representations. This study presents a new data representation method for EMRs that takes the information in clinical narratives into account. To address the limitations of previous approaches in handling complex parts of EMR data, an unsupervised method is proposed for building a patient representation, which integrates unstructured data with structured data extracted from patients' EMRs. In order to model the extracted data, we employed a tree structure that captures the temporal relations of multiple medical events from EMR. We processed clinical notes to extract symptoms, signs, and diseases using different tools such as medspaCy, MetaMap, and scispaCy and mapped entities to the Unified Medical Language System (UMLS). After creating a tree data structure, we utilized two novel relabeling methods for the non-leaf nodes of the tree to capture two temporal aspects of the extracted events. By traversing the tree, we generated a sequence that could create an embedding vector for each patient. The comprehensive evaluation of the proposed method for patient similarity and mortality prediction tasks demonstrated that our proposed model leads to lower mean squared error (MSE), higher precision, and normalized discounted cumulative gain (NDCG) relative to baselines.

研究动机与目标

开发一种统一的患者表征模型，有效整合非结构化临床笔记与结构化EMR数据。
利用基于树的数据结构建模医疗事件之间的时序关系，特别是既往与当前疾病之间的关系。
通过从EMR序列生成上下文感知的低维嵌入向量，提升患者相似性评估与死亡率预测性能。
在真实世界数据集上评估所提方法与现有基线的性能，验证其在关键下游指标上的改进。

提出的方法

该方法从EMR事件构建树形数据结构，节点代表通过NLP工具（如medspaCy、MetaMap、scispaCy）提取的医疗实体（如疾病、症状）。
将提取的实体映射至统一医学语言系统（UMLS），以确保语义一致性和标准化。
对非叶节点应用两种新型重标记策略，以编码时序信息：医疗事件的共现性与时间顺序。
通过遍历树结构生成保留时序依赖关系的序列表示，进而作为表示学习模型（如PV-DM）的输入。
增强版UTTree-H通过基于既往疾病存在情况调整节点标签，显式整合患者既往病史。
应用主成分分析（PCA）与下游分类器（XGBoost、SVM、随机森林）评估嵌入质量在死亡率预测任务中的表现。

实验结果

研究问题

RQ1如何将非结构化与结构化EMR数据有效整合为单一、连贯的患者表征？
RQ2建模医疗事件之间的时序关系对患者相似性与预测性能有何影响？
RQ3与忽略历史背景的模型相比，将既往病史纳入表征是否能提升下游任务的准确性？
RQ4在树节点上应用的所提重标记策略如何影响生成嵌入序列的质量？

主要发现

UTTree-H模型在患者相似性任务中相较所有基线方法均取得更低的均方误差（MSE），且具有统计显著性（p < 0.01）。
该模型在患者相似性排序任务中表现出更高的精确率与归一化折扣累积增益（NDCG），表明检索质量更优。
当从患者病史中提取的生物医学概念超过八个时，UTTree-H模型在误差率降低方面始终优于其他方法。
基于所提嵌入训练的XGBoost分类器在所有数据集上均取得最高中位数准确率，箱线图显示其四分位距更紧凑，表明性能更稳健。
Mann-Whitney U检验（Wilcoxon signed-rank test）证实，UTTree与基线模型之间的性能差异具有统计显著性（p < 0.01），除一项比较以星号标注外。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。