QUICK REVIEW

[论文解读] CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks

Chao Pang, Xinzhuo Jiang|arXiv (Cornell University)|Nov 10, 2021

Machine Learning in Healthcare参考文献 13被引用 31

一句话总结

CEHR-BERT 将结构化电子病历数据中的时序信息通过人工时间令牌和时序概念嵌入，以及一个Visit Type Prediction 目标，整合到 BERT 框架中，以改进多种疾病预测任务。

ABSTRACT

Embedding algorithms are increasingly used to represent clinical concepts in healthcare for improving machine learning tasks such as clinical phenotyping and disease prediction. Recent studies have adapted state-of-the-art bidirectional encoder representations from transformers (BERT) architecture to structured electronic health records (EHR) data for the generation of contextualized concept embeddings, yet do not fully incorporate temporal data across multiple clinical domains. Therefore we developed a new BERT adaptation, CEHR-BERT, to incorporate temporal information using a hybrid approach by augmenting the input to BERT using artificial time tokens, incorporating time, age, and concept embeddings, and introducing a new second learning objective for visit type. CEHR-BERT was trained on a subset of Columbia University Irving Medical Center-York Presbyterian Hospital's clinical data, which includes 2.4M patients, spanning over three decades, and tested using 4-fold cross-validation on the following prediction tasks: hospitalization, death, new heart failure (HF) diagnosis, and HF readmission. Our experiments show that CEHR-BERT outperformed existing state-of-the-art clinical BERT adaptations and baseline models across all 4 prediction tasks in both ROC-AUC and PR-AUC. CEHR-BERT also demonstrated strong transfer learning capability, as our model trained on only 5% of data outperformed comparison models trained on the entire data set. Ablation studies to better understand the contribution of each time component showed incremental gains with every element, suggesting that CEHR-BERT's incorporation of artificial time tokens, time and age embeddings with concept embeddings, and the addition of the second learning objective represents a promising approach for future BERT-based clinical embeddings.

研究动机与目标

激励在结构化 EHR 数据中利用时序结构以提升下游预测。
开发 CEHR-BERT，一种基于 BERT 的模型，通过人工时间令牌和时序嵌入对时间进行编码。
引入第二个预训练目标（Visit Type Prediction）以提升预测性能。
在使用大型 CUIMC-NYP OMOP 数据集的多项临床预测任务上评估 CEHR-BERT。

提出的方法

将患者病史表示为带有 VS/VE 令牌的就诊，并在就诊之间使用人工时间令牌（ATT）。
通过一个 FC 层将概念嵌入与时间嵌入和年龄嵌合，形成时序概念嵌入。
使用 Masked Language Modeling (MLM) 和一个辅助的 Visit Type Prediction (VTP) 目标进行预训练。
在4折评估、4个预测任务上将 CEHR-BERT 与 BEHRT、MedBERT 和基线方法进行比较。
进行消融研究以评估时间令牌、时间/年龄嵌入以及 VTP 的贡献。

实验结果

研究问题

RQ1通过人工时间令牌和时序嵌入引入时序信息，能否改善基于 BERT 的结构化 EHR 数据表示？
RQ2就诊类型预测目标是否能进一步提升下游疾病预测性能？
RQ3在多项预测任务中，CEHR-BERT 与现有的 EHR-BERT 改编（BEHRT、MedBERT）及传统基线的比较如何？
RQ4在标注数据有限的少样本学习场景中，CEHR-BERT 是否有效？

主要发现

CEHR-BERT 在全部四个预测任务上均优于 BEHRT、MedBERT 和基线方法（t2dm HF、HF 入院、出院回家、死亡）。
在 t2dm HF 上，CEHR-BERT 实现 AUC 80.7% 和 PR-AUC 32.3%。
在 HF 再入院上，CEHR-BERT 实现 AUC 66.3% 和 PR-AUC 38.6%。
在出院回家但伴随死亡结果上，CEHR-BERT 实现 AUC 94.6% 和 PR-AUC 52.7%。
在住院预测上，CEHR-BERT 实现 AUC 75.9% 和 PR-AUC 31.1%。
在少样本学习中，使用5%的训练数据的 CEHR-BERT 的表现超过在完整数据上训练的竞争对手（例如 t2dm HF 的 AUC ~0.78，PR-AUC ~0.29）。
消融研究显示时间令牌、时间/年龄嵌入以及 VTP 目标带来逐步提升，表明它们具有叠加价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。