[论文解读] A Study of Social and Behavioral Determinants of Health in Lung Cancer Patients Using Transformers-based Natural Language Processing Models
本文比较基于 BERT 与 RoBERTa 的 transformer NLP 模型,以从临床叙述中提取 SBDoH 概念,并显示叙述相对于结构化的 EHR 在肺癌队列中提供了更多细节。
Social and behavioral determinants of health (SBDoH) have important roles in shaping people's health. In clinical research studies, especially comparative effectiveness studies, failure to adjust for SBDoH factors will potentially cause confounding issues and misclassification errors in either statistical analyses and machine learning-based models. However, there are limited studies to examine SBDoH factors in clinical outcomes due to the lack of structured SBDoH information in current electronic health record (EHR) systems, while much of the SBDoH information is documented in clinical narratives. Natural language processing (NLP) is thus the key technology to extract such information from unstructured clinical text. However, there is not a mature clinical NLP system focusing on SBDoH. In this study, we examined two state-of-the-art transformer-based NLP models, including BERT and RoBERTa, to extract SBDoH concepts from clinical narratives, applied the best performing model to extract SBDoH concepts on a lung cancer screening patient cohort, and examined the difference of SBDoH information between NLP extracted results and structured EHRs (SBDoH information captured in standard vocabularies such as the International Classification of Diseases codes). The experimental results show that the BERT-based NLP model achieved the best strict/lenient F1-score of 0.8791 and 0.8999, respectively. The comparison between NLP extracted SBDoH information and structured EHRs in the lung cancer patient cohort of 864 patients with 161,933 various types of clinical notes showed that much more detailed information about smoking, education, and employment were only captured in clinical narratives and that it is necessary to use both clinical narratives and structured EHRs to construct a more complete picture of patients' SBDoH factors.
研究动机与目标
- 强调健康的社会与行为决定因素(SBDoH)在临床结局中的重要性,并在分析中降低混杂/误分的可能性。
- 评估最先进的 transformer NLP 模型从临床叙述中提取 SBDoH 概念的能力。
- 将 NLP 提取的 SBDoH 信息与结构化 EHR 数据进行比较,以评估 SBDoH 捕获的完整性。
- 将表现最佳的模型应用于肺癌筛查队列,以表征 SBDoH 因素。
提出的方法
- 评估两种 transformer NLI 模型,BERT 和 RoBERTa,用于从临床叙述中提取 SBDoH 概念。
- 使用严格和宽松的 F1 分数来衡量 SBDoH 提取的模型性能。
- 在一个包含 864 例患者、161,933 条病历的队列中,将 NLP 得到的 SBDoH 数据与结构化 EHR 的 SBDoH 数据进行比较。
- 分析叙述记录与结构化记录在吸烟、教育和就业等信息捕获的差异。
实验结果
研究问题
- RQ1基于 transformer 的 NLP 模型是否能够准确地从非结构化临床叙述中提取 SBDoH 概念?
- RQ2哪个模型(BERT 还是 RoBERTa)在临床文本的 SBDoH 提取方面提供更高的准确性?
- RQ3从完整性的角度,NLP 提取的 SBDoH 信息与结构化 EHR 的 SBDoH 数据相比如何?
- RQ4在肺癌队列中,叙述更好捕获哪些 SBDoH 因素(如吸烟、教育、就业)?
主要发现
- 基于 BERT 的 NLP 实现了最佳的严格/宽松 F1 分数,分别为 0.8791 和 0.8999。
- NLP 提取的 SBDoH 信息在吸烟、教育和就业方面检测到的细节远多于结构化 EHR 词汇。
- 在包含 864 例肺癌患者、161,933 条病历的队列中,叙述补充了结构化 EHR,形成更完整的 SBDoH 图景。
- 构建患者的全面 SBDoH 档案,需要临床叙述和结构化 EHR 数据。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。