QUICK REVIEW

[论文解读] End-To-End Clinical Trial Matching with Large Language Models

Dyke Ferber, Lars Hilgers|arXiv (Cornell University)|Jul 18, 2024

Statistical Methods in Clinical Trials被引用 13

一句话总结

该论文提出一个端到端的流程，使用 GPT-4o 搜索全球肿瘤学临床试验，并针对患者电子病历对照标准逐条进行资格匹配，达到高准确性，在某些任务中超越了专家医生。

ABSTRACT

Matching cancer patients to clinical trials is essential for advancing treatment and patient care. However, the inconsistent format of medical free text documents and complex trial eligibility criteria make this process extremely challenging and time-consuming for physicians. We investigated whether the entire trial matching process - from identifying relevant trials among 105,600 oncology-related clinical trials on clinicaltrials.gov to generating criterion-level eligibility matches - could be automated using Large Language Models (LLMs). Using GPT-4o and a set of 51 synthetic Electronic Health Records (EHRs), we demonstrate that our approach identifies relevant candidate trials in 93.3% of cases and achieves a preliminary accuracy of 88.0% when matching patient-level information at the criterion level against a baseline defined by human experts. Utilizing LLM feedback reveals that 39.3% criteria that were initially considered incorrect are either ambiguous or inaccurately annotated, leading to a total model accuracy of 92.7% after refining our human baseline. In summary, we present an end-to-end pipeline for clinical trial matching using LLMs, demonstrating high precision in screening and matching trials to individual patients, even outperforming the performance of qualified medical doctors. Our fully end-to-end pipeline can operate autonomously or with human supervision and is not restricted to oncology, offering a scalable solution for enhancing patient-trial matching in real-world settings.

研究动机与目标

展示一个端到端的流程，将患者 EHR 映射到 ClinicalTrials.gov 上的合适肿瘤学临床试验。
混合使用 No-SQL 与向量相似度检索，以高效检索相关试验。
使用大语言模型进行逐条资格条件检查，并输出结构化、可验证的结果。
提供逐条的解释，并实现人机协作以完善真实标签（ground truth）。

提出的方法

构建混合数据库（MongoDB + ChromaDB）以支持精确条件检索和基于向量的检索。
使用 BAAI/bge-large-en-v1.5（768 维）对试验文本进行向量化嵌入，并以 50-token 重叠分割文本以进行向量检索。
使用 GPT-4o 生成 No-SQL 查询并执行多步、编程式的试验筛选。
将资格条件表示为结构化、嵌套的编程对象，并对患者数据进行评估。
输出逐条资格结果为 True/False/Unknown，并通过连锁思维风格推理提供解释。
以五位肿瘤科医生的人类基线对 AI 进行评估，并使用 AI 反馈迭代改进 ground truth。

实验结果

研究问题

RQ1GPT-4o 是否能够从超过 100k 的肿瘤学试验中为给定患者 EHR 正确识别出一组候选试验？
RQ2系统是否能够在单条标准级别上准确评估资格条件并提供解释？
RQ3端到端流程在试验匹配和标准评估方面是否达到或超过人类专家的表现？
RQ4以编程化、结构化输出的方法是否比自由文本提示在资格评估上更可靠、具可迁移性？
RQ5人机迭代精炼对 ground-truth 准确性的影响是什么？

主要发现

该流程在测试用例中的 93.3% 检出相关且经人工预选的试验（15 个基准用例）。
初始逐条级别匹配对比人工评估达到 88.0% 的准确率（1,398/1,589 条标准）。
使用 AI 反馈对人工 ground truth 进行 refined，使总体准确率提升至 92.7%。
GPT-4o 独自对人类决策在审阅后进行了 39.3% 的修正，显示出 AI 辅助纠错的显著潜力。
最终候选集中，Top-5 和 Top-10 的命中目标试验分别为 10/15 和 14/15。
该方法在端到端试验匹配方面显示出高精度，并且并非天生受限于癌症领域。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。