QUICK REVIEW

[论文解读] Leveraging Large Language Model as Simulated Patients for Clinical Education

Yaneng Li, Cheng Zeng|arXiv (Cornell University)|Apr 13, 2024

Topic Modeling被引用 14

一句话总结

CureFun 是一个模型无关的框架，使用 LLM 作为具有图驱动内存和自动评估的虚拟模拟病人，并且还在临床教育中将 LLM 评估为虚拟医生。

ABSTRACT

Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.

研究动机与目标

解决临床教育中传统模拟病人高成本与高风险的问题。
开发一个面向所有模型的 VSP 框架，利用 LLM 实现地道的对话流程。
自动化学生–病人对话的评估，并实现可扩展的评估。
评估多种 LLMs，并从诊断角度讨论它们作为虚拟医生的潜力。

提出的方法

利用命名实体识别（NER）和关系提取从 SP 脚本构建案例图，以形成检索增强生成（RAG）骨干。
实现一个图驱动的上下文自适应 SP 聊天机器人（ERRG：Extract–Retrieve–Rewrite–Generate）来控制对话流程。
将 SP 清单转换为可由 LLM 执行的自动评估程序，并在多种 LLM 之间进行集成投票。
通过运行预定义诊断情景并分析非评分指标（信息密度、情感倾向等）来评估 LLM 作为虚拟医生的表现。
部署辅助模块（文本转语音/语音转文本、带 RDF/SPARQL 的图数据库、专用 LLM 服务器）以提升真实感和可扩展性。

实验结果

研究问题

RQ1如何使用 LLMs 在临床教育中模拟病人角色并实现真实的对话流程？
RQ2图增强、指令微调的框架能否改善 SP 对话质量和评估可靠性？
RQ3自动化的基于 LLM 的评估与人类评估者在 SP 考试中的一致性如何？
RQ4各种 LLM 作为虚拟医生在诊断性访谈中的相对能力如何？
RQ5在大规模医学教育中，将 LLM 作为 VSP 和 VD 的优点与局限性是什么？

主要发现

模型	B-ELO	无我们的方法
Mixtral-8x7B	1462.40	1510.60 (+48.20)
Qwen72B	1523.93	1575.20 (+51.27)
PaLM	1570.91	1639.07 (+68.16)
GPT-3.5-Turbo	1403.54	1653.72 (+250.18)
ERNIE-Bot 4	1780.88	1880.15 (+99.27)

CureFun 框架在 SP 场景中产生的对话比其他基于 LLM 的聊天机器人更真实、专业。
自动化评估分数与人类评估者高度相关（平均 Spearman 0.81，Pearson 0.85，p<0.05）。
将多种 LLM 与自动评分程序结合的集成评估提供可靠的学生评估并可扩展到大规模群体。
ERNIe-Bot-4 与框架结合在测试的骨干中实现了最佳 SP 性能；使用该框架时，GPT-3.5-Turbo 显示显著提升（+250.18 B-ELO）。
在将 LLM 作为虚拟医生的评估中，ChatGPT 在所有 LLM 中获得最高总分，DISC-MedLLM 位居第二；人类评估者（专家）在诊断能力方面优于所有 LLM。
该框架揭示 SP 与 VD 在实践中的分歧，突显建设整合的 SP–VD 培训流程以用于医学教育的必要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。