QUICK REVIEW

[논문 리뷰] Leveraging Large Language Model as Simulated Patients for Clinical Education

Yaneng Li, Cheng Zeng|arXiv (Cornell University)|2024. 04. 13.

Topic Modeling인용 수 14

한 줄 요약

CureFun은 그래프 기반 메모리와 자동 평가를 갖춘 가상 시뮬레이션 환자(VSP)로 LLM을 사용하는 모델-에-구애받지 않는 프레임워크이며, 임상 교육에서 가상 의사로서 LLM을 평가한다.

ABSTRACT

Simulated Patients (SPs) play a crucial role in clinical medical education by providing realistic scenarios for student practice. However, the high cost of training and hiring qualified SPs, along with the heavy workload and potential risks they face in consistently portraying actual patients, limit students' access to this type of clinical training. Consequently, the integration of computer program-based simulated patients has emerged as a valuable educational tool in recent years. With the rapid development of Large Language Models (LLMs), their exceptional capabilities in conversational artificial intelligence and role-playing have been demonstrated, making them a feasible option for implementing Virtual Simulated Patient (VSP). In this paper, we present an integrated model-agnostic framework called CureFun that harnesses the potential of LLMs in clinical medical education. This framework facilitates natural conversations between students and simulated patients, evaluates their dialogue, and provides suggestions to enhance students' clinical inquiry skills. Through comprehensive evaluations, our approach demonstrates more authentic and professional SP-scenario dialogue flows compared to other LLM-based chatbots, thus proving its proficiency in simulating patients. Additionally, leveraging CureFun's evaluation ability, we assess several medical LLMs and discuss the possibilities and limitations of using LLMs as virtual doctors from the perspective of their diagnostic abilities.

연구 동기 및 목표

임상 교육에서 전통적 시뮬레이션 환자의 높은 비용과 위험을 해소한다.
현실적인 대화 흐름을 위해 LLM을 활용하는 모델-에-구애받지 않는 VSP 프레임워크를 개발한다.
학생–환자 대화의 자동 평가를 자동화하고 확장 가능한 평가를 가능하게 한다.
진단 관점에서 가상 의사로서의 가능성을 다수의 LLM을 평가하고 논의한다.

제안 방법

NER와 관계 추출을 사용하여 SP 스크립트로부터 사례 그래프를 구성하고 검색 증강 생성(RAG) 백본을 형성한다.
대화 흐름을 제어하기 위해 그래프 기반 맥락 적응형 SP 챗봇(ERRG: Extract–Retrieve–Rewrite–Generate)을 구현한다.
SP 체크리스트를 여러 LLM에 걸친 앙상블 투표를 통한 LLM 실행 가능한 자동 평가 프로그램으로 전환한다.
사전 정의된 진단 시나리오를 실행하고 비점수화 지표(정보 밀도, 정서적 경향 등)를 분석하여 LLM을 가상 의사로 평가한다.
현실감과 확장성을 높이기 위해 보조 모듈(TTS/STT, RDF/SPARQL 그래프 데이터베이스, 전용 LLM 서버)을 배치한다.

실험 결과

연구 질문

RQ1LLM을 어떻게 활용하여 임상 교육에서 authentic 대화 흐름을 가진 환자 역할을 시뮬레이션할 수 있는가?
RQ2그래프 보강 및 지시문 미세조정 프레임워크가 SP 대화 품질과 평가 신뢰성을 향상시킬 수 있는가?
RQ3자동화된 LLM 기반 평가가 SP 시험에서 인간 평가자와 얼마나 잘 일치하는가?
RQ4다양한 LLM이 진단 면담에서 가상 의사로서의 수행 능력이 얼마나 다른가?
RQ5대규모 의학교육에서 LLM을 VSP 및 VD로 사용할 때의 강점과 한계는 무엇인가?

주요 결과

모델	B-ELO	당사 프레임워크 없이
Mixtral-8x7B	1462.40	1510.60 (+48.20)
Qwen72B	1523.93	1575.20 (+51.27)
PaLM	1570.91	1639.07 (+68.16)
GPT-3.5-Turbo	1403.54	1653.72 (+250.18)
ERNIE-Bot 4	1780.88	1880.15 (+99.27)

CureFun 프레임워크는 SP 시나리오에서 다른 LLM 기반 챗봇보다 더 진auth이고 전문적인 SP 대화를 만들어낸다.
자동 평가 점수는 인간 평가자와 강하게 상관관계를 보인다(평균 Spearman 0.81, Pearson 0.85, p<0.05).
여러 LLM과 자동 평가 프로그램의 앙상블은 신뢰할 수 있는 학생 평가를 제공하고 대규모 코호트에 확장 가능하다.
ERNIe-Bot-4가 프레임워크와 결합되어 테스트된 백본들 중 최상의 SP 성능을 보였으며, 프레임워크를 사용할 때 GPT-3.5-Turbo는 주목할 만한 개선을 보였다(+250.18 B-ELO).
가상 의사로서 LLM을 평가할 때, ChatGPT가 전체 점수에서 최고를 차지했고, DISC-MedLLM이 그 뒤를 이었으며, 진단 능력에서 인간 평가자(전문가)가 모든 LLM을 능가했다.
프레임워크는 SP와 VD가 실제로는 차이가 있음을 시사하며, 의학교육을 위한 통합 SP–VD 훈련 파이프라인의 필요성을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.