QUICK REVIEW

[논문 리뷰] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

Nafiz Imtiaz Khan, Kylie Cleland|arXiv (Cornell University)|2026. 01. 19.

Artificial Intelligence in Healthcare and Education인용 수 0

한 줄 요약

이 연구는 대형 언어 모델이 보고서에서 절차적 방사선 의학 케이스를 자동으로 추출하여 수동 케이스 로깅을 대체할 수 있는지 테스트하고, 로컬 모델과 상용 모델을 비교하며 프롬프트 전략을 평가합니다.

ABSTRACT

Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.

연구 동기 및 목표

방사선학 절차 로그를 서술 보고서에서 자동화하기 위해 LLM 사용의 가능성 평가.
AI 기반 추출이 어려운 절차 범주를 식별합니다.
임상 워크플로우 배치를 위한 대기 시간 및 비용 등 통합 고려사항을 평가합니다.
확장 가능한 문서 자동화를 위한 프롬프트 전략 및 모델 선택에 대한 지침을 제공합니다.

제안 방법

2018–2024년의 9명의 IR 거주자의 414개 방사선 보고서와 39개의 미리 정의된 절차를 포함하는 후향적 데이터셋.
Annotators established ground truth with Cohen’s Kappa = 0.896.
다섯 개의 오픈/로컬 모델과 하나의 상용 모델을 제로샷으로 평가하되, Instruction Prompting 및 Chain-of-Thought prompting을 사용합니다.
지표: 민감도, 특이도, F1-점수, 추론 시간, 토큰 사용량, 비용 추정.
메타데이터 기반 비교를 위한 Crosswalk 벤치마크를 사용합니다.

실험 결과

연구 질문

RQ1L L M이 방사선 보고서에서 구조화된 절차 데이터를 높은 정확도로 추출할 수 있는가?
RQ2모델 성능이 절차 범주(혈관 진단, 혈관 중재, 비혈관 중재)에 따라 달라지는가?
RQ3다양한 프롬 prompting 전략에서 로컬 대 상용 모델의 속도, 비용, 정확도 차이가 있는가?
RQ4실제 워크플로우 통합을 위한 실용적 배치 고려사항(대기 시간, 토큰 사용량, 비용)은 무엇인가?

주요 결과

모델 유형	모델 이름	프롬 prompts	모달리티	TP	TN	FP	FN	민감도 (%)	특이도 (%)	F1-점수 (%)
Benchmark	Cross-Walk	NA	All	451	15364	93	238	65.46	99.40	73.15
Benchmark	Cross-Walk	NA	VascularDiagnosis	143	3065	23	81	63.84	99.26	73.33
Benchmark	Cross-Walk	NA	VascularIntervention	157	5906	38	109	59.02	99.36	68.11
Benchmark	Cross-Walk	NA	NonVascularIntervention	151	6393	32	48	75.88	99.50	79.06
Local	Qwen-2.5:72B	IP	All	649	15174	283	40	94.19	98.17	80.08
Local	Qwen-2.5:72B	CoT	All	627	15326	131	62	91.00	99.15	86.66
Local	Qwen-2.5:72B	IP	VascularDiagnosis	219	3068	20	5	97.77	99.35	94.60
Local	Qwen-2.5:72B	IP	VascularIntervention	247	5803	141	19	92.86	97.63	75.54
Local	Qwen-2.5:72B	IP	NonVascularIntervention	183	6303	122	16	91.96	98.10	72.62
Local	Qwen-2.5:72B	CoT	VascularDiagnosis	214	3071	17	10	95.54	99.45	94.07
Local	Qwen-2.5:72B	CoT	VascularIntervention	242	5868	76	24	90.98	98.72	82.88
Local	Qwen-2.5:72B	CoT	NonVascularIntervention	171	6387	38	28	85.93	99.41	83.82
Commercial	Claude-3.5-Haiku	IP	All	633	14961	496	56	91.87	96.79	69.64
Commercial	Claude-3.5-Haiku	IP	VascularDiagnosis	215	3067	21	9	95.98	99.32	93.48
Commercial	Claude-3.5-Haiku	IP	VascularIntervention	230	5737	207	36	86.47	96.52	65.43
Commercial	Claude-3.5-Haiku	IP	NonVascularIntervention	188	6157	268	11	94.47	95.83	57.41
Commercial	Claude-3.5-Haiku	CoT	All	613	15348	109	76	88.97	99.29	86.89
Commercial	Claude-3.5-Haiku	CoT	VascularDiagnosis	210	3069	19	14	93.75	99.38	92.71
Commercial	Claude-3.5-Haiku	CoT	VascularIntervention	228	5905	39	38	85.71	99.34	85.55
Commercial	Claude-3.5-Haiku	CoT	NonVascularIntervention	175	6374	51	24	87.94	99.21	82.35

상용 모델 Claude-3.5-Haiku가 Chain-of-Thought prompting으로 모든 구성 중 가장 높은 F1-점수(86.89%)를 달성합니다.
로컬 모델 Qwen-2.5:72B가 Chain-of-Thought prompting으로 F1-점수 86.66%를 달성하며 높은 민감도와 특이도를 보입니다.
Crosswalk 메타데이터 벤치마크는 높은 특이도(99.40%)를 보이나 민감도는 낮은 편(65.46%)으로, 자유 텍스트 추출의 한계를 강조합니다.
프롬 prompting 전략(CoT)은 일반적으로 F1-점수를 향상시키고 위양성 감소에 기여하며, 특히 VascularIntervention과 같은 복잡한 범주에서 두드러집니다.
모델별 추론 시간은 Claude-3.5-Haiku IP가 약 1.97초/절차로 가장 빠르고, Qwen-2.5:72B CoT는 약 13.47초/절차로 느리지만 정확합니다.
수동 입력을 자동 로깅으로 대체할 때 거주자 1인당 연간 시간 절감은 35시간을 초과하는 것으로 추정됩니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.