QUICK REVIEW

[论文解读] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

Nafiz Imtiaz Khan, Kylie Cleland|arXiv (Cornell University)|Jan 19, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

本研究测试大型语言模型是否能自动从报告中提取程序性放射学病例，以取代手工日志记录，比较本地与商业模型以及提示策略的效果。

ABSTRACT

Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.

研究动机与目标

评估使用LLMs从叙述性报告中自动化放射学程序性病例日志的可行性。
识别AI提取困难的程序类别。
评估在临床工作流部署中的延迟、成本等整合因素。
为可扩展的文档自动化提供提示策略与模型选择方面的指导。

提出的方法

来自九名IR住院医师（2018–2024）的414份放射学报告的回顾性数据集，包含39个预定义程序。
标注者通过Cohen’s Kappa = 0.896建立地面 truth。
在零-shot、使用Instruction Prompting与Chain-of-Thought prompting的情況下，评估六种模型（五个开放/本地，一个商业）。
度量指标：敏感性、特异性、F1分数、推理时间、token使用量与成本估算。
Crosswalk基准作为基于元数据的比较基线。

实验结果

研究问题

RQ1LLMs是否能以高准确度从放射学报告中提取结构化的程序性数据？
RQ2模型在不同程序类别（血管诊断、血管干预、非血管干预）下的表现是否不同？
RQ3在不同提示策略下，本地与商业模型在速度、成本和准确性方面的对比如何？
RQ4在真实工作流程整合中，实际部署需要考虑的因素（延迟、token使用量、成本）有哪些？

主要发现

模型类型	模型名称	提示策略	模态	TP	TN	FP	FN	敏感性(%)	特异性(%)	F1-Score(%)
基准	Cross-Walk	NA	All	451	15364	93	238	65.46	99.40	73.15
基准	Cross-Walk	NA	VascularDiagnosis	143	3065	23	81	63.84	99.26	73.33
基准	Cross-Walk	NA	VascularIntervention	157	5906	38	109	59.02	99.36	68.11
基准	Cross-Walk	NA	NonVascularIntervention	151	6393	32	48	75.88	99.50	79.06
本地	Qwen-2.5:72B	IP	All	649	15174	283	40	94.19	98.17	80.08
本地	Qwen-2.5:72B	CoT	All	627	15326	131	62	91.00	99.15	86.66
本地	Qwen-2.5:72B	IP	VascularDiagnosis	219	3068	20	5	97.77	99.35	94.60
本地	Qwen-2.5:72B	IP	VascularIntervention	247	5803	141	19	92.86	97.63	75.54
本地	Qwen-2.5:72B	IP	NonVascularIntervention	183	6303	122	16	91.96	98.10	72.62
本地	Qwen-2.5:72B	CoT	VascularDiagnosis	214	3071	17	10	95.54	99.45	94.07
本地	Qwen-2.5:72B	CoT	VascularIntervention	242	5868	76	24	90.98	98.72	82.88
本地	Qwen-2.5:72B	CoT	NonVascularIntervention	171	6387	38	28	85.93	99.41	83.82
商业	Claude-3.5-Haiku	IP	All	633	14961	496	56	91.87	96.79	69.64
商业	Claude-3.5-Haiku	IP	VascularDiagnosis	215	3067	21	9	95.98	99.32	93.48
商业	Claude-3.5-Haiku	IP	VascularIntervention	230	5737	207	36	86.47	96.52	65.43
商业	Claude-3.5-Haiku	IP	NonVascularIntervention	188	6157	268	11	94.47	95.83	57.41
商业	Claude-3.5-Haiku	CoT	All	613	15348	109	76	88.97	99.29	86.89
商业	Claude-3.5-Haiku	CoT	VascularDiagnosis	210	3069	19	14	93.75	99.38	92.71
商业	Claude-3.5-Haiku	CoT	VascularIntervention	228	5905	39	38	85.71	99.34	85.55
商业	Claude-3.5-Haiku	CoT	NonVascularIntervention	175	6374	51	24	87.94	99.21	82.35

商业模型Claude-3.5-Haiku配合Chain-of-Thought提示在所有配置中实现最高F1-score（86.89%）。
本地模型Qwen-2.5:72B配合Chain-of-Thought提示获得F1-score 86.66%，具有高敏感性和特异性。
Crosswalk元数据基线显示高特异性（99.40%）但较低敏感性（65.46%），突显其在自由文本提取方面的局限性。
提示策略（CoT）通常提高F1分数并降低误报，尤其在如血管干预等复杂类别中效果明显。
推理时间因模型而异：Claude-3.5-Haiku IP ~1.97s/程序（最快）；Qwen-2.5:72B CoT ~13.47s/程序（较慢但准确）。
当自动化日志替代手动输入时，预计每名住院医师的年度节省时间超过35小时。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。