QUICK REVIEW

[論文レビュー] Intelligent Documentation in Medical Education: Can AI Replace Manual Case Logging?

Nafiz Imtiaz Khan, Kylie Cleland|arXiv (Cornell University)|Jan 19, 2026

Artificial Intelligence in Healthcare and Education被引用数 0

ひとこと要約

この研究は、報告書から手続き的放射線科ケースを自動抽出できるかを検証し、ローカルモデルと商用モデルおよび prompting 戦略を比較します。

ABSTRACT

Procedural case logs are a core requirement in radiology training, yet they are time-consuming to complete and prone to inconsistency when authored manually. This study investigates whether large language models (LLMs) can automate procedural case log documentation directly from free-text radiology reports. We evaluate multiple local and commercial LLMs under instruction-based and chain-of-thought prompting to extract structured procedural information from 414 curated interventional radiology reports authored by nine residents between 2018 and 2024. Model performance is assessed using sensitivity, specificity, and F1-score, alongside inference latency and token efficiency to estimate operational cost. Results show that both local and commercial models achieve strong extraction performance, with best F1-scores approaching 0.87, while exhibiting different trade-offs between speed and cost. Automation using LLMs has the potential to substantially reduce clerical burden for trainees and improve consistency in case logging. These findings demonstrate the feasibility of AI-assisted documentation in medical education and highlight the need for further validation across institutions and clinical workflows.

研究の動機と目的

LLMを用いて放射線診断の手続きケースログを narrative レポートから自動作成する実現可能性を評価する。
AIベースの抽出が難しい手順カテゴリを特定する。
臨床ワークフロー展開の遅延性とコストを含む統合検討を評価する。
スケーラブルな文書化自動化のための prompting 戦略とモデル選択に関するガイダンスを提供する。

提案手法

2018年から2024年までの九人のIR研修医による414件の放射線報告の回顧的データセットと39の事前定義手順。
アノテーターは Cohen’s Kappa = 0.896 でグラウンドトゥルースを確立。
ゼロショット、Instruction Prompting および Chain-of-Thought prompting で六つのモデル（五つのオープン/ローカル、1つの商用）を評価。
指標：感度、特異度、F1スコア、推論時間、トークン使用量、コスト推定。
比較のベースラインとして metadata ベースの Crosswalk ベンチマークを使用。

実験結果

リサーチクエスチョン

RQ1放射線レポートから構造化された手続きデータを高精度で抽出できるか。
RQ2モデルの性能は手順カテゴリ（血管診断、血管介入、非血管介入）で異なるか。
RQ3ローカル vs 商用モデルは、異なる prompting 戦略下で速度、コスト、精度がどう異なるか。
RQ4実世界のワークフロー統合における実用的な展開検討（遅延、トークン使用量、コスト）は何か。

主な発見

Model Type	Model-Name	Prompting	Modality	TP	TN	FP	FN	Sensitivity (%)	Specificity (%)	F1-Score (%)
Benchmark	Cross-Walk	NA	All	451	15364	93	238	65.46	99.40	73.15
Benchmark	Cross-Walk	NA	VascularDiagnosis	143	3065	23	81	63.84	99.26	73.33
Benchmark	Cross-Walk	NA	VascularIntervention	157	5906	38	109	59.02	99.36	68.11
Benchmark	Cross-Walk	NA	NonVascularIntervention	151	6393	32	48	75.88	99.50	79.06
Local	Qwen-2.5:72B	IP	All	649	15174	283	40	94.19	98.17	80.08
Local	Qwen-2.5:72B	CoT	All	627	15326	131	62	91.00	99.15	86.66
Local	Qwen-2.5:72B	IP	VascularDiagnosis	219	3068	20	5	97.77	99.35	94.60
Local	Qwen-2.5:72B	IP	VascularIntervention	247	5803	141	19	92.86	97.63	75.54
Local	Qwen-2.5:72B	IP	NonVascularIntervention	183	6303	122	16	91.96	98.10	72.62
Local	Qwen-2.5:72B	CoT	VascularDiagnosis	214	3071	17	10	95.54	99.45	94.07
Local	Qwen-2.5:72B	CoT	VascularIntervention	242	5868	76	24	90.98	98.72	82.88
Local	Qwen-2.5:72B	CoT	NonVascularIntervention	171	6387	38	28	85.93	99.41	83.82
Commercial	Claude-3.5-Haiku	IP	All	633	14961	496	56	91.87	96.79	69.64
Commercial	Claude-3.5-Haiku	IP	VascularDiagnosis	215	3067	21	9	95.98	99.32	93.48
Commercial	Claude-3.5-Haiku	IP	VascularIntervention	230	5737	207	36	86.47	96.52	65.43
Commercial	Claude-3.5-Haiku	IP	NonVascularIntervention	188	6157	268	11	94.47	95.83	57.41
Commercial	Claude-3.5-Haiku	CoT	All	613	15348	109	76	88.97	99.29	86.89
Commercial	Claude-3.5-Haiku	CoT	VascularDiagnosis	210	3069	19	14	93.75	99.38	92.71
Commercial	Claude-3.5-Haiku	CoT	VascularIntervention	228	5905	39	38	85.71	99.34	85.55
Commercial	Claude-3.5-Haiku	CoT	NonVascularIntervention	175	6374	51	24	87.94	99.21	82.35

チェーン・オブ・思考 prompting を用いた商用モデル Claude-3.5-Haiku は全設定の中で最高の F1 スコア（86.89%）を達成。
ローカルモデル Qwen-2.5:72B の Chain-of-Thought prompting は F1スコア 86.66%、高い感度と特異度を達成。
Crosswalk メタデータのベースラインは高い特異度（99.40%）だが感度は低く（65.46%）自由形式抽出の制限を示す。
prompting 戦略（CoT）は、特に血管介入のような複雑なカテゴリで F1 スコアを向上させ、偽陽性を減少させる傾向。
推論時間はモデルによって異なる：Claude-3.5-Haiku IP 約1.97秒／手続き（最速）；Qwen-2.5:72B CoT 約13.47秒／手続き（遅いが高精度）。
居住者1名あたりの年間時間節約は自動ログ記録による手動入力の代替で35時間超を見込む。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。