QUICK REVIEW

[論文レビュー] Leveraging Large Language Models to Extract and Translate Medical Information in Doctors' Notes for Health Records and Diagnostic Billing Codes

Peter Hartnett, Chung-Chi Huang|arXiv (Cornell University)|Jan 14, 2026

Machine Learning in Healthcare被引用数 0

ひとこと要約

This paper presents an on-device, privacy-preserving LLM/RAG pipeline to extract clinical information from doctors’ notes and translate it into ICD-10-CM codes, assessing open-weight models and prompting strategies, and advocating a human-in-the-loop approach.

ABSTRACT

Physician burnout in the United States has reached critical levels, driven in part by the administrative burden of Electronic Health Record (EHR) documentation and complex diagnostic codes. To relieve this strain and maintain strict patient privacy, this thesis explores an on-device, offline automatic medical coding system. The work focuses on using open-weight Large Language Models (LLMs) to extract clinical information from physician notes and translate it into ICD-10-CM diagnostic codes without reliance on cloud-based services. A privacy-focused pipeline was developed using Ollama, LangChain, and containerized environments to evaluate multiple open-weight models, including Llama 3.2, Mistral, Phi, and DeepSeek, on consumer-grade hardware. Model performance was assessed for zero-shot, few-shot, and retrieval-augmented generation (RAG) prompting strategies using a novel benchmark of synthetic medical notes. Results show that strict JSON schema enforcement achieved near 100% formatting compliance, but accurate generation of specific diagnostic codes remains challenging for smaller local models (7B-20B parameters). Contrary to common prompt-engineering guidance, few-shot prompting degraded performance through overfitting and hallucinations. While RAG enabled limited discovery of unseen codes, it frequently saturated context windows, reducing overall accuracy. The findings suggest that fully automated unsupervised coding with local open-source models is not yet reliable; instead, a human-in-the-loop assisted coding approach is currently the most practical path forward. This work contributes a reproducible local LLM architecture and benchmark dataset for privacy-preserving medical information extraction and coding.

研究の動機と目的

Reduce physician administrative burden by automating extraction of medical notes into structured data and billing codes while preserving patient privacy.
Evaluate open-weight LLMs for clinical coding tasks using zero-shot, few-shot, and RAG prompting on a local device.
Provide a reproducible benchmark with fictional notes to support testing, evaluation, and future research in private medical coding automation.
Enable flexible integration of evolving medical code databases for local, customizable coding workflows.

提案手法

Build a privacy-focused local LLM pipeline using Ollama, LangChain, and containerized environments for on-device inference.
Assess open-weight LLMs (7B–20B) with zero-shot, few-shot, and RAG prompting for medical coding tasks on synthetic notes.
Use a JSON-led output schema to enforce structured results and enable easy interoperability with EHRs.
Incorporate a Retrieval-Augmented Generation (RAG) layer to retrieve ICD-10-CM context from local documents and code databases.
Evaluate prompts and models on a private, feature-checked benchmark to analyze accuracy, consistency, and efficiency.
Propose a modular multi-agent extension to improve task specialization in future work.

実験結果

リサーチクエスチョン

RQ1Can open-weight LLMs on consumer hardware accurately extract diagnoses and map them to ICD-10-CM codes from physician notes?
RQ2How do zero-shot, few-shot, and RAG prompting strategies affect coding accuracy, output consistency, and context-window usage in small-to-mid sized models?
RQ3Does a privacy-preserving, on-device architecture with a flexible code database meet practical needs for automated medical coding, or is a human-in-the-loop approach preferable?

主な発見

Model	Size	Notes
Deepseek-r1	8b
Llama3.2	latest (8b)
Gpt-oss	20b
Gemma3.2	270m
Medllama2	7b
Meditron	7b
Mistral	7b
Phi4	14b

Strict JSON formatting can be achieved with near 100% compliance, but accurate code generation remains challenging for small models (7B–20B).
Few-shot prompting often degrades performance and can cause overfitting and hallucinations in this setup.
RAG improves identification of unseen codes but can saturate context windows, reducing accuracy in smaller architectures.
On-device, open-source pipelines show promise but are not yet reliable for full automation, suggesting a human-in-the-loop approach as a practical solution.
An open-source, reproducible local LLM architecture and benchmark dataset for extracting and translating medical information into diagnostic codes is provided.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。