QUICK REVIEW

[논문 리뷰] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin|arXiv (Cornell University)|2026. 02. 19.

Topic Modeling인용 수 0

한 줄 요약

본 논문은 약 1B 파라미터의 작은 LLM이 few-shot 프롬프트, 제약 디코딩, 감독형 미세조정, 지속적 사전학습을 활용해 이탈리아 의학 NLP 작업을 체계적으로 분석하고, 미세조정이 일반적으로 최상의 성능을 제공하는 반면 추론 시점 방법은 저자원 환경에서 강력한 옵션을 제공한다는 것을 보여준다.

ABSTRACT

Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

연구 동기 및 목표

약 1B 파라미터의 작은 LLM이 이탈리아어에서 더 큰 의학 NLP 기준과 동등하거나 이를 능가할 수 있는지 평가한다.
추론 시점 전략(few-shot prompting, constraint decoding)과 학습 시점 전략(supervised fine-tuning, continual pre-training)을 체계적으로 비교한다.
이탈리아어로 된 다양한 임상 NLP 작업에 대해 가장 효과적인 적응 조합을 식별한다.
재현성과 추가 연구를 가능하게 하기 위해 이탈리아 의료 NLP 데이터셋과 최상위 성능 모델을 공개한다.

제안 방법

작은 LLM 계열(Llama-3, Gemma-3, Qwen-3)을 20개의 임상 NLP 작업(NER, RE, CRF, QA, Argument Mining)에 걸쳐 평가한다.
출력 형식을 강제하기 위해 구조화된 프롬프트와 constraint decoding을 사용한 제로샷 및 few-shot 프롬프트를 평가한다.
LoRA를 사용한 감독형 지시 준수를 통해 학습 데이터로 모델을 미세조정한다.
미세조정 전에 대형 이탈리아 의학 말뭉치(과학 및 임상의)로 지속적 사전학습을 적용한다.
작은 모델의 이점을 정량화하기 위해 더 큰 기준선(Qwen-3-32B, MedGemma-27B)과 비교한다.

Figure 1: Average performances of 1B LLMs on the 14 medical sub-tasks when different methods are applied at both inference ( left ) and training ( right ) time. Exposing models to Fine-Tuning ( FT ) turns out to be the most effective approach overall, consistently outperforming the baseline (Qwen3-3

실험 결과

연구 질문

RQ1약 1B 파라미터의 작은 LLM이 이탈리아 의학 NLP 작업에서 더 큰 모델과 비교해 경쟁력 있는 성능을 달성할 수 있는가?
RQ2few-shot prompting, constraint decoding, 미세조정, 지속적 사전학습이 작은 LLM의 성능에 대해 상대적으로 어떤 영향을 미치는가?
RQ3추론 시점과 학습 시점 적응의 어떤 조합이 전체 성능과 분포 외(out-of-distribution) 데이터에 대한 일반화에 가장 좋은가?

주요 결과

모델	방법	NER	CRF	RE	QA	ARG	평균	델타 베이스라인	p-값
medgemma-27b	4-shot + CD	52.4	62.2	8.9	83.2	62.6	53.8	-	-
Qwen3-32B	4-shot	49.7	63.3	11.5	82.8	66.3	54.7	-	-
gemma-3-1b-pt 1.00B	CPT, FT	48.0	42.5	5.6	19.2	57.5	34.6	-20.1	*
gemma-3-1b-it 1.00B	CPT, FT	53.6	65.8	47.2	61.8	68.7	59.4	+4.7	*
Llama-3.2-1B	CPT, FT	47.8	34.6	20.8	37.6	74.8	43.1	-11.6	*
Llama-3.2-1B-Instruct	CPT, FT	59.7	64.7	27.5	43.2	73.8	53.8	-0.9	***
Qwen3-1.7B-Base	CPT, FT	59.4	70.3	18.3	52.8	74.0	55.0	+0.3	*
Qwen3-1.7B	FT	61.5	67.4	25.3	57.6	77.6	57.9	+3.2	**
Qwen3-1.7B	CPT, FT	61.5	67.4	25.3	57.6	77.6	57.9	+3.2	**
Llama-3.2-1B-Instruct	FT	62.2	73.7	33.2	60.8	79.4	61.9	+7.2	***

미세조정(FT)은 모델과 작업 전반에 걸쳐 가장 강력한 성능 향상을 지속적으로 제공한다.
지속적 사전학습(CPT)은 제한적이거나 사례에 따라 이점이 있으며, 특히 gemma-3-1b-it에서 두드러진 개선이 있다.
Few-shot prompting(4-shot)은 평균 성능을 향상시키지만 추론 시간이 증가한다; constraint decoding(CD)만으로는 영향이 작지만 4-shot과 CD를 결합하면 유익하다.
훈련 시점 전략 중 FT가 일반적으로 CPT+FT보다 낫고, CPT+FT가 일부 경우에 더 큰 기준선과 대등한 성능을 보일 수 있다.
최고의 작은 모델(Qwen3-1.7B + FT)은 Qwen3-32B 4-shot를 평균 +9.2 포인트 상회; 6개 중 5개 작은 LLM이 분포 내 데이터에서 더 큰 기준선을 능가한다.
훈련으로 얻은 분포 외(out-of-distribution) 이점은 분포 내 이점보다 작아, 여전히 작업 또는 데이터세트 특화 튜닝이 필요함을 시사한다.

Figure 2: Impact on inference time of using 4-shot and Constraint Decoding ( CD ) settings. While 4-shot significantly increases the time required to run the inference, CD does not. The average is calculated among 5 models and 14 subtasks, using the vLLM and outlines libraries for model serving.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.