QUICK REVIEW

[論文レビュー] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin|arXiv (Cornell University)|Feb 19, 2026

Topic Modeling被引用数 0

ひとこと要約

This paper systematically analyzes how small LLMs (~1B params) can tackle Italian medical NLP tasks using few-shot prompting, constraint decoding, supervised fine-tuning, and continual pre-training, showing fine-tuning generally yields the best results while inference-time methods offer strong low-resource options.

ABSTRACT

Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

研究の動機と目的

小型LLM（約1Bパラメータ）がイタリア語の医療NLPタスクで大規模ベースラインと同等または上回るかを評価する。
推論時戦略（few-shot prompting、constraint decoding）と訓練時戦略（ supervised fine-tuning、continual pre-training）を体系的に比較する。
イタリア語の多様な臨床NLPタスクに対して最も効果的な適応組み合わせを特定する。
再現性と今後の研究を促進するため、イタリア語医療NLPデータセットとトップパフォーマンスモデルを公開する。

提案手法

20の臨床NLPタスク（NER、RE、CRF、QA、Argument Mining）を横断して小型LLMファミリ（Llama-3、Gemma-3、Qwen-3）を評価する。
出力形式を強制するための構造化プロンプトと制約デコードを用いたゼロショットおよびfew-shot promptingを評価する。
訓練データを用いた指示遵守を監督付きファインチューニング（LoRA）でモデルを微調整する。
ファインチューニング前に大規模なイタリア語医療コーパス（科学・臨床）で継続的事前学習を適用する。
小型モデルの利益を定量化するため、より大きなベースライン（Qwen-3-32B、MedGemma-27B）と比較する。

Figure 1: Average performances of 1B LLMs on the 14 medical sub-tasks when different methods are applied at both inference ( left ) and training ( right ) time. Exposing models to Fine-Tuning ( FT ) turns out to be the most effective approach overall, consistently outperforming the baseline (Qwen3-3

実験結果

リサーチクエスチョン

RQ1小型LLM（約1Bパラメータ）は、イタリア語の医療NLPタスクでより大きなモデルと競合できるか？
RQ2小型LLMの性能に対するfew-shot prompting、constraint decoding、ファインチューニング、継続的事前学習の相対的影響はどれくらいか？
RQ3推論時と訓練時の適応の組み合わせで、最も高い総合性能と分布外データへの一般化を得られるのはどれか？

主な発見

Model	Method	NER	CRF	RE	QA	ARG	AVG	delta baseline	p-val
medgemma-27b	4-shot + CD	52.4	62.2	8.9	83.2	62.6	53.8	-	-
Qwen3-32B	4-shot	49.7	63.3	11.5	82.8	66.3	54.7	-	-
gemma-3-1b-pt 1.00B	CPT, FT	48.0	42.5	5.6	19.2	57.5	34.6	-20.1	*
gemma-3-1b-it 1.00B	CPT, FT	53.6	65.8	47.2	61.8	68.7	59.4	+4.7	*
Llama-3.2-1B	CPT, FT	47.8	34.6	20.8	37.6	74.8	43.1	-11.6	*
Llama-3.2-1B-Instruct	CPT, FT	59.7	64.7	27.5	43.2	73.8	53.8	-0.9	***
Qwen3-1.7B-Base	CPT, FT	59.4	70.3	18.3	52.8	74.0	55.0	+0.3	*
Qwen3-1.7B	FT	61.5	67.4	25.3	57.6	77.6	57.9	+3.2	**
Qwen3-1.7B	CPT, FT	61.5	67.4	25.3	57.6	77.6	57.9	+3.2	**
Llama-3.2-1B-Instruct	FT	62.2	73.7	33.2	60.8	79.4	61.9	+7.2	***

ファインチューニング（FT）は、モデルとタスクを問わず最も強い性能向上を一貫してもたらす。
継続的事前学習（CPT）は限定的またはケース特異的な利益を提供し、顕著な改善は主にgemma-3-1b-itで見られる。
few-shot prompting（4-shot）は平均性能を向上させるが推論時間を増加させる。制約デコード（CD）単独は影響が小さいが、4-shotとCDを組み合わせると有利。
訓練時戦略の中で、FTは通常CPT+FTよりも上回ることが多く、CPT+FTが大きなベースラインに匹敵する場合がある。
最良の小型モデル（Qwen3-1.7B + FT）は、平均で+9.2ポイントに対してQwen3-32B 4-shotを上回る。小型LLMの5つ中5つが分布内データで大きなベースラインを上回る。
分布外の訓練による利得は分布内の利得より小さく、タスクまたはデータセット固有の調整の継続的必要性を示している。

Figure 2: Impact on inference time of using 4-shot and Constraint Decoding ( CD ) settings. While 4-shot significantly increases the time required to run the inference, CD does not. The average is calculated among 5 models and 14 subtasks, using the vLLM and outlines libraries for model serving.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。