Skip to main content
QUICK REVIEW

[论文解读] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Pietro Ferrazzi, Mattia Franzin|arXiv (Cornell University)|Feb 19, 2026
Topic Modeling被引用 0
一句话总结

本论文系统分析了小型 LLM(~1B 参数)如何通过少样本提示、约束解码、监督微调和持续预训练来处理意大利语医学 NLP 任务,结果显示微调通常带来最佳性能,而推理阶段的方法为低资源场景提供了强有力的选择。

ABSTRACT

Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

研究动机与目标

  • 评估大模型基线在意大利语医学 NLP 任务上的可比性或领先性是否可被 ~1B 参数的小型 LLM 追平或超越。
  • 系统比较推理时策略(少样本提示、约束解码)与训练时策略(有监督微调、持续预训练)的相对影响。
  • 识别在意大利语多样化临床 NLP 任务中最有效的自适应组合。
  • 发布意大利语医学 NLP 数据集和表现最佳的模型,以实现可复现性和进一步研究。

提出的方法

  • 评估小型 LLM 家族(Llama-3、Gemma-3、Qwen-3)在 20 个人工智能临床 NLP 任务上的表现(NER、RE、CRF、QA、Argument Mining)。
  • 评估零样本和少样本提示,使用结构化提示和约束解码以强制输出格式。
  • 基于训练数据使用 LoRA 进行有监督的指令遵循微调模型。
  • 在微调前对大型意大利语医学语料库(科学和临床)进行持续预训练。
  • 与更大基线(Qwen-3-32B、MedGemma-27B)进行比较以量化小模型的收益。
Figure 1: Average performances of 1B LLMs on the 14 medical sub-tasks when different methods are applied at both inference ( left ) and training ( right ) time. Exposing models to Fine-Tuning ( FT ) turns out to be the most effective approach overall, consistently outperforming the baseline (Qwen3-3
Figure 1: Average performances of 1B LLMs on the 14 medical sub-tasks when different methods are applied at both inference ( left ) and training ( right ) time. Exposing models to Fine-Tuning ( FT ) turns out to be the most effective approach overall, consistently outperforming the baseline (Qwen3-3

实验结果

研究问题

  • RQ1小型 LLM(~1B 参数)在意大利语医学 NLP 任务上是否能够与更大模型实现有竞争性的性能?
  • RQ2少样本提示、约束解码、微调和持续预训练对小型 LLM 性能的相对影响是什么?
  • RQ3哪种推理时与训练时自适应的组合在总体性能和对分布外数据的泛化上表现最好?

主要发现

ModelMethodNERCRFREQAARGAVGdelta baselinep-val
medgemma-27b4-shot + CD52.462.28.983.262.653.8--
Qwen3-32B4-shot49.763.311.582.866.354.7--
gemma-3-1b-pt 1.00BCPT, FT48.042.55.619.257.534.6-20.1*
gemma-3-1b-it 1.00BCPT, FT53.665.847.261.868.759.4+4.7*
Llama-3.2-1BCPT, FT47.834.620.837.674.843.1-11.6*
Llama-3.2-1B-InstructCPT, FT59.764.727.543.273.853.8-0.9***
Qwen3-1.7B-BaseCPT, FT59.470.318.352.874.055.0+0.3*
Qwen3-1.7BFT61.567.425.357.677.657.9+3.2**
Qwen3-1.7BCPT, FT61.567.425.357.677.657.9+3.2**
Llama-3.2-1B-InstructFT62.273.733.260.879.461.9+7.2***
  • 微调(FT)在模型和任务间始终带来最强的性能提升。
  • 持续预训练(CPT)提供的收益有限或仅在个别情境有显著改善,主要体现在 gemma-3-1b-it。
  • 少样本提示(4-shot)提升平均性能但增加推理时间;约束解码(CD)本身影响较小,但4-shot 与 CD 的结合是有利的。
  • 在训练时策略中,FT 常常优于 CPT+FT,而 CPT+FT 在某些情况下可达到较大的基线水平。
  • 最佳的小模型(Qwen3-1.7B + FT)平均分比 Qwen3-32B 4-shot 高出 +9.2 点;六个小型 LLM 中有五个在分布内数据上超越了较大基线。
  • 对分布外的训练收益低于分布内收益,表明仍需面向任务或数据集的定制调整。
Figure 2: Impact on inference time of using 4-shot and Constraint Decoding ( CD ) settings. While 4-shot significantly increases the time required to run the inference, CD does not. The average is calculated among 5 models and 14 subtasks, using the vLLM and outlines libraries for model serving.
Figure 2: Impact on inference time of using 4-shot and Constraint Decoding ( CD ) settings. While 4-shot significantly increases the time required to run the inference, CD does not. The average is calculated among 5 models and 14 subtasks, using the vLLM and outlines libraries for model serving.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。