[论文解读] Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks
本论文提出 Meerkat-7B,一种 7B 开源医疗语言模型,使用教材中的链路思考数据进行训练,达到 USMLE 通过水平并超过若干 7B/open 模型。
While recent advancements in commercial large language models (LM) have shown promising results in medical tasks, their closed-source nature poses significant privacy and security concerns, hindering their widespread use in the medical field. Despite efforts to create open-source models, their limited parameters often result in insufficient multi-step reasoning capabilities required for solving complex medical problems. To address this, we introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. The models were trained using our new synthetic dataset consisting of high-quality chain-of-thought reasoning paths sourced from 18 medical textbooks, along with diverse instruction-following datasets. Our systems achieved remarkable accuracy across six medical benchmarks, surpassing the previous best models such as MediTron and BioMistral, and GPT-3.5 by a large margin. Notably, Meerkat-7B surpassed the passing threshold of the United States Medical Licensing Examination (USMLE) for the first time for a 7B-parameter model, while Meerkat-70B outperformed GPT-4 by an average of 1.3%. Additionally, Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8 and closely matching GPT-4's 21.8. Our systems offered more detailed free-form responses to clinical queries compared to existing small models, approaching the performance level of large commercial models. This significantly narrows the performance gap with large LMs, showcasing its effectiveness in addressing complex medical challenges.
研究动机与目标
- 鼓励更安全、隐私保护的医疗 AI,在不让数据暴露给闭源系统的前提下应用
- 开发一个具有增强多步医疗推理能力的开源 7B 模型
- 证明 CoT 微调和教科书基础的增强可以改善医疗问答性能
- 展示将 USMLE 类任务的推理转移到现实临床问题的可迁移性
提出的方法
- 用指令跟随数据对 Mistral-7B 主干进行微调
- 用 GPT-4 生成来自 MedQA 的 9.3K 条 CoT 示例和来自 18 本医学教材的 78K 条 CoT 示例
- 创建 MedBooks-CoT-18,其中包含来自教材的 QA 对及 CoT 路径
- 使用多样的医学用例指令跟随数据集来增强训练
- 在 8x80G A100 GPU 上进行三个 epoch 的下一个 token 预测训练
- 用多项医学基准测试和 CoT 数据消融分析进行评估
![Figure 1: Overview of recent advances in language models (LM) based on their performance on the MedQA benchmark [ 28 ] . Large closed-source models have surpassed the USMLE passing threshold, reaching a state-of-the-art performance with 90% accuracy [ 8 ] . On the other hand, the previous best open-](https://ar5iv.labs.arxiv.org/html/2404.00376/assets/figures/overview_final.png)
实验结果
研究问题
- RQ1相较于仅用 QA 数据训练,CoT 微调如何影响医学 QA 基准的性能?
- RQ2用教材派生的 CoT 路径来增强训练数据,是否在 CoT 本身之上还能提升性能?
- RQ3一个 7B 的开源模型是否能够超越 USMLE 通过阈值,并在标准医学基准上胜过更大的开源模型?
- RQ4Meerkat-7B 的解释(ROUGE-L、BERTScore、GPT-4 分数)与人类解释及更大语言模型相比如何?
主要发现
- Meerkat-7B 在七项医学基准上的平均准确率为 64.2%,超越 GPT-3.5 13.1%、MediTron-7B 13.4%、BioMistral-7B 9.8%。
- 在 MedQA 上,Meerkat-7B 达到 74.3%(MedQA)和 USMLE 抽样测试 71.4%,超出一个 7B 模型的 USMLE 阈值。
- Meerkat-7B 在 USMLE 风格任务上优于 MediTron-7B 和 BioMistral-7B,在自由文本临床回答方面与 GPT-3.5 相当。
- 消融表明 CoT 微调将 MedQA 性能在所有模型上平均提升 7.5%;加入 MedBooks-CoT-18 数据进一步提高准确性 5.4%。
- Meerkat-7B 的解释与答案正确性相关;ROUGE-L 和 BERTScore 更偏向 Meerkat-7B,GPT-4 的总体评分最高。
- Meerkat-7B 能在临床查询中给出更详细的自由文本回答,同时保持与 GPT-3.5 相当的事实性。
![Figure 2: Performance of models on seven multiple-choice QA benchmark datasets. Our Meerkat-7B models generally outperformed existing 7B models and GPT-3.5 and even outperformed MediTron-70B on MedQA. The scores of GPT-3.5, GPT-4 and MediTron-70B are obtained from the papers of Nori et al. [ 6 ] , T](https://ar5iv.labs.arxiv.org/html/2404.00376/assets/figures/main_results_final2.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。