QUICK REVIEW

[论文解读] HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs

Junying Chen, Xidong Wang|arXiv (Cornell University)|Nov 16, 2023

Topic Modeling被引用 24

一句话总结

HuatuoGPT-II 引入了一种统一的一阶段域适配协议，将领域数据转换为指令-输出格式，并训练中文医学大型语言模型，在中文医疗基准测试中达到最新的状态。

ABSTRACT

Adapting a language model into a specific domain, a.k.a `domain adaption', is a common practice when specialized knowledge, e.g. medicine, is not encapsulated in a general language model like Llama2. The challenge lies in the heterogeneity of data across the two training stages, as it varies in languages, genres, or formats. To tackle this and simplify the learning protocol, we propose to transform heterogeneous data, from the both pre-training and supervised stages, into a unified, simple input-output pair format. We validate the new protocol in the domains where proprietary LLMs like ChatGPT perform relatively poorly, such as Traditional Chinese Medicine. The developed model, HuatuoGPT-II, has shown state-of-the-art performance in Chinese medicine domain on a number of benchmarks, e.g. medical licensing exams. It even outperforms proprietary models like ChatGPT and GPT-4 in some aspects, especially in Traditional Chinese Medicine. Expert manual evaluations further validate HuatuoGPT-II's advantages over existing LLMs. Notably, HuatuoGPT-II was benchmarked in a fresh Chinese National Medical Licensing Examination where it achieved the best performance, showcasing not only its effectiveness but also its generalization capabilities.

研究动机与目标

激励医学领域的 LLM 域适配并降低训练流水线的复杂性。
提出一种统一的一阶段协议，取代传统的两阶段持续预训练和有监督微调。
开发并评估在中医和通用中文医疗任务上表现出色的中文医学 LLM。
展示数据统一和基于优先级的采样，以有效注入领域知识。

提出的方法

收集跨百科全书、书籍、文学作品和网络资源等多样化的域特定语料库（中文/英文）。
通过使用一个 LLM 进行问题生成和答案合成，将域数据统一为指令-输出格式，与 SFT 数据对齐。
通过将统一的域数据与微调数据合并，在优先采样策略的引导下执行一阶段训练。
将数据编码标准化为固定长度序列，并仅优化指令风格数据的输出损失。
结合开放基准和专家评估进行评估，包括一个新的医学执业考试场景。

实验结果

研究问题

RQ1在医学 LLMs 中，一阶段域适配相对于传统的两阶段流程会带来哪些性能提升？
RQ2通过 LLM 生成的问题与答案实现的数据统一在将异构域数据与 SFT 数据对齐方面有多有效？
RQ3用一阶段适应训练的中文医学 LLM 能否在中文医疗基准测试和执业考试中超过开源和专有模型？

主要发现

模型	MedQA	MedMCQA	CMB	CMExam	MMLU	CMMLU	C_Eval
HuatuoGPT-II (7B)	25.77	31.20	28.81	31.07	34.91	33.23	36.53
HuatuoGPT-II (13B)	45.68	47.41	63.34	68.98	54.00	61.45	64.00
DISC-MedLLM	28.67	-	32.47	36.62	-	-	-
ChatGPT (API)	52.24	53.60	43.26	46.51	69.96	50.37	48.80
GPT-4 (API)	47.3	48.2	53.5	50.3	53.7	54.2	58.6

HuatuoGPT-II 在 MedQA、MedMCQA、CMB、CMExam 等中文医疗基准测试中，在开源模型中达到最新的性能水平，其中 13B 变体尤为出色。
在中国国家医师执业考试中，13B 模型在若干科目接近或达到领先的专有模型，显著优于许多开源基线。
专家评估与自动化评估表明相对于主流 LLMs 的响应质量具有竞争力甚至更优，尤其在中医方面表现突出。
采用数据统一与优先采样的一阶段域适配在有效转移领域知识的同时，简化了训练流水线，相较于两阶段方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。