[论文解读] Zhongjing: Enhancing the Chinese Medical Capabilities of Large Language Model through Expert Feedback and Real-world Multi-turn Dialogue
Zhongjing 是基于 LLaMA 的首个中文医疗领域大语言模型,使用完整的训练流水线(pre-training, SFT, RLHF)和大规模多轮医生–患者数据集(CMtMedQA);在参数显著更少的情况下,优于开源中文医疗大模型,在某些领域甚至接近 ChatGPT。
Recent advances in Large Language Models (LLMs) have achieved remarkable breakthroughs in understanding and responding to user intents. However, their performance lag behind general use cases in some expertise domains, such as Chinese medicine. Existing efforts to incorporate Chinese medicine into LLMs rely on Supervised Fine-Tuning (SFT) with single-turn and distilled dialogue data. These models lack the ability for doctor-like proactive inquiry and multi-turn comprehension and cannot align responses with experts' intentions. In this work, we introduce Zhongjing, the first Chinese medical LLaMA-based LLM that implements an entire training pipeline from continuous pre-training, SFT, to Reinforcement Learning from Human Feedback (RLHF). Additionally, we construct a Chinese multi-turn medical dialogue dataset of 70,000 authentic doctor-patient dialogues, CMtMedQA, which significantly enhances the model's capability for complex dialogue and proactive inquiry initiation. We also define a refined annotation rule and evaluation criteria given the unique characteristics of the biomedical domain. Extensive experimental results show that Zhongjing outperforms baselines in various capacities and matches the performance of ChatGPT in some abilities, despite the 100x parameters. Ablation studies also demonstrate the contributions of each component: pre-training enhances medical knowledge, and RLHF further improves instruction-following ability and safety. Our code, datasets, and models are available at https://github.com/SupritYoung/Zhongjing.
研究动机与目标
- Bridge the gap in Chinese medical LLMs by integrating continuous pre-training, supervised fine-tuning, and reinforcement learning from human feedback.
- Create a large-scale multi-turn Chinese medical dialogue dataset to enable proactive inquiry and complex consultations.
- Define domain-specific annotation and evaluation criteria to better assess medical dialogue capabilities, safety, and professionalism.
- Demonstrate the impact of pre-training and RLHF on medical knowledge, instruction-following, and safety.
提出的方法
- Continuous pre-training on diverse real-world medical corpora using Ziya-LLaMA as the base to instill medical knowledge.
- Construction of CMtMedQA, a 70,000-turn Chinese multi-turn medical dialogue dataset with proactive inquiries, derived from real doctor–patient interactions and checked against CMeKG.
- Four SFT data types: single-turn medical dialogues, CMtMedQA multi-turn dialogues, medical NLP task instructions, and general medical-related dialogues to mitigate catastrophic forgetting.
- RLHF: use a refined medical annotation rule with six medical experts ranking 20,000 model outputs to train a reward model, then apply PPO to align with expert intent.
- Evaluation employs a three-dimension, nine-ability framework and uses GPT-4/human experts for scoring safety, professionalism, and fluency.
实验结果
研究问题
- RQ1How does end-to-end training (pre-training + SFT + RLHF) affect Chinese medical capabilities of an open-source LLM?
- RQ2Can a large, real-world, multi-turn medical dialogue dataset improve proactive inquiry and diagnostic reasoning in a Chinese medical LLM?
- RQ3What is the contribution of continuous pre-training and RLHF to safety, professionalism, and fluency in medical dialogues?
- RQ4How does Zhongjing compare to existing open-source Chinese medical LLMs and to ChatGPT across multiple capabilities?
- RQ5What evaluation criteria best capture the unique demands of medical dialogues in LLMs?
主要发现
- Zhongjing outperforms open-source Chinese medical LLM baselines across multiple capability dimensions.
- The model matches ChatGPT in some abilities despite having only 1% of ChatGPT’s parameters.
- CMtMedQA significantly enhances the model’s multi-turn dialogue and proactive inquiry capabilities.
- Pre-training improves medical knowledge while RLHF improves instruction-following and safety; ablation shows both are important.
- The scaling of instructions and domain-specific data drives performance; excessive distilled data can hurt real medical accuracy.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。