QUICK REVIEW

[论文解读] TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang|arXiv (Cornell University)|Jun 3, 2024

Traditional Chinese Medicine Studies被引用 9

一句话总结

TCMBench 引入了一个专门的基准和度量（TCM-ED、TMNLI、TCMDeberta 和 TCMScore）来评估和分析在 Traditional Chinese Medicine 中的 LLM 性能，揭示了显著的改进空间，以及领域知识和提示策略的影响。

ABSTRACT

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

研究动机与目标

为评估 LLM 超越以西方医学为导向的数据集而需要一个面向 TCM 的基准提供动机。
从 TCM Licensing Exam (TCMLE) 创建一个大型、具有代表性的 TCM 评估数据集（TCM-ED）。
开发与领域对齐的评估指标（TCMScore），以评估 TCM 文本生成中的语义和知识一致性。
研究模型规模、领域知识和提示策略如何影响 TCM 中的 LLM 性能。
提供洞见以指导未来在 TCM 应用中对 LLM 的开发。

提出的方法

从 TCMLE 构建 5,473 对问答的 TCM-ED，确保覆盖各分支和题型；其中 1,300 对带有标准分析。
创建 TMNLI，一个 TCM-specific NLI 数据集（9,788 个问题及分析），以评估生成分析与标准分析之间的语义一致性。
开发 TCM-Deberta，一种用于推断 TCM 语义一致性的微调 NLI 模型。
通过将术语级匹配（Term F1*）与语义一致性（TCM-Deberta 分数）以及长度惩罚相结合，定义并计算 TCMScore。
使用多项选择题的准确率和 1,300 个分析基础评估，结合传统和领域特定度量（Rouge、BertScore、BartScore、TCMScore）来评估 LLM。
利用提示工程（任务描述、CoT、few-shot）和多轮对话在不同分支间评估推理与稳定性。

实验结果

研究问题

RQ1大型语言模型在真实的 TCM 知识和临床推理问题上的基线表现如何？
RQ2添加领域特定知识或定向微调是否能提升在 TCM 中的 LLM 性能，以及对核心推理能力的影响？
RQ3传统生成度量（Rouge、BertScore、BartScore）与领域特定度量（TCMScore）在反映 TCM 知识准确性和连贯性方面有何比较？
RQ4提示策略（CoT、few-shot、multi-turn 对话）在提升 TCM 理解与推理中扮演什么角色？
RQ5模型在 TCM 的分支如 TCM Basis、Clinical Medicine、Western Medicine 上的表现差异如何？

主要发现

没有评估的 LLM 在 TCMLE 上通过 60% 标准，表明在 TCM AI 方面还有较大提升空间。
具有领域知识或经过专业微调的模型可以提高性能，但微调可能削弱核心推理和语言能力。
领域特定度量（TCMScore）提供了超越 Rouge/BertScore/SARI 的互补见解，尤其在捕捉 TCM 术语使用和语义一致性方面。
带有示例的提示（few-shot）通常改善复杂推理；然而，过长的提示可能对某些模型的表现造成不利影响。
GPT-4 在测试模型中整体准确率最高，但仍未达到及格线，凸显领域差距；跨领域模型（如 ChatGLM）在使用合适的中文语料库时在某些分支表现出色。
评估显示文本长度和表面相似性会影响传统度量，而 TCMScore 能更好地反映知识准确性和一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。