QUICK REVIEW

[論文レビュー] TCMBench: A Comprehensive Benchmark for Evaluating Large Language Models in Traditional Chinese Medicine

Wenjing Yue, Xiaoling Wang|arXiv (Cornell University)|Jun 3, 2024

Traditional Chinese Medicine Studies被引用数 9

ひとこと要約

TCMBench は専用のベンチマークと指標（TCM-ED、TMNLI、TCMDeberta、TCMScore）を導入し、伝統中国医学における LLM の性能を評価・分析する。ドメイン知識と prompting 戦略の影響が顕著で、改善の余地が大きいことを明らかにする。

ABSTRACT

Large language models (LLMs) have performed remarkably well in various natural language processing tasks by benchmarking, including in the Western medical domain. However, the professional evaluation benchmarks for LLMs have yet to be covered in the traditional Chinese medicine(TCM) domain, which has a profound history and vast influence. To address this research gap, we introduce TCM-Bench, an comprehensive benchmark for evaluating LLM performance in TCM. It comprises the TCM-ED dataset, consisting of 5,473 questions sourced from the TCM Licensing Exam (TCMLE), including 1,300 questions with authoritative analysis. It covers the core components of TCMLE, including TCM basis and clinical practice. To evaluate LLMs beyond accuracy of question answering, we propose TCMScore, a metric tailored for evaluating the quality of answers generated by LLMs for TCM related questions. It comprehensively considers the consistency of TCM semantics and knowledge. After conducting comprehensive experimental analyses from diverse perspectives, we can obtain the following findings: (1) The unsatisfactory performance of LLMs on this benchmark underscores their significant room for improvement in TCM. (2) Introducing domain knowledge can enhance LLMs' performance. However, for in-domain models like ZhongJing-TCM, the quality of generated analysis text has decreased, and we hypothesize that their fine-tuning process affects the basic LLM capabilities. (3) Traditional metrics for text generation quality like Rouge and BertScore are susceptible to text length and surface semantic ambiguity, while domain-specific metrics such as TCMScore can further supplement and explain their evaluation results. These findings highlight the capabilities and limitations of LLMs in the TCM and aim to provide a more profound assistance to medical research.

研究の動機と目的

西洋医学志向のデータセットを超えて LLM を評価するため、TCM 固有のベンチマークの必要性を促す。
TCMLE から大規模かつ代表的な TCM 評価データセット（TCM-ED）を作成する。
TCM テキスト生成における意味的整合性と知識一貫性を評価するためのドメインに沿った評価指標（TCMScore）を開発する。
モデルサイズ、ドメイン知識、 prompting 戦略が TCM における LLM の性能に与える影響を調査する。
将来の TCM アプリケーション向け LLM の開発を指針づける知見を提供する。

提案手法

TCMLE からの 5,473 区分の Q&A ペアで TCM-ED を構築し、分野・質問タイプのカバーを確保する；1,300 件は標準分析付き。
生成分析と標準分析との意味論的一貫性を評価するための TCM 固有の NLI データセット TMNLI（9,788 問題と分析を含む）を作成する。
TCM の意味論的一貫性を推測するための微調整済み NLI モデル TCM-Deberta を開発する。
Term F1* による語彙レベルの一致と意味的整合性（TCM-Deberta のスコア）および長さペナルティを組み合わせて TCMScore を定義・算出する。
複数選択問題の正答率と、分析ベースの 1,300 件の評価を従来指標とドメイン固有指標（Rouge、BertScore、BartScore、TCMScore）で評価する。
タスク説明、CoT、few-shot、複数ターン対話などのプロンプト設計を用い、分岐ごとの推論と安定性を評価する。

実験結果

リサーチクエスチョン

RQ1 autentic TCM 知識と臨床推理問題に対する大規模言語モデルの基準性能はどの程度か？
RQ2ドメイン知識の追加や専門的なファインチューニングは TCM における LLM の性能を向上させるか、そして中核的推論能力にどのような影響を与えるか？
RQ3従来の生成指標（Rouge、BertScore、BartScore）とドメイン特有の指標（TCMScore）は、TCM の知識正確性と一貫性を反映するうえでどの程度異なるか？
RQ4CoT、few-shot、マルチターン対話などの prompting 戦略は TCM の理解と推論を改善するうえでどのような役割を果たすか？
RQ5モデルのパフォーマンスは TCM Basis、Clinical Medicine、Western Medicine といった分野ごとにどのように異なるか？

主な発見

Model	A1/A2	A3（ゼロショット）	A3（Few-shot）	B1	合計
Chinese LlaMa	0.0969	0.1075	0.1620	0.1151	0.1089
HuaTuo	0.1944	0.1981	0.1402	0.1876	0.1840
ZhongJing-TCM	0.3537	0.3364	0.3178	0.2182	0.2695
ChatGLM	0.3613	0.4595	0.6168	0.4568	0.4477
ChatGPT	0.4510	0.4657	0.4782	0.4444	0.4398
GPT-4	0.5819	0.6231	0.6277	0.6011	0.5986

評価済みの LLM の中で 60% の基準をクリアするものはなく、TCM AI の改善余地が大きいことを示している。
ドメイン知識や専門的なチューニングを持つモデルは性能を改善できるが、ファインチューニングは中核的推論能力と言語能力を低下させる場合がある。
ドメイン特有の指標（TCMScore）は、Rouge/BertScore/SARI を超えた補完的な洞察を提供し、特に TCM 用語の使用と意味的整合性を捉える点で有効である。
例を用いた prompting（few-shot）は複雑な推論を一般的に改善する。ただし、過度に長いプロンプトは一部モデルの性能を損なうことがある。
GPT-4 は試験対象モデルの中で総合正答率が最高だが、通過には至らず、ドメイン間のギャップを浮き彫りにしている。クロスドメインモデル（例: ChatGLM）は適切な中国語コーパスにより特定の分野で優れている。
評価はテキスト長と表層的類似性が従来指標に影響を与える一方、TCMScore は知識の正確性と一貫性をよりよく反映することを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。