QUICK REVIEW

[論文レビュー] Me LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie, Qingyu Chen|arXiv (Cornell University)|Feb 20, 2024

Machine Learning in Healthcare被引用数 7

ひとこと要約

Me-LLaMA は、オープンソースの LLaMA モデルを基盤とし、ドメイン特化の事前学習と指示チューニングで医療テキスト分析と診断を改善する医療分野の LLM ファミリーであり、オープンモデルに対するゼロショット、監督付き、複雑ケース性能で堅牢、いくつかの設定で ChatGPT/GPT-4 に競合する。

ABSTRACT

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.

研究の動機と目的

医学における言語理解と診断支援を改善するための領域特化型 LLM の必要性を動機づける。
Biomedical/clinical data を用いた継続的な事前学習と指示チューニングを行う医療 LLM ファミリー（Me-LLaMA 13B/70B）の開発。
自動評価と人的評価の両方を用いて複数の医療テキスト分析タスクと複雑な臨床ケース診断における性能を評価する。

提案手法

ドメイン特化データを総計 129B トークンで Me-LLaMA 基盤モデル（13B および 70B）を事前学習する。
214K の生物医療/臨床サンプルで指示チューニングを通じてチャット強化版を作成する。
70B モデルの訓練には相当な計算資源を割り当てる（100,000 A100 GPU 時間を超える）。
12 のベンチマークデータセットを含む六つの医療テキスト分析タスクと複雑な臨床ケース診断で評価する。
ゼロショットと監督付きの性能を、LLaMA および他のオープンソースの医療 LLM と比較し、タスク固有のチューニング後は ChatGPT および GPT-4 に対して比較可能。

実験結果

リサーチクエスチョン

RQ1生物医療/臨床データで訓練されたドメイン適応型 LLM は、コアの医療テキスト分析タスクで一般用途のオープンソース医療 LLM を上回ることができるか？
RQ2タスク固有の指示チューニングは、医療ベンチマークにおけるゼロショット設定に対して性能にどのような影響を与えるか？
RQ3Me-LLaMA モデルは、複数の医療データセットと複雑な臨床シナリオで最先端のクローズドモデル（ChatGPT、GPT-4）に匹敵するか、または超えることができるか？

主な発見

Me-LLaMA は六つの医療テキスト分析タスクにおいて、ゼロショットおよび監督付き設定で LLaMA および他のオープンソースの医療 LLM を上回る。
タスク固有のチューニングはさらに性能を向上させ、8つのデータセットのうち7つで ChatGPT を上回る。
Me-LLaMA はチューニング後、8データセット中5データセットで GPT-4 を上回る。
複雑な臨床ケースでは、Me-LLaMA は ChatGPT および GPT-4 に匹敵する性能を示す。
本研究は領域特化データの価値を強調し、事前学習スケールとファインチューニングコストのトレードオフを論じる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。