QUICK REVIEW

[論文レビュー] AlpaCare:Instruction-tuned Large Language Models for Medical Application

Xinlu Zhang, Chenxin Tian|arXiv (Cornell University)|Oct 23, 2023

Topic Modeling被引用数 15

ひとこと要約

AlpaCare は多様な 52k の医療指示応答データセットを半自動パイプラインで生成し、医療能力と一般化を医療・一般ドメインの両方で向上させるよう LLaMA モデルをファインチューニングします。この要約は原文と同じ数値・式・固有名詞を保持しています。

ABSTRACT

Instruction-finetuning (IFT) has become crucial in aligning Large Language Models (LLMs) with diverse human needs and has shown great potential in medical applications. However, previous studies mainly fine-tune LLMs on biomedical datasets with limited diversity, which often rely on benchmarks or narrow task scopes, and hence significantly limit the effectiveness on their medical instruction-following ability and generalizability. To bridge this gap, we propose creating a diverse, machine-generated medical IFT dataset, MedInstruct-52k, using GPT-4 and ChatGPT with a high-quality expert-curated seed set. We then fine-tune LLaMA-series models on the dataset to develop AlpaCare. Despite using a smaller domain-specific dataset than previous medical LLMs, AlpaCare not only demonstrates superior performance on medical applications, with up to 38.1% absolute gain over best baselines in medical free-form instruction evaluations, but also achieves 6.7% absolute gains averaged over multiple general domain benchmarks. Human evaluation further shows that AlpaCare consistently outperforms best baselines in terms of both correctness and helpfulness. We offer public access to our data, model, and codebase in https://github.com/XZhang97666/AlpaCare.

研究の動機と目的

タスク多様な医療指示チューニングが一般化を犠牲にせず医療能力を高めることを示す。
52k の医療に焦点を当てた自己指示データセットが、より大きな医療指示データセットを上回ることを示す。
ドメイン特化の指示データが医療および一般-domain タスクの性能に如何に影響するかを評価する。
今後のオープンソース医療 LLM 研究を促進する公開リソースを提供する。

提案手法

データ生成を導くため、トピック・観点・タスクタイプ・難易度を網羅する臨床医作成のシードセットを構築する。
シードセットから GPT-4 を用いて広範な医療タスクを自動生成（Rouge-L 多様性フィルタリングを適用）。
有効なタスクに対して ChatGPT に回答を生成させ、指示チューニング用に 52k MedInstruct データセットを作成。
MedInstruct-52k で LLaMA モデルをファインチューニングし AlpaCare モデルを取得。
医療領域および一般領域の自由形式の指示テストを用いて医療能力と一般化を評価し、複数の評価者（例：gpt-3.5-turbo、Claude-2）による二面スコアリングを実施。
7B および 13B のバックボーンと複数の参照モデルを横断して AlpaCare を他のベースラインと比較。

実験結果

リサーチクエスチョン

RQ1多様でドメイン特化した自己指示データセットで学習することで、LLM の医療熟練度が向上するか。
RQ2ドメイン特化の指示チューニングは一般的なドメイン指示への一般化を高めるか。
RQ3AlpaCare は異なるバックボーンと評価者間で他の医療/指示チューニングモデルと比較してどうか。

主な発見

AlpaCare は複数の参照モデルを横断した医療および一般ドメイン領域で、指示チューニング済みベースラインを一貫して上回る。
MedInstruct-52k での学習は医療能力を高めつつ、一般化を維持または向上させる。
AlpaCare-13B は医療および一般領域の評価の両方で他の 13B 指示チューニングモデルを上回る。
異なる評価者（gpt-3.5-turbo、Claude-2）での評価は AlpaCare の堅牢な性能と評価者バイアスの低減を示す。
Backbone Ablations は AlpaCare が Alpaca を含む全ての LLM バックボーンを凌ぐことを示し、データの多様性がモデルサイズを超える利点を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。