QUICK REVIEW

[論文レビュー] A Survey of Large Language Models in Medicine: Progress, Application, and Challenge

Hongjian Zhou, Fenglin Liu|arXiv (Cornell University)|Nov 9, 2023

Topic Modeling被引用数 35

ひとこと要約

医療用大規模言語モデル（LLMs）の構築・評価・臨床実践への適用方法を包括的にレビューし、課題と今後の展望を強調する。

ABSTRACT

Large language models (LLMs), such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide

研究の動機と目的

医療用LLMsがどのように構築されるか（事前学習、ファインチューニング、プロンプティング）と、それらが使用するデータソースを説明する。
医療系NLP全体における評価指標とベンチマーク課題を要約し、医療用LLMの性能を評価する。
現実の臨床現場で医療用LLMsを展開する際の実用的な臨床応用とガイドラインを説明する。
幻覚（hallucination）、データ制約、倫理・安全性などの主要な課題を特定し、今後の進歩の方向性を提案する。

提案手法

事前学習およびファインチューニングに用いられる既存の医療用LLMアーキテクチャとデータソースを調査する。
ファインチューニング手法（SFT、IFT、パラメータ効率チューニング）とプロンプティング手法（zero-/few-shot、CoT、self-consistency、prompt tuning）を比較する。
下流の生物医療NLPタスク（識別的および生成的）と標準的な評価データセットを要約する。
七つの臨床適用シナリオのガイドラインを提供し、展開上の考慮事項を検討する。
幻覚、評価ベンチマーク、データ制約、新しい知識の適応、整合性、倫理/安全性などの課題を検討し、今後の方向性を示す。

Figure 2: We demonstrate the development of model sizes for medical large language models in different model architectures, i.e., BERT-like, Baichuan/ChatGLM/LLaMA-like, and GPT/PaLM-like.

実験結果

リサーチクエスチョン

RQ1医療用LLMsはどのように構築すべきか、最も効果的なデータソースは何か？
RQ2医療用LLMsを評価するベンチマークと指標は何か、タスク間での性能はどうか？
RQ3医療現場で医療用LLMsをどのように適用すべきか？
RQ4展開と保守において医療用LLMsはどのような課題に直面するか？
RQ5医療用LLMsの構築・評価・展開を改善する方向性は何か？

主な発見

医療用LLMsは、ドメインデータでの事前学習、一般LLMsのファインチューニングまたはプロンプティングによって登場し、生物医療NLPタスクで高い性能を達成している。
識別的および生成的な下流タスクの広範な範囲が、ベンチマークと人間の専門家と比較して医療用LLMsを評価するために用いられている。
本論文は七つの臨床シナリオにわたる医療用LLMsの展開に関する実用的なガイドラインを提供し、GPT-3.5-turbo、GPT-4、そして人間専門家との性能比較を強調している。
幻覚、評価のギャップ、ドメインデータの制約、倫理/安全性の懸念が医療用LLMsの主要な課題として特定され、より広範な信頼性評価と安全な展開を求める呼びかけがある。

Figure 3: Performance comparison between the GPT-3.5 turbo, GPT-4, state-of-the-art task-specific fine-tuned models, and human experts, on seven downstream biomedical NLP tasks across eleven datasets. Please refer to Appendix B for details.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。