QUICK REVIEW

[논문 리뷰] Me LLaMA: Foundation Large Language Models for Medical Applications

Qianqian Xie, Qingyu Chen|arXiv (Cornell University)|2024. 02. 20.

Machine Learning in Healthcare인용 수 7

한 줄 요약

Me-LLaMA는 의료 도메인 LLM 계열로, 오픈 소스 LLaMA 모델을 기반으로 도메인 특화 사전 학습 및 지시 미세조정을 통해 의학 텍스트 분석 및 진단을 향상시키도록 최적화되었으며, 여러 설정에서 오픈 모델에 비해 강력한 제로샷, 감독 학습 및 복합 사례 성능을 달성하고 ChatGPT/GPT-4와도 경쟁력을 보입니다.

ABSTRACT

Recent advancements in large language models (LLMs) like ChatGPT and LLaMA show promise in medical applications, yet challenges remain in medical language comprehension. This study presents Me-LLaMA, a new medical LLM family based on open-source LLaMA models, optimized for medical text analysis and diagnosis by leveraging large-scale, domain-specific datasets. The Me-LLaMA family, including foundation models Me-LLaMA 13/70B and their chat-enhanced versions, was developed through continued pre-training and instruction tuning with 129B tokens and 214K samples from biomedical and clinical sources. Training the 70B models required over 100,000 A100 GPU hours. Me-LLaMA's performance was evaluated across six medical text analysis tasks using 12 benchmark datasets and complex clinical case diagnosis, with automatic and human evaluations. Results indicate Me-LLaMA outperforms LLaMA and other open-source medical LLMs in zero-shot and supervised settings. Task-specific tuning further boosts performance, surpassing ChatGPT on 7 of 8 datasets and GPT-4 on 5 of 8. For complex clinical cases, Me-LLaMA achieves performance comparable to ChatGPT and GPT-4. This work underscores the importance of domain-specific data in developing medical LLMs and addresses the high computational costs involved in training, highlighting a balance between pre-training and fine-tuning strategies. Me-LLaMA models are now accessible under user agreements, providing a valuable resource for advancing medical AI.

연구 동기 및 목표

의학 분야에서 도메인 특화 LLM의 필요성을 제시하여 언어 이해 및 진단 지원을 개선합니다.
생물의학/임상 데이터에 대한 지속적인 사전 학습과 지시 튜닝으로 의료 LLM 계열(Me-LLaMA 13B/70B)을 개발합니다.
자동 평가와 인간 평가를 모두 활용하여 여러 의료 텍스트 분석 작업과 복합 임상 사례 진단에서의 성능을 평가합니다.

제안 방법

도메인 특화 데이터 129B 토큰으로 Me-LLaMA 기반 모델(13B 및 70B)을 사전 학습합니다.
214K 생물의학/임상 샘플에 대한 지시 튜닝을 통해 챗-강화 버전을 만듭니다.
70B 모델의 학습에 대해 상당한 연산을 할당합니다(100,000 A100 GPU 시간 이상).
6개의 의료 텍스트 분석 작업과 12개의 벤치마크 데이터셋, 그리고 복합 임상 사례 진단에서 평가합니다.
제로샷 및 감독 학습 성능을 LLaMA 및 다른 오픈 소스 의료 LLM과 비교하고, 작업별 튜닝 후 ChatGPT 및 GPT-4와도 비교합니다.

실험 결과

연구 질문

RQ1생물의학/임상 데이터로 학습된 도메인 적응형 LLM이 핵심 의료 텍스트 분석 작업에서 일반 목적의 오픈 소스 의료 LLM보다 성능이 우수할 수 있는가؟
RQ2의무 작업별 지시 튜닝이 제로샷 설정에 비해 의료 벤치마크에서 성능에 어떤 영향을 미치는가?
RQ3Me-LLaMA 모델이 여러 의료 데이터 세트 및 복합 임상 시나리오에서 최신의 폐쇄형 모델(ChatGPT, GPT-4)을 따라가거나 능가할 수 있는가?

주요 결과

Me-LLaMA는 제로샷 및 감독 학습 설정에서 LLaMA 및 다른 오픈 소스 의료 LLM보다 여섯 가지 의료 텍스트 분석 작업에서 우수하다.
작업별 튜닝은 성능을 더욱 향상시키고 8개 데이터 세트 중 7개에서 ChatGPT를 능가한다.
튜닝 후 8개 데이터 세트 중 5개에서 GPT-4를 능가한다.
복합 임상 사례에 대해 Me-LLaMA는 ChatGPT 및 GPT-4와 비교 가능한 성능을 달성한다.
도메인 특화 데이터의 가치와 사전 학습 규모와 미세 튜닝 비용 간의 trade-off를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.