QUICK REVIEW

[論文レビュー] Capabilities of GPT-4 on Medical Challenge Problems

Harsha Nori, Nicholas King|arXiv (Cornell University)|Mar 20, 2023

Artificial Intelligence in Healthcare and Education被引用数 497

ひとこと要約

この論文はGPT-4（テキストのみ）をUSMLE風試験とMultiMedQAベンチマークで評価し、素の医療推論力が強いこと、GPT-3.5より校正が向上していること、顕著な定性的能力を示すこと、結果はGPT-3.5を上回り、ベースラインと競合することを示している。

ABSTRACT

Large language models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation across various domains, including medicine. We present a comprehensive evaluation of GPT-4, a state-of-the-art LLM, on medical competency examinations and benchmark datasets. GPT-4 is a general-purpose model that is not specialized for medical problems through training or engineered to solve clinical tasks. Our analysis covers two sets of official practice materials for the USMLE, a three-step examination program used to assess clinical competency and grant licensure in the United States. We also evaluate performance on the MultiMedQA suite of benchmark datasets. Beyond measuring model performance, experiments were conducted to investigate the influence of test questions containing both text and images on model performance, probe for memorization of content during training, and study probability calibration, which is of critical importance in high-stakes applications like medicine. Our results show that GPT-4, without any specialized prompt crafting, exceeds the passing score on USMLE by over 20 points and outperforms earlier general-purpose models (GPT-3.5) as well as models specifically fine-tuned on medical knowledge (Med-PaLM, a prompt-tuned version of Flan-PaLM 540B). In addition, GPT-4 is significantly better calibrated than GPT-3.5, demonstrating a much-improved ability to predict the likelihood that its answers are correct. We also explore the behavior of the model qualitatively through a case study that shows the ability of GPT-4 to explain medical reasoning, personalize explanations to students, and interactively craft new counterfactual scenarios around a medical case. Implications of the findings are discussed for potential uses of GPT-4 in medical education, assessment, and clinical practice, with appropriate attention to challenges of accuracy and safety.

研究の動機と目的

公式USMLE練習問題（Steps 1-3）とMultiMedQAベンチマークスイートにおけるGPT-4のパフォーマンスを評価する。
ゼロショットおよび少数ショットプロンプトの下で、GPT-4をGPT-3.5および公開ベースライン（例：ChatGPT、Flan-PaLM 540B、Med-PaLM）と比較する。
メディア問題、予測確率の校正、訓練データの記憶化の可能性を分析する。
医療推論の説明や反事実的シナリオ生成などの定性的能力を検討する。
安全性と正確性の考慮とともに、医療教育、評価、臨床実践への含意について論じる。

提案手法

確立されたテンプレートに従い、ゼロショットおよびランダムに選択された5ショットプロンプトを用いたテキストのみのGPT-4モデルを使用する。
USMLE Sample Exam、USMLE Self Assessments、MedQA、PubMedQA、MedMCQA、およびMMLU（Medical components）を含む6つの医療データセットで評価する。
利用可能な場合、GPT-4をGPT-3.5およびFlan-PaLM 540BとMed-PaLMの公表結果と比較する。
画像を伴う問題と伴わない問題（テキストのみのプロンプト）の双方の性能を評価し、多肢選択肢の確率推定による校正を分析する。
ブラックボックス MELD（Memorization effects Levenshtein detector）ヒューリスティックによる記憶化の調査とデータ漏洩の潜在性について議論する。
チェーン・オブ・思考（思考過程の連鎖）やキュレートされた典型例といったプロンプティング戦略の潜在的な利点と限界、およびモデルのアラインメント／安全性チューニングの影響を探る。

実験結果

リサーチクエスチョン

RQ1公式USMLE練習問題（Steps 1-3）におけるGPT-4の性能は、GPT-3.5および他の医療LLMベースラインと比較してどうか。
RQ2MultiMedQAベンチマークスイート全体、MedQA、PubMedQA、MedMCQA、MMLUでのGPT-4の性能はどうか。
RQ3テキストのみの問題と画像を参照する問題の扱いはどうか、予測確率の校正はどうか。
RQ4GPT-4の出力に試験内容の記憶化の証拠はあるか、ベンチマークに対する影響は何か。
RQ5GPT-4が推論を説明したり、対話的な反事実的医療ケースシナリオに関与する際にどのような定性的能力が現れるか。

主な発見

GPT-4は試験サンプルでUSMLE合格閾値を20点以上上回り、USMLE資料ではGPT-3.5を30ポイント以上上回る。
USMLE Self AssessmentとSample Examでは、ゼロショットと5ショットの双方でGPT-4がGPT-3.5よりはるかに高い正確性を示す（例：Self Assessmentの平均86.65%はGPT-4、GPT-3.5は53.61%）。
MultiMedQAデータセットのほとんどのタスクでGPT-4はGPT-3.5およびFlan-PaLM 540Bを上回る。PubMedQAは例外で、いくつかのベースラインより高くない場合がある。
GPT-4（テキストのみ）は、モデルに渡されていないメディアを参照する問題でも高い性能を示し、テキストのみの処理でも70-80%の正確性を達成する。
GPT-4は複数の選択問題でGPT-3.5より著しく優れた校正を示し、予測確率が実際の正解率に近づく（例：予測0.96が特定のデータポイントで実正解率93%に対応）。
基本のGPT-4（GPT-4-base）は、いくつかのデータセットで調整済みリリース版より3-5ポイントの改善を示し、アラインメント中心の安全性チューニングが生のパフォーマンスに影響を与える可能性を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。