QUICK REVIEW

[論文レビュー] Evaluation of GPT-3.5 and GPT-4 for supporting real-world information needs in healthcare delivery

Debadutta Dash, Rahul Thapa|arXiv (Cornell University)|Apr 26, 2023

Artificial Intelligence in Healthcare and Education被引用数 21

ひとこと要約

本研究はGPT-3.5とGPT-4を医療現場の医師の質問に対する情報学相談支援として評価し、専門家レポートとの一致は限定的で過半数の有害性の兆候は見られず、プロンプト設計とモデルの調整の必要性を強調する。

ABSTRACT

Despite growing interest in using large language models (LLMs) in healthcare, current explorations do not assess the real-world utility and safety of LLMs in clinical settings. Our objective was to determine whether two LLMs can serve information needs submitted by physicians as questions to an informatics consultation service in a safe and concordant manner. Sixty six questions from an informatics consult service were submitted to GPT-3.5 and GPT-4 via simple prompts. 12 physicians assessed the LLM responses' possibility of patient harm and concordance with existing reports from an informatics consultation service. Physician assessments were summarized based on majority vote. For no questions did a majority of physicians deem either LLM response as harmful. For GPT-3.5, responses to 8 questions were concordant with the informatics consult report, 20 discordant, and 9 were unable to be assessed. There were 29 responses with no majority on "Agree", "Disagree", and "Unable to assess". For GPT-4, responses to 13 questions were concordant, 15 discordant, and 3 were unable to be assessed. There were 35 responses with no majority. Responses from both LLMs were largely devoid of overt harm, but less than 20% of the responses agreed with an answer from an informatics consultation service, responses contained hallucinated references, and physicians were divided on what constitutes harm. These results suggest that while general purpose LLMs are able to provide safe and credible responses, they often do not meet the specific information need of a given question. A definitive evaluation of the usefulness of LLMs in healthcare settings will likely require additional research on prompt engineering, calibration, and custom-tailoring of general purpose models.

研究の動機と目的

二つの大規模言語モデル（GPT-3.5とGPT-4）が医師から提出される情報学の質問に安全に答えることができるかを評価する。
LLMの回答と確立された情報学相談レポートとの一致を評価する。
現実の臨床質問における潜在的な患者への危害や幻覚を含む安全性の懸念を特定する。

提案手法

情報学相談サービスからの66件の医師の質問を簡単なプロンプトでGPT-3.5とGPT-4に提出する。
12名の医師に対し、LLMの回答が患者への危害を及ぼすかどうかと情報学相談レポートと一致しているかを評価させる。
医師の評価を多数決で要約し、安全性と一致性を判断する。
各モデルについて、一致、非一致、評価不能のカウントを報告する。

実験結果

リサーチクエスチョン

RQ1GPT-3.5とGPT-4は医療提供における現実の医師情報ニーズに対して安全な回答を提供できるか。
RQ2LLMの回答は確立された情報学相談レポートとどの程度一致するか。
RQ3臨床問合せにおけるLLM出力の有害性、幻覚、または不一致のパターンはどのようなものか。

主な発見

どの質問に対しても医師の過半数がいかなるLLMの回答を有害と判断しなかった。
GPT-3.5: 8 concordant, 20 discordant, 9 unable to assess; 29 with no majority across Agree/Disagree/Unable.
GPT-4: 13 concordant, 15 discordant, 3 unable to assess; 35 with no majority across Agree/Disagree/Unable.
両方のLLMの回答は顕著な有害性は大半にはなく、幻覚的な参照を含むことがあり、情報学相談レポートと一致しないことが多かった。
情報学相談サービスの回答に同意した回答は20%未満であった。
一般目的のLLMは安全であり得るが、さらなるプロンプト設計とカスタマイズなしには信頼できる有用性を発揮しないことを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。