QUICK REVIEW

[論文レビュー] People over trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy

Shruthi Shekar, Pat Pataranutaporn|arXiv (Cornell University)|Aug 11, 2024

Artificial Intelligence in Healthcare and Education被引用数 11

ひとこと要約

本研究は、非専門家がAI生成の医療回答を医師の回答と見分けられないことを示し、AIを有効であるか信頼できると見なす傾向があり、特にAIが高精度と表示される場合には安全性の懸念を高める。

ABSTRACT

This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by physicians as having high or low accuracy. Results showed that participants could not effectively distinguish between AI-generated and Doctors' responses and demonstrated a preference for AI-generated responses, rating High Accuracy AI-generated responses as significantly more valid, trustworthy, and complete/satisfactory. Low Accuracy AI-generated responses on average performed very similar to Doctors' responses, if not more. Participants not only found these low-accuracy AI-generated responses to be valid, trustworthy, and complete/satisfactory but also indicated a high tendency to follow the potentially harmful medical advice and incorrectly seek unnecessary medical attention as a result of the response provided. This problematic reaction was comparable if not more to the reaction they displayed towards doctors' responses. This increased trust placed on inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of High Accuracy AI-generated responses when told they were given by a doctor and experts rated AI-generated responses significantly higher when the source of the response was unknown. Both experts and non-experts exhibited bias, finding AI-generated responses to be more thorough and accurate than Doctors' responses but still valuing the involvement of a Doctor in the delivery of their medical advice. Ensuring AI systems are implemented with medical professionals should be the future of using AI for the delivery of medical advice.

研究の動機と目的

一般の人々がAI生成の医療回答と医師提供の回答を区別できるかを評価する。
AIと医師の回答に対する妥当性・信頼性・完全性・ユーザー意図への認識を評価する。
回答元の知識（源の情報）を知ることが認識と信頼にどのように影響するかを検討する。
源が開示された場合と未知の場合で専門家評価者に偏りが生じるかを探る。

提案手法

HealthTapの質問に対して150件のAI生成回答を収集し、4名の医師が正確性を評価（はい、たぶん、いいえ）してHigh/Low Accuracy AI出力を分類する。
100名のオンライン参加者を対象に、30件の医師回答、30件の高精度AI回答、30件の低精度AI回答からなるデータセットを作成する。
実験1: 参加者は理解と源を判断し、源に対する自信度を測定する。AIと医師を比較する。
実験2: 参加者は源を正確には知らずにAIと医師を評価する；妥当性・信頼・完全性・行動意図を測定する。
実験3: ソース記述（医師、AI、AIによる医師支援）による評価の偏りを検出するためのランダムラベル実験。
源が開示されたときの専門家の偏りを評価するための盲検/非盲検条件下での追加の医師評価。

Figure 1: Visual summary of the dataset construction and pipeline of experiments discussed in this paper.

実験結果

リサーチクエスチョン

RQ1参加者はAI生成の回答と医師提供の回答を区別できるか。
RQ2参加者はAIと医師の回答に対して妥当性・信頼性・完全性・行動意図をどう評価するか。
RQ3回答源（医師かAI）を知ることが認識と信頼に影響するか。
RQ4専門家は源開示の有無でAI出力を評価する際に偏りを示すか。

主な発見

参加者はAI生成と医師回答を区別できず、タイプ間で源識別の正確性は概ね50%程度であった。
実験2では高精度AI生成回答が医師回答よりも妥当性・信頼性・完全性/満足度の点で有意に高く評価された。
低精度AI生成回答は妥当性・信頼・完全性の指標で医師の回答と同等またはそれを上回る場合があった。
ソースが未知の場合、AI生成回答を信頼する傾向が一般に見られたが、高精度AIの回答が医師由来とラベリングされると信頼がさらに増した。
専門家は源が未知のときにAI生成回答を高く評価したが、源がAIと分かった場合の評価は大幅に低下した。
本研究は誤情報リスクを浮き彫りにする：一般利用者は臨床医の監督がない場合、有害なAIアドバイスに従う可能性がある。

Figure 2: Example Medical Questions by Category: Comparing Doctors’ and AI-Generated Responses

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。