QUICK REVIEW

[논문 리뷰] People over trust AI-generated medical responses and view them to be as valid as doctors, despite low accuracy

Shruthi Shekar, Pat Pataranutaporn|arXiv (Cornell University)|2024. 08. 11.

Artificial Intelligence in Healthcare and Education인용 수 11

한 줄 요약

본 연구는 비전문가들이 AI가 생성한 의학적 응답을 의사와 구별하지 못하며, 종종 AI를 의학적으로 타당하다고 보거나 더 신뢰하는 경향이 있으며, 특히 AI가 고정확도로 표기될 때 안전성 우려가 제기된다.

ABSTRACT

This paper presents a comprehensive analysis of how AI-generated medical responses are perceived and evaluated by non-experts. A total of 300 participants gave evaluations for medical responses that were either written by a medical doctor on an online healthcare platform, or generated by a large language model and labeled by physicians as having high or low accuracy. Results showed that participants could not effectively distinguish between AI-generated and Doctors' responses and demonstrated a preference for AI-generated responses, rating High Accuracy AI-generated responses as significantly more valid, trustworthy, and complete/satisfactory. Low Accuracy AI-generated responses on average performed very similar to Doctors' responses, if not more. Participants not only found these low-accuracy AI-generated responses to be valid, trustworthy, and complete/satisfactory but also indicated a high tendency to follow the potentially harmful medical advice and incorrectly seek unnecessary medical attention as a result of the response provided. This problematic reaction was comparable if not more to the reaction they displayed towards doctors' responses. This increased trust placed on inaccurate or inappropriate AI-generated medical advice can lead to misdiagnosis and harmful consequences for individuals seeking help. Further, participants were more trusting of High Accuracy AI-generated responses when told they were given by a doctor and experts rated AI-generated responses significantly higher when the source of the response was unknown. Both experts and non-experts exhibited bias, finding AI-generated responses to be more thorough and accurate than Doctors' responses but still valuing the involvement of a Doctor in the delivery of their medical advice. Ensuring AI systems are implemented with medical professionals should be the future of using AI for the delivery of medical advice.

연구 동기 및 목표

일반 사람들이 AI가 생성한 의료 응답과 의사가 제공한 응답을 구별할 수 있는지 평가한다.
AI 대 의사 응답에 대한 타당성, 신뢰성, 완전성 및 사용자 의도에 대한 인식을 평가한다.
응답 출처에 대한 지식이 인식과 신뢰에 어떤 영향을 미치는지 검토한다.
출처가 공개되었을 때와 미공개일 때 전문가 평가자에게 편향이 나타나는지 조사한다.

제안 방법

HealthTap 질문에 대해 AI가 생성한 응답 150개를 수집하고, 네 명의 의사가 정확도(Yes, Maybe, No)를 평가하여 High vs. Low Accuracy AI 출력으로 분류한다.
100명의 온라인 참가자를 대상으로 의사 응답 30개, High Accuracy AI 응답 30개, Low Accuracy AI 응답 30개로 데이터세트를 만든다.
실험 1: 참가자들이 이해도와 출처를 판단하고 출처에 대한 신뢰도도 평가한다; AI vs. 의사를 비교한다.
실험 2: 참가자들이 정확한 출처를 모른 채 AI vs. 의사를 평가하고 타당성, 신뢰, 완전성, 행동 의도를 측정한다.
실험 3: 출처 서술자(의사, AI, 의사가 AI를 보조하는 경우)에 따른 편향을 평가하기 위한 무작위 라벨 실험.
출처가 공개되었을 때 AI에 반하는 전문가 편향을 평가하기 위한 Blind/Non-Blind 조건에서의 추가 의사 평가.

Figure 1: Visual summary of the dataset construction and pipeline of experiments discussed in this paper.

실험 결과

연구 질문

RQ1참가자들이 AI가 생성한 응답과 의사가 제공한 응답을 구별할 수 있는가?
RQ2참가자들은 AI 대 의사 응답에 대해 타당성, 신뢰, 완전성 및 행동 의도을 어떻게 평가하는가?
RQ3응답의 출처(의사 대 AI)에 대한 지식이 인식과 신뢰에 영향을 미치는가?
RQ4전문가가 AI 출력물을 출처 공개 여부에 따라 편향을 보이는가?

주요 결과

참가자들은 AI가 생성한 응답과 의사 응답을 신뢰성 있게 구분하지 못했고, 유형별로 약 50% 정도의 출처 식별 정확도를 보였다.
실험 2에서 High Accuracy AI가 생성한 응답은 의사 응답보다 훨씬 더 타당하고, 신뢰할 수 있으며, 완전성/만족도가 높게 평가되었다.
Low Accuracy AI 응답은 타당성, 신뢰, 완전성 측정에서 의사 응답과 비슷하게 작동했으며, 일부 경우 의사보다 성과가 우수한 것으로 나타났다.
출처를 모른 상태에서 AI 생성 응답을 일반적으로 신뢰했지만, High Accuracy AI 응답이 의사로 라벨링되면 신뢰가 더 증가했다.
전문가들은 출처를 모를 때 AI 생성 응답을 전반적으로 더 높게 평가했으나, 응답이 AI로부터 왔다는 것을 알게 되었을 때 평가가 현저히 떨어졌다.
본 연구는 오정보 위험을 강조한다: 일반 사용자가 임상의 감독 없이 해로운 AI 조언을 따를 수 있다.

Figure 2: Example Medical Questions by Category: Comparing Doctors’ and AI-Generated Responses

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.