QUICK REVIEW

[논문 리뷰] A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis

Salman Razzaki, Adam Baker|arXiv (Cornell University)|2018. 06. 27.

Clinical Reasoning and Diagnostic Skills참고 문헌 1인용 수 44

한 줄 요약

본 연구는 현실적 vignette를 사용하여 인간 의사와 대조적으로 AI 삼분류 및 진단 시스템을 전향적으로 검증하며, AI 성능이 의사와 비슷하고 전반적으로 더 안전한 삼분류 권고를 보인다.

ABSTRACT

Online symptom checkers have significant potential to improve patient care, however their reliability and accuracy remain variable. We hypothesised that an artificial intelligence (AI) powered triage and diagnostic system would compare favourably with human doctors with respect to triage and diagnostic accuracy. We performed a prospective validation study of the accuracy and safety of an AI powered triage and diagnostic system. Identical cases were evaluated by both an AI system and human doctors. Differential diagnoses and triage outcomes were evaluated by an independent judge, who was blinded from knowing the source (AI system or human doctor) of the outcomes. Independently of these cases, vignettes from publicly available resources were also assessed to provide a benchmark to previous studies and the diagnostic component of the MRCGP exam. Overall we found that the Babylon AI powered Triage and Diagnostic System was able to identify the condition modelled by a clinical vignette with accuracy comparable to human doctors (in terms of precision and recall). In addition, we found that the triage advice recommended by the AI System was, on average, safer than that of human doctors, when compared to the ranges of acceptable triage provided by independent expert judges, with only a minimal reduction in appropriateness.

연구 동기 및 목표

AI 기반 삼분류 및 진단 시스템(Babylon)의 진단 정확성을 인간 의사와 비교하여 평가한다.
AI 주도 삼분류 권고의 안전성과 적합성을 평가한다.
반자연스러운 OSCE 설계를 통해 정보 수집 및 병력 확인 능력을 검토한다.
공개적으로 이용 가능한 케이스 비네트와 확립된 시험 자료에 대해 AI 성능을 벤치마킹한다.

제안 방법

OSCE 형식의 모의 상담을 통한 반자연적 역할극을 사용한다.
AI 시스템 출력과 독립적으로 블라인드한 판정가 및 다수의 의사를 비교한다.
회상, 정밀도, F1 지표를 사용해 감별진단 및 삼분류 행동을 평가한다.
감별 진단 질과 삼분류 안전성에 대한 전문가의 정성적 평가를 포함한다.
의사 유형 행동을 시뮬레이션하기 위해 내부 임계값을 조정해 AI의 민감도를 테스트한다.

실험 결과

연구 질문

RQ1AI 기반 삼분류 및 진단 시스템이 비네트로 모델링된 질환을 인간 의사와 유사한 정확도(정밀도 및 재현율)로 식별할 수 있는가?
RQ2독립 심판 임계값 내에서 AI가 제시하는 삼분류 권고가 인간 의사가 제공하는 것만큼 안전하거나 더 안전한가?
RQ3AI 성능이 전문가 평가의 감별 질 및 확립된 시험 벤치마크에 비해 어떤가?
RQ4의사와 비교했을 때 AI 시스템의 재현율 vs 정밀도에 내부 임계값을 조정하면 어떤 영향이 나타나는가?
RQ5AI 출력이 공개적으로 이용 가능한 비네트 벤치마크(Semigran 2015, MRCGP AKT/CSA)에 일반화되는가?

주요 결과

지표	의사 A	의사 B	의사 C	의사 D	의사 E	의사 F	의사 G	Babylon AI	평균 의사
재현율	80.9%	64.1%	93.8%	84.3%	90.0%	90.2%	84.3%	80.0%	83.9%
정밀도	42.9%	36.8%	53.5%	38.1%	33.9%	43.3%	56.5%	44.4%	43.6%
F1-점수	56.1%	46.7%	68.1%	52.5%	49.2%	58.5%	67.7%	57.1%	57.0%
바이엔트 수	47	78	48	51	70	51	51	100	56.6

AI 시스템은 비네트 처리에서 의사와 유사한 재현율과 정밀도를 달성한다(Babylon AI 재현율 80.0%, 정밀도 44.4%, F1 57.1%).
평균 의사 재현율: 83.9%, 정밀도: 43.6%, F1: 57.0% 전체 7명의 의사.
AI 삼분류 안전성(97.0%)은 의사(평균 93.1%)를 상회하며 적절성은 비슷하거나 약간 낮은 편(AI 90.0% vs 의사 90.5%).
전문가 심판의 AI 감별 질이 의사와 비교 가능하다고 평가(83.0% 및 83.0%–83.0%–? 서로 다른 패널에서); GP 패널 결과는 평가자에 따라 AI가 때로 낮은 점수를 받은 경우도 있음.
Semigran 2015 비네트에서 AI의 상위 1 재현율은 70.0%이고 상위 3 재현율은 96.7%로 의사의 75.3% 및 90.3%에 비해 높거나 비슷.
AKT/CSA 벤치마크에서 AI의 상위 3 포함 비율은 86.7%(AKT)와 75.0%(CSA)로 나타났다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.