QUICK REVIEW

[論文レビュー] A comparative study of artificial intelligence and human doctors for the purpose of triage and diagnosis

Salman Razzaki, Adam Baker|arXiv (Cornell University)|Jun 27, 2018

Clinical Reasoning and Diagnostic Skills参考文献 1被引用数 44

ひとこと要約

この研究は現実的な症例像を用いた人間の医師との比較でAIのトリアージと診断システムを前向きに検証し、AIのパフォーマンスは医師と同等であり、一般的にトリアージ推奨はより安全であることを示しています。

ABSTRACT

Online symptom checkers have significant potential to improve patient care, however their reliability and accuracy remain variable. We hypothesised that an artificial intelligence (AI) powered triage and diagnostic system would compare favourably with human doctors with respect to triage and diagnostic accuracy. We performed a prospective validation study of the accuracy and safety of an AI powered triage and diagnostic system. Identical cases were evaluated by both an AI system and human doctors. Differential diagnoses and triage outcomes were evaluated by an independent judge, who was blinded from knowing the source (AI system or human doctor) of the outcomes. Independently of these cases, vignettes from publicly available resources were also assessed to provide a benchmark to previous studies and the diagnostic component of the MRCGP exam. Overall we found that the Babylon AI powered Triage and Diagnostic System was able to identify the condition modelled by a clinical vignette with accuracy comparable to human doctors (in terms of precision and recall). In addition, we found that the triage advice recommended by the AI System was, on average, safer than that of human doctors, when compared to the ranges of acceptable triage provided by independent expert judges, with only a minimal reduction in appropriateness.

研究の動機と目的

AI搭載のトリアージ・診断システム（Babylon）を人間の医師と比較評価する。
AI主導のトリアージ推奨の安全性と適切性を評価する。
セミ自然主義的OSCEデザインを用いた情報収集と問診能力を検証する。
公に利用可能な症例ベンチマークと既存の試験資料とAIのパフォーマンスをベンチマークする。

提案手法

OSCE形式での模擬診療を用いたセミ自然主義的なロールプレイを実施する。
AIシステムの出力を独立したブラインド審査員と複数の医師と比較する。
再現率(Recall)、適合率(Precision)、F1指標を用いて鑑別診断とトリアージ行動を評価する。
鑑別の質とトリアージの安全性に関する専門家の定性的評価を取り入れる。
医師タイプの挙動を模倣する内部閾値を変化させてAIの感度をテストする。

実験結果

リサーチクエスチョン

RQ1AI搭載のトリアージと診断システムはビネットでモデル化された状態を人間の医師と同等の精度（精度と再現率）で識別できるか。
RQ2独立審査基準内でAI生成のトリアージ推奨は人間の医師と同等かそれ以上の安全性を示すか。
RQ3AIのパフォーマンスは専門家評価による鑑別品質および既存の試験ベンチマークに対してどうなるか。
RQ4内部閾値を調整するとAIのRecallと医生のRecallの相対的な精度がどう変化するか。
RQ5AI出力はSemigran 2015、MRCGP AKT/CSAといった公開ベンチマークに一般化できるか。

主な発見

Metric	Doctor A	Doctor B	Doctor C	Doctor D	Doctor E	Doctor F	Doctor G	Babylon AI	Average Doctor
Recall	80.9%	64.1%	93.8%	84.3%	90.0%	90.2%	84.3%	80.0%	83.9%
Precision	42.9%	36.8%	53.5%	38.1%	33.9%	43.3%	56.5%	44.4%	43.6%
F1-score	56.1%	46.7%	68.1%	52.5%	49.2%	58.5%	67.7%	57.1%	57.0%
Number of Vignettes	47	78	48	51	70	51	51	100	56.6

AIシステムはビネット全体で医師と同等の再現率と適合率を達成（Babylon AI Recall 80.0%、Precision 44.4%、F1 57.1%）。
7人の医師の平均Recall: 83.9%、Precision: 43.6%、F1: 57.0%。
AIトリアージの安全性（97.0%）は医師（平均93.1%）を上回り、適切性は_AI 90.0% 対医師 90.5%）とほぼ同等またはやや低い。
専門家審査員はAIの鑑別品質を医師と同等と評価（83.0% および 83.0%〜83.0%〜? の異なるパネルで）。GPパネルの結果は評価者によってAIが低く評価される場合もあった。
Semigran 2015のビネットに対するAIのパフォーマンス：AIのトップ1 Recall 70.0%、トップ3 Recall 96.7% 対医師 75.3%、90.3%。
AKT/CSAのベンチマークではAIのトップ3にモデル化された病気を含めた割合が 86.7% (AKT) と 75.0% (CSA)。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。