QUICK REVIEW

[论文解读] Can Artificial Intelligence Reliably Report Chest X-Rays?: Radiologist Validation of an Algorithm trained on 2.3 Million X-Rays

Preetham Putha, Manoj Tadepalli|arXiv (Cornell University)|Jul 19, 2018

Radiomics and Machine Learning in Medical Imaging参考文献 37被引用 32

一句话总结

本研究开发并验证了一种深度学习算法，该算法基于230万张标注过的胸部X光片进行训练，可检测9种特定异常，并区分正常与异常扫描。该系统表现出高精度，异常与正常检测的AUC为0.92，各项异常的AUC在0.89至0.98之间，显示出在放射科医生验证的环境下接近放射科医生水平的性能。

ABSTRACT

Background: Chest X-rays are the most commonly performed, cost-effective diagnostic imaging tests ordered by physicians. A clinically validated AI system that can reliably separate normals from abnormals can be invaluble particularly in low-resource settings. The aim of this study was to develop and validate a deep learning system to detect various abnormalities seen on a chest X-ray. Methods: A deep learning system was trained on 2.3 million chest X-rays and their corresponding radiology reports to identify various abnormalities seen on a Chest X-ray. The system was tested against - 1. A three-radiologist majority on an independent, retrospectively collected set of 2000 X-rays(CQ2000) 2. Radiologist reports on a separate validation set of 100,000 scans(CQ100k). The primary accuracy measure was area under the ROC curve (AUC), estimated separately for each abnormality and for normal versus abnormal scans. Results: On the CQ2000 dataset, the deep learning system demonstrated an AUC of 0.92(CI 0.91-0.94) for detection of abnormal scans, and AUC(CI) of 0.96(0.94-0.98), 0.96(0.94-0.98), 0.95(0.87-1), 0.95(0.92-0.98), 0.93(0.90-0.96), 0.89(0.83-0.94), 0.91(0.87-0.96), 0.94(0.93-0.96), 0.98(0.97-1) for the detection of blunted costophrenic angle, cardiomegaly, cavity, consolidation, fibrosis, hilar enlargement, nodule, opacity and pleural effusion. The AUCs were similar on the larger CQ100k dataset except for detecting normals where the AUC was 0.86(0.85-0.86). Interpretation: Our study demonstrates that a deep learning algorithm trained on a large, well-labelled dataset can accurately detect multiple abnormalities on chest X-rays. As these systems improve in accuracy, applying deep learning to widen the reach of chest X-ray interpretation and improve reporting efficiency will add tremendous value in radiology workflows and public health screenings globally.

研究动机与目标

开发一种深度学习系统，利用大规模真实世界数据，可靠检测多种胸部X光片异常。
在两个独立数据集上，通过放射科医生共识和个体放射科报告验证算法性能。
评估在资源有限环境中部署AI实现自动化初步报告的可行性，以减少报告积压并提升可及性。
评估系统在不依赖临床病史的情况下检测特定影像学表现的准确性，以实现全球适用性。
确定从放射科报告中通过自然语言处理（NLP）提取的标签是否可作为大规模AI模型训练中专家标注的可靠替代方案。

提出的方法

该算法基于来自45个全球中心的230万张匿名、回顾性收集的胸部X光片进行训练，涵盖前后位、前后斜位、仰卧位及侧位影像。
通过自然语言处理（NLP）流程从放射科报告中提取异常标签，为9项特定发现（包括：肋膈角变钝、心脏扩大、空洞、实变、纤维化、肺门增大、结节、密度增高、胸腔积液）生成训练标签。
为每种异常单独训练一个深度学习模型，以根据病灶特征和空间模式优化检测性能。
系统在两个数据集上进行验证：CQ2000（2,000张X光片，以三名放射科医生多数投票作为金标准）和CQ100k（10万张X光片，以放射科报告作为金标准）。
性能通过受试者工作特征曲线下面积（AUC）进行衡量，各项异常及整体分类的AUC均报告95%置信区间。
模型生成热图和边界框以定位病灶，但本研究未对定位准确性进行正式验证。

实验结果

研究问题

RQ1基于230万张真实世界标注胸部X光片训练的深度学习模型，能否在检测多种常见异常方面达到放射科医生水平的准确性？
RQ2在2,000张独立验证集上，该AI系统的性能与三名放射科医生多数共识相比如何？
RQ3基于NLP的标签生成方法在多大程度上能产生可靠且可泛化的训练数据，适用于真实临床环境？
RQ4尽管存在标签噪声或细微发现，该系统在更大、更多样化的数据集（CQ100k）上是否仍能保持高性能？
RQ5此类AI系统能否通过在资源有限或报告积压严重的环境中实现自动化初步报告，有效支持放射科工作流程？

主要发现

在CQ2000数据集上，该深度学习系统在区分正常与异常胸部X光片方面的AUC为0.92（95%置信区间：0.91–0.94）。
对于各项单独异常，AUC范围为0.89（肺门增大）至0.98（胸腔积液），所有异常检测性能均表现优异，包括心脏扩大（0.96）和实变（0.95）。
在更大的CQ100k数据集上，正常与异常检测的AUC为0.86（95%置信区间：0.85–0.86），表明性能略有下降，可能由于纳入了细微或临床意义不大的发现。
大多数异常的AUC在CQ2000与CQ100k之间保持一致，表明模型具有良好的泛化能力，且NLP标签化过程未引入显著偏差。
该系统在各项异常上均表现出高敏感性和特异性，尽管在CQ100k上的敏感性较低，可能由于放射科医生对细微异常的报告更为保守。
本研究证实，大规模、通过NLP标注的数据集可训练出接近专家诊断准确率的AI模型，支持其在筛查和工作流支持应用中的部署。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。