QUICK REVIEW

[论文解读] MURA: Large Dataset for Abnormality Detection in Musculoskeletal Radiographs

Pranav Rajpurkar, Jeremy Irvin|arXiv (Cornell University)|Dec 11, 2017

Radiomics and Machine Learning in Medical Imaging参考文献 17被引用 248

一句话总结

MURA 引入了一个大型上肢胸片数据集（40,561 张图像，14,863 研究）标注为正常/异常；训练 DenseNet-169 基线模型以检测异常，并将性能与放射科医生进行比较。AUROC 0.929；总体模型性能低于最佳放射科医生但在某些研究类型上可与之媲美。

ABSTRACT

We introduce MURA, a large dataset of musculoskeletal radiographs containing 40,561 images from 14,863 studies, where each study is manually labeled by radiologists as either normal or abnormal. To evaluate models robustly and to get an estimate of radiologist performance, we collect additional labels from six board-certified Stanford radiologists on the test set, consisting of 207 musculoskeletal studies. On this test set, the majority vote of a group of three radiologists serves as gold standard. We train a 169-layer DenseNet baseline model to detect and localize abnormalities. Our model achieves an AUROC of 0.929, with an operating point of 0.815 sensitivity and 0.887 specificity. We compare our model and radiologists on the Cohen's kappa statistic, which expresses the agreement of our model and of each radiologist with the gold standard. Model performance is comparable to the best radiologist performance in detecting abnormalities on finger and wrist studies. However, model performance is lower than best radiologist performance in detecting abnormalities on elbow, forearm, hand, humerus, and shoulder studies. We believe that the task is a good challenge for future research. To encourage advances, we have made our dataset freely available at https://stanfordmlgroup.github.io/competitions/mura .

研究动机与目标

为上肢研究提供一个大型、公开可用的肌肉骨骼放射线数据集，标注为正常或异常。
开发并评估一个深度学习基线模型，以检测多种研究类型中的异常。
使用稳健的指标和评估者间一致性，将模型性能与放射科医生性能进行比较。
提供定位/解释见解（CAMs），并释放数据以促进进一步研究。

提出的方法

使用169层 DenseNet 对研究中的每张图像预测异常。
通过对每张图像的概率求平均来获得研究级别的异常概率。
针对每种研究类型使用加权二元交叉熵进行训练，以解决类别不平衡问题。
将输入规范化为 ImageNet 的均值/标准差，调整为 320x320，并应用数据增强（随机翻转、旋转）。
通过在验证损失排序的前5个模型进行集成以得到最终预测；在带有放射科医生标签的测试集上进行评估以与金标准比较。

实验结果

研究问题

RQ1一个 CNN 能否在多样化的上肢放射影像视图中准确检测异常？
RQ2在不同研究类型（肘部、指、前臂、手、肱骨、肩部、腕部）上，模型性能与具备执业资格的放射科医生相比如何？
RQ3不同研究类型的常见错误模式有哪些，模型性能与人类读者的差距有多大？
RQ4模型解释（CAMs）是否能突出放射科医生证实的临床相关区域？

主要发现

模型在测试集上的 AUROC 为 0.929。
阈值为 0.5 时，模型的灵敏度为 0.815，特异性为 0.887。
模型在指部研究的性能可与最佳放射科医生相当（0.389 对 0.410），在腕部研究中亦相当（0.931 对 0.931）。
总体而言，模型 AUROC 为 0.929，但最佳放射科医生的工作点位于模型 ROC 曲线之上，说明放射科医生整体优于模型。
在肘部、前臂、手部、肱骨和肩部研究中，模型的性能低于最佳放射科医生，但在某些类型（如指部）可与最差放射科医生相当。
还生成了 CAM 可视化以定位对异常预测有显著贡献的区域。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。