QUICK REVIEW

[论文解读] Deep Neural Networks Improve Radiologists' Performance in Breast Cancer Screening

Nan Wu, Jason Phang|arXiv (Cornell University)|Mar 20, 2019

AI in cancer detection被引用 69

一句话总结

两阶段深度卷积神经网络，使用乳腺水平和像素级标签，在筛查乳腺X线摄影中达到放射科医生水平，并在作为第二读者时提升放射科医生表现；混合模型优于任一单独模型。

ABSTRACT

We present a deep convolutional neural network for breast cancer screening exam classification, trained and evaluated on over 200,000 exams (over 1,000,000 images). Our network achieves an AUC of 0.895 in predicting whether there is a cancer in the breast, when tested on the screening population. We attribute the high accuracy of our model to a two-stage training procedure, which allows us to use a very high-capacity patch-level network to learn from pixel-level labels alongside a network learning from macroscopic breast-level labels. To validate our model, we conducted a reader study with 14 readers, each reading 720 screening mammogram exams, and find our model to be as accurate as experienced radiologists when presented with the same data. Finally, we show that a hybrid model, averaging probability of malignancy predicted by a radiologist with a prediction of our neural network, is more accurate than either of the two separately. To better understand our results, we conduct a thorough analysis of our network's performance on different subpopulations of the screening population, model design, training procedure, errors, and properties of its internal representations.

研究动机与目标

Motivate improved breast cancer screening accuracy while reducing false positives.
Exploit large-scale pixel-level and breast-level labels to train high-capacity networks.
Develop a two-stage training framework to leverage patch-level heatmaps with a breast-level classifier.
Evaluate model performance against radiologists and in radiologist–model hybrids.

提出的方法

Use four view-specific ResNet-22 based columns for CC and MLO views.
Train an auxiliary patch-level network on 256x256 patches with malignant/benign labels from pixel-level segmentations.
Generate heatmaps from patch-level predictions and feed them as extra channels to the breast-level model.
Apply two-stage training to learn from pixel-level as well as breast-level labels (not end-to-end).
Ensemble five models with different initializations to improve robustness.

实验结果

研究问题

RQ1Can a deep CNN trained with breast-level and pixel-level labels achieve radiologist-level accuracy on screening mammograms?
RQ2Does adding patch-level heatmaps improve malignant/benign predictions compared to image-only models?
RQ3Do hybrids of radiologists and the CNN outperform either alone?
RQ4How does model performance vary across populations (screening vs biopsied) and patient subgroups (age, breast density)?

主要发现

Population	Model	Malignant AUC (single)	Benign AUC (single)	Malignant AUC (ensemble)	Benign AUC (ensemble)
筛查人群	仅图像	0.827 ± 0.008	0.731 ± 0.004	0.840	0.743
筛查人群	图像与热力图	0.886 ± 0.003	0.747 ± 0.002	0.895	0.756
活检子人群	仅图像	0.781 ± 0.006	0.673 ± 0.003	0.791	0.682
活检子人群	图像与热力图	0.843 ± 0.004	0.690 ± 0.002	0.850	0.696

On the screening population, image-only AUCs were 0.827 (malignant) and 0.731 (benign) for single models; ensemble improved to 0.840 and 0.743.
Image-and-heatmaps models achieved 0.886 (malignant) and 0.747 (benign) as single; 0.895 and 0.756 with ensemble.
On the biopsied subpopulation, image-only single AUCs were 0.781 (malignant) and 0.673 (benign); image-and-heatmaps reached 0.843 (malignant) and 0.690 (benign) as single, 0.850 and 0.696 with ensemble.
Radiologist readers showed AUCs spanning 0.705–0.860 (mean 0.778) with PRAUC 0.244–0.453 (mean 0.364).
A hybrid model (radiologist and CNN averaged predictions) achieved higher AUC/PRAUC than either alone (e.g., average hybrid AUC 0.891, PRAUC 0.431).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。