QUICK REVIEW

[论文解读] Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Noel Codella, Veronica Rotemberg|arXiv (Cornell University)|Feb 9, 2019

Cutaneous Melanoma Detection and Management参考文献 7被引用 984

一句话总结

本文总结了 ISIC 2018 皮肤病变分析挑战在黑色素瘤检测方面的研究，详细介绍数据集、任务、评估协议、结果，以及对泛化和监管的意义。

ABSTRACT

This work summarizes the results of the largest skin image analysis challenge in the world, hosted by the International Skin Imaging Collaboration (ISIC), a global partnership that has organized the world's largest public repository of dermoscopic images of skin. The challenge was hosted in 2018 at the Medical Image Computing and Computer Assisted Intervention (MICCAI) conference in Granada, Spain. The dataset included over 12,500 images across 3 tasks. 900 users registered for data download, 115 submitted to the lesion segmentation task, 25 submitted to the lesion attribute detection task, and 159 submitted to the disease classification task. Novel evaluation protocols were established, including a new test for segmentation algorithm performance, and a test for algorithm ability to generalize. Results show that top segmentation algorithms still fail on over 10% of images on average, and algorithms with equal performance on test data can have different abilities to generalize. This is an important consideration for agencies regulating the growing set of machine learning tools in the healthcare domain, and sets a new standard for future public challenges in healthcare.

研究动机与目标

呈现 ISIC 2018 挑战的设计与参与指标。
引入新的评估协议，包括 Thresholded Jaccard 和平衡准确率。
通过使用内部和外部测试划分来评估泛化性。
分析分割、属性检测和疾病分类任务的结果。
为未来医疗健康机器学习的公开挑战提供建议。

提出的方法

将挑战分为三项任务：分割、属性检测和疾病分类。
使用 Thresholded Jaccard 以考虑分割中的观察者间变异性。
使用平衡准确率以缓解分类中的普遍性偏差。
包含内部和外部保留测试划分以评估泛化性。
提供描述方法的 4 页手稿并披露使用的同域或异域数据。
使用任务特定指标评估分割、属性检测和分类。

实验结果

研究问题

RQ1在新的评估协议下，分割、属性检测和疾病分类的表现如何？
RQ2在分割中，Thresholded Jaccard 是否比 Jaccard 更能反映临床效用？
RQ3相较于其他指标，平衡准确率如何影响排名和泛化？
RQ4算法是否能够在黑色素瘤检测中从内部数据划分推广到外部数据划分？
RQ5属性检测性能受限对临床实践和未来挑战有哪些影响？

主要发现

顶尖的分割提交在 Thresholded Jaccard 上达到大约 0.80，但在超过 10% 的图像上仍然失败。
属性检测性能较低，最佳平均 Jaccard（每个属性）约为 0.473。
疾病分类的最高平衡准确率为 0.885，存在显著的内部与外部泛化差距。
算法常常对内部数据过拟合；不同方法之间的泛化程度各异。
与准确率或 AUC 相比，平衡准确率对参与者排名的影响显著。
外部测试数据揭示了未被内部测试数据捕捉到的性能差异。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。