[论文解读] The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction
本论文提出 MAMA-MIA 挑战,一个大规模基准测试,用于评估跨机构和子群体公平性下的乳腺 DCE-MRI 肿瘤分割与 pCR 预测,并给出最终排行榜和关于准确性–公平性权衡的见解。
Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.
研究动机与目标
- 通过解决单中心研究的泛化能力局限性,推动乳腺癌影像领域的鲁棒、公平的人工智能应用。
- 在一个统一框架下联合评估原发肿瘤分割与治疗前 pCR 预测。
- 评估模型在年龄、绝经状态、乳腺密度等子群体中的公平性。
- 提供标准化数据集、协议和基线资源,促进可重复性、公平性的 AI 研究。
提出的方法
- 定义一个两任务基准:任务1 为自动原发肿瘤分割,任务2 使用治疗前 MRI 的 pCR 预测。
- 在美国多机构队列(n=1506)上训练,在私人欧洲中心(n=574)上测试,以评估跨领域泛化能力。
- 使用一个统一的评分框架,结合准确性与公平性;λ=0.5,权重等同。
- 在年龄、绝经状态和乳腺密度定义的子组上评估公平性。
- 提供标准化的预处理和容器化评估工作流,便于在 CodaBench 上的可重复性。
- 比较多样化团队(26 支团队,14 个国家)并分析设计趋势及准确性–公平性权衡。
实验结果
研究问题
- RQ1模型在跨机构与跨大陆的乳腺 MRI 肿瘤分割与 pCR 预测方面的泛化能力有多强?
- RQ2人口统计因素(年龄、绝经状态、乳腺密度)对模型性能与公平性有何影响?
- RQ3前沿方法在预测准确性与子群体公平性之间存在哪些权衡?
- RQ4哪些体系结构与训练策略在跨站点评估下能实现稳健、公平的性能?
主要发现
| Rank | Team | Combined Score | Fairness Score | Performance Score | DSC | NormHD |
|---|---|---|---|---|---|---|
| 1 | MIC | 0.8858 | 0.9531 | 0.8185 | 0.7360 | 0.0990 |
| 2 | FME | 0.8820 | 0.9574 | 0.8066 | 0.7125 | 0.0993 |
| 3 | ViCOROB | 0.8782 | 0.9482 | 0.8083 | 0.7182 | 0.1017 |
| 4 | Martel Lab | 0.8735 | 0.9449 | 0.8021 | 0.7121 | 0.1078 |
| 5 | AIH-Mama | 0.8677 | 0.9532 | 0.7823 | 0.6914 | * 0.1268* |
| 6 | HWT@YCH | 0.8655 | 0.9339 | 0.7971 | 0.7080 | 0.1138 |
| 7 | Flamingo | 0.8640 | 0.9434 | 0.7847 | 0.7033 | * 0.1338* |
| 8 | CALADAN | 0.8631 | 0.9621 | 0.7640 | 0.7022 | * 0.1742* |
| 9 | bigAI | 0.8517 | 0.9464 | 0.7570 | 0.6872 | * 0.1732* |
| 10 | Shangqi,Gao@CAM | 0.8485 | 0.9621 | 0.7349 | 0.6101 | 0.1404 |
| 11 | GK_KI | 0.8451 | 0.9581 | 0.7321 | 0.6330 | 0.1688 |
| 12 | Jeff | 0.8439 | 0.9519 | 0.7360 | 0.7025 | * 0.2305* |
| 13 | Baseline | 0.8290 | 0.9373 | 0.7208 | 0.6871 | 0.2455 |
| 14 | Dynamo | 0.8290 | 0.9373 | 0.7208 | 0.6871 | * 0.2455* |
| 15 | PM | 0.8290 | 0.9373 | 0.7208 | 0.6871 | * 0.2455* |
| 16 | AEHRC-MIA | 0.8256 | 0.9261 | 0.7251 | 0.6781 | * 0.2280* |
| 17 | AI Strollers | 0.8030 | 0.9156 | 0.6904 | 0.6296 | * 0.2489* |
| 18 | MedImgLab_Unipa | 0.7270 | 0.9084 | 0.5456 | 0.4717 | 0.3805 |
| 19 | FPixel | 0.7270 | 0.9084 | 0.5456 | 0.4717 | 0.3805 |
| 20 | BWS-KNU | 0.7257 | 0.9382 | 0.5132 | 0.4556 | 0.4291 |
| 21 | CIG@Illinois | 0.6593 | 0.8931 | 0.4256 | 0.5195 | 0.6683 |
- 12 支团队在任务1的公平性与性能方面均超越基线,且提升覆盖顶尖排名。
- 对于任务1,顶尖方法在 DSC 上实现显著提升,并相比基线降低 NormHD。
- 对于任务2,只有三支团队超越基线,且三者均提升了公平性,且有两组的性能高于基线。
- 竞争揭示了外部测试下的显著性能波动,以及总体准确性与子群体公平性之间的权衡。
- 该基准提供标准化的数据集、评估代码与报告指南,推动乳腺癌影像领域的鲁棒与公平的 AI。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。