QUICK REVIEW

[논문 리뷰] The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

Lidia Garrucho, Smriti Joshi|arXiv (Cornell University)|2026. 03. 01.

MRI in cancer diagnosis인용 수 0

한 줄 요약

본 논문은 대규모 벤치마크인 MAMA-MIA Challenge를 도입하며, 교차 기관 및 하위 집단 공정성 평가와 함께 유방 DCE-MRI 종양 분할 및 pCR 예측을 평가합니다. 최종 리더보드와 정확도-공정성 트레이드오프에 대한 통찰을 보고합니다.

ABSTRACT

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are often developed using single-center data and evaluated using aggregate performance metrics, limiting their generalizability and obscuring potential performance disparities across demographic subgroups. The MAMA-MIA Challenge was designed to address these limitations by introducing a large-scale benchmark that jointly evaluates primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under external testing and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

연구 동기 및 목표

단일 중심 연구의 일반화 한계를 해결하여 강건하고 공정한 AI를 촉진한다.
단일 프레임워크를 사용하여 주 종양 분할과 사전 치료 pCR 예측을 함께 평가한다.
나이, 폐경 여부, 유방 밀도 하위집단에서 모델의 공정성을 평가한다.
재현 가능한 AI를 촉진하기 위해 표준화된 데이터셋, 프로토콜, 벤치마크 자원을 제공한다.

제안 방법

두 작업 벤치마크 정의: 작업 1은 자동 주종양 분할, 작업 2는 사전 치료 MRI만을 사용한 pCR 예측.
교차 도메인 일반화를 평가하기 위해 미국 다기관 코호트(n=1506)로 학습하고 비공개 유럽 센터(n=574)에서 테스트한다.
정확도와 공정성을 결합한 통합 점수 프레임워크를 사용하며; 람다=0.5로 동일 가중치를 부여한다.
나이, 폐경 여부, 유방 밀도로 정의된 하위그룹에서 공정성을 평가한다.
재현성을 위해 표준화된 전처리와 CodaBench에서 컨테이너화된 평가 워크플로를 제공한다.
다양한 팀(26개 팀, 14국가)을 비교하고 설계 트렌드와 정확도-공정성 트레이드오프를 분석한다.

실험 결과

연구 질문

RQ1모델이 기관 간 및 대륙 간에 유방 MRI 종양 분할 및 pCR 예측에 대해 얼마나 잘 일반화되는가?
RQ2인구통계학적 요인(나이, 폐경 여부, 유방 밀도)이 모델 성능과 공정성에 미치는 영향은 무엇인가?
RQ3최신 방법에서 예측 정확도와 하위집단 공정성 간의 트레이드오프는 무엇인가?
RQ4교차 사이트 평가에서 견고하고 공정한 성능을 제공하는 어떤 아키텍처 및 학습 전략이 있는가?

주요 결과

순위	팀	종합 점수	공정성 점수	성능 점수	DSC	NormHD
1	MIC	0.8858	0.9531	0.8185	0.7360	0.0990
2	FME	0.8820	0.9574	0.8066	0.7125	0.0993
3	ViCOROB	0.8782	0.9482	0.8083	0.7182	0.1017
4	Martel Lab	0.8735	0.9449	0.8021	0.7121	0.1078
5	AIH-Mama	0.8677	0.9532	0.7823	0.6914	* 0.1268*
6	HWT@YCH	0.8655	0.9339	0.7971	0.7080	0.1138
7	Flamingo	0.8640	0.9434	0.7847	0.7033	* 0.1338*
8	CALADAN	0.8631	0.9621	0.7640	0.7022	* 0.1742*
9	bigAI	0.8517	0.9464	0.7570	0.6872	* 0.1732*
10	Shangqi,Gao@CAM	0.8485	0.9621	0.7349	0.6101	0.1404
11	GK_KI	0.8451	0.9581	0.7321	0.6330	0.1688
12	Jeff	0.8439	0.9519	0.7360	0.7025	* 0.2305*
13	Baseline	0.8290	0.9373	0.7208	0.6871	0.2455
14	Dynamo	0.8290	0.9373	0.7208	0.6871	* 0.2455*
15	PM	0.8290	0.9373	0.7208	0.6871	* 0.2455*
16	AEHRC-MIA	0.8256	0.9261	0.7251	0.6781	* 0.2280*
17	AI Strollers	0.8030	0.9156	0.6904	0.6296	* 0.2489*
18	MedImgLab_Unipa	0.7270	0.9084	0.5456	0.4717	0.3805
19	FPixel	0.7270	0.9084	0.5456	0.4717	0.3805
20	BWS-KNU	0.7257	0.9382	0.5132	0.4556	0.4291
21	CIG@Illinois	0.6593	0.8931	0.4256	0.5195	0.6683

12개 팀이 Task 1에서 공정성과 성능 모두에서 기준치를 상회했으며, 개선은 상위 순위 전반에 걸쳤다.
Task 1에서 상위 방법들은 DSC에서 큰 향상과 NormHD의 감소를 달성했다.
Task 2에서는 세 팀만이 baseline을 넘어섰으며, 이들 전원이 공정성을 향상시켰고 두 팀은 baseline보다 높은 성능을 보였다.
대외 테스트에서 눈에 띄는 성능 변동성과 전체 정확도와 하위집단 공정성 간의 트레이드오프가 나타났다.
벤치마크는 표준화된 데이터셋, 평가 코드, 보고 지침을 제공하여 유방암 영상에서 견고하고 공정한 AI를 촉진한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.