QUICK REVIEW

[논문 리뷰] Pathologist-Level Grading of Prostate Biopsies with Artificial Intelligence

Peter Ström, Kimmo Kartasalo|arXiv (Cornell University)|2019. 07. 02.

Prostate Cancer Diagnosis and Treatment참고 문헌 33인용 수 12

한 줄 요약

이 연구는 STHLM3 인구 기반 연구에서 확보한 전체 슬라이드 이미지를 사용하여 병리의사 수준의 정확도로 전립선 생검을 평가하는 딥러닝 기반 AI 시스템을 개발한다. 6,682개의 생검으로 훈련하고 1,631개의 독립적 사례로 테스트한 AI는 암 검출에 AUC 0.997, 환자 수준의 암 예측에 AUC 0.999, Gleason 등급 분류에 Cohen’s kappa 0.62를 기록하여 전문 병리의사 수준과 유사한 성능을 보이며, 전립선 암 병리 진단의 변동성과 노동 부담을 줄일 잠재력을 입증한다.

ABSTRACT

Background: An increasing volume of prostate biopsies and a world-wide shortage of uro-pathologists puts a strain on pathology departments. Additionally, the high intra- and inter-observer variability in grading can result in over- and undertreatment of prostate cancer. Artificial intelligence (AI) methods may alleviate these problems by assisting pathologists to reduce workload and harmonize grading. Methods: We digitized 6,682 needle biopsies from 976 participants in the population based STHLM3 diagnostic study to train deep neural networks for assessing prostate biopsies. The networks were evaluated by predicting the presence, extent, and Gleason grade of malignant tissue for an independent test set comprising 1,631 biopsies from 245 men. We additionally evaluated grading performance on 87 biopsies individually graded by 23 experienced urological pathologists from the International Society of Urological Pathology. We assessed discriminatory performance by receiver operating characteristics (ROC) and tumor extent predictions by correlating predicted millimeter cancer length against measurements by the reporting pathologist. We quantified the concordance between grades assigned by the AI and the expert urological pathologists using Cohen's kappa. Results: The performance of the AI to detect and grade cancer in prostate needle biopsy samples was comparable to that of international experts in prostate pathology. The AI achieved an area under the ROC curve of 0.997 for distinguishing between benign and malignant biopsy cores, and 0.999 for distinguishing between men with or without prostate cancer. The correlation between millimeter cancer predicted by the AI and assigned by the reporting pathologist was 0.96. For assigning Gleason grades, the AI achieved an average pairwise kappa of 0.62. This was within the range of the corresponding values for the expert pathologists (0.60 to 0.73).

연구 동기 및 목표

전립선 암 진단에서 증가하는 작업 부담과 전립선 병리의사 부족 문제를 해결하기 위해.
전립선 생검의 Gleason 등급 분류에서 높은 내·외관자 간 변동성을 줄이기 위해.
임상적 정확도로 전립선 암을 탐지하고 국소화하며 등급을 매길 수 있는 AI 시스템을 개발하기 위해.
표준화된 지표를 사용하여 전문 병리의사와 비교해 AI의 성능을 평가하기 위해.
인구 기반 전립선 암 선별에 있어 AI의 임상적 타당성을 입증하기 위해.

제안 방법

STHLM3 연구에서 확보한 8,313개의 전립선 생검 전체 슬라이드 이미지를 디지털화하였으며, 훈련용으로 6,682개, 독립적 테스트용으로 1,631개를 사용하였다.
Inception V3, ResNet-50, Xception 아키텍처를 기반으로 한 앙상블 모델을 사용하여 딥 네트워크(DNN)를 훈련시켰다.
훈련 데이터에서 교차 검증을 통한 하이퍼파ram터 튜닝을 통해 모델 성능을 최적화하였다.
전립선 조직병리학에서의 일반화 능력을 향상시키기 위해 ImageNet 사전 학습을 활용한 전이 학습을 적용하였다.
DNN 특징에서부터 밀리미터 단위의 암 길이를 예측하기 위해 XGBoost 회귀를 사용하였다.
수신기 작동 특성(ROC) 곡선, 상관 분석, 등급 일치도에 대한 Cohen’s kappa를 사용하여 성능을 검증하였다.

실험 결과

연구 질문

RQ1AI 시스템이 생검 샘플에서 전립선 암을 병리의사 수준의 정확도로 탐지할 수 있는가?
RQ2AI의 Gleason 등급 분류 성능은 전문 전립선 병리의사와 비교해 어떻게 되는가?
RQ3AI는 전립선 암 등급 분류에서 외관자 간 변동성을 어느 정도 줄일 수 있는가?
RQ4AI는 병리의사 측정값과 비교해 암의 범위를 밀리미터 단위로 얼마나 잘 예측하는가?
RQ5AI는 실제 인구 기반 선별 환경에서 신뢰성 있게 적용될 수 있는가?

주요 결과

AI는 양성과 악성 생검 코어를 구분하는 데 있어 수신기 작동 특성 곡선 아래 면적(AUC)이 0.997이었다.
환자가 전립선 암이 있는지 여부를 분류하는 데 있어 AUC는 0.999였다.
AI가 예측한 암 길이와 병리의사가 기록한 측정값 간 상관계수는 0.96이었다.
AI의 평균 쌍별 Cohen’s kappa는 Gleason 등급 분류에서 0.62였으며, 전문 병리의사의 범위(0.60–0.73) 내에 있었다.
AI는 이형성 및 전립선내상피내암종과 같은 다양한 조직형질 및 복잡한 사례에서도 강력한 성능을 보였다.
다양한 생검 코어와 기관 간에도 높은 성능을 유지하여 강력한 일반화 능력을 입증하였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.