QUICK REVIEW

[논문 리뷰] Development and Validation of Deep Learning Algorithms for Detection of Critical Findings in Head CT Scans

Sasank Chilamkurthy, Rohit Ghosh|arXiv (Cornell University)|2018. 03. 13.

Medical Imaging Techniques and Applications참고 문헌 38인용 수 82

한 줄 요약

간단 요약: 이 논문은 대형 다중센터 데이터셋(Qure25k와 CQ500)을 사용하여 비혈관 뇌 CT에서 중요한 소견을 자동으로 감지하는 딥러닝 모델을 개발·검증하고, AUC와 작동점 성능을 보고한다.

ABSTRACT

Importance: Non-contrast head CT scan is the current standard for initial imaging of patients with head trauma or stroke symptoms. Objective: To develop and validate a set of deep learning algorithms for automated detection of following key findings from non-contrast head CT scans: intracranial hemorrhage (ICH) and its types, intraparenchymal (IPH), intraventricular (IVH), subdural (SDH), extradural (EDH) and subarachnoid (SAH) hemorrhages, calvarial fractures, midline shift and mass effect. Design and Settings: We retrospectively collected a dataset containing 313,318 head CT scans along with their clinical reports from various centers. A part of this dataset (Qure25k dataset) was used to validate and the rest to develop algorithms. Additionally, a dataset (CQ500 dataset) was collected from different centers in two batches B1 & B2 to clinically validate the algorithms. Main Outcomes and Measures: Original clinical radiology report and consensus of three independent radiologists were considered as gold standard for Qure25k and CQ500 datasets respectively. Area under receiver operating characteristics curve (AUC) for each finding was primarily used to evaluate the algorithms. Results: Qure25k dataset contained 21,095 scans (mean age 43.31; 42.87% female) while batches B1 and B2 of CQ500 dataset consisted of 214 (mean age 43.40; 43.92% female) and 277 (mean age 51.70; 30.31% female) scans respectively. On Qure25k dataset, the algorithms achieved AUCs of 0.9194, 0.8977, 0.9559, 0.9161, 0.9288 and 0.9044 for detecting ICH, IPH, IVH, SDH, EDH and SAH respectively. AUCs for the same on CQ500 dataset were 0.9419, 0.9544, 0.9310, 0.9521, 0.9731 and 0.9574 respectively. For detecting calvarial fractures, midline shift and mass effect, AUCs on Qure25k dataset were 0.9244, 0.9276 and 0.8583 respectively, while AUCs on CQ500 dataset were 0.9624, 0.9697 and 0.9216 respectively.

연구 동기 및 목표

치료 지연을 줄이기 위해 긴급 머리 CT 소견의 자동 선별(triage)과 신속한 식별을 유도한다.
방사선 보고 및 전문가 합의를 골드 표준으로 삼는 대규모 다센터 데이터셋(Qure25k와 CQ500)을 개발한다.
두개강 내 출혈, 골절, 및 질량 효과/중선 이동에 대해 별개의 딥러닝 모델을 학습한다.
임상 도입 및 벤치마킹을 지원하기 위해 소견별 성능 지표를 제공한다.

제안 방법

각 출혈 유형마다 다섯 개의 병렬 완전 연결층을 갖춘 ResNet18을 사용하여 슬라이스 수준 출혈 분류기를 학습하고, 슬라이스 신뢰도를 무작위 숲(Random Forest)으로 결합하여 스캔 수준 예측을 얻는다.
패밀리: IPH, SDH, EDH에 대한 밀집(segmentation) 모델(UNet)을 학습하고, 두개골 골절 탐지를 위해 희소성을 해결하기 위해 Hard Negative Mining을 적용한 DeepLab 기반 접근법을 사용한다.
중선 이동 및 질량 효과에 대해 수정된 ResNet18과 병렬 FC 층의 두 가지 가지(branch) 접근법을 사용하고, Random Forest로 스캔 수준 신뢰도로 집계한다.
CT 스캔은 축 방향 비조영 시퀀스를 선택하고, 5 mm로 재샘플링하고, 224x224로 크기 조정하며, 뇌/골/경막하 창을 채널로 쌓아 입력한다.
주요 지표로 AUC를 사용한 ROC 곡선을 통해 평가하고, 고민감도 및 고특이도 운용점에서의 민감도와 특이도를 보고한다.

실험 결과

연구 질문

RQ1다양한 센터에 걸쳐 비대조(non-contrast) 두개강 뇌 CT에서 다섯 가지 유형의 뇌실내 출혈을 딥러닝 모델이 정확하게 탐지할 수 있는가?
RQ2두개골 골절, 중선 이동 및 질량 효과를 신뢰성 있게 탐지하고, 방사선전문의 합의와의 비교는 어떤가?
RQ3개발(Qure25k)과 독립 임상 검증(CQ500) 데이터셋 간에 모델 성능은 어떻게 일반화되는가?
RQ4다수의 방사선 전문의 합의와 단일 판독자 금표준 사용이 측정 성능에 미치는 영향은 무엇인가?
RQ5혼잡하거나 원격 환경에서 신뢰할 수 있는 스캔 삼진을 제공하여 치료 시간 단축에 자동 삼진 시스템이 기여할 수 있는가?

주요 결과

Qure25k에서 ICH의 AUC는 0.9194, 뇌실내출혈은 0.9544, 중선 이동은 0.9276, 두개골 골절은 0.9244, 질량 효과는 0.8583이었다.
CQ500(B1+B2)에서 ICH 0.9419, IPH 0.9544, IVH 0.9310, SDH 0.9521, EDH 0.9731, SAH 0.9574, 골절 0.9624, 중선 이동 0.9697, 질량 효과 0.9216.
CQ500의 고민감도 운용점에서 민감도는 각각 0.9463(ICH), 0.9487(두개골 골절), 0.9385(중선 이동)이고 특이도는 각각 0.7098, 0.8606, 0.8944였다.
알고리즘은 CQ500에서 Qure25k보다 더 높은 AUC를 달성했고, 질량 효과에서 가장 큰 차이를 보였다(0.9216 vs 0.8583).
CQ500은 ICH(Fleiss의 카파 0.7827) 및 IPH(0.7746)에서 더 높은 판독자 일치도를 보였고, 골절(0.4507) 및 SDH(0.5418)에서는 낮았다.
본 연구는 벤치마킹을 위한 공개 CQ500 데이터세트를 제공하고 두개 CT에서 소견별 딥러닝 성능을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.