QUICK REVIEW

[논문 리뷰] A comparative study between vision transformers and CNNs in digital pathology

Luca Deininger, Bernhard Stimpel|arXiv (Cornell University)|2022. 06. 01.

AI in cancer detection인용 수 31

한 줄 요약

비전 트랜스포머(DeiT-Tiny 및 DINO)가 디지털 병리에서 종양 탐지 및 조직 유형 식별에 대해 ResNet18과 비슷한 성능을 보이며, 슬라이드 수준 예측은 유사하지만 학습 비용은 더 높다; DINO는 PathNet보다 더 넓은 전이 가능성을 보여준다.

ABSTRACT

Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.

연구 동기 및 목표

ViT가 네 가지 조직 타입에서 WSIs의 종양 탐지 및 조직 타입 식별에 어떻게 성능하는지 평가한다.
완전 지도 ViT와 자기 지도 ViT(DINO)를 ResNet18 및 PathNet 기준선과 비교한다.
슬라이드 수준 예측의 상관관계와 주의 맵(attention maps)의 정성적 차이를 분석한다.
디지털 병리에서 ViT의 학습 효율성과 실용적 고려사항을 평가한다.

제안 방법

ImageNet에서 사전학습된 DeiT-Tiny(ViT)를 완전 지도 기준선으로 사용한다.
TCGA 기반 데이터(TCGA 100)로 사전학습된 DeiT-Tiny 백본을 가진 DINO 자기 지도 ViT를 사용한다.
ImageNet에서 사전학습된 ResNet18 및 자기 지도 BYOL 사전학습 PathNet과 비교한다.
패치 단위 종양 탐지와 조직 타입 식별을 각각 PR AUC 및 매크로 PR AUC로 평가한다.
ViT의 일반화를 높이기 위해 SAM으로 훈련하고, 균형 샘플링 및 Albumentations 증강을 적용한다.
모델 간 슬라이드 수준 예측 및 피어슨 상관관계를 계산하고 위치지정용 Grad-CAM 히트맵을 생성한다.

실험 결과

연구 질문

RQ1ViT가 다중 조직 타입에서 패치 단위 종양 탐지에서 CNN과 대등하거나 우수한가?
RQ2셀프-감독 ViT(DINO)가 디지털 병리 과제에서 PathNet 또는 감독 ViT보다 이점을 제공하는가?
RQ3ViT와 CNN 간 슬라이드 수준 예측의 상관관계는 어떠하며 이것이 학습된 특징에 대해 시사하는 바는 무엇인가?
RQ4ViT를 이 분야에서 사용하는 실질적 학습 시간 및 자원 측면의 함의는 무엇인가?
RQ5넓은 맥락이나 더 도전적인 하위 작업이 필요한 디지털 병리에서 ViT가 더 효과적인가?

주요 결과

모델	FW	PR AUC CRC9	PR AUC SLN	PR AUC DLBCL	PR AUC LUAD	PR AUC Breast	ACC CRC9	ACC SLN	ACC DLBCL	ACC LUAD	ACC Breast
ResNet18	×	0.999	0.885	0.976	0.913	0.809	0.995	0.981	0.880	0.858	0.915
DeiT-Tiny	×	0.998	0.917	0.970	0.940	0.817	0.982	0.988	0.874	0.880	0.913
PathNet	×	0.999	0.908	0.970	0.920	0.818	0.995	0.943	0.866	0.885	0.920
DINO	×	0.999	0.912	0.958	0.933	0.828	0.991	0.984	0.874	0.871	0.924

ResNet18과 ViT(DeiT-Tiny 및 DINO)는 데이터 세트 간 PR AUC 및 정확도가 매우 유사하다.
ViT가 다섯 개의 조직 타입/데이터셋 작업 중 세 가지에서 ResNet18보다 우수하고(슬라이드 수준에서 LUAD, LUAD, Breast), DLBCL에서 다소 뒤처진다.
DINO는 일반적으로 PathNet보다 더 높은 성능을 보여 다양하게 사전학습된 것으로부터 더 넓은 전이 가능성을 시사한다.
슬라이드 수준 예측은 ResNet18과 ViT 간에 상관관계가 있으며 유사한 이미징 특징을 포착한다는 것을 시사한다.
ViT 학습은 SAM 기반으로 더 느리고 CNN 학습보다 계산 집약적이며 처리량은 유사하다.
Grad-CAM 히트맵은 일부 샘플에서 ViT가 CNN보다 더 국소화된 영역에 주목함을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.