QUICK REVIEW

[論文レビュー] A comparative study between vision transformers and CNNs in digital pathology

Luca Deininger, Bernhard Stimpel|arXiv (Cornell University)|Jun 1, 2022

AI in cancer detection被引用数 31

ひとこと要約

ビジョン・トランスフォーマー（DeiT-Tiny および DINO）はデジタル病理における腫瘍検出と組織タイプ識別でResNet18と同等の性能を示し、スライドレベルの予測は類似しているものの学習コストが高い；DINOはPathNetより広い転移性を示す。

ABSTRACT

Recently, vision transformers were shown to be capable of outperforming convolutional neural networks when pretrained on sufficient amounts of data. In comparison to convolutional neural networks, vision transformers have a weaker inductive bias and therefore allow a more flexible feature detection. Due to their promising feature detection, this work explores vision transformers for tumor detection in digital pathology whole slide images in four tissue types, and for tissue type identification. We compared the patch-wise classification performance of the vision transformer DeiT-Tiny to the state-of-the-art convolutional neural network ResNet18. Due to the sparse availability of annotated whole slide images, we further compared both models pretrained on large amounts of unlabeled whole-slide images using state-of-the-art self-supervised approaches. The results show that the vision transformer performed slightly better than the ResNet18 for three of four tissue types for tumor detection while the ResNet18 performed slightly better for the remaining tasks. The aggregated predictions of both models on slide level were correlated, indicating that the models captured similar imaging features. All together, the vision transformer models performed on par with the ResNet18 while requiring more effort to train. In order to surpass the performance of convolutional neural networks, vision transformers might require more challenging tasks to benefit from their weak inductive bias.

研究の動機と目的

WSI（全スライド画像）における腫瘍検出と4つの組織タイプ全体、および組織タイプ識別におけるビジョン・トランスフォーマーの性能を評価する。
完全に教師ありViTと自己教師ありViT（DINO）を、ResNet18およびPathNetのベースラインと比較する。
スライドレベルの予測の相関とアテンションマップの定性的な差異を分析する。
デジタル病理におけるViTの訓練効率と実務上の考慮点を評価する。

提案手法

ImageNetで事前学習されたDeiT-Tiny（ViT）を完全に教師ありのベースラインとして用いる。
TCGAベースのデータ（TCGA 100）で事前学習されたDeiT-Tinyバックボーンを用いたDINO自己教師ありViT。
ImageNetで事前学習されたResNet18とPathNet（BYOLで自己教師あり pretrained）と比較する。
パッチ単位の腫瘍検出と組織タイプ識別を、それぞれPR AUCとマクロPR AUCを用いて評価する。
一般化を高めるためViTにSAMを用いて訓練し、バランスの取れたサンプリングとAlbumentationsによるデータ拡張を適用する。
モデル間でスライドレベルの予測とピアソン相関を計算し、定位のためGrad-CAMヒートマップを生成する。

実験結果

リサーチクエスチョン

RQ1ViTは複数の組織タイプにおけるパッチ単位の腫瘍検出でCNNと同等またはそれを超えることができるか？
RQ2自己教師ありViT（DINO）はデジタル病理タスクにおいてPathNetや教師ありViTより有利であるか？
RQ3ViTとCNN間のスライドレベル予測の相関はどの程度か、また得られた特徴について何を意味するか？
RQ4この領域でViTとCNNを使用する際の実用的な訓練時間とリソースへの影響はどのようなものか？
RQ5デジタル病理においてより広い文脈を要するタスクやより難しい下流タスクに対してViTはより効果的か？

主な発見

モデル	FW	PR AUC CRC9	PR AUC SLN	PR AUC DLBCL	PR AUC LUAD	PR AUC Breast	ACC CRC9	ACC SLN	ACC DLBCL	ACC LUAD	ACC Breast
ResNet18	×	0.999	0.885	0.976	0.913	0.809	0.995	0.981	0.880	0.858	0.915
DeiT-Tiny	×	0.998	0.917	0.970	0.940	0.817	0.982	0.988	0.874	0.880	0.913
PathNet	×	0.999	0.908	0.970	0.920	0.818	0.995	0.943	0.866	0.885	0.920
DINO	×	0.999	0.912	0.958	0.933	0.828	0.991	0.984	0.874	0.871	0.924

ResNet18とViTs（DeiT-TinyおよびDINO）は、データセット全体で非常に似たPR AUCと精度を示す。
ViTsは5つの組織タイプ/データセットタスクのうち3つ（SLN、LUAD、乳がん）でResNet18を上回り、DLBCLではやや遅れを取る。
DINOは一般にPathNetより高い性能を示し、さまざまな事前学習からの転移性が広いことを示唆する。
スライドレベルの予測はResNet18とViTsの間で相関しており、取り扱われる画像特徴が類似していることを示している。
ViTの訓練はSAMベースで遅く、CNN訓練より計算量が多いがスループットは同等程度である。
Grad-CAMヒートマップは、いくつかのサンプルでViTがCNNよりも局所的な領域に焦点を当てることを示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。