QUICK REVIEW

[論文レビュー] Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification

Smriti Regmi, Aliza Subedi|arXiv (Cornell University)|Apr 23, 2023

COVID-19 diagnosis using AI被引用数 10

ひとこと要約

この論文は Vision Transformer (ViT) と DeiT アーキテクチャが、3つの医用画像データセット（胸部X線、Kvasir、Kvasir-Capsule）において、複数の指標で従来のCNNベースラインを上回ることを示し、ViTを医用画像分類タスクの強力なベンチマークとして確立している。

ABSTRACT

Medical image analysis is a hot research topic because of its usefulness in different clinical applications, such as early disease diagnosis and treatment. Convolutional neural networks (CNNs) have become the de-facto standard in medical image analysis tasks because of their ability to learn complex features from the available datasets, which makes them surpass humans in many image-understanding tasks. In addition to CNNs, transformer architectures also have gained popularity for medical image analysis tasks. However, despite progress in the field, there are still potential areas for improvement. This study uses different CNNs and transformer-based methods with a wide range of data augmentation techniques. We evaluated their performance on three medical image datasets from different modalities. We evaluated and compared the performance of the vision transformer model with other state-of-the-art (SOTA) pre-trained CNN networks. For Chest X-ray, our vision transformer model achieved the highest F1 score of 0.9532, recall of 0.9533, Matthews correlation coefficient (MCC) of 0.9259, and ROC-AUC score of 0.97. Similarly, for the Kvasir dataset, we achieved an F1 score of 0.9436, recall of 0.9437, MCC of 0.9360, and ROC-AUC score of 0.97. For the Kvasir-Capsule (a large-scale VCE dataset), our ViT model achieved a weighted F1-score of 0.7156, recall of 0.7182, MCC of 0.3705, and ROC-AUC score of 0.57. We found that our transformer-based models were better or more effective than various CNN models for classifying different anatomical structures, findings, and abnormalities. Our model showed improvement over the CNN-based approaches and suggests that it could be used as a new benchmarking algorithm for algorithm development.

研究の動機と目的

CNNに対する長距離依存性を強くモデリングする efficiently 医用画像分類代替の必要性を動機づける。
ViTとDeiTモデルを多 modality 医用データセットでCNNベースラインと比較評価する。
トランスフォーマーベースの医用画像分類を強化するデータ拡張と訓練戦略を検討する。
適切な指標を用いたデータセット間での ViT 改善の統計的有意性を評価する。

提案手法

三つの医用データセットで事前学習済み ViT バリアント（ViT-B/16, ViT-L/16, ViT-L/32）をファインチューニング。
ImageNet-21k からの転移学習を用いて ViT/DeiT と CNN ベースラインおよびアンサンブルモデルを比較。
クラス不均衡を扱うためデータセット特有のデータ拡張と損失関数（クロスエントロピー vs focal loss）を適用。
MCC, ROC-AUC, precision, recall, F1, accuracy, ROC 曲線などの指標で評価; MCC 比較について対応のある t 検定を実施。

Figure 1 : An original ViT [ 7 ] structure for the classification task. The image is first converted into flattened patches through Patch Embedding and Position Embedding, then processed by the Transformer encoder [ 22 ] . The prediction result is obtained after the MLP Head.

実験結果

リサーチクエスチョン

RQ1ViTは胸部X線、内視鏡、カプセル内視鏡データセットでCNNベースのモデルを MCC と ROC-AUC の点で凌駕するのか。
RQ2ViT のバリアントは DeiT および CNN アンサンブルと多様な医用画像モダリティでどう比較されるのか。
RQ3データ拡張と損失関数はトランスフォーマーベースの医用画像分類の性能にどんな役割を果たすのか。
RQ4観測されたトランスフォーマーベースの改善はデータセット全体で統計的に有意か。
RQ5ViTベースのモデルは将来の医用画像分類研究の堅牢なベンチマークとなりうるか。

主な発見

ViT-L/16 は Chest X-ray で評価されたモデルの中で最も高い MCC を達成し、指標全体で強い性能を示す。
ViT バリアントは一般に Chest X-ray および Kvasir データセットで CNN ベースラインおよび DeiT を複数の指標で上回す。
Kvasir-Capsule データセットでは ViT-B/16 がトップ MCC を獲得し、トランスフォーマーモデルは重み付き精度と F1 スコアで優位性を示す。
ROC 曲線は3つのデータセットすべてで ViT モデルの競争力のあるまたは優れた性能を示す。
対応のある t 検定は Chest X-ray および Kvasir-Capsule データセットで ViT の MCC 改善が多くの SOTA ベースラインと比較して統計的に有意であることを示す；一部の Kvasir 比較は必ずしも有意ではない。

Figure 2 : Example samples from Chest X-ray, Kvasir, and Kvasir-Capsule datasets.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。