[论文解读] Vision Transformer for Efficient Chest X-ray and Gastrointestinal Image Classification
论文展示 Vision Transformer (ViT) 与 DeiT 架构在三个医用图像数据集(Chest X-ray、Kvasir、Kvasir-Capsule)上,相对于若干 CNN 基线,在多项指标上均超越,确立 ViT 作为医学影像分类任务的强基准。
Medical image analysis is a hot research topic because of its usefulness in different clinical applications, such as early disease diagnosis and treatment. Convolutional neural networks (CNNs) have become the de-facto standard in medical image analysis tasks because of their ability to learn complex features from the available datasets, which makes them surpass humans in many image-understanding tasks. In addition to CNNs, transformer architectures also have gained popularity for medical image analysis tasks. However, despite progress in the field, there are still potential areas for improvement. This study uses different CNNs and transformer-based methods with a wide range of data augmentation techniques. We evaluated their performance on three medical image datasets from different modalities. We evaluated and compared the performance of the vision transformer model with other state-of-the-art (SOTA) pre-trained CNN networks. For Chest X-ray, our vision transformer model achieved the highest F1 score of 0.9532, recall of 0.9533, Matthews correlation coefficient (MCC) of 0.9259, and ROC-AUC score of 0.97. Similarly, for the Kvasir dataset, we achieved an F1 score of 0.9436, recall of 0.9437, MCC of 0.9360, and ROC-AUC score of 0.97. For the Kvasir-Capsule (a large-scale VCE dataset), our ViT model achieved a weighted F1-score of 0.7156, recall of 0.7182, MCC of 0.3705, and ROC-AUC score of 0.57. We found that our transformer-based models were better or more effective than various CNN models for classifying different anatomical structures, findings, and abnormalities. Our model showed improvement over the CNN-based approaches and suggests that it could be used as a new benchmarking algorithm for algorithm development.
研究动机与目标
- Motivate the need for efficient medical image classification alternatives to CNNs with strong long-range dependency modeling.
- Evaluate ViT and DeiT models against CNN baselines on multi-modal medical datasets.
- Investigate data augmentation and training strategies to enhance transformer-based medical image classification.
- Assess statistical significance of ViT improvements across datasets using appropriate metrics.
提出的方法
- Fine-tune pre-trained ViT variants (ViT-B/16, ViT-L/16, ViT-L/32) on three medical datasets.
- Compare ViT/DeiT with CNN baselines and ensemble models using transfer learning from ImageNet-21k.
- Apply dataset-specific data augmentation and loss functions (cross-entropy vs focal loss) to handle class imbalance.
- Evaluate using metrics including MCC, ROC-AUC, precision, recall, F1, accuracy, and ROC curves; perform paired t-tests for MCC comparisons.
![Figure 1 : An original ViT [ 7 ] structure for the classification task. The image is first converted into flattened patches through Patch Embedding and Position Embedding, then processed by the Transformer encoder [ 22 ] . The prediction result is obtained after the MLP Head.](https://ar5iv.labs.arxiv.org/html/2304.11529/assets/Figures/21.jpg)
实验结果
研究问题
- RQ1Do Vision Transformers exceed CNN-based models on chest X-ray, endoscopy, and capsule endoscopy datasets in terms of MCC and ROC-AUC?
- RQ2How do ViT variants compare to DeiT and CNN ensembles across diverse medical imaging modalities?
- RQ3What role do data augmentation and loss functions play in Transformer-based medical image classification performance?
- RQ4Are the observed Transformer-based improvements statistically significant across datasets?
- RQ5Can ViT-based models serve as robust benchmarks for future medical image classification research?
主要发现
- ViT-L/16 achieves the highest MCC on Chest X-ray among evaluated models, indicating strong performance across metrics.
- ViT variants generally outperform CNN baselines and DeiT on the Chest X-ray and Kvasir datasets across multiple metrics.
- On the Kvasir-Capsule dataset, ViT-B/16 attains the top MCC, with Transformer models showing advantages in weighted precision and F1-score.
- ROC curves illustrate competitive or superior performance of ViT models across all three datasets.
- Paired t-tests indicate statistically significant MCC improvements for ViT on Chest X-ray and Kvasir-Capsule datasets relative to many SOTA baselines; some Kvasir comparisons are not always significant.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。