QUICK REVIEW

[논문 리뷰] VM-UNET-V2 Rethinking Vision Mamba UNet for Medical Image Segmentation

Mingya Zhang, Yue Yu|arXiv (Cornell University)|2024. 03. 14.

Medical Image Segmentation Techniques인용 수 7

한 줄 요약

VM-UNetV2는 Vision State Space Models (VSS)을 SDI와 결합한 UNet과 유사한 구조에 통합하여 의학 영상 분할에서 장거리 의존성을 효율적으로 모델링하며, 다수의 데이터세트에서 FLOPs, 매개변수 수, FPS가 우수한 경쟁력 있는 성능을 달성한다.

ABSTRACT

In the field of medical image segmentation, models based on both CNN and Transformer have been thoroughly investigated. However, CNNs have limited modeling capabilities for long-range dependencies, making it challenging to exploit the semantic information within images fully. On the other hand, the quadratic computational complexity poses a challenge for Transformers. Recently, State Space Models (SSMs), such as Mamba, have been recognized as a promising method. They not only demonstrate superior performance in modeling long-range interactions, but also preserve a linear computational complexity. Inspired by the Mamba architecture, We proposed Vison Mamba-UNetV2, the Visual State Space (VSS) Block is introduced to capture extensive contextual information, the Semantics and Detail Infusion (SDI) is introduced to augment the infusion of low-level and high-level features. We conduct comprehensive experiments on the ISIC17, ISIC18, CVC-300, CVC-ClinicDB, Kvasir, CVC-ColonDB and ETIS-LaribPolypDB public datasets. The results indicate that VM-UNetV2 exhibits competitive performance in medical image segmentation tasks. Our code is available at https://github.com/nobodyplayer1/VM-UNetV2.

연구 동기 및 목표

의료 영상에 대한 선형 복잡도로 장거리 모델링을 결합한 분할 모델을 동기부여한다.
VSS 블록과 SDI를 도입하여 낮은 수준의 특징과 높은 수준의 특징을 융합하는 VM-UNetV2를 제안한다.
피부과 및 위장관 폴립 데이터세트에서 경쟁력 있는 성능을 입증한다.
인코더 깊이 및 딥 supervision에 따른 모델 복잡도(FLOPs, Params, FPS)를 분석하고 인코더 깊이 및 심층 감독의 영향을 평가한다.

제안 방법

세 가지 모듈 아키텍처를 채택한다: 인코더, Semantic and Detail Infusion (SDI), 디코더.
인코더 백본으로 Vision Mamba (VSS) 블록을 사용하여 선형 복잡도로 장거리 컨텍스트를 포착한다.
SDI 모듈을 통해 CBAM 기반 주의 가이던스로 다중 스케일 특징을 융합한다.
훈련 중 크로스 스테이지 딥 슈퍼비전을 적용한다.
두 클래스 분할을 위해 Cross-Entropy 및 Dice 손실(L = L_BCE + L_Dice)로 학습한다.
인코더 가중치를 ImageNet-1k에서 사전 학습된 VMamba로 초기화한다.

실험 결과

연구 질문

RQ1Vision State Space Models (SSMs)가 선형 복잡도로 의학 영상 분할에서 경쟁력 있는 롱-레이즈 컨텍스트 모델링을 제공할 수 있는가?
RQ2SDI를 도입하여 의미적 및 디테일 인퓨전이 고수준 의미론을 활용하면서 미세한 디테일 보존을 개선하는가?
RQ3인코더 깊이와 딥 슈퍼비전에 따른 분할 성능이 다양한 의료 데이터세트에서 어떤 영향을 받는가?

주요 결과

VM-UNetV2는 ISIC17/18 및 다수의 폴립 데이터세트에서 강력한 베이스라인과 비교하여 mIoU, DSC, 정확도에서 경쟁력 있는 성능을 달성한다.
ISIC17에서 VM-UNetV2는 mIoU 82.34, DSC 90.31, Acc 96.70, Spe 97.67, Sen 91.89를 달성; ISIC18에서 mIoU 81.37, DSC 89.73, Acc 95.06, Spe 97.13, Sen 88.64를 달성한다.
Kvasir-SEG, ClinicDB, ColonDB, ETIS, CVC-300에서 VM-UNetV2는 VM-UNet보다 mIoU 및 DSC에서 개선을 보이고, UNetV2 기반 베이스라인에 비해 경쟁력 있는 점수를 보인다.
VM-UNetV2는 여러 베이스라인에 비해 FLOPs, Params, FPS에서 우수한 효율성을 보여주며, 표 3에 따라 바람직한 효율성을 나타낸다.
얼리에이션 연구는 인코더 깊이를 [2,2,9,2] 근처로 설정하면 일반적으로 성능이 향상되며, 딥 슈퍼비전의 이점은 데이터세트에 따라 다르게 나타난다고 제시한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.