QUICK REVIEW

[논문 리뷰] CATFA-Net: A Trans-Convolutional Approach for Accurate Medical Image Segmentation

Siddhartha Mallick, Aayushman Ghosh|arXiv (Cornell University)|2026. 03. 15.

Advanced Neural Network Applications인용 수 0

한 줄 요약

CATFA-Net은 경량 컨볼루션 디코더를 갖춘 계층적 하이브리드 인코더를 제시하며, Context Addition Attention과 Cross-Channel Trans-Convolutional Fusion을 도입하여 여러 의료 데이터셋에서 효율적인 계산으로 Dice 점수 최상위 성능을 달성합니다.

ABSTRACT

Convolutional blocks have played a crucial role in advancing medical image segmentation by excelling in dense prediction tasks. However, their inability to effectively capture long-range dependencies has limited their performance. Transformer-based architectures, leveraging attention mechanisms, address this limitation by modeling global context and creating expressive feature representations. Recent research has explored this potential by introducing hybrid frameworks that combine transformer encoders with convolutional decoders. Despite their advantages, these approaches face challenges such as limited inductive bias, high computational cost, and reduced robustness to data variability. To overcome these issues, this study introduces CATFA-Net, a novel and efficient segmentation framework designed to produce high-quality segmentation masks while reducing computational costs and increasing inference speed. CATFA-Net employs a hierarchical hybrid encoder architecture with a lightweight convolutional decoder backbone. Its transformer-based encoder uses a new Context Addition Attention mechanism that captures inter-image dependencies without the quadratic complexity of standard attention mechanisms. Features from the transformer branch are fused with those from the convolutional branch through a proposed Cross-Channel Attention mechanism, which helps retain spatial and channel information during downsampling. Additionally, a Spatial Fusion Attention mechanism in the decoder refines features while reducing background noise ambiguity. Extensive evaluations on five publicly available datasets show that CATFA-Net outperforms existing methods in accuracy and efficiency. The framework sets new state-of-the-art Dice scores on GLaS (94.48%) and ISIC 2018 (91.55%). Robustness tests and external validation further demonstrate its strong ability to generalize in binary segmentation tasks.

연구 동기 및 목표

의료 영상 분할에서 전통적인 ConvNet을 넘어 장기 의존성을 포착할 필요성을 동기부여한다.
ConvNeXt와 transformer 기반의 H-CAT를 결합한 효율적 하이브리드 인코더를 개발하되 제곱적 복잡도를 줄인다.
채널 간 및 공간 차원에서 특징을 융합하기 위한 맥락 인식 주의 메커니즘을 도입한다.
디코딩 중 배경 잡음을 완화하기 위한 Spatial Attention Fusion Gate와 경량 Conv-G-NeXt 디코더를 사용한다.
다양한 공개 데이터셋에 걸친 강한 일반화 및 Robustness를 보여준다.

제안 방법

두 가지 가지 분기 Trans-convolutional 인코더를 제안한다: ConvNeXt 인코더 분기와 Hierarchical Context Addition Transformer (H-CAT) 인코더 분기.
표준 자기 주의(Self-Attention)를 Context Addition Self-Attention (CAP)으로 대체하여 공간 축소 블록을 사용해 복잡도를 줄이면서 영상 간 의존성을 모델링한다.
고정된 위치 인코딩 없이 위치 정보를 인코딩하기 위해 깊이별 전부 합성 가능한 합성망(d-FCN)을 도입한다.
Cross Channel Trans-Convolution Fusion Attention(CCTFA)을 통해 교차 채널 주의와 공간 주의를 결합하여 인코더 출력 융합을 수행한다.
Spatial Attention Fusion Gate(SAFG)를 갖춘 Conv-G-NeXt 디코더를 사용해 업샘플링을 다듬고 배경 잡음을 억제한다.
개선된 디코딩 성능을 위해 GELU 활성화가 적용된 BN 기반 Conv-G-NeXt 블록을 Demonstrate한다.

Figure 1: Demonstrating the importance of modeling long-range dependencies. Examples from various medical imaging benchmarks (GLaS, DS Bowl 2018, REFUGE, CVC Clinic DB, ISIC 2018) are shown. Blue outlines represent the ground truth (gt), red outlines indicate U-Net predictions, and green outlines sh

실험 결과

연구 질문

RQ1ConvNeXt와 transformer 유사 H-CAT를 결합한 하이브리드 인코더가 의료 영상 분할에서 글로벌 컨텍스트를 효율적으로 포착할 수 있는가?
RQ2Context Addition Self-Attention 메커니즘이 간소화된 복잡도로 영상 간 의존성을 보존하면서 계산 복잡도를 줄이는가?
RQ3Cross Channel Trans-Convolution Fusion Attention이 두 인코더 분기에서 다중 스케일 특징을 효과적으로 통합하는가?
RQ4Spatial Attention Fusion Gate가 디코더의 강건성을 배경 잡음 및 잘못 분류에 대해 개선하는가?
RQ5제안된 설계 선택이 여러 공개 데이터셋에서 최첨단 Dice 점수와 강건한 성능을 산출하는가?

주요 결과

GLaS(94.48%) 및 ISIC 2018(91.55%)에서 최첨단 Dice 점수를 달성한다.
다섯 개의 공개 데이터셋(GLaS, DS Bowl 2018, REFUGE, CVC Clinic DB, ISIC 2018)에서 우수한 성능을 보인다.
robust한 일반화 능력을 바이너리 분할 태스크에서 분석 및 외부 검증으로 입증한다.
CAP 및 주의 경로의 공간 축소로 전체 자기 주의 대비 계산 부담을 줄인다.
Decoder에서 Conv-G-NeXt 블록과 LN보다 BN 기반 정규화를 사용하는 경우 디코딩 정확도가 향상된다.
재현성을 위한 공개 PyTorch 구현을 제공한다.

Figure 2: Overview of the proposed CATFA-Net model for efficient medical image segmentation . This architecture integrates ConvNeXt and H-CAT encoder branches for efficient feature extraction, utilizing advanced attention mechanisms such as context-addition attention, which captures inter-image rese

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.