QUICK REVIEW

[논문 리뷰] How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images

Xinrong Hu, Xiaowei Xu|arXiv (Cornell University)|2023. 06. 23.

Advanced Neural Network Applications인용 수 29

한 줄 요약

본 논문은 SAM 인코더만을 경량의 비프롬프트 예측 헤드(ViT AutoSAM, CNN, 또는 Linear)로 미세조정하여 SAM을 의학 영상 분할에 적합하게 적응시키고, 프롬프트 없이도 효율적인 소수 샷 학습과 다중 클래스 마스크를 가능하게 한다.

ABSTRACT

The emerging scale segmentation model, Segment Anything (SAM), exhibits impressive capabilities in zero-shot segmentation for natural images. However, when applied to medical images, SAM suffers from noticeable performance drop. To make SAM a real ``foundation model" for the computer vision community, it is critical to find an efficient way to customize SAM for medical image dataset. In this work, we propose to freeze SAM encoder and finetune a lightweight task-specific prediction head, as most of weights in SAM are contributed by the encoder. In addition, SAM is a promptable model, while prompt is not necessarily available in all application cases, and precise prompts for multiple class segmentation are also time-consuming. Therefore, we explore three types of prompt-free prediction heads in this work, include ViT, CNN, and linear layers. For ViT head, we remove the prompt tokens in the mask decoder of SAM, which is named AutoSAM. AutoSAM can also generate masks for different classes with one single inference after modification. To evaluate the label-efficiency of our finetuning method, we compare the results of these three prediction heads on a public medical image segmentation dataset with limited labeled data. Experiments demonstrate that finetuning SAM significantly improves its performance on medical image dataset, even with just one labeled volume. Moreover, AutoSAM and CNN prediction head also has better segmentation accuracy than training from scratch and self-supervised learning approaches when there is a shortage of annotations.

연구 동기 및 목표

SAM이라는 자연 이미지 기반 모델을 의학 영상 도메인에 적응시킬 필요성 제기.
SAM 인코더를 고정하고 비프롬프트 다중 클래스 분할을 위한 예측 헤드를 추가하는 경량 미세조정 전략 제안.
제한된 라벨 데이터 하에서 세 가지 헤드 아키텍처(ViT 기반 AutoSAM, CNN 기반 헤드, Linear)를 평가.
공개 의학 영상 데이터셋에서 처음부터 학습 및 자체 감독 기반 기준선 대비 라벨 효율적 향상 시연.

제안 방법

SAM 인코더 가중치를 고정하고 경량의 작업 특화 헤드를 부착하여 미세조정.
SAM 마스크 디코더를 비프롬프트 헤드로 교체; AutoSAM에서 클래스별 임베딩 중복으로 다중 클래스 마스크 가능.
세 가지 헤드 아키텍처 평가: ViT 기반 AutoSAM, CNN 기반 헤드(UNet 유사 디코더), Linear 헤드.
적은 수의 라벨 부피(1개 또는 5개)를 사용하여 혼합 손실로 학습(Cross-Entropy와 Dice 손실의 혼합).
박스 프롬프트가 있는 제로샷 SAM과 원래의 UNet과의 비교 평가 및 SimCLR 기반 자체 감독 프리트레이닝 대비.

실험 결과

연구 질문

RQ1SAM의 인코더를 고정하고 경량의 비프롬프트 헤드를 추가하면 제한된 주석으로 의료 이미지에서 경쟁력 있는 분할을 달성할 수 있는가?
RQ2몇 샷 설정에서 어떤 헤드 아키텍처(AutoSAM ViT, CNN, Linear)가 최상의 성능을 보이는가?
RQ3AutoSAM이 프롬프트 없이도 의료 데이터셟에서 다중 클래스 분할을 효율적으로 가능하게 하는가?

주요 결과

방법	Dice%	ASSD	RV	Myo	LV
UNET	13.45 ± 1.89	16.24 ± 4.14	22.95 ± 0.47	17.55 ± 2.05	51.55 ± 6.42
UNET + SimCLR	14.25 ± 6.52	19.40 ± 6.36	27.54 ± 9.80	20.40 ± 3.95	33.14 ± 4.39
Encoder + LN	0.00 ± 0.00	20.42 ± 13.20	48.40 ± 22.50	22.94 ± 12.32	49.38 ± 12.32
Encoder + CNN	30.66 ± 14.28	39.96 ± 8.14	50.55 ± 13.56	40.39 ± 11.90	38.13 ± 16.42
AutoSAM (ft all)	17.10 ± 9.76	30.05 ± 7.77	43.82 ± 13.91	30.32 ± 10.05	25.93 ± 1.94
AutoSAM	31.66 ± 13.26	33.49 ± 9.23	52.83 ± 16.49	39.32 ± 12.82	23.59 ± 2.07
sup w/ UNET	40.36 ± 2.36	52.23 ± 3.80	62.91 ± 5.58	51.83 ± 3.41	32.28 ± 1.40
5 volumes / UNET + SimCLR	45.48 ± 4.65	58.20 ± 6.12	68.95 ± 3.88	57.18 ± 3.20	28.98 ± 7.13
5 volumes / Encoder + LN	22.07 ± 11.2	37.38 ± 11.56	33.69 ± 27.63	31.05 ± 16.14	-
5 volumes / Encoder + CNN	59.87 ± 1.86	62.81 ± 2.82	78.96 ± 2.79	67.21 ± 1.32	25.46 ± 11.14
5 volumes / AutoSAM (ft all)	22.43 ± 18.03	37.08 ± 13.49	53.75 ± 15.08	37.76 ± 15.22	24.44 ± 9.92
5 volumes / AutoSAM	58.48 ± 3.90	62.18 ± 2.97	80.58 ± 1.42	67.08 ± 2.56	17.54 ± 3.65
5 volumes / unsup SAM (box)	53.57 ± 0.86	39.60 ± 0.65	0.00 ± 0.00	31.06 ± 0.41	7.83 ± 0.67

SAM 인코더를 경량 헤드로 미세조정하면 단일 라벨 부피로도 의료 분할 성능이 크게 향상된다.
저데이터 상황에서 AutoSAM 및 CNN 헤드가 처음부터 학습 및 SimCLR 대비 우수한 성능을 보였고, Linear 헤드는 과적합 문제로 성능이 떨어졌다.
AutoSAM(ViT 기반 헤드) 및 CNN 헤드가 다른 기준선보다 높은 Dice 점수를 달성했고, AutoSAM은 일반적으로 ASSD에서 더 나은 성능을 보였다.
더 큰 SAM 인코더 크기(vit-h)가 일반적으로 결과를 개선하는 경향이 있지만, Encoder + CNN에 비해 AutoSAM은 인코더 크기 변화에 민감도가 낮았다.
라벨링 데이터가 증가함에 따라(5개 부피) 성능 차이가 확연히 커지며, 특히 Dice 점수에서 AutoSAM 및 CNN 헤드에 유리하게 나타났다.
AutoSAM은 각 클래스별 임베딩을 중복시켜 프롬프트 없이 한 추론에서 다중 클래스 마스크를 생성할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.