QUICK REVIEW

[논문 리뷰] Fast Segment Anything

Xu Zhao, Wenchao Ding|arXiv (Cornell University)|2023. 06. 21.

Multimodal Machine Learning Applications인용 수 33

한 줄 요약

FastSAM은 YOLOv8-seg로 모든 인스턴스 분할을 수행하고 이후 프롬프트 유도 선택으로 SAM의 segment-anything 태스크에 대한 실시간 CNN 기반 대안을 제시하며, 유사한 성능을 달성하고, ~50x speedups에서 유사한 성능을 달성한다.

ABSTRACT

The recently proposed segment anything model (SAM) has made a significant influence in many computer vision tasks. It is becoming a foundation step for many high-level tasks, like image segmentation, image caption, and image editing. However, its huge computation costs prevent it from wider applications in industry scenarios. The computation mainly comes from the Transformer architecture at high-resolution inputs. In this paper, we propose a speed-up alternative method for this fundamental task with comparable performance. By reformulating the task as segments-generation and prompting, we find that a regular CNN detector with an instance segmentation branch can also accomplish this task well. Specifically, we convert this task to the well-studied instance segmentation task and directly train the existing instance segmentation method using only 1/50 of the SA-1B dataset published by SAM authors. With our method, we achieve a comparable performance with the SAM method at 50 times higher run-time speed. We give sufficient experimental results to demonstrate its effectiveness. The codes and demos will be released at https://github.com/CASIA-IVA-Lab/FastSAM.

연구 동기 및 목표

산업에서 실시간 segment-anything 응용을 가능하게 하여 계산 요구를 줄인다.
CNN 기반 탐지기가 segment-anything 태스크에서 SAM의 성능에 맞출 수 있는지 확인한다.
두 단계 FastSAM 프레임워크(모든 인스턴스 분할 AIS와 프롬프트 유도 선택 PGS)를 훨씬 빠른 추론으로 시연한다.
제로샷 작업인 에지 탐지, 객체 제안, 텍스트-유도 분할 등에서 FastSAM을 평가하여 일반화를 확인한다.

제안 방법

segment-anything을 두 단계 프로세스(AIS: 모든 인스턴스 분할)와 PGS(프롬프트 유도 선택)로 재정의한다.
AIS를 위해 이미지의 모든 물체를 분할하기 위해 인스턴스 분할 가지를 갖춘 YOLOv8-seg를 사용한다(YOLACT-스타일 프로토타입).
CNN 탐지기로 견고한 마스크를 학습하기 위해 SA-1B 데이터셋의 2% (1/50)로 학습한다.
AIS 마스크에서 대상 객체를 식별하기 위해 포인트 프롬프트, 박스 프롬프트, 텍스트 프롬프트(CLIP를 통해)를 사용하는 프롬프트 유도 선택을 적용한다.
엔드 투 엔드 트랜스포머 기반 분할 없이 프롬프트를 마스크 선택으로 매핑하기 위해 간단한 프롬프트 인코더/디코더를 활용한다.
다양한 프롬프트 설정에서 RTX 3090에서 SAM보다 50배 빠른 추론 속도를 보이는 속도 비교를 제공한다.

실험 결과

연구 질문

RQ1CNN 기반 탐지기가 인스턴스 분할 가지를 포함한 분할 성능을 SAM과 유사하게 달성하면서 실시간 속도를 제공할 수 있는가?
RQ2FastSAM이 에지 탐지, 객체 제안 생성, 텍스트-유도 분할과 같은 제로샷 작업에서 SAM과 비교하여 어떤 성능을 보이는가?
RQ3AIS와 PGS로 segment-anything을 분리하는 것이 엔드투엔드 트랜스포머 접근 방식보다 강점과 한계가 무엇인가?
RQ4SA-1B의 일부만으로도 실제 응용에서 경쟁력 있는 결과를 얻을 수 있는가?

주요 결과

FastSAM은 RTX 3090에서 SAM(32×32 프롬프트 모드)보다 약 50배 빠르게 실행되면서도 유사한 성능을 유지한다.
FastSAM은 제로샷 설정에서 BSDS500에 대해 경쟁력 있는 에지 탐지 결과를 달성하며, R50가 더 높고 AP가 SAM과 유사하게 나타난다.
COCO에 대한 객체 제안에서 FastSAM은 32×32 프롬프트로 SAM보다 약간 더 잘 나오는 AR1000 63.7을 달성하면서도 훨씬 빠르다.
LVIS v1에서 FastSAM은 bbox의 강력한 AR@1000과 마스크 AR@1000에서 SAM과 경쟁력 있는 성능을 보이며 특히 제로샷 설정에서 강하다.
FastSAM은 ViTDet가 제공하는 박스를 프롬프트로 사용한 제로샷 인스턴스 분할에서 견고한 성능을 보이나 COCO/LVIS에서 AP는 완전 감독 방법 및 SAM보다 낮은 경우가 있다.
CLIP을 이용한 텍스트 프롬프트 기반 분할은 가능하지만 CLIP 임베딩 처리량으로 인해 더 느리며, 유연성과 속도 간의 트레이드오프를 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.