QUICK REVIEW

[論文レビュー] Fast Segment Anything

Xu Zhao, Wenchao Ding|arXiv (Cornell University)|Jun 21, 2023

Multimodal Machine Learning Applications被引用数 33

ひとこと要約

FastSAMは、YOLOv8-segによる全インスタンスセグメンテーションを行い、プロンプト誘導選択でSAMのセグメント-anythingタスクのリアルタイムCNNベースの代替を提案し、約50倍の速度アップで同等の性能を達成する。

ABSTRACT

The recently proposed segment anything model (SAM) has made a significant influence in many computer vision tasks. It is becoming a foundation step for many high-level tasks, like image segmentation, image caption, and image editing. However, its huge computation costs prevent it from wider applications in industry scenarios. The computation mainly comes from the Transformer architecture at high-resolution inputs. In this paper, we propose a speed-up alternative method for this fundamental task with comparable performance. By reformulating the task as segments-generation and prompting, we find that a regular CNN detector with an instance segmentation branch can also accomplish this task well. Specifically, we convert this task to the well-studied instance segmentation task and directly train the existing instance segmentation method using only 1/50 of the SA-1B dataset published by SAM authors. With our method, we achieve a comparable performance with the SAM method at 50 times higher run-time speed. We give sufficient experimental results to demonstrate its effectiveness. The codes and demos will be released at https://github.com/CASIA-IVA-Lab/FastSAM.

研究の動機と目的

産業界におけるリアルタイムのsegment-anythingアプリケーションを、計算量削減により動機付ける。
CNNベースの検出器がsegment-anythingタスクでSAMの性能に匹敵できるかを探る。
2段階のFastSAMフレームワーク（全インスタンスセグメンテーションとプロンプト誘導選択）を、推論を著しく高速化して実証する。
エッジ検出、オブジェクト提案、テキスト誘導セグメンテーションを含むゼロショットタスクでFastSAMを評価し、一般化を試す。

提案手法

segment-anythingを2段階プロセスに再定式化する：全インスタンスセグメンテーション（AIS）に続くプロンプト誘導選択（PGS）。
AISにはYOLOv8-segとインスタンスセグメンテーションブランチ（YOLACT風プロトタイプ）を用いて、画像内の全オブジェクトをセグメント化する。
CNN検出器で頑健なマスクを学習するためにSA-1Bデータセットの2％（1/50）で訓練する。
ポイントプロンプト、ボックスプロンプト、テキストプロンプト（CLIP経由）を用いたプロンプト誘導選択を用いて、AISマスクからターゲットオブジェクトを識別する。
エンドツーエンドのトランスフォーマーベースのセグメンテーションを使わず、単純なプロンプトエンコーダ/デコーダを利用してマスク選択にマップする。
RTX 3090上でさまざまなプロンプト設定の下でSAMより50x高速な推論を示す速度比較を提供する。

実験結果

リサーチクエスチョン

RQ1CNNベースの検出器がインスタンスセグメンテーションブランチを備えて、segment-anythingタスクでSAMと同等のセグメンテーション性能をリアルタイムの速度で達成できるか。
RQ2FastSAMはエッジ検出、オブジェクト提案生成、テキスト誘導セグメンテーションなどのゼロショットタスクでSAMと比較してどう機能するか。
RQ3AISとPGSを分離することとエンドツーエンドのトランスフォーマー手法の長所と制限は何か。
RQ4SA-1Bの一部データで訓練すれば、実世界のアプリケーションで競争力のある結果を得られるか。

主な発見

FastSAMは、単一のRTX 3090上でSAM（32×32プロンプトモード）より約50倍速く動作しつつ、性能はほぼ同等。
ゼロショット設定でBSDS500のエッジ検出結果が競争力を持ち、R50が高く、APはSAMと同様。
COCOでのオブジェクト提案では、FastSAMはAR1000が63.7に達し、32×32プロンプトでのSAMをやや上回り、推論は大幅に高速。
LVIS v1では、FastSAMはbboxのAR@1000が強く、ゼロショット設定ではSAMに対してマスクAR@1000で競争力を持つ。
FastSAMはViTDet提供のボックスをプロンプトとして用いたゼロショットインスタンスセグメンテーションを示すが、COCO/LVISにおけるAPは完全監視法およびSAMと比較して低い。
CLIPを用いたテキストプロンプトベースのセグメンテーションは可能だが、CLIP埋込み処理のスループットのため遅く、柔軟性と速度のトレードオフを強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。