QUICK REVIEW

[論文レビュー] ISTR: End-to-End Instance Segmentation with Transformers

Jie Hu, Liujuan Cao|arXiv (Cornell University)|May 3, 2021

Advanced Neural Network Applications参考文献 54被引用数 54

ひとこと要約

ISTRは、Transformersを基盤としたエンドツーエンドのインスタンスセグメンテーションフレームワークであり、低次元のマスク埋め込みを回帰し、ビピラテ・マッチングセット損失を使用し、予測を反復的に洗練させる。NMSなしでCOCOの結果に競争力がある。

ABSTRACT

End-to-end paradigms significantly improve the accuracy of various deep-learning-based computer vision models. To this end, tasks like object detection have been upgraded by replacing non-end-to-end components, such as removing non-maximum suppression by training with a set loss based on bipartite matching. However, such an upgrade is not applicable to instance segmentation, due to its significantly higher output dimensions compared to object detection. In this paper, we propose an instance segmentation Transformer, termed ISTR, which is the first end-to-end framework of its kind. ISTR predicts low-dimensional mask embeddings, and matches them with ground truth mask embeddings for the set loss. Besides, ISTR concurrently conducts detection and segmentation with a recurrent refinement strategy, which provides a new way to achieve instance segmentation compared to the existing top-down and bottom-up frameworks. Benefiting from the proposed end-to-end mechanism, ISTR demonstrates state-of-the-art performance even with approximation-based suboptimal embeddings. Specifically, ISTR obtains a 46.8/38.6 box/mask AP using ResNet50-FPN, and a 48.1/39.9 box/mask AP using ResNet101-FPN, on the MS COCO dataset. Quantitative and qualitative results reveal the promising potential of ISTR as a solid baseline for instance-level recognition. Code has been made available at: https://github.com/hujiecpp/ISTR.

研究の動機と目的

Motivate end-to-end training for instance segmentation beyond traditional NMS-dependent pipelines.
Develop a framework that predicts low-dimensional mask embeddings alongside boxes and class labels.
Enable end-to-end optimization through a set-based bipartite matching loss.
Introduce a recurrent refinement strategy to jointly improve detection and segmentation across stages.

提案手法

Mask embedding encoder/decoderを学習して、低次元の埋め込みでマスクを表現する。
ボックス、クラス、マスク埋め込みの類似性を組み合わせたビピラテ・マッチングコストを定義する。
マッチした予測に対してセット損失を適用して、境界ボックス、クラス、マスク埋め込みを監督する。
画像特徴とRoI特徴を、動的アテンションを持つTransformerエンコーダで予測ヘッドに結合する。
推論時にNMSを使わず、N世代の再帰的洗練でクエリボックスと予測を更新する。
マルチスケールバックボーンと標準のCOCO損失（L1、giou、focal loss、Dice for masks）で学習する。

実験結果

リサーチクエスチョン

RQ1Can end-to-end instance segmentation be achieved with Transformers by predicting mask embeddings instead of full masks?
RQ2Does a set-based bipartite matching loss combining boxes, classes, and mask embeddings enable NMS-free inference?
RQ3How does recurrent refinement affect joint detection and segmentation performance?
RQ4What architectural choices (dynamic attention, pooling, loss terms) maximize end-to-end performance for COCO?
RQ5How does ISTR perform compared to state-of-the-art methods on COCO, especially for small objects?

主な発見

手法	バックボーン	エポック	APm	APm_S	APm_M	APm_L	APb	APb_S	APb_M	APb_L	FPS	Time	GPU
ISTR, ours	ResNet50-FPN	36	38.6	22.1	40.4	50.6	46.8	27.8	48.7	59.9	13.8	72.5	1080Ti
ISTR, ours	ResNet101-FPN	36	39.9	22.8	41.9	52.3	48.1	28.7	50.4	61.5	11.0	91.3	1080Ti

ISTR achieves competitive COCO metrics, e.g., 46.8 box AP / 38.6 mask AP (ResNet50-FPN) and 48.1 box AP / 39.9 mask AP (ResNet101-FPN) on test-dev.
Mask embeddings outperform direct mask predictions, with optimal embedding dimension around 60–80.
Cosine similarity on normalized mask embeddings improves mask matching cost and overall performance.
Dynamic attention for fusing RoI and image features yields gains over multi-head attention.
Global average pooling with position embeddings improves both box and mask APs.
Stage-wise recurrent refinement allows detection and segmentation to improve jointly across stages, saturating after a few iterations.
ISTR shows strong performance on small objects and competitive Box AP compared to end-to-end DETR-based methods.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。