QUICK REVIEW

[论文解读] ISTR: End-to-End Instance Segmentation with Transformers

Jie Hu, Liujuan Cao|arXiv (Cornell University)|May 3, 2021

Advanced Neural Network Applications参考文献 54被引用 54

一句话总结

ISTR 基于 Transformer 的端到端实例分割框架，通过回归低维掩码嵌入，使用二部匹配集损失，并迭代性地 refine 预测，在不使用 NMS 的情况下在 COCO 上取得具有竞争力的结果。

ABSTRACT

End-to-end paradigms significantly improve the accuracy of various deep-learning-based computer vision models. To this end, tasks like object detection have been upgraded by replacing non-end-to-end components, such as removing non-maximum suppression by training with a set loss based on bipartite matching. However, such an upgrade is not applicable to instance segmentation, due to its significantly higher output dimensions compared to object detection. In this paper, we propose an instance segmentation Transformer, termed ISTR, which is the first end-to-end framework of its kind. ISTR predicts low-dimensional mask embeddings, and matches them with ground truth mask embeddings for the set loss. Besides, ISTR concurrently conducts detection and segmentation with a recurrent refinement strategy, which provides a new way to achieve instance segmentation compared to the existing top-down and bottom-up frameworks. Benefiting from the proposed end-to-end mechanism, ISTR demonstrates state-of-the-art performance even with approximation-based suboptimal embeddings. Specifically, ISTR obtains a 46.8/38.6 box/mask AP using ResNet50-FPN, and a 48.1/39.9 box/mask AP using ResNet101-FPN, on the MS COCO dataset. Quantitative and qualitative results reveal the promising potential of ISTR as a solid baseline for instance-level recognition. Code has been made available at: https://github.com/hujiecpp/ISTR.

研究动机与目标

推动端到端训练的实例分割，超越传统依赖 NMS 的管线。
开发一个同时预测边界框、类别标签和低维掩码嵌入的框架。
通过基于集合的二部匹配损失实现端到端优化。
引入一种递归细化策略，在各阶段共同提升检测与分割性能。

提出的方法

学习一个掩码嵌入编码器/解码器，用低维嵌入来表示掩码。
定义一个二部匹配成本，综合边界框、类别和掩码嵌入相似度。
对匹配的预测使用集合损失来监督边界框、类别和掩码嵌入。
通过带有动态注意力的 Transformer 编码器将图像特征与 RoI 特征融合，用于预测头。
在推理阶段使用 N 个阶段的递归细化来更新查询框和预测，推理时不使用 NMS。
使用多尺度骨干网和标准 COCO 损失（L1、giou、focal loss、Dice for masks）进行训练。

实验结果

研究问题

RQ1是否可以通过 Transformer 通过预测掩码嵌入而不是完整掩码来实现端到端的实例分割？
RQ2将边界框、类别和掩码嵌入相结合的基于集合的二部匹配损失是否能够实现无需 NMS 的推断？
RQ3递归细化如何影响联合检测与分割的性能？
RQ4哪些架构选择（动态注意力、池化、损失项）能最大化 COCO 的端到端性能？
RQ5相较于最新方法，ISTR 在 COCO 上的表现如何，尤其是对小目标？

主要发现

Method	Backbone	Epochs	APm	APm_S	APm_M	APm_L	APb	APb_S	APb_M	APb_L	FPS	Time	GPU
ISTR, ours	ResNet50-FPN	36	38.6	22.1	40.4	50.6	46.8	27.8	48.7	59.9	13.8	72.5	1080Ti
ISTR, ours	ResNet101-FPN	36	39.9	22.8	41.9	52.3	48.1	28.7	50.4	61.5	11.0	91.3	1080Ti

ISTR 在 COCO 指标上具有竞争力，例如 test-dev 上 46.8 的 box AP / 38.6 的 mask AP（ResNet50-FPN）和 48.1 的 box AP / 39.9 的 mask AP（ResNet101-FPN）。
掩码嵌入优于直接掩码预测，最佳嵌入维度大约在 60–80 之间。
对归一化掩码嵌入计算余弦相似性能够改善掩码匹配成本及整体性能。
用于融合 RoI 与图像特征的动态注意力相较于多头注意力带来提升。
带位置嵌入的全局平均池化同时提升了边界框和掩码的 AP。
分阶段的递归细化使检测与分割在各阶段共同提升，经过几次迭代后趋于饱和。
ISTR 在小目标上表现出色，并且与端到端 DETR 基方法相比，Box AP 具有竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。