QUICK REVIEW

[论文解读] SipMask: Spatial Information Preservation for Fast Image and Video Instance Segmentation

Jiale Cao, Rao Muhammad Anwer|arXiv (Cornell University)|Jul 29, 2020

Advanced Neural Network Applications参考文献 56被引用 28

一句话总结

SipMask 提出了一种快速、单阶段的实例分割方法，通过使用轻量级的空间保持（SP）模块生成子区域特定的空间系数，在目标边界框内保留空间信息，从而提高相邻目标的分割掩码精度。该方法在单阶段方法中达到最先进性能，相较于 TensorMask 提升 1.0% AP，相较于 YOLACT 提升 3.0% AP，同时推理速度比 TensorMask 快四倍，并在 Titan Xp 上实现实时推理速度。

ABSTRACT

Single-stage instance segmentation approaches have recently gained popularity due to their speed and simplicity, but are still lagging behind in accuracy, compared to two-stage methods. We propose a fast single-stage instance segmentation method, called SipMask, that preserves instance-specific spatial information by separating mask prediction of an instance to different sub-regions of a detected bounding-box. Our main contribution is a novel light-weight spatial preservation (SP) module that generates a separate set of spatial coefficients for each sub-region within a bounding-box, leading to improved mask predictions. It also enables accurate delineation of spatially adjacent instances. Further, we introduce a mask alignment weighting loss and a feature alignment scheme to better correlate mask prediction with object detection. On COCO test-dev, our SipMask outperforms the existing single-stage methods. Compared to the state-of-the-art single-stage TensorMask, SipMask obtains an absolute gain of 1.0% (mask AP), while providing a four-fold speedup. In terms of real-time capabilities, SipMask outperforms YOLACT with an absolute gain of 3.0% (mask AP) under similar settings, while operating at comparable speed on a Titan Xp. We also evaluate our SipMask for real-time video instance segmentation, achieving promising results on YouTube-VIS dataset. The source code is available at https://github.com/JialeCao001/SipMask.

研究动机与目标

为解决单阶段实例分割模型在准确勾勒空间相邻目标时因边界框内空间信息丢失而导致性能不佳的问题。
在不牺牲推理速度的前提下提升掩码预测精度，尤其适用于实时应用。
开发一种轻量级模块，实现在目标提议内的细粒度空间表征。
将方法扩展至实时视频实例分割，保持一致的性能表现。

提出的方法

提出一种新颖的轻量级空间保持（SP）模块，将每个目标的边界框划分为子区域，并为每个子区域生成独立的空间系数，以保留空间细节。
采用掩码对齐加权损失，基于分类置信度和与真实框的 IoU 对逐像素 BCE 损失进行重加权，优先关注预测准确的边界框。
应用特征对齐方案，增强检测头与分割头的特征表示，提升检测与分割之间的相关性。
通过增加一个全卷积的跟踪分支，将单阶段框架适配至视频实例分割，实现跨帧实例关联。
在 COCO 上采用 ResNet101-FPN 主干网络，在 YouTube-VIS 上采用 ResNet50-FPN 主干网络，采用单尺度推理设置，实现实时性能。
通过消融实验验证，采用 $2\times2$ 子区域划分是在精度与速度之间最佳的平衡点。

实验结果

研究问题

RQ1在目标边界框内保留空间信息是否能提升单阶段实例分割的掩码质量？
RQ2使用子区域特定的空间系数是否能改善空间相邻实例的边界勾勒？
RQ3轻量级空间保持模块是否能在不牺牲推理速度的前提下提升精度？
RQ4所提出的掩码对齐加权损失如何影响掩码预测性能？
RQ5单阶段 SipMask 框架能否有效扩展至实时视频实例分割？

主要发现

在 COCO test-dev 上，SipMask 使用单尺度输入 $544\times544$ 时达到 32.8 掩码 AP，且在 Titan Xp 上实现 30 fps 实时推理速度。
在 COCO test-dev 上，SipMask 相较于最先进单阶段方法 TensorMask 提升 1.0% 掩码 AP，同时实现四倍加速。
与实时性相近的 YOLACT 相比，SipMask 在 Titan Xp 上以相似推理速度下提升 3.0% 掩码 AP。
消融实验表明，采用 $2\times2$ 子区域划分提供最佳平衡，达到 32.9 AP，而 $3\times3$ 及以上子区域的增益微乎其微。
掩码对齐加权损失通过结合分类与定位得分进行重加权，使性能提升 0.8%（从 31.2 AP 提升至 32.0 AP）。
在 YouTube-VIS 上，SipMask 达到 32.5 AP，较 MaskTrack R-CNN 提升 2.2% 掩码精度，且保持 30 fps 推理速度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。