QUICK REVIEW

[论文解读] InsPro: Propagating Instance Query and Proposal for Online Video Instance Segmentation

Fei He, Haoyang Zhang|arXiv (Cornell University)|Jan 5, 2023

Video Analysis and Summarization被引用 8

一句话总结

InsPro 引入一种基于查询的框架，通过跨帧传播实例查询-提议对来实现在线视频实例分割的隐式对象关联，在 YouTube-VIS 2019 和 2021 上无需显式跟踪头即可达到 state-of-the-art。

ABSTRACT

Video instance segmentation (VIS) aims at segmenting and tracking objects in videos. Prior methods typically generate frame-level or clip-level object instances first and then associate them by either additional tracking heads or complex instance matching algorithms. This explicit instance association approach increases system complexity and fails to fully exploit temporal cues in videos. In this paper, we design a simple, fast and yet effective query-based framework for online VIS. Relying on an instance query and proposal propagation mechanism with several specially developed components, this framework can perform accurate instance association implicitly. Specifically, we generate frame-level object instances based on a set of instance query-proposal pairs propagated from previous frames. This instance query-proposal pair is learned to bind with one specific object across frames through conscientiously developed strategies. When using such a pair to predict an object instance on the current frame, not only the generated instance is automatically associated with its precursors on previous frames, but the model gets a good prior for predicting the same object. In this way, we naturally achieve implicit instance association in parallel with segmentation and elegantly take advantage of temporal clues in videos. To show the effectiveness of our method InsPro, we evaluate it on two popular VIS benchmarks, i.e., YouTube-VIS 2019 and YouTube-VIS 2021. Without bells-and-whistles, our InsPro with ResNet-50 backbone achieves 43.2 AP and 37.6 AP on these two benchmarks respectively, outperforming all other online VIS methods.

研究动机与目标

以简单且快速的替代显式跟踪/关联为动机，推动在线 VIS。
开发基于查询的传播机制，在跨帧中隐式地链接对象实例。
提升查询表示以应对遮挡、运动模糊和新对象外观。
引入训练策略，确保跨帧的一对一查询-对象对应。

提出的方法

使用固定集的可学习实例查询与提议，逐帧传播以预测每帧的实例。
引入帛内查询注意力（intra-query attention），以从特征库中获取长程时序线索来增强查询。
在 SegHead 中，对多阶段进行动态实例交互，并使用基于条件卷积的掩模头。
在训练时应用时序一致性匹配，以强制跨帧的一对一查询-对象对应。
提出框去重损失，以减少跨帧对同一对象的重复提议。

实验结果

研究问题

RQ1在在线设置中，通过查询-提议传播实现的隐式实例关联是否可以匹配或超过基于显式跟踪的 VIS 方法？
RQ2时序传播、帛内查询注意力和去重损失如何影响跨帧的 VIS 准确性与稳定性？
RQ3不同的库长度 T 对 InsPro 的实例查询表示质量有何影响？
RQ4InsPro 在 YouTube-VIS 2019 和 2021 的有无外部 COCO 训练数据下的表现如何？
RQ5纯查询传播的 VIS 系统是否能够在 FPS 上与基于跟踪的方法相比具备竞争力？

主要发现

方法	AP	AP50	AP75	AR1	AR10	FPS
InsPro (YouTube-VIS 2019; COCO)	43.2	65.3	48.0	38.8	49.0	26.3
InsPro (YouTube-VIS 2021; COCO)	37.6	58.7	40.9	32.7	41.4	26.3

InsPro 在 YouTube-VIS 2019 和 2021 的在线 VIS 性能达到最新水平，使用 ResNet-50 骨干（有 COCO 数据时 AP 43.2；2021 有 COCO 数据时 AP 37.6）。
在没有 COCO 数据的情况下，InsPro 在 2019 的 AP 为 40.2、在 2021 的 AP 为 36.1，仍然优于许多在线基线。
时序传播加上时序一致性匹配将 AP 从基线的 24.0 提升到 37.4，证明了有效的隐式关联。
箱子去重损失减少了重复提议，并带来约 1 AP 的增益（38.4 对 37.4）。
帛内查询注意力结合特征库（T 上限为 18）进一步将 AP 提升至 40.2，同时对速度的影响很小。
InsPro-lite 在精度权衡下实现 45.7 FPS，而 InsPro（含 COCO）在 RTX 2080Ti 上的全模型 FPS 为 26.3。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。