QUICK REVIEW

[论文解读] Mask2Former for Video Instance Segmentation

Bowen Cheng, Anwesa Choudhuri|arXiv (Cornell University)|Dec 20, 2021

Generative Adversarial Networks and Image Synthesis被引用 64

一句话总结

Mask2Former 将图像分割推广到视频，通过在时空遮蔽注意力中预测3D分割体积，在不改变架构或训练的情况下实现YouTubeVIS的最先进结果。

ABSTRACT

We find Mask2Former also achieves state-of-the-art performance on video instance segmentation without modifying the architecture, the loss or even the training pipeline. In this report, we show universal image segmentation architectures trivially generalize to video segmentation by directly predicting 3D segmentation volumes. Specifically, Mask2Former sets a new state-of-the-art of 60.4 AP on YouTubeVIS-2019 and 52.6 AP on YouTubeVIS-2021. We believe Mask2Former is also capable of handling video semantic and panoptic segmentation, given its versatility in image segmentation. We hope this will make state-of-the-art video segmentation research more accessible and bring more attention to designing universal image and video segmentation architectures.

研究动机与目标

证明一个通用的图像分割模型（Mask2Former）在不进行架构变更的前提下能够执行视频实例分割。
将Mask2Former扩展到在3D时空数据上工作，并在时间上预测3D实例掩码。
在YouTubeVIS-2019和YouTubeVIS-2021上评估性能，以建立最先进的结果。

提出的方法

将视频序列视为3D体积 T x H x W，并在该体积上应用遮蔽注意力。
添加时序位置编码和3D掩码预测机制，以在时间上产生每个实例的掩码。
使用联合时空遮蔽注意力，并从前一层掩码导出的3D注意力掩码。
采用正弦时间和空间位置编码，非参数且长度自适应。
预测3D实例掩码 R n,t,h,w = sigmoid(E_mask(:,n)^T · E_pixel(:,t,h,w)).
使用 Detectron2 与 AdamW、标准 VIS 训练设置进行训练与评估，并且不进行 COCO 增强。

实验结果

研究问题

RQ1一个通用的图像分割模型 Mask2Former 是否能够在不进行架构或训练管线修改的情况下实现具有竞争力甚至更优的视频实例分割结果？
RQ2与专门的 VIS 模型相比，3D时空遮蔽和3D体积预测在 YouTubeVIS 数据集上的表现如何？
RQ3时间编码和3D注意力对跨帧的一致性与实例跟踪有何影响？
RQ4考虑到其图像分割的多功能性，Mask2Former 是否有能力扩展到视频语义分割和全景分割？

主要发现

方法	骨干网络	数据	AP	AP50	AP75
CNN	VisTR [15]	R50	36.2 ± 0.5	59.8	36.9
CNN	VisTR [15]	R101	40.1 ± 0.5	45.0	38.3
IFC	R50	V	41.2 ± 0.5	65.1	44.6
IFC	R101	V	42.6 ± 0.5	66.6	46.3
SeqFormer	R50	V	45.1 ± 0.5	66.9	50.5
SeqFormer	R50	V + C80k	47.4 ± 0.5	69.8	51.8
SeqFormer	R101	V + C80k	49.0 ± 0.5	71.1	55.7
Mask2Former	R50	V	46.4 ± 0.8	68.0	50.0
Mask2Former	R101	V	49.2 ± 0.7	72.8	54.2
Transformer	SeqFormer [16]	Swin-L	59.3 ± 0.5	82.1	66.4
Mask2Former	Swin-T	V	51.5 ± 0.7	75.0	56.5
Swin-S	V	54.3 ± 0.7	79.0	58.8
Swin-B	V	59.5 ± 0.7	84.3	67.2
Swin-L	V	60.4 ± 0.5	84.4	67.0
best of 5 runs	Swin-L	V	60.7 ± 0.5	84.4	66.7

在 YouTubeVIS-2019 上，Mask2Former 与 Swin-L 一起达到 60.4 AP，且未使用 COCO 增强，超越了以往方法。
在 YouTubeVIS-2021 上，Mask2Former 与 Swin-L 达到 60.7 AP（5 次运行中的最佳值）和 84.4 AP50，且在不使用额外数据的情况下超越了最先进的方法。
Mask2Former 的变体（R50、R101、Swin-T/S/B/L 骨干）在相同训练设置下持续优于可比的 VIS 方法。
该方法不修改架构、损失或训练管线即可实现视频实例分割的最先进结果。
推理对整段视频序列进行处理，给出前十名预测且无需后处理，能够适应可变序列长度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。