QUICK REVIEW

[论文解读] RVOS: End-to-End Recurrent Network for Video Object Segmentation

Carles Ventura, Míriam Bellver|RECERCAT (Consorci de Serveis Universitaris de Catalunya)|Mar 13, 2019

Visual Attention and Saliency Detection参考文献 36被引用 37

一句话总结

RVOS 提出了一种端到端的循环架构，用于多目标视频对象分割，在空间域和时间域均可操作，实现 zero-shot 和 one-shot VOS，无需后处理且推理快速。

ABSTRACT

Multiple object video object segmentation is a challenging task, specially for the zero-shot case, when no object mask is given at the initial frame and the model has to find the objects to be segmented along the sequence. In our work, we propose a Recurrent network for multiple object Video Object Segmentation (RVOS) that is fully end-to-end trainable. Our model incorporates recurrence on two different domains: (i) the spatial, which allows to discover the different object instances within a frame, and (ii) the temporal, which allows to keep the coherence of the segmented objects along time. We train RVOS for zero-shot video object segmentation and are the first ones to report quantitative results for DAVIS-2017 and YouTube-VOS benchmarks. Further, we adapt RVOS for one-shot video object segmentation by using the masks obtained in previous time steps as inputs to be processed by the recurrent module. Our model reaches comparable results to state-of-the-art techniques in YouTube-VOS benchmark and outperforms all previous video object segmentation methods not using online learning in the DAVIS-2017 benchmark. Moreover, our model achieves faster inference runtimes than previous methods, reaching 44ms/frame on a P100 GPU.

研究动机与目标

解决在不进行后处理的情况下进行 zero-shot 和 one-shot 视频对象分割的挑战。
提出一个完全端到端的循环架构，能够在空间域和时间域同时处理多个对象。
在同一框架内实现 zero-shot 与 one-shot VOS 的无缝自适应。
在 DAVIS-2017 和 YouTube-VOS 基准测试上，在无需在线微调的情况下实现具有竞争力的准确度。
展示相对于此前方法更快的推理速度。

提出的方法

使用带有 ResNet-101 主干的编码器-解码器，以提取每帧的多分辨率特征。
引入空间递归以在单帧内预测多个对象掩码，并在时间上维持一致的对象顺序。
通过 ConvLSTM 块引入时间递归，以保持每个对象在跨帧的掩码一致性。
按照方程 3，将 ConvLSTM 的输入条件设为编码器特征、前一帧的掩码，以及空间/时间隐藏状态。
共享分配机制（带软 IoU 成本的匈牙利算法），在第一帧将预测掩码与真实对象对齐。
通过不依赖初始掩码来支持 zero-shot VOS；通过在解码阶段将前一帧掩码作为输入通道来支持 one-shot VOS。

实验结果

研究问题

RQ1一个完全端到端可训练的模型是否可以在不进行后处理的情况下，在零-shot 和一-shot 情况下对视频中的多个对象进行分割？
RQ2将空间递归与时间递归联合建模是否比单独使用空间递归或时间递归能提升 VOS 性能？
RQ3在零-shot 和一-shot 设置下，RVOS 在标准基准 DAVIS-2017、YouTube-VOS 的表现如何，且不进行在线学习？
RQ4与在线学习方法相比，完全端到端的 VOS 模型的运行时特性如何？
RQ5模型如何处理随帧变化的对象实例数量和对象消失情况？

主要发现

RVOS 在 YouTube-VOS 上取得具竞争力的结果，在 DAVIS-2017 的一射 VOS 上优于非在线学习方法。
时空递归（RVOS-ST）在一-shot 和 zero-shot 设置下，始终优于仅空间（RVOS-S）和仅时间（RVOS-T）配置。
使用推断掩码进行训练（RVOS-ST+）在某些情况下比仅使用 ground-truth 的训练具有鲁棒性并提高性能。
该模型在 P100 GPU 上约 44 ms/帧的速度，优于若干可比的在线学习方法。
RVOS 可以在单次前向传播中处理多个对象，避免后处理并在跨帧保持对象索引的一致性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。