QUICK REVIEW

[论文解读] Decoupling Features in Hierarchical Propagation for Video Object Segmentation

Zongxin Yang, Yi Yang|arXiv (Cornell University)|Oct 18, 2022

Visual Attention and Saliency Detection被引用 51

一句话总结

DeAOT 将在分层视频对象分割中视觉（与对象无关）和 ID（对象特异）特征传播解耦，使用一个轻量级的 Gated Propagation Module，以在精度和实时效率上相对于 AOT 提高。

ABSTRACT

This paper focuses on developing a more effective method of hierarchical propagation for semi-supervised Video Object Segmentation (VOS). Based on vision transformers, the recently-developed Associating Objects with Transformers (AOT) approach introduces hierarchical propagation into VOS and has shown promising results. The hierarchical propagation can gradually propagate information from past frames to the current frame and transfer the current frame feature from object-agnostic to object-specific. However, the increase of object-specific information will inevitably lead to the loss of object-agnostic visual information in deep propagation layers. To solve such a problem and further facilitate the learning of visual embeddings, this paper proposes a Decoupling Features in Hierarchical Propagation (DeAOT) approach. Firstly, DeAOT decouples the hierarchical propagation of object-agnostic and object-specific embeddings by handling them in two independent branches. Secondly, to compensate for the additional computation from dual-branch propagation, we propose an efficient module for constructing hierarchical propagation, i.e., Gated Propagation Module, which is carefully designed with single-head attention. Extensive experiments show that DeAOT significantly outperforms AOT in both accuracy and efficiency. On YouTube-VOS, DeAOT can achieve 86.0% at 22.4fps and 82.0% at 53.4fps. Without test-time augmentations, we achieve new state-of-the-art performance on four benchmarks, i.e., YouTube-VOS (86.2%), DAVIS 2017 (86.2%), DAVIS 2016 (92.9%), and VOT 2020 (0.622). Project page: https://github.com/z-x-yang/AOT.

研究动机与目标

在分层 VOS 传播中，推动在传播对象特异的 ID 信息的同时保留对象无关的视觉信息。
引入一个双分支传播框架，以解耦视觉嵌入和 ID 嵌入。
设计一个高效的传播模块（GPM），在保持性能的同时降低计算量。
在多个 VOS 基准上展示最新的准确性和实时速度。
展示在不同 VOS 基准上的泛化能力以及对不同骨干网络的鲁棒性。

提出的方法

将对象无关的视觉嵌入和对象特定的 ID 嵌入解耦为共享注意力映射的 Visual Branch 和 ID Branch。
用基于单头注意力和深度卷积的 Gated Propagation Modules 替代多头 LSTT 块。
使用门控函数通过 GP(U, Q, K, V) 来调制传播，并结合深度卷积以获得局部上下文。
在两个分支之间共享注意力映射，以便 ID 传播利用视觉引导的匹配。
在 GPM 框架中为两个分支建立长期传播、短期传播和自传播。
提供三种 DeAOT 变体（T、S、B、L）具有不同的记忆和层配置，以在速度与准确性之间取得平衡。

实验结果

研究问题

RQ1在分层传播中解耦视觉嵌入和 ID 嵌入是否能提升视觉嵌入的保留效果和整体 VOS 精度？
RQ2与多头 LSTT 块相比，单头门控传播方法是否在降低计算量的同时维持性能？
RQ3双分支传播和 GPM 在 YouTube-VOS、DAVIS 2017/2016、VOT 2020 基准上的结果有何影响？

主要发现

DeAOT 在 YouTube-VOS 及其他基准上，在准确性和运行时速度方面显著优于 AOT。
R50-DeAOT-L 在 22.4 fps 下达到 86.0%/85.9%（J/F）；SwinB-DeAOT-L 在 11.9–15.4 fps 下达到 86.2%/86.1%，取决于变体。
DeAOT-L 与 SwinB-DeAOT-L 在 YouTube-VOS 2018/2019、DAVIS 2017、DAVIS 2016 和 VOT 2020 上无需测试时增强就达到顶级性能。
消融研究表明双分支传播和 GPM 对性能至关重要；用 LSTT 替换 GPM 会显著降低准确性。
使用带有 GPM 的单头注意力在准确性上具有竞争力，并相比多头 AOT 获得显著的速度提升。
在 DAVIS 2016 和 VOT 2020 上，DeAOT 变体在准确性（J/F/EAO）和实时指标方面超过若干最先进方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。