QUICK REVIEW

[论文解读] Video Instance Segmentation using Inter-Frame Communication Transformers

Sukjun Hwang, Miran Heo|arXiv (Cornell University)|Jun 7, 2021

Advanced Image and Video Retrieval Techniques参考文献 33被引用 58

一句话总结

本论文提出 Inter-frame Communication Transformers (IFC) 用于视频实例分割，通过大幅降低时空注意力的需求实现高精度，支持快速每剪辑处理，在 YouTube-VIS 基准上取得强结果。

ABSTRACT

We propose a novel end-to-end solution for video instance segmentation (VIS) based on transformers. Recently, the per-clip pipeline shows superior performance over per-frame methods leveraging richer information from multiple frames. However, previous per-clip models require heavy computation and memory usage to achieve frame-to-frame communications, limiting practicality. In this work, we propose Inter-frame Communication Transformers (IFC), which significantly reduces the overhead for information-passing between frames by efficiently encoding the context within the input clip. Specifically, we propose to utilize concise memory tokens as a mean of conveying information as well as summarizing each frame scene. The features of each frame are enriched and correlated with other frames through exchange of information between the precisely encoded memory tokens. We validate our method on the latest benchmark sets and achieved the state-of-the-art performance (AP 44.6 on YouTube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). Our method can also be applied to near-online inference for processing a video in real-time with only a small delay. The code will be made available.

研究动机与目标

推动高效的每剪辑视频实例分割，以处理遮挡和运动模糊，同时避免代价高昂的时空注意力开销。
开发基于记忆令牌的跨帧通信机制，以丰富同一剪辑内各帧的特征。
提供面向实例的训练与跟踪方案，最大化时空掩码相似性（IoU）以用于 VIS。
提供轻量级、剪辑级的变换器架构，支持在线、近在线和离线推理。
在 YouTube-VIS 基准上展示出强劲的速度-准确性权衡，同时维持对大量实例的可扩展性。

提出的方法

提出 Inter-frame Communication Transformers (IFC)，包含两个变换器阶段：Encode-Receive（逐帧处理）和 Gather-Communicate（通过记忆令牌进行跨帧通信）。
对每帧使用一小组可训练的记忆令牌来总结场景上下文，并在不进行全时空自注意力的情况下实现跨帧注意力。
通过 Encode-Receive 阶段对帧进行独立处理，然后通过记忆令牌在 Gather-Communicate 阶段跨帧聚合信息。
为潜在实例生成固定大小的对象查询，并产生用于本剪辑中跨所有帧应用的实例特定掩码的条件卷积权重。
采用二分匹配损失进行训练，将预测与真实掩码配对，使用基于掩码的 Dice 损失与 focal 损失，优化时空掩码 IoU。
通过在重叠剪辑之间使用时空软 IoU 与匈牙利匹配来实现剪辑级跟踪。

实验结果

研究问题

RQ1具备记忆令牌通信的每剪辑变换器模型，是否能在降低时空注意力成本的同时获得具有竞争力的 VIS 精度？
RQ2记忆令牌和剪辑级条件化如何影响跨帧特征富化与 VIS 中的实例跟踪？
RQ3剪辑长度（T）和记忆令牌尺寸（M）对 VIS 的精度和速度有何影响？
RQ4该模型是否能在 YouTube-VIS 数据集上以强劲的速度-精度权衡，支持在线、近在线和离线推理？

主要发现

在离线推理下达到近似于最新状态的性能（在 YouTube-VIS 2019 验证集 AP 44.6）。
在离线设置下，运行速度快（ResNet-50 高达 107.1 FPS），并具备强劲的剪辑级 VIS 速度-精度平衡。
在 YouTube-VIS 2019 的在线/近在线/离线模式下，超越竞争的 VIS 方法，同时避免使用如可变形卷积或级联网络等重型模块。
在近在线模式（T=5）下，达到 46.5 FPS，AP 约 41.0，展示了具有小延迟的实际实时适用性。
在 YouTube-VIS 2021 验证集上，该方法获得具有竞争力的 AP（约 35–37 区间）及相关指标，在 VIS 为中心的设置中超过若干基线。
消融实验表明记忆令牌对跨帧通信至关重要，且分解的（逐帧）记忆令牌交互在性能上优于统一令牌方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。