QUICK REVIEW

[论文解读] Occluded Video Instance Segmentation

Jiyang Qi|arXiv (Cornell University)|Jan 1, 2024

Multimodal Machine Learning Applications参考文献 32被引用 26

一句话总结

本文提出了 OVIS，一个大规模的遮挡视频实例分割数据集，包含 25 个类别中的 296k 个掩码，并提出了一种时间特征校准模块，显著提升了遮挡实例的性能。该方法基于 MaskTrack R-CNN 和 SipMask 构建，在 OVIS 上达到 15.1 AP，在 YouTube-VIS 上达到 35.1 AP，显著优于之前的 SOTA 方法。

ABSTRACT

Can our video understanding systems perceive objects when a heavy occlusion exists in a scene? To answer this question, we collect a large-scale dataset called OVIS for occluded video instance segmentation, that is, to simultaneously detect, segment, and track instances in occluded scenes. OVIS consists of 296k high-quality instance masks from 25 semantic categories, where object occlusions usually occur. While our human vision systems can understand those occluded instances by contextual reasoning and association, our experiments suggest that current video understanding systems are not satisfying. On the OVIS dataset, the highest AP achieved by state-of-the-art algorithms is only 14.4, which reveals that we are still at a nascent stage for understanding objects, instances, and videos in a real-world scenario. In experiments, a simple plug-and-play module that performs temporal feature calibration is proposed to complement missing object cues caused by occlusion. Built upon MaskTrack R-CNN and SipMask, we obtain an AP of 15.1 and 14.5 on the OVIS dataset and achieve 32.1 and 35.1 on the YouTube-VIS dataset respectively, a remarkable improvement over the state-of-the-art methods. The OVIS dataset is released at http://songbai.site/ovis , and the project code will be available soon.

研究动机与目标

研究视频理解系统在严重遮挡场景下的性能表现。
收集并发布一个大规模、高质量的遮挡视频实例分割数据集。
开发一种即插即用模块，以增强遮挡条件下的特征表示能力。
评估当前视频实例分割模型在真实遮挡场景下的局限性。
通过时间特征校准提升遮挡视频中的实例分割精度。

提出的方法

作者提出了 OVIS，一个新数据集，包含来自 25 个类别的 296k 个实例掩码，且频繁出现遮挡。
提出一种即插即用的时间特征校准模块，以在遮挡期间恢复缺失的物体线索。
该模块可无缝集成到 MaskTrack R-CNN 和 SipMask 等现有架构中，无需对网络结构进行大规模修改。
该方法利用时间一致性，通过跨帧聚合信息来优化特征表示。
通过建模时间上的长距离依赖关系和上下文关联，增强特征表示能力。
该方法采用端到端训练，并在推理阶段应用，以提升遮挡条件下的分割性能。

实验结果

研究问题

RQ1当前的视频理解系统能否在严重遮挡条件下准确检测和分割物体？
RQ2SOTA 视频实例分割模型在遮挡场景下的性能如何退化？
RQ3一个简单、即插即用的模块是否能在不修改网络结构的前提下提升遮挡实例的性能？
RQ4时间特征校准在遮挡期间恢复缺失线索的过程中起到什么作用？
RQ5所提方法在 OVIS 和 YouTube-VIS 等不同数据集上是否具备良好的泛化能力？

主要发现

SOTA 方法在 OVIS 数据集上达到的最高 AP 仅为 14.4，表明仍有巨大提升空间。
所提出的时序特征校准模块将 OVIS 数据集上的 AP 提升至 15.1，显著优于先前方法。
在 YouTube-VIS 数据集上，该方法使用 MaskTrack R-CNN 达到 32.1 AP，使用 SipMask 达到 35.1 AP，超越了当前 SOTA 结果。
该性能提升通过一个简单、即插即用的模块实现，证明了其有效性且无需复杂微调。
结果表明，当前模型在遮挡场景下表现不佳，凸显了对更强时间与上下文推理能力的需求。
OVIS 数据集已公开发布于 http://songbai.site/ovis，以支持未来在遮挡视频理解领域的研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。