QUICK REVIEW

[论文解读] Sequence Level Semantics Aggregation for Video Object Detection

Haiping Wu, Yuntao Chen|arXiv (Cornell University)|Jul 15, 2019

Advanced Image and Video Retrieval Techniques参考文献 36被引用 39

一句话总结

SELSA 在视频对象检测中引入序列级语义特征聚合，将视频视为整个序列中的语义邻居集合，在 ImageNet VID 上实现最前沿的 mAP，而无需复杂后处理。

ABSTRACT

Video objection detection (VID) has been a rising research direction in recent years. A central issue of VID is the appearance degradation of video frames caused by fast motion. This problem is essentially ill-posed for a single frame. Therefore, aggregating features from other frames becomes a natural choice. Existing methods rely heavily on optical flow or recurrent neural networks for feature aggregation. However, these methods emphasize more on the temporally nearby frames. In this work, we argue that aggregating features in the full-sequence level will lead to more discriminative and robust features for video object detection. To achieve this goal, we devise a novel Sequence Level Semantics Aggregation (SELSA) module. We further demonstrate the close relationship between the proposed method and the classic spectral clustering method, providing a novel view for understanding the VID problem. We test the proposed method on the ImageNet VID and the EPIC KITCHENS dataset and achieve new state-of-the-art results. Our method does not need complicated postprocessing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean.

研究动机与目标

通过利用完整序列信息来提升 VID，而不是仅汇聚近帧信息。
提出 SELSA 模块，通过语义相似性在整个视频中聚合 ROI 特征。
将 SELSA 与谱聚类相关联，为 VID 提供基于聚类的解释。
在大规模数据集（ImageNet VID、EPIC KITCHENS）上展示端到端训练的性能提升。
展示对类似 Seq-NMS 的后处理依赖减少。

提出的方法

从整段视频的帧中提取 ROI 提案。
使用广义余弦相似度计算跨帧的提案之间的语义相似性。
通过对序列中语义相似提案使用 softmax 归一化权重来聚合特征。
将 SELSA 模块插入 Faster R-CNN 主干并进行端到端训练。
提供一个谱聚类解释：提案形成图，聚合降低类内方差。
讨论与图卷积网络的关系，并展示该方法促进一个用于聚合的块对角 T。

实验结果

研究问题

RQ1在不依赖光流或循环时序模型的情况下，完整序列语义聚合是否能改进 VID？
RQ2SELSA 是否能有效降低视频帧中多样化外观下的类内特征方差？
RQ3SELSA 是否兼容端到端训练且无需如 Seq-NMS 等重后处理？
RQ4与 ImageNet VID 和 EPIC KITCHENS 上的最先进 VID 方法相比，SELSA 的性能如何？

主要发现

在 ImageNet VID 上，使用 ResNet-101 且无视频级后处理，SELSA 达到 80.25 mAP，超过多种基于流的方法。
使用 ResNeXt-101 时，SELSA 达到 83.11 mAP，甚至在无后处理的情况下超越若干同代方法。
增加帧数并使用来自全序列的语义邻居在快速运动场景下带来显著提升（例如快速 mAP 提高到 61.38）。
在消融实验中，使用整个序列的语义聚合的 SELSA 比单帧和帧内聚合变体有显著优势。
数据增强进一步提升性能，例如在使用 VID 数据增强时，ResNet-101 的 mAP 提升 +2.44。
Seq-NMS 后处理对 SELSA 的额外增益很小，表明该模块已捕捉到序列级信息。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。