QUICK REVIEW

[论文解读] Flow-Guided Feature Aggregation for Video Object Detection

Xizhou Zhu, Yujie Wang|arXiv (Cornell University)|Mar 29, 2017

Advanced Neural Network Applications参考文献 39被引用 105

一句话总结

Flow-guided feature aggregation (FGFA) 通过沿运动路径对邻近帧特征进行扭曲与聚合来增强视频目标检测的逐帧 CNN 特征，并端到端训练，以提升相较于单帧检测器的准确性。

ABSTRACT

Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers from degenerated object appearances in videos, e.g., motion blur, video defocus, rare poses, etc. Existing work attempts to exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for video object detection. It leverages temporal coherence on feature level instead. It improves the per-frame features by aggregation of nearby features along the motion paths, and thus improves the video recognition accuracy. Our method significantly improves upon strong single-frame baselines in ImageNet VID, especially for more challenging fast moving objects. Our framework is principled, and on par with the best engineered systems winning the ImageNet VID challenges 2016, without additional bells-and-whistles. The proposed method, together with Deep Feature Flow, powered the winning entry of ImageNet VID challenges 2017. The code is available at https://github.com/msracver/Flow-Guided-Feature-Aggregation.

研究动机与目标

通过在特征层面利用时间信息来提升视频目标检测，而不是在检测结果的后处理上进行改进。
开发一个端到端可训练的框架，通过 flow-guided aggregation 在邻近帧之间增强逐帧特征。
应对视频中对象外观退化（运动模糊、失焦、罕见姿态）带来的挑战。
在 ImageNet VID 上展示具有竞争力的性能，而无需大量手工设计的后处理喇叭和花哨功能。

提出的方法

对每个视频帧应用逐帧特征提取器。
使用一个 flow 网络估计帧之间的光流并将邻近帧的特征扭曲到参考帧。
用一个小型嵌入网络对扭曲后的特征和参考帧特征进行嵌入，以便计算相似性。
通过在嵌入空间中的余弦相似性为每个空间位置计算自适应权重，并对扭曲特征进行加权聚合。
将聚合后的特征输入检测网络（基于 R-FCN）以在参考帧上实现端到端的对象检测。
将所有组件端到端训练，在训练过程中进行时间 dropout，以对跨帧范围进行正则化。

实验结果

研究问题

RQ1是否可以在特征层面利用时间信息显著提升视频目标检测的准确性，使其超越单帧检测器？
RQ2在慢速、中速和快速对象运动下，flow-guided feature aggregation 是否提供稳健的改进？
RQ3端到端训练的光流估计、特征扭曲与聚合对检测性能的影响，与基于框的后处理相比如何？
RQ4聚合范围、计算成本与检测精度之间的权衡是什么？

主要发现

FGFA 显著优于 ImageNet VID 的强大单帧基线，获得更高的平均精度（mAP）。
该方法对快速移动的对象也能带来显著提升，快速运动组的 mAP 提高更大。
自适应、基于 flow 引导的聚合有助于从邻近帧聚集信息，在外观退化时提高检测效果。
端到端训练至关重要；固定某些组件（如 FlowNet）会降低性能。
将 FGFA 与盒子级技术如 Seq-NMS 结合，可以在不进行大量工程化处理的情况下获得更好的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。