QUICK REVIEW

[论文解读] Towards Weakly-Supervised Action Localization

Philippe Weinzaepfel, Xavier Martín|arXiv (Cornell University)|May 18, 2016

Human Pose and Action Recognition参考文献 32被引用 28

一句话总结

本文提出一种弱监督动作定位方法，通过使用最先进检测器进行人体管路提取，并结合基于检测的跟踪，实现在UCF-Sports和J-HMDB数据集上每视频少于5个管路的情况下达到95%的召回率。该方法采用改进的密集轨迹结合多折多重实例学习（MIL），在UCF-Sports上取得84%的mAP，在J-HMDB上取得54%的mAP，接近全监督性能，同时引入了包含330万帧和10种动作的大规模DALY数据集。

ABSTRACT

This paper presents a novel approach for weakly-supervised action localization, i.e., that does not require per-frame spatial annotations for training. We first introduce an effective method for extracting human tubes by combining a state-of-the-art human detector with a tracking-by-detection approach. Our tube extraction leverages the large amount of annotated humans available today and outperforms the state of the art by an order of magnitude: with less than 5 tubes per video, we obtain a recall of 95% on the UCF-Sports and J-HMDB datasets. Given these human tubes, we perform weakly-supervised selection based on multi-fold Multiple Instance Learning (MIL) with improved dense trajectories and achieve excellent results. We obtain a mAP of 84% on UCF-Sports, 54% on J-HMDB and 45% on UCF-101, which outperforms the state of the art for weakly-supervised action localization and is close to the performance of the best fully-supervised approaches. The second contribution of this paper is a new realistic dataset for action localization, named DALY (Daily Action Localization in YouTube). It contains high quality temporal and spatial annotations for 10 actions in 31 hours of videos (3.3M frames), which is an order of magnitude larger than standard action localization datasets. On the DALY dataset, our tubes have a spatial recall of 82%, but the detection task is extremely challenging, we obtain 10.8% mAP.

研究动机与目标

开发一种弱监督动作定位框架，避免逐帧空间标注。
利用现有的人体检测标注和基于检测的跟踪方法，提升人体管路提取的准确性。
在无需帧级标注的情况下，实现与全监督方法相当的高精度动作定位性能。
引入一个大规模、真实的动作定位基准数据集，命名为DALY，包含31小时YouTube视频和10个动作类别。

提出的方法

通过结合最先进的人体检测器与基于检测的跟踪流水线，生成时空管路提议，提取人体管路。
利用现有的大规模人体检测标注显著提升管路提取的召回率，在UCF-Sports和J-HMDB上实现每视频少于5个管路即达95%的召回率。
采用改进的密集轨迹作为视觉特征，结合多折多重实例学习（MIL）实现弱监督动作定位。
MIL框架在管路和视频之间聚合特征，仅基于视频级标签实现动作定位。
该方法在标准基准和新引入的DALY数据集上进行评估，该数据集包含330万帧和10个动作类别，来源于YouTube视频。

实验结果

研究问题

RQ1能否通过利用现有检测标注和跟踪技术，显著提升人体管路提取性能，从而实现弱监督动作定位？
RQ2在无逐帧标注的情况下，多折MIL结合改进的密集轨迹在多大程度上可实现高精度定位？
RQ3所提方法在标准基准上与当前最先进弱监督方法相比表现如何？
RQ4像DALY这样基于大规模YouTube视频的真实数据集，能否作为动作定位的有意义基准？

主要发现

所提管路提取方法在UCF-Sports和J-HMDB上实现95%的召回率，每视频少于5个管路，性能优于先前工作一个数量级。
在UCF-Sports上，该方法在弱监督动作定位中实现84%的mAP，超越当前最先进方法，接近全监督性能。
在J-HMDB上，该方法实现54%的mAP，是该数据集上弱监督动作定位的最先进报告结果。
在UCF-101上，该方法实现45%的mAP，尽管数据集复杂，仍表现出强大的泛化能力。
在新引入的DALY数据集上，管路提取实现82%的空间召回率，但检测仍具挑战，导致mAP为10.8%。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。