QUICK REVIEW

[论文解读] The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary

Bernard Ghanem, Juan Carlos Niebles|arXiv (Cornell University)|Aug 11, 2018

Human Pose and Action Recognition参考文献 5被引用 55

一句话总结

本论文总结了2018年 ActivityNet Challenge，详细介绍六个任务（三个主要 ActivityNet 任务和三个嘉宾任务），以及在大规模视频中的时序建议、定位和密集字幕方面的顶级提交。

ABSTRACT

The 3rd annual installment of the ActivityNet Large- Scale Activity Recognition Challenge, held as a full-day workshop in CVPR 2018, focused on the recognition of daily life, high-level, goal-oriented activities from user-generated videos as those found in internet video portals. The 2018 challenge hosted six diverse tasks which aimed to push the limits of semantic visual understanding of videos as well as bridge visual content with human captions. Three out of the six tasks were based on the ActivityNet dataset, which was introduced in CVPR 2015 and organized hierarchically in a semantic taxonomy. These tasks focused on tracing evidence of activities in time in the form of proposals, class labels, and captions. In this installment of the challenge, we hosted three guest tasks to enrich the understanding of visual information in videos. The guest tasks focused on complementary aspects of the activity recognition problem at large scale and involved three challenging and recently compiled datasets: the Kinetics-600 dataset from Google DeepMind, the AVA dataset from Berkeley and Google, and the Moments in Time dataset from MIT and IBM Research.

研究动机与目标

推动对日常生活活动在大规模、用户生成视频中的语义视觉理解的极限。
通过多样的任务与数据集，将视觉内容与人类文本描述连接起来。
通过在 ActivityNet 和嘉宾数据集上使用提案、定位和字幕评估指标，提供标准化评估。

提出的方法

定义六个任务（三个基于 ActivityNet 的任务和三个嘉宾任务），以评估视频理解的不同方面。
使用 AR-AN 对时序提案进行评估，以及基于平均 AR/AN 的指标来衡量提案质量。
对时空 IoU 阈值下使用平均精确度 (mAP) 来评估时序定位。
使用基于 METEOR/BLEU/CIDEr 的平均评分来评估事件的密集字幕。
将 Kinetics-600、AVA、Moments in Time 的嘉宾任务纳入，以拓宽对大规模的理解。

实验结果

研究问题

RQ1如何在保持对目标活动具有辨别性的同时高效生成时序动作提案？
RQ2在未裁剪的长视频中，当前方法对定位和识别行动的效果如何？
RQ3模型在单个视频中检测、定位和描述多个事件（密集字幕）方面表现如何？
RQ4大型嘉宾数据集（Kinetics-600、AVA、Moments in Time）为广泛活动理解提供了哪些见解？
RQ5在大规模活动识别中，不同任务和数据集上的顶尖方法有哪些？

主要发现

任务1（时序动作提案）：在 Baidu Vis、上海交通大学、YH Technologies 的前三名 AUC 分别为 71.00、69.30 和 67.78。
任务2（时序动作定位）：前三名平均 mAP 值为 38.53、35.49 和 35.27。
任务3（事件密集字幕）：前两名平均 METEOR 分别为 8.53 和 8.13。
任务A（裁剪活动识别）：前三名平均误差为 10.99、11.69 和 12.20。
任务B（时空动作定位）：CV 跟踪的 mAP@0.5IoU 前三名为 21.08、21.03、19.60；全轨道为 20.99、19.60、16.76。
任务C（裁剪事件识别）：全轨道前三名平均准确度为 52.91、51.26、50.06；迷你轨道为 47.72、45.49、45.10。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。