[论文解读] The AVA-Kinetics Localized Human Actions Video Dataset
本论文介绍 AVA-Kinetics,这是一个跨域数据集,为 Kinetics-700 视频子集提供 AVA 风格的局部动作注释,并使用 Ground-truth 和检测框在 Video Action Transformer Network 上基准动作分类。
This paper describes the AVA-Kinetics localized human actions video dataset. The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol, and extending the original AVA dataset with these new AVA annotated Kinetics clips. The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames. We describe the annotation process and provide statistics about the new dataset. We also include a baseline evaluation using the Video Action Transformer Network on the AVA-Kinetics dataset, demonstrating improved performance for action classification on the AVA test set. The dataset can be downloaded from https://research.google.com/ava/
研究动机与目标
- Motivate the creation of a dataset that combines AVA's localized action labeling with Kinetics' video diversity to improve generalization.
- Describe the annotation pipeline and statistics of AVA-Kinetics.
- Provide baseline benchmarks for action classification using Video Action Transformer Network on AVA-Kinetics.
- Analyze how increasing Kinetics-driven data impacts per-class performance and overall mAP.
提出的方法
- Annotate AVA-style bounding boxes and actions on a frame selected from each Kinetics video clip.
- Use Faster RCNN to detect persons, select a key-frame with highest detection confidence, annotate missing boxes, and create a 2-second clip around the key-frame for labeling by multiple raters.
- Retain labels that are verified by a majority of at least 2 of 3 raters.
- Train a Video Action Transformer Network on ground-truth boxes (and separately with detected boxes at test time) to assess action classification performance.
- Evaluate correlations between Kinetics and AVA class annotations via Normalized Pointwise Mutual Information (NPMI) and study per-class performance categories (person-object, person-pose, person-person).
- Analyze data-size impact on performance by varying AVA vs AVA-Kinetics training data.
实验结果
研究问题
- RQ1Does annotating Kinetics videos with AVA-style localization and labels create a useful, more diverse training signal for action recognition?
- RQ2How does training on AVA, Kinetics, or their combination affect action classification performance on AVA and AVA-Kinetics test sets?
- RQ3What is the relationship between Kinetics-derived data size and gains in mAP across AVA classes?
- RQ4How do per-class performance trends differ across person-object, person-pose, and person-person interaction categories?
- RQ5How does action classification performance change when using ground-truth boxes versus detected boxes?
主要发现
- AVA-Kinetics combines AVA and Kinetics to provide AVA-style localization for Kinetics clips, yielding broader visual diversity with AVA-style labels.
- Using the Video Action Transformer Network, training on AVA-Kinetics improves AVA val mAP by 5.26 points when evaluated with ground-truth boxes.
- Training on AVA-Kinetics generally improves generalization and per-class performance, with notable gains in several classes such as watch, cut, listen, and swim.
- When using detected boxes, the improvements persist but are smaller due to detector imperfections; training on AVA-Kinetics still yields a positive gain on AVA val.
- Per-class analysis shows pose-based actions are easier, while object-interaction actions remain challenging, and Kinetics data particularly helps increase examples for underrepresented classes.
- Figure 8 demonstrates that most classes benefit from increased Kinetics samples, with the sole exception of 'enter' showing a slight decrease.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。