QUICK REVIEW

[论文解读] VideoGraph: Recognizing Minutes-Long Human Activities in Videos

Noureldien Hussein, Efstratios Gavves|arXiv (Cornell University)|May 13, 2019

Human Pose and Action Recognition参考文献 59被引用 46

一句话总结

VideoGraph introduces a soft, data-driven graph representation with learnable nodes and graph embeddings to model minutes-long activities, achieving improvements on Breakfast, Epic-Kitchens, and Charades.

ABSTRACT

Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.

研究动机与目标

激发对在较长时间跨度内展开的分钟级人类活动的识别动机。
开发一种受图启发的表示，其中节点可学习、边为概率性关系，以保持时序结构。
通过直接从数据中学习图节点，消除对节点级注释的需求。
在 Breakfast、Epic-Kitchens 和 Charades 上对比强基线，证明其有效性。

提出的方法

将活动表示为一个软无向图，其中节点是学习得到的潜在概念，边是学习得到的关系。
使用节点注意力块通过将段片段特征与学习到的节点相关联来生成对节点敏感的特征，而无需节点注释。
引入一个图嵌入层，学习时序和节点关系，然后应用空间卷积以捕获跨节点的交互。
主干 CNN（I3D 或 ResNet-152）提取段特征；对每个视频处理 64 个段（8 帧段）以形成图表示。
分类器使用两层全连接 + BatchNorm/ReLU（单标签时 softmax，多标签时 sigmoid）。

实验结果

研究问题

RQ1在没有显式节点注释的情况下，是否可以用可学习、数据驱动的图来表示分钟级别的活动？
RQ2图嵌入机制是否能捕捉对长时程活动识别至关重要的时序转换和节点间关系？
RQ3在相同骨干网络下，VideoGraph 在 Breakfast、Epic-Kitchens 和 Charades 上与最先进基线相比的表现如何？
RQ4时序结构对识别性能的贡献在多大程度上超过细粒度动作线索？

主要发现

在使用相同骨干网络（I3D）时，VideoGraph 在 Charades、Breakfast 和 Epic-Kitchens 上相对于基线获得提升。
在 Charades 上，I3D + VideoGraph 达到 37.8 mAP，高于仅使用 I3D 的 32.9 mAP。
在 Breakfast 使用 I3D 骨干时，VideoGraph 达到 69.45% 的准确率和 63.14% 的 mAP，超出若干基线。
在 Epic-Kitchens 使用 I3D 骨干时，VideoGraph 实现 55.32% 的 mAP，与 Timeception 和 ActionVLAD 变体相比具有竞争力。
使用 ResNet-152 骨干时，VideoGraph 将 Breakfast 的准确率提高到 69.45% 且 Breakfast mAP 提升到 63.14%，相比 I3D（不带 VideoGraph）的 58.61%/47.05%。
潜在概念 Y 的初始化会影响性能；Sobol 初始化在 Epic-Kitchens 和 Charades 上效果最好，Breakfast 上随机初始化最好（表 3）。
可视化显示学习到的潜在概念在训练过程中发散（成对距离增加），并揭示活动的可解释节点关系（图 5–7）。
VideoGraph 对时序结构（顺序感知）的约束比某些基线更有效，当时间顺序被打乱时性能明显下降，与无序方法如 ActionVLAD（表 4）不同。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。