QUICK REVIEW

[论文解读] Cross-view Action Modeling, Learning and Recognition

Jiang Wang, Xiaohan Nie|arXiv (Cornell University)|May 12, 2014

Human Pose and Action Recognition参考文献 23被引用 56

一句话总结

该论文提出了一种多视角时空 AND-OR 图（MST-AOG）模型，用于在 2D 视频中进行跨视角动作识别，训练时利用 3D 人体骨骼数据，推理时无需 3D 输入。该模型通过分层建模跨视角的几何、外观和运动特征，实现了最先进性能，在跨视角识别任务中达到 81.6% 的准确率，并在不同受试者和环境中表现出强鲁棒性。

ABSTRACT

Existing methods on video-based action recognition are generally view-dependent, i.e., performing recognition from the same views seen in the training data. We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i.e., the recognition is performed on the video from an unknown and unseen view. As a compositional model, MST-AOG compactly represents the hierarchical combinatorial structures of cross-view actions by explicitly modeling the geometry, appearance and motion variations. This paper proposes effective methods to learn the structure and parameters of MST-AOG. The inference based on MST-AOG enables action recognition from novel views. The training of MST-AOG takes advantage of the 3D human skeleton data obtained from Kinect cameras to avoid annotating enormous multi-view video frames, which is error-prone and time-consuming, but the recognition does not need 3D information and is based on 2D video input. A new Multiview Action3D dataset has been created and will be released. Extensive experiments have demonstrated that this new action representation significantly improves the accuracy and robustness for cross-view action recognition on 2D videos.

研究动机与目标

为解决在 2D 视频中识别新视角动作的挑战，现有方法因依赖视角的特征学习而失效。
开发一种组合式、分层模型，以捕捉多视角下几何、外观和运动的变化。
通过使用 3D 骨骼数据作为训练代理，减少对昂贵多视角视频标注的依赖。
仅在推理阶段使用 2D 视频输入，实现对跨视角、跨受试者和跨环境动作识别的鲁棒性。
通过数据驱动学习发现具有判别性的姿态和视角不变结构，以提升泛化能力。

提出的方法

MST-AOG 模型采用分层 AND-OR 图结构，节点表示动作、姿态、视角、身体部位和特征，实现对时空模式的组合式建模。
在高层进行定位，以捕捉低分辨率的空间和时间特征，提升鲁棒性并减轻标注负担。
训练期间使用来自 Kinect 传感器的 3D 人体骨骼数据，显式建模 2D 视角投影和跨视角的几何关系。
提出一种判别性数据挖掘方法，自动发现频繁且具有判别性的姿态，构成动作节点结构的基础。
模型从多视角视频和 3D 骨骼数据中学习外观和运动特征，实现在无 3D 输入的 2D 视频上进行推理。
推理通过遍历分层结构并结合概率推理，实现跨视角姿态检测和动作分类。

实验结果

研究问题

RQ1组合式生成模型能否有效利用 3D 骨骼数据进行训练，在 2D 视频中表示跨视角动作变化？
RQ2如何在分层结构中联合建模多视角下的几何、外观和运动变化？
RQ3该模型是否能在推理阶段无需 3D 输入的情况下泛化到新视角？
RQ4使用低分辨率特征在跨视角、跨受试者和跨环境设置下在多大程度上提升鲁棒性？
RQ5所提出的基于数据驱动的姿态发现方法相较于基线方法，在提升识别准确率方面有多高效？

主要发现

MST-AOG 模型在 Multiview Action3D 数据集上进行跨视角测试时，识别准确率达到 81.6%，显著优于先前方法。
模型在受试者和环境之间均表现出更强鲁棒性，跨环境测试准确率达 79.3%，而最佳基线方法仅为 27.4%。
使用低分辨率特征可提升识别准确率，证明模型在处理视觉变化方面的有效性。
混淆矩阵显示，'单手拾取'与'双手拾取'动作最常被混淆，原因在于其运动和外观相似。
在 MSR-DailyActivity3D 数据集上，MST-AOG 仅使用 RGB 视频输入即达到 73.1% 的准确率，优于 Poselet（23.75%）和 Action Bank（23%）等方法。
模型成功检测到姿态和视角，未来工作将聚焦于整合人-物体交互建模，以提升对复杂动作的识别能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。