QUICK REVIEW

[论文解读] First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations

Guillermo Garcia-Hernando, Shanxin Yuan|arXiv (Cornell University)|Apr 8, 2017

Hand Gesture Recognition Systems参考文献 74被引用 18

一句话总结

本文提出了一项新颖的首人称动作识别基准，采用RGB-D视频和来自磁性动作捕捉系统的3D手部姿态标注，支持对自指手-物体交互行为的研究。主要贡献在于表明，3D手部姿态特征显著提升了动作识别的准确率——在使用真实姿态时达到78.73%，远超仅依赖外观特征的方法；同时表明，在遮挡环境下，姿态估计的鲁棒性对性能至关重要。

ABSTRACT

In this work we study the use of 3D hand poses to recognize first-person dynamic hand actions interacting with 3D objects. Towards this goal, we collected RGB-D video sequences comprised of more than 100K frames of 45 daily hand action categories, involving 26 different objects in several hand configurations. To obtain hand pose annotations, we used our own mo-cap system that automatically infers the 3D location of each of the 21 joints of a hand model via 6 magnetic sensors and inverse kinematics. Additionally, we recorded the 6D object poses and provide 3D object models for a subset of hand-object interaction sequences. To the best of our knowledge, this is the first benchmark that enables the study of first-person hand actions with the use of 3D hand poses. We present an extensive experimental evaluation of RGB-D and pose-based action recognition by 18 baselines/state-of-the-art approaches. The impact of using appearance features, poses, and their combinations are measured, and the different training/testing protocols are evaluated. Finally, we assess how ready the 3D hand pose estimation field is when hands are severely occluded by objects in egocentric views and its influence on action recognition. From the results, we see clear benefits of using hand pose as a cue for action recognition compared to other data modalities. Our dataset and experiments can be of interest to communities of 3D hand pose estimation, 6D object pose, and robotics as well as action recognition.

研究动机与目标

为解决缺乏真实世界、自指场景下具有精确3D手部姿态标注的动态手-物体交互数据集的问题。
评估3D手部姿态特征相较于外观特征在首人称动作识别中的影响。
评估最先进手部姿态估计算法在真实世界、遮挡自指场景下的性能，及其对动作识别的影响。
提供一个用于联合手-物体姿态估计的基准，以促进3D手部姿态估计、机器人学与动作识别领域的研究。

提出的方法

从45种日常手部动作类别（涉及26种物体）在三个场景中采集了超过10万帧RGB-D图像。
采用自定义磁性动作捕捉系统，通过指尖上安装的六个传感器及逆运动学算法，估算21个关节的3D手部姿态。
为10项动作中的4种物体提供了6D物体姿态真实值和3D网格模型，以支持联合手-物体分析。
设计了训练与测试协议，包括跨被试者和跨物体划分，以评估姿态估计算法的泛化能力。
在该数据集上，使用多种数据模态和融合策略，评估了18种最先进的RGB-D与基于姿态的动作识别模型。
通过在推理阶段用估计姿态替换真实姿态，量化了手部姿态估计误差对动作识别的影响。

实验结果

研究问题

RQ1与RGB-D外观特征相比，3D手部姿态特征在首人称动作识别中的有效性如何？
RQ2在自指视角下，物体遮挡在多大程度上会降低手部姿态估计的准确性？
RQ3在真实自指序列中，手部姿态估计算法在未见被试者和未见物体上的泛化能力如何变化？
RQ4在动作识别任务中，使用真实3D手部姿态与估计姿态之间的性能差距有多大？
RQ5在循环神经网络中引入时序建模，能否缓解噪声手部姿态估计对动作识别的负面影响？

主要发现

使用真实3D手部姿态时，动作识别准确率达到78.73%，显著优于仅依赖外观的基线方法。
将手部姿态估计误差降低两倍，使动作识别性能提升超过两倍。
当使用在无物体数据上训练的估计姿态时，动作识别准确率从78.73%（真实姿态）下降至72.06%，凸显了在物体交互数据中进行训练的必要性。
手部姿态估计误差最低的是拇指（12.45 mm）和食指（15.48 mm），而这两者也是对动作识别最具信息量的部位。
基于LSTM的基线模型对噪声姿态估计表现出鲁棒性，即使在姿态误差较高时仍能保持可接受的准确率，这得益于时序建模能力。
跨物体泛化性能显著差于跨被试者泛化，表明物体形状和抓握构型对手部姿态估计至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。