QUICK REVIEW

[论文解读] PREDICT & CLUSTER: Unsupervised Skeleton Based Action Recognition

Kun Su, Liu Xiulong|arXiv (Cornell University)|Nov 27, 2019

Human Pose and Action Recognition被引用 33

一句话总结

该论文提出 PREDICT & CLUSTER，一种基于骨架的无监督动作识别系统，通过使用带有预测头的自监督自编码器，从原始关键点序列中学习解耦且可聚类的特征。在无需任何动作标签的情况下，其在多个基准测试中达到与监督方法相当的性能，优于先前的无监督骨架和RGB+D方法，尤其在跨视角泛化方面表现突出。

ABSTRACT

We propose a novel system for unsupervised skeleton-based action recognition. Given inputs of body keypoints sequences obtained during various movements, our system associates the sequences with actions. Our system is based on an encoder-decoder recurrent neural network, where the encoder learns a separable feature representation within its hidden states formed by training the model to perform prediction task. We show that according to such unsupervised training the decoder and the encoder self-organize their hidden states into a feature space which clusters similar movements into the same cluster and distinct movements into distant clusters. Current state-of-the-art methods for action recognition are strongly supervised, i.e., rely on providing labels for training. Unsupervised methods have been proposed, however, they require camera and depth inputs (RGB+D) at each time step. In contrast, our system is fully unsupervised, does not require labels of actions at any stage, and can operate with body keypoints input only. Furthermore, the method can perform on various dimensions of body keypoints (2D or 3D) and include additional cues describing movements. We evaluate our system on three extensive action recognition benchmarks with different number of actions and examples. Our results outperform prior unsupervised skeleton-based methods, unsupervised RGB+D based methods on cross-view tests and while being unsupervised have similar performance to supervised skeleton-based action recognition.

研究动机与目标

开发一种完全无监督的骨架动作识别方法，训练过程中无需动作标签。
仅使用2D或3D人体关键点序列实现动作识别，不依赖RGB或深度数据。
学习一个解耦的特征空间，使相似动作聚类在一起，不同动作相互分离。
在骨架和RGB+D基准测试中超越现有无监督方法，尤其在跨视角评估中表现更优。
证明自监督预测预训练可生成与监督方法相媲美的特征表示。

提出的方法

该方法采用编码器-解码器RNN架构，其中编码器处理关键点序列，解码器重建未来帧。
通过自监督预测任务进行训练：从过去的关键点序列预测未来序列。
通过该预测目标，编码器的隐藏状态学习到可分离的特征表示。
编码器和解码器共同将隐藏状态自组织为一个特征空间，使相似动作聚类，不同动作分离。
该方法对输入维度具有无关性，支持2D和3D关键点序列。
可将额外的运动线索整合到输入中，而无需修改核心架构。

实验结果

研究问题

RQ1在RNN自编码器中，基于自监督预测目标是否能从原始骨架序列中学习到无标签的解耦、可识别动作的特征？
RQ2该无监督方法的性能与最先进监督骨架动作识别模型相比如何？
RQ3所学特征在无微调或标签数据的情况下是否能跨视角泛化？
RQ4与现有无监督RGB+D和仅骨架方法相比，该方法在聚类质量和准确率方面表现如何？
RQ5该模型在不依赖标签数据的情况下，能在多大程度上利用额外的运动线索？

主要发现

尽管完全无监督，该方法在性能上可与监督骨架动作识别模型相媲美。
在所有三个评估基准上，其性能优于先前的无监督骨架方法。
在跨视角动作识别中，其性能超越无监督RGB+D方法，表明具有强大的泛化能力。
自监督训练目标使特征空间中相似动作聚类在一起，不同动作相互分离。
该方法在2D和3D关键点输入下均表现有效，且可整合额外的运动线索。
即使在训练的任何阶段均无动作标签的情况下，模型性能依然强劲。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。