QUICK REVIEW

[论文解读] Cooperative Training of Deep Aggregation Networks for RGB-D Action Recognition

Pichao Wang, Wanqing Li|arXiv (Cornell University)|Dec 5, 2017

Human Pose and Action Recognition被引用 42

一句话总结

本文提出了一种协作训练框架 c-ConvNet，通过在单一深度神经网络中联合优化 RGB 和深度特征，实现 RGB-D 动作识别。通过结合 Softmax 损失与模态内和模态间三元组排序损失，该方法增强了特征判别能力并减少了模态差异，在包括 NTU RGB+D 和 ChaLearn LAP IsoGD 在内的三个基准数据集上取得了最先进性能。

ABSTRACT

A novel deep neural network training paradigm that exploits the conjoint information in multiple heterogeneous sources is proposed. Specifically, in a RGB-D based action recognition task, it cooperatively trains a single convolutional neural network (named c-ConvNet) on both RGB visual features and depth features, and deeply aggregates the two kinds of features for action recognition. Differently from the conventional ConvNet that learns the deep separable features for homogeneous modality-based classification with only one softmax loss function, the c-ConvNet enhances the discriminative power of the deeply learned features and weakens the undesired modality discrepancy by jointly optimizing a ranking loss and a softmax loss for both homogeneous and heterogeneous modalities. The ranking loss consists of intra-modality and cross-modality triplet losses, and it reduces both the intra-modality and cross-modality feature variations. Furthermore, the correlations between RGB and depth data are embedded in the c-ConvNet, and can be retrieved by either of the modalities and contribute to the recognition in the case even only one of the modalities is available. The proposed method was extensively evaluated on two large RGB-D action recognition datasets, ChaLearn LAP IsoGD and NTU RGB+D datasets, and one small dataset, SYSU 3D HOI, and achieved state-of-the-art results.

研究动机与目标

解决动作识别中 RGB 与深度特征之间模态差异的挑战。
增强从异构模态中学习的深度特征的判别能力。
使单一网络能够以协作方式从 RGB 和深度输入中共同学习，而无需独立处理通道。
通过嵌入跨模态相关性来提升识别准确率，即使仅有一个模态可用，这些相关性依然有效。
通过动态图像表示和预训练的 ImageNet 模型，实现在小数据集上的有效微调。

提出的方法

该方法使用排序池化（rank pooling）将 RGB 和深度视频序列编码为动态图像（VDIs 和 DDIs），以保留时空结构。
共享的 c-ConvNet 架构在单一网络中处理 RGB 视觉动态图像（VDIs）和深度动态图像（DDIs）。
网络通过分类用的 Softmax 损失与多组件排序损失联合训练，以减少特征差异。
排序损失包括模态内三元组损失（在 RGB 或深度内部）和模态间三元组损失（在 RGB 与深度之间），以最小化模态特异性和跨模态差异。
损失函数通过加权组合排序损失与 Softmax 损失进行优化，由超参数 λ 控制。
在推理阶段应用产品得分融合策略，将四通道动态图像（DDIf、VDIf、DDIb、VDIb）的预测结果进行融合，以提升最终准确率。

实验结果

研究问题

RQ1单一深度神经网络能否以协作方式有效从 RGB 和深度模态中学习，而非独立处理？
RQ2在联合训练过程中，如何最小化 RGB 与深度特征之间的模态差异以提升泛化能力？
RQ3在共享网络中，跨模态相关性能在多大程度上被嵌入，以使当某一模态缺失时，另一模态仍能支持识别？
RQ4联合优化 Softmax 损失与多层级三元组排序损失是否能产生比传统单损失训练更具判别性的特征？
RQ5性能对关键超参数（如三元组损失中的边距 α 和模态内与模态间损失之间的权重 λ）的敏感程度如何？

主要发现

在 NTU RGB+D 数据集（跨被试设置）上，所提方法使用产品得分融合达到 89.08% 的准确率，优于平均融合与最大值融合方法。
在 ChaLearn LAP IsoGD 数据集上，该方法使用产品得分融合达到 44.80% 的准确率，显著优于平均融合（43.48%）与最大值融合（42.01%）。
在小规模 SYSU 3D HOI 数据集上，该方法使用产品得分融合达到 98.33% 的准确率，证明其在数据有限情况下的有效性。
三元组损失中的最优边距 α 在 NTU RGB+D 上为 0.1，在 LAP IsoGD 上为 0.2，更高值会导致准确率显著下降。
用于平衡模态内与模态间三元组损失的权重 λ 影响适中，较高值（如 λ=5）在 LAP IsoGD 等困难数据集上能提升性能。
该方法在所有三个数据集上均取得了最先进结果，证实了协作训练与联合损失优化的有效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。