QUICK REVIEW

[论文解读] Numerical Coordinate Regression with Convolutional Neural Networks

Aiden Nibali, Zhen He|arXiv (Cornell University)|Jan 23, 2018

Human Pose and Action Recognition参考文献 17被引用 186

一句话总结

我们引入一种可微分的空间到数值变换（DSNT），无需额外参数即可将热图转换为坐标，从而提升端到端训练和用于坐标回归（如姿态估计）的推理速度，并展示DSNT通常优于热图匹配和全连接方法。

ABSTRACT

We study deep learning approaches to inferring numerical coordinates for points of interest in an input image. Existing convolutional neural network-based solutions to this problem either take a heatmap matching approach or regress to coordinates with a fully connected output layer. Neither of these approaches is ideal, since the former is not entirely differentiable, and the latter lacks inherent spatial generalization. We propose our differentiable spatial to numerical transform (DSNT) to fill this gap. The DSNT layer adds no trainable parameters, is fully differentiable, and exhibits good spatial generalization. Unlike heatmap matching, DSNT works well with low heatmap resolutions, so it can be dropped in as an output layer for a wide range of existing fully convolutional architectures. Consequently, DSNT offers a better trade-off between inference speed and prediction accuracy compared to existing techniques. When used to replace the popular heatmap matching approach used in almost all state-of-the-art methods for pose estimation, DSNT gives better prediction accuracy for all model architectures tested.

研究动机与目标

解决CNN在坐标回归中的热图匹配和全连接输出的局限性。
提出一个可微分、无参数的DSNT层，保留空间泛化和端到端可微分性。
在MPII人类姿态数据集上对不同CNN架构评估DSNT，以评估准确性和推理效率。
探索正则化策略以促进有意义的热图形状并提升坐标预测。

提出的方法

将DSNT定义为一个可微分层，接受单通道归一化热图并将坐标计算为离散二元分布的均值。
将坐标表示为带有X和Y坐标网格的二维期望，以实现亚像素精度和反向传播。
使用热图激活函数（softmax、abs、ReLU、sigmoid）来创建归一化热图；发现softmax表现最佳。
端到端训练，使用欧几里得坐标损失而非热图损失，确保损失直接针对坐标精度。
引入正则化项（方差和分布发散如KL/JS）来塑形热图并提高准确性。
在多种热图分辨率下，与heatmap matching和全连接输出在ResNet和堆叠时钟架构上进行对比DSNT。

实验结果

研究问题

RQ1DSNT是否能在保留空间泛化的前提下实现端到端可训练的坐标回归？
RQ2DSNT是否在不同架构和热图分辨率下超越传统的热图匹配和全连接方法？
RQ3哪些正则化策略能最好地提升DSNT的性能和热图质量？
RQ4基于DSNT的模型在准确性和推理速度方面与最先进的姿态估计架构相比如何？

主要发现

Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Total	Time (ms)	Memory
97.8	96.0	90.0	84.3	89.8	85.2	79.7	89.5	18.6 ± 0.5	636 MiB
97.6	95.6	89.6	83.9	89.2	84.8	79.0	89.0	N/A	N/A
97.9	95.1	89.9	85.3	89.4	85.7	81.7	89.7	41.3 ± 0.2	1432 MiB
98.2	96.3	91.2	87.1	90.1	87.4	83.6	90.9	60.5 ± 0.1	1229 MiB
98.5	96.7	92.5	88.7	91.1	88.6	86.0	92.0	194.6 ± 76.8	1476 MiB
97.8	96.0	90.0	84.3	89.8	85.2	79.7	89.5	18.6 ± 0.5	636 MiB

在MPII人类姿态数据上，DSNT在测试的架构中始终优于热图匹配和全连接输出。
即使在低热图分辨率（例如7x7）也比热图匹配具有更好的准确性，且分辨率增加时仍然稳健。
正则化，尤其是Jensen-Shannon分布正则化，相较于原始DSNT提高准确性，目标高斯参数表现出鲁棒性。
以ResNet-50为骨干网络的DSNT（28px热图）在推理速度显著更快、内存占用更低的情况下，达到与更大hourglass模型具有竞争力的准确性。
DSNT支持亚像素坐标预测和通过坐标输出的完整反向传播，与基于argmax的方法不同。
与堆叠时钟模型相比，基于DSNT的ResNet在速度/内存权衡方面更有利，准确性仅有适度下降。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。