QUICK REVIEW

[论文解读] Pose Invariant Embedding for Deep Person Re-identification

Liang Zheng, Yujia Huang|arXiv (Cornell University)|Jan 26, 2017

Video Surveillance and Tracking Methods参考文献 23被引用 178

一句话总结

本论文提出基于 PoseBox 的 pose 不变嵌入 PIE，通过 PoseBox Fusion (PBF) 网络将原始图像、PoseBox 和姿态估计的置信度融合，在姿态和检测器变化下鲁棒再识别人。

ABSTRACT

Pedestrian misalignment, which mainly arises from detector errors and pose variations, is a critical problem for a robust person re-identification (re-ID) system. With bad alignment, the background noise will significantly compromise the feature learning and matching process. To address this problem, this paper introduces the pose invariant embedding (PIE) as a pedestrian descriptor. First, in order to align pedestrians to a standard pose, the PoseBox structure is introduced, which is generated through pose estimation followed by affine transformations. Second, to reduce the impact of pose estimation errors and information loss during PoseBox construction, we design a PoseBox fusion (PBF) CNN architecture that takes the original image, the PoseBox, and the pose estimation confidence as input. The proposed PIE descriptor is thus defined as the fully connected layer of the PBF network for the retrieval task. Experiments are conducted on the Market-1501, CUHK03, and VIPeR datasets. We show that PoseBox alone yields decent re-ID accuracy and that when integrated in the PBF network, the learned PIE descriptor produces competitive performance compared with the state-of-the-art approaches.

研究动机与目标

解决行人重识别中由姿态变化和检测错误引起的对齐问题。
提出 PoseBox 以归一化姿态以及三流 PoseBox Fusion 以缓解姿态估计误差。
学习一个鲁棒的 PIE 描述符，在标准基准数据集上与最先进方法相媲美。

提出的方法

通过 CMP 基于姿态估计从检测到的身体关节构造 PoseBox，并将其投影到三种类型（PoseBox1、PoseBox2、PoseBox3）。
引入一个三流 PoseBox Fusion（PBF）网络，输入 PoseBox、原始图像，以及一个 14 维的姿态估计置信向量；两个图像流具有独立的 CNN，其输出与一个投影的置信向量在最终的全连接层前拼接。
将 PIE 定义为融合后的全连接（FC）激活（对于 AlexNet 为 PIE(A, FC7)/PIE(A, FC8)，对于 ResNet-50 为 PIE(R, Pool5)/PIE(R, FC)）。
使用对应三个输入的三个 softmax 损失之和进行训练；对 PIE 嵌入应用 ReLU，并使用欧氏距离进行检索。

实验结果

研究问题

RQ1基于 PoseBox 的归一化是否能在姿态和检测器引起的对齐误差下提升 re-ID 性能？
RQ2包含姿态估计置信度的多流融合是否优于单流 PoseBox 或原始图像基线？
RQ3在 PoseBox 构造中包括手臂/头部对 re-ID 精度有何影响？
RQ4PIE 与市场数据集 Market-1501、CUHK03、VIPeR 的现有方法相比表现如何？

主要发现

方法	维度	Market-1501 Rank-1	Market-1501 Rank-5	Market-1501 Rank-10	Market-1501 Rank-20	Market-1501 mAP	CUHK03 Rank-1	CUHK03 Rank-5	CUHK03 Rank-10	CUHK03 Rank-20	CUHK03 mAP	VIPeR Rank-1	VIPeR Rank-5	VIPeR Rank-10	VIPeR Rank-20
Baseline1 (R, Pool5)	2,048	73.02	87.44	91.24	94.70	47.62	51.60	79.60	87.70	95.00	23.42	42.31	51.96	63.80	-
Baseline1 (R, FC)	751	70.58	84.95	90.02	93.53	45.84	54.80	84.20	91.70	97.60	15.85	28.80	37.41	47.85	-
PIE (R, Pool5)	4,108	78.65	90.26	93.59	95.69	53.87	57.10	84.60	91.40	96.20	43.01	60.22	71.??	??.??	60.22
PIE (R, FC)	751	75.12	88.27	92.28	94.77	51.57	61.50	89.30	94.50	97.60	23.80	37.88	47.31	56.55	-
PIE (A, FC7)	8,206	64.61	82.07	87.83	91.75	38.95	59.80	85.35	91.85	95.85	21.77	38.04	46.61	56.61	-
PIE (A, FC8)	751	65.68	82.51	87.89	91.63	41.12	62.40	88.00	93.70	96.50	18.10	31.20	38.92	49.40	-

PIE 在 Market-1501、CUHK03 和 VIPeR 数据集上始终优于强基线。
在 Market-1501 上，使用 ResNet-50 的 PIE 达到 rank-1 78.65% 和 mAP 53.87%（PIE, Pool5/FC 变体）。
PIE (Pool5, img) 与 PIE (Pool5, pb) 变体在各指标上均优于 Baseline1 和 Baseline2，表明原始图像与 PoseBox 的有效融合。
PoseBox2（躯干+腿+手臂）优于 PoseBox1（躯干+腿），PoseBox3（增加头部）带来边际增益；但与 PBF 融合后，这些差距缩小。
PIE 使用 AlexNet 和 PIE 使用 ResNet-50 达到与现有方法竞争甚至领先的结果，PIE+Kissme 在某些基准上达到最优性能。
消融研究表明去掉原始图像或 PoseBox 流会降低性能，说明融合的互补价值以及来自置信向量的可靠性信号。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。