QUICK REVIEW

[论文解读] End-to-End Deep Learning for Person Search.

Tong Xiao, Shuang Li|arXiv (Cornell University)|Apr 7, 2016

Video Surveillance and Tracking Methods参考文献 55被引用 156

一句话总结

本文提出了一种端到端的深度学习框架用于开放世界场景下的行人搜索，该框架联合定位与重识别行人，且无需依赖标注的候选行人边界框。通过引入一种随机采样交叉熵损失来处理稀疏且不平衡的标签，该方法在新收集的大规模、场景多样的行人搜索数据集上实现了最先进性能，该数据集包含18,184张图像和99,809个标注。

ABSTRACT

Existing person re-identification (re-id) benchmarks and algorithms mainly focus on matching cropped pedestrian images between queries and candidates. However, it is different from real-world scenarios where the annotations of pedestrian bounding boxes are unavailable and the target person needs to be found from whole images. To close the gap, we investigate how to localize and match query persons from the scene images without relying on the annotations of candidate boxes. Instead of breaking it down into two separate tasks—pedestrian detection and person re-id, we propose an end-to-end deep learning framework to jointly handle both tasks. A random sampling softmax loss is proposed to effectively train the model under the supervision of sparse and unbalanced labels. On the other hand, existing benchmarks are small in scale and the samples are collected from a few fixed camera views with low scene diversities. To address this issue, we collect a largescale and scene-diversified person search dataset, which contains 18,184 images, 8,432 persons, and 99,809 annotated bounding boxes1. We evaluate our approach and other baselines on the proposed dataset, and study the influence of various factors. Experiments show that our method achieves the best result.

研究动机与目标

弥合现有行人重识别基准（专为裁剪图像设计）与真实世界场景之间的差距，后者中行人边界框不可用。
开发一种统一的深度学习框架，联合执行行人定位与重识别，避免将检测与重识别分为两个阶段。
通过引入一种新颖的随机采样交叉熵损失，解决在行人搜索中使用稀疏且不平衡标签进行训练的挑战。
构建一个大规模、场景多样的行人搜索数据集，以支持对行人搜索方法进行更真实、更鲁棒的评估。

提出的方法

提出一种端到端的深度学习架构，联合预测行人边界框和用于重识别的嵌入特征。
引入一种随机采样交叉熵损失，以在仅存在少量正样本的查询中提升训练稳定性和性能，即每个查询仅有少数正样本可用。
使用弱监督信号端到端训练模型，仅提供查询行人标注，训练过程中无需候选行人框。
利用共享主干网络的特征图，在统一特征空间中生成检测与重识别预测。
采用多任务学习目标，在反向传播过程中同时优化定位与重识别任务。
设计损失函数，在训练期间随机采样负样本候选，以防止模型崩溃，并在标签稀缺条件下提升泛化能力。

实验结果

研究问题

RQ1端到端的深度学习模型是否能在不依赖标注候选边界框的情况下，有效实现联合行人定位与重识别？
RQ2所提出的随机采样交叉熵损失在稀疏且不平衡监督下如何提升模型性能？
RQ3数据集规模与场景多样性在多大程度上影响行人搜索模型的性能？
RQ4所提出方法与将检测与重识别分离的流水线式方法相比表现如何？

主要发现

所提出的端到端框架在新收集的行人搜索数据集上实现了最先进性能，优于现有基线方法。
随机采样交叉熵损失在稀疏且不平衡标签设置下显著提升了训练收敛速度与模型准确率。
该大规模、场景多样的数据集包含18,184张图像、8,432名行人和99,809个边界框标注，能够实现更真实的行人搜索系统评估。
实验表明，联合学习定位与重识别比分离的检测与重识别流水线性能更优。
由于训练数据的多样性，模型对场景复杂度与摄像机视角变化表现出鲁棒性。
消融研究证实，所提出的损失函数在处理训练过程中正负样本不平衡问题中至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。