QUICK REVIEW

[논문 리뷰] PersonNet: Person Re-identification with Deep Convolutional Neural Networks

Lin Wu, Chunhua Shen|arXiv (Cornell University)|2016. 01. 27.

Video Surveillance and Tracking Methods참고 문헌 2인용 수 210

한 줄 요약

PersonNet은 3x3 필터와 이웃 차이 계층을 가진 아주 깊은 시암즈 CNN을 도입하여 인물 재식별을 위한 특징과 유사도 메트릭을 공동으로 학습하고, 여러 데이터셋에서 최첨단 성능을 달성한다.

ABSTRACT

In this paper, we propose a deep end-to-end neu- ral network to simultaneously learn high-level features and a corresponding similarity metric for person re-identification. The network takes a pair of raw RGB images as input, and outputs a similarity value indicating whether the two input images depict the same person. A layer of computing neighborhood range differences across two input images is employed to capture local relationship between patches. This operation is to seek a robust feature from input images. By increasing the depth to 10 weight layers and using very small (3$ imes$3) convolution filters, our architecture achieves a remarkable improvement on the prior-art configurations. Meanwhile, an adaptive Root- Mean-Square (RMSProp) gradient decent algorithm is integrated into our architecture, which is beneficial to deep nets. Our method consistently outperforms state-of-the-art on two large datasets (CUHK03 and Market-1501), and a medium-sized data set (CUHK01).

연구 동기 및 목표

Motivate and develop a deep end-to-end network that jointly learns robust features and a similarity metric for person re-identification.
Increase network depth with small 3x3 convolutions to improve discriminative power under cross-view variations.
Incorporate a neighborhood difference layer to model local patch relations and misalignment between camera views.
Adopt RMSProp for adaptive gradient updates to facilitate training of a deep network.
Demonstrate state-of-the-art performance on multiple large re-id benchmarks.

제안 방법

Use a pair of RGB images as input to a Siamese-like network with tied weights across views.
Stack 3x3 convolutional layers with max-pooling, followed by a neighborhood patch matching layer that computes cross-view differences in local patches.
Include a patch summary layer and subsequent convolutional/max-pooling layers, ending in three fully connected layers for a softmax similarity decision (same/different).
Employ a 10-layer deep architecture with 3x3 receptive fields to increase non-linearity and representation capacity.
Use hyperbolic tangent activations and RMSProp instead of standard SGD for training the deep network.
Apply online sampling of image pairs and data augmentation (translation and horizontal reflection) to balance positive/negative pairs.

실험 결과

연구 질문

RQ1Can a deeper CNN with small 3x3 filters improve person re-identification accuracy over prior architectures?
RQ2Does incorporating a neighborhood patch difference layer help model local cross-view variations and misalignment?
RQ3What is the impact of RMSProp as an optimization method for training deep networks in re-id tasks?
RQ4How does the proposed method perform on CUHK03, CUHK01, and Market-1501 compared to state-of-the-art methods?

주요 결과

PersonNet achieves the best reported rank-1 accuracy on CUHK03, CUHK01, and Market-1501 in the reported experiments.
On CUHK03, 64.80% (Rank-1) and up to 98.20% (Rank-20) are achieved, outperforming prior methods.
On CUHK01, Rank-1 is 71.14%, with Rank-5 90.07%, Rank-10 95.00%, and Rank-20 98.06%.
On Market-1501, Rank-1 is 37.21% and mAP is 18.57%.
The convergence study shows RMSProp provides more stable and faster convergence than SGD for this deep architecture.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.