QUICK REVIEW

[논문 리뷰] Spatially and Temporally Efficient Non-local Attention Network for Video-based Person Re-Identification

Chih‐Ting Liu, Chih-Wei Wu|arXiv (Cornell University)|2019. 08. 05.

Video Surveillance and Tracking Methods인용 수 52

한 줄 요약

논문은 비디오 기반 사람 재식별을 위한 비지역적 비디오 주의 네트워크 NVAN과 정확도를 보존하면서 계산 обход를 줄이는 STE-NVAN을 제시하며, MARS에서 최첨단 성능과 DukeV에서 경쟁력 있는 결과를 달성합니다.

ABSTRACT

Video-based person re-identification (Re-ID) aims at matching video sequences of pedestrians across non-overlapping cameras. It is a practical yet challenging task of how to embed spatial and temporal information of a video into its feature representation. While most existing methods learn the video characteristics by aggregating image-wise features and designing attention mechanisms in Neural Networks, they only explore the correlation between frames at high-level features. In this work, we target at refining the intermediate features as well as high-level features with non-local attention operations and make two contributions. (i) We propose a Non-local Video Attention Network (NVAN) to incorporate video characteristics into the representation at multiple feature levels. (ii) We further introduce a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) to reduce the computation complexity by exploring spatial and temporal redundancy presented in pedestrian videos. Extensive experiments show that our NVAN outperforms state-of-the-arts by 3.8% in rank-1 accuracy on MARS dataset and confirms our STE-NVAN displays a much superior computation footprint compared to existing methods.

연구 동기 및 목표

비디오 기반 Re-ID를 위한 다층 특징 표현에 시공-영상 특성 포함.
프레임 간 글로벌 맥락을 포착하기 위해 저수준 및 고수준 특징 모두를 비지역적 주의로 정제.
Pedestrian 비디오의 공간-시간 중복을 활용하여 성능 저하 없이 계산을 줄이는 것.

제안 방법

ResNet-50 백본에 비지역적 주의 층을 삽입하여 다중 특징 수준에서 시공-시간 맥락을 융합하는 NVAN.
비용 효율적 학습 및 테스트를 위해 restricted random sampling (RRS)을 사용하여 비디오에서 프레임을 선택.
3D 평균 풀링과 배치 정규화를 이용한 최종 시퀀스 특징을 생성하는 Feature Pooling Layer (FPL)을 도입.
두 가지 복잡도 감소 전략 적용: Spatial Reduction Non-local Layer (특징을 수평 스트라이프로 그룹화) 및 Temporal Hierarchical Reduction (프레임-특징 풀링으로 시간 차원을 축소).
교차 엔트로피 및 소마진 배치-하드 트리플렛 손실의 조합으로 학습; 경험적 발견: 최종 특징의 교차 엔트로피와 BN 이전 특징의 트리플렛 손실 사용.

실험 결과

연구 질문

RQ1비지역적 주의 층이 시퀀스 전반에 걸쳐 중간 및 고수준 특징을 정제하여 비디오 기반 Re-ID를 개선할 수 있는가?
RQ2다중 특징 수준에서 시공-시간 정보를 도입하는 것이 Re-ID 성능에 어떤 영향을 미치는가?
RQ3정확도에 큰 손실 없이 비지역적 비디오 주의의 계산 비용을 크게 줄일 수 있는 방법은 무엇인가?
RQ4공간 축소와 시간적 계층적 축소가 STE-NVAN이 기존 방법보다 효율성과 정확도에서 우위를 갖도록 시너지를 이루는가?

주요 결과

NVAN은 MARS에서 rank-1 정확도 3.8% 포인트 앞서 최첨단 비디오 기반 Re-ID 방법을 능가합니다.
STE-NVAN은 NVAN 대비 FLOP를 72.7% 감소시키고 MARS에서 rank-1 손실은 1.1%만으로 유지합니다.
NVAN은 ResNet-50에 비지역 층을 추가한 베이스라인 대비 MARS와 DukeV 데이터세트에서 R1과 mAP에서 유의미한 격차로 개선을 보입니다.
공간 축소와 시간 축소는 각각 FLOP을 크게 줄이면서도 정확도 손실이 미미하고, 결합된 STE-NVAN이 최적의 효율-정확도 트레이드오프를 달성합니다.
MARS에서 NVAN은 90.0% R1, 82.8% mAP를 달성하고 STE-NVAN은 88.9% R1, 81.2% mAP를 달성하며 NVAN보다 FLOP가 더 낮습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.