QUICK REVIEW

[論文レビュー] Spatially and Temporally Efficient Non-local Attention Network for Video-based Person Re-Identification

Chih‐Ting Liu, Chih-Wei Wu|arXiv (Cornell University)|Aug 5, 2019

Video Surveillance and Tracking Methods参考文献 21被引用数 35

ひとこと要約

この論文は NVAN を導入して空間-時間文脈をマルチレベルの動画特徴へ埋め込み、 STE-NVAN で計算量を削減しつつ精度を維持し、MARS で最先端の結果を達成した。

ABSTRACT

Video-based person re-identification (Re-ID) aims at matching video sequences of pedestrians across non-overlapping cameras. It is a practical yet challenging task of how to embed spatial and temporal information of a video into its feature representation. While most existing methods learn the video characteristics by aggregating image-wise features and designing attention mechanisms in Neural Networks, they only explore the correlation between frames at high-level features. In this work, we target at refining the intermediate features as well as high-level features with non-local attention operations and make two contributions. (i) We propose a Non-local Video Attention Network (NVAN) to incorporate video characteristics into the representation at multiple feature levels. (ii) We further introduce a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) to reduce the computation complexity by exploring spatial and temporal redundancy presented in pedestrian videos. Extensive experiments show that our NVAN outperforms state-of-the-arts by 3.8% in rank-1 accuracy on MARS dataset and confirms our STE-NVAN displays a much superior computation footprint compared to existing methods.

研究の動機と目的

Motivate robust video-based person re-ID by leveraging spatial and temporal information at multiple feature levels.
Incorporate non-local attention into CNN backbones to refine intermediate and high-level features.
Reduce the computational burden of non-local attention through spatial and temporal reductions.
Demonstrate state-of-the-art performance on large video-based Re-ID benchmarks.
Provide an efficient variant (STE-NVAN) with favorable accuracy-computation trade-offs.

提案手法

Introduce Non-local Video Attention Network (NVAN) by inserting non-local attention layers at multiple CNN feature levels to capture spatio-temporal dependencies.
Use Restricted Random Sampling (RRS) to select sequences of frames for efficient training and inference.
Incorporate a Feature Pooling Layer (FPL) performing 3D average pooling followed by batch normalization.
Propose Spatial Reduction Non-local Layer to group features into horizontal stripes, reducing affinity computations from THW to TS (S stripes).
Propose Temporal Reduction with Hierarchical Structure by temporally pooling features to reduce temporal dimension across stages.
Define loss with cross-entropy on final features and soft-margin batch-hard triplet loss on pre-BN features.
Develop STE-NVAN by combining spatial reduction and hierarchical temporal reduction to cut FLOPs while maintaining performance.

実験結果

リサーチクエスチョン

RQ1Can non-local attention be effectively integrated at multiple feature levels to improve video-based Re-ID performance?
RQ2How can the computational cost of non-local attention be reduced without sacrificing accuracy?
RQ3What is the impact of frame sampling strategy (RRS) on Re-ID performance?
RQ4What is the trade-off between accuracy and computation when applying spatial vs. temporal reductions in NVAN?
RQ5How does STE-NVAN compare to state-of-the-art attention-based video Re-ID methods in terms of FLOPs and accuracy?

主な発見

手法	特徴	MARS R1	MARS mAP	DukeV R1	DukeV mAP	# FLOP数
ResNet-50	FPL	87.3	79.1	95.0	92.7	30.4 G
ResNet-50	max-FPL	86.3	76.6	95.4	92.4	30.4 G
NVAN	FPL	90.0	82.8	96.3	94.9	60.0 G
NVAN+Spatial Reduc.	FPL	89.7	82.5	96.3	94.7	30.4 G
NVAN+Temporal Reduc.	FPL	89.2	81.2	95.6	93.7	40.4 G
STE-NVAN	FPL	88.9	81.2	95.2	93.5	16.5 G

NVAN achieves strong improvements on MARS, reaching 90.0% rank-1 accuracy and 82.8% mAP, outperforming prior methods.
NVAN also attains 96.3% R1 and 94.9% mAP on DukeV, demonstrating strong cross-dataset performance.
Applying spatial reduction and temporal reduction dramatically lowers FLOPs while incurring minimal drops in accuracy (e.g., spatial reduction alone keeps R1/mAP largely intact; temporal reduction preserves most performance).
STE-NVAN reduces FLOPs by 72.7% compared to NVAN and 45.7% less than the non-attention baseline, with minimal loss in accuracy (e.g., ~0.8-1.1% R1 drop in key cases).
NVAN and STE-NVAN offer favorable accuracy-FLOP trade-offs relative to existing attention-based video Re-ID methods, with STE-NVAN providing the best efficiency.
Extensive ablations show benefits of more sampled frames (T) and more non-local layers, and validate the effectiveness of both spatial and temporal reductions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。