Skip to main content
QUICK REVIEW

[論文レビュー] Spatially and Temporally Efficient Non-local Attention Network for Video-based Person Re-Identification

Chih‐Ting Liu, Chih-Wei Wu|arXiv (Cornell University)|Aug 5, 2019
Video Surveillance and Tracking Methods参考文献 21被引用数 35
ひとこと要約

この論文は NVAN を導入して空間-時間文脈をマルチレベルの動画特徴へ埋め込み、 STE-NVAN で計算量を削減しつつ精度を維持し、MARS で最先端の結果を達成した。

ABSTRACT

Video-based person re-identification (Re-ID) aims at matching video sequences of pedestrians across non-overlapping cameras. It is a practical yet challenging task of how to embed spatial and temporal information of a video into its feature representation. While most existing methods learn the video characteristics by aggregating image-wise features and designing attention mechanisms in Neural Networks, they only explore the correlation between frames at high-level features. In this work, we target at refining the intermediate features as well as high-level features with non-local attention operations and make two contributions. (i) We propose a Non-local Video Attention Network (NVAN) to incorporate video characteristics into the representation at multiple feature levels. (ii) We further introduce a Spatially and Temporally Efficient Non-local Video Attention Network (STE-NVAN) to reduce the computation complexity by exploring spatial and temporal redundancy presented in pedestrian videos. Extensive experiments show that our NVAN outperforms state-of-the-arts by 3.8% in rank-1 accuracy on MARS dataset and confirms our STE-NVAN displays a much superior computation footprint compared to existing methods.

研究の動機と目的

  • Motivate robust video-based person re-ID by leveraging spatial and temporal information at multiple feature levels.
  • Incorporate non-local attention into CNN backbones to refine intermediate and high-level features.
  • Reduce the computational burden of non-local attention through spatial and temporal reductions.
  • Demonstrate state-of-the-art performance on large video-based Re-ID benchmarks.
  • Provide an efficient variant (STE-NVAN) with favorable accuracy-computation trade-offs.

提案手法

  • Introduce Non-local Video Attention Network (NVAN) by inserting non-local attention layers at multiple CNN feature levels to capture spatio-temporal dependencies.
  • Use Restricted Random Sampling (RRS) to select sequences of frames for efficient training and inference.
  • Incorporate a Feature Pooling Layer (FPL) performing 3D average pooling followed by batch normalization.
  • Propose Spatial Reduction Non-local Layer to group features into horizontal stripes, reducing affinity computations from THW to TS (S stripes).
  • Propose Temporal Reduction with Hierarchical Structure by temporally pooling features to reduce temporal dimension across stages.
  • Define loss with cross-entropy on final features and soft-margin batch-hard triplet loss on pre-BN features.
  • Develop STE-NVAN by combining spatial reduction and hierarchical temporal reduction to cut FLOPs while maintaining performance.

実験結果

リサーチクエスチョン

  • RQ1Can non-local attention be effectively integrated at multiple feature levels to improve video-based Re-ID performance?
  • RQ2How can the computational cost of non-local attention be reduced without sacrificing accuracy?
  • RQ3What is the impact of frame sampling strategy (RRS) on Re-ID performance?
  • RQ4What is the trade-off between accuracy and computation when applying spatial vs. temporal reductions in NVAN?
  • RQ5How does STE-NVAN compare to state-of-the-art attention-based video Re-ID methods in terms of FLOPs and accuracy?

主な発見

手法特徴MARS R1MARS mAPDukeV R1DukeV mAP# FLOP数
ResNet-50FPL87.379.195.092.730.4 G
ResNet-50max-FPL86.376.695.492.430.4 G
NVANFPL90.082.896.394.960.0 G
NVAN+Spatial Reduc.FPL89.782.596.394.730.4 G
NVAN+Temporal Reduc.FPL89.281.295.693.740.4 G
STE-NVANFPL88.981.295.293.516.5 G
  • NVAN achieves strong improvements on MARS, reaching 90.0% rank-1 accuracy and 82.8% mAP, outperforming prior methods.
  • NVAN also attains 96.3% R1 and 94.9% mAP on DukeV, demonstrating strong cross-dataset performance.
  • Applying spatial reduction and temporal reduction dramatically lowers FLOPs while incurring minimal drops in accuracy (e.g., spatial reduction alone keeps R1/mAP largely intact; temporal reduction preserves most performance).
  • STE-NVAN reduces FLOPs by 72.7% compared to NVAN and 45.7% less than the non-attention baseline, with minimal loss in accuracy (e.g., ~0.8-1.1% R1 drop in key cases).
  • NVAN and STE-NVAN offer favorable accuracy-FLOP trade-offs relative to existing attention-based video Re-ID methods, with STE-NVAN providing the best efficiency.
  • Extensive ablations show benefits of more sampled frames (T) and more non-local layers, and validate the effectiveness of both spatial and temporal reductions.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。