[論文レビュー] Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification
The paper introduces a Shape-Erased Feature Learning framework (SGIEL) that decomposes VI-ReID representations into shape-related and shape-erased parts via orthogonal subspaces, enabling discovery of diverse modality-shared cues beyond body shape and improving cross-modal re-identification performance.
Due to the modality gap between visible and infrared images with high visual ambiguity, learning extbf{diverse} modality-shared semantic concepts for visible-infrared person re-identification (VI-ReID) remains a challenging problem. Body shape is one of the significant modality-shared cues for VI-ReID. To dig more diverse modality-shared cues, we expect that erasing body-shape-related semantic concepts in the learned features can force the ReID model to extract more and other modality-shared features for identification. To this end, we propose shape-erased feature learning paradigm that decorrelates modality-shared features in two orthogonal subspaces. Jointly learning shape-related feature in one subspace and shape-erased features in the orthogonal complement achieves a conditional mutual information maximization between shape-erased feature and identity discarding body shape information, thus enhancing the diversity of the learned representation explicitly. Extensive experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method.
研究の動機と目的
- Motivate learning diverse modality-shared cues for VI-ReID beyond body shape, which is highly ambiguous across visible and infrared data.
- Propose a shape-erased feature learning paradigm that decorrelates shape-related and shape-erased features using an orthogonal subspace decomposition.
- Develop the SGIEL framework to jointly optimize shape-related and shape-erased objectives, improving modality-shared representations.
- Leverage body-shape priors from pre-trained parsing to guide shape-related features while encouraging discovery of other discriminative cues.
提案手法
- Decompose each modality feature z(i) into shape-related z_sr(i) and shape-erased z_se(i) via a semi-orthogonal projector P, with z_sr(i)=P^T z(i) and z_se(i)=(I−PP^T) z(i).
- Impose an orthogonality regularizer L_ortho to encourage P^T P to approximate the identity in an L1 sense (Eq. 3).
- Maximize conditional mutual information I(Z_se^(i); Y | X^(s)) by maximizing I(Z_se^(i); Y) (Eq. 4) and minimizing I(Z_se^(i); Y; X^(s)); proximate estimates achieved via cross-entropy losses and MSE guidance (Eqs. 5, 9, 11).
- Learn Z_sr^(i) to imitate Z^(s) (the body-shape representation) by minimizing L_srmse and L_srkl (Eq. 12).
- Eliminate modality-specific information in Z_se^(i) by cross-modal alignment losses L_sekl and by minimizing cross-entropy losses across modalities (Eq. 16).
- Train a joint objective L_train comprising L_int (identity, triplet, and cross-modal KL losses), L_sr, L_se, L_ortho, and L_sid with a dynamic re-weighting scheme α_t^sr, α_t^se based on gradient norms (Eq. 19–20).
実験結果
リサーチクエスチョン
- RQ1Can erasing body-shape information in VI-ReID features lead to the deliberate discovery of additional modality-shared cues beyond body shape?
- RQ2Does an orthogonal decomposition of features into shape-related and shape-erased subspaces improve cross-modal discrimination and reduce modality-specific biases?
- RQ3How does jointly optimizing shape-related and shape-erased objectives affect VI-ReID performance on standard benchmarks?
- RQ4What is the impact of body-shape priors and semantic parsing on guiding shape-related features compared to purely data-driven representations?
主な発見
- SGIEL achieves competitive to state-of-the-art VI-ReID results on SYSU-MM01, RegDB, and HITSZ-VCM datasets, outperforming several baselines with comparable parameter budgets.
- Ablation studies show that erasing body shape (shape-erased learning) yields measurable improvements in Rank-1 and mAP over baselines that do not erase shape.
- Orthogonal constraint on the two subspaces is beneficial; removing orthogonality severely degrades performance, while appropriate orthogonal design (single projector P of size 512) yields the best results in the ablations (Tables 4–5).
- Visualization indicates shape-related objectives focus on body contours, while shape-erased objectives attend to complementary regions, supporting the idea of learning diverse cues (Grad-CAM++ visualizations).
- On SYSU-MM01 (single-shot) Ours achieves 75.18 Rank-1, 70.12 mAP for All Search and 81.20 mAP for Indoor Search in the 1x parameter setting; with concatenation (C) it reaches 77.12 Rank-1, 72.33 mAP (All Search) and 82.95 mAP (Indoor Search).
- The method is extended to RegDB and HITSZ-VCM and shows competitive performance gains over baselines and several prior methods (Tables 1–3).
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。