[논문 리뷰] Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification
SSAN은 텍스트-투-이미지 사람 ReID를 위해 의미적으로 정렬된 부위 수준의 시각 및 텍스트 특징을 자동으로 학습하며, Word Attention Module, multi-view non-local relations, 및 Compound Ranking loss를 사용하고 새로운 ICFG-PEDES 데이터셋을 더합니다.
Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view non-local network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github.com/zifyloo/SSAN.
연구 동기 및 목표
- Address the challenge of cross-modal text-to-image ReID with large textual intra-class variance and variable word-body-part mappings.
- Automatically derive part-level textual features aligned to visual regions without external tools.
- Model relationships among body parts to better match noun phrases through multi-view non-local interactions.
- Reduce textual intra-class variance via a Compound Ranking loss that leverages descriptions of other images of the same identity.
- Provide a new, more challenging and identity-centric dataset (ICFG-PEDES) to advance text-to-image ReID research.
제안 방법
- Extract part-level visual features by uniform partitioning of visual feature maps.
- Process descriptions with Bi-LSTM to obtain word representations.
- Use a Word Attention Module (WAM) to predict word-to-part associations and generate part-level textual features.
- Apply a shared 1x1 convolution to align global visual and textual features in a common space (global branch).
- Introduce Part-specific Feature Learning (PFL) and Part Relation Learning (PRL) in part branches to obtain semantically aligned part features.
- Employ Multi-View Non-Local Network (MV-NLN) to capture intra-part and inter-part relationships in both modalities and refine part features.
- Propose a Compound Ranking (CR) loss that combines strong and weak supervision terms, with adaptive margins, to exploit descriptions of other images of the same identity as supervision.
- Train with global, PFL, and PRL features using a combination of ID loss and the CR loss; inference sums three modality similarities (S_g, S_l, S_n).
실험 결과
연구 질문
- RQ1Can semantically self-aligned part features be automatically extracted for text-to-image ReID without external text tools?
- RQ2Does modeling inter-part relationships via MV-NLN improve cross-modal alignment and retrieval performance?
- RQ3Can a compound ranking loss leveraging descriptions of other images of the same identity reduce textual intra-class variance?
- RQ4Does the proposed SSAN architecture outperform existing text-to-image ReID methods on standard and newly introduced datasets?
주요 결과
- SSAN improves baseline performance by 4.58 percentage points in Rank-1 on CUHK-PEDES and 3.56 points on ICFG-PEDES when adding PFL (part feature learning).
- Adding PRL (part relation learning) yields a further improvement of 1.33 percentage points (CUHK-PEDES) and 0.95 points (ICFG-PEDES).
- Incorporating the CR loss provides an additional gain of 1.62 points (CUHK-PEDES) and 1.21 points (ICFG-PEDES) in Rank-1.
- SSAN outperforms the state-of-the-art on CUHK-PEDES, surpassing ViTAA by 5.4% in Rank-1 accuracy for Rank-1, with competitive performance in other ranks.
- SSAN achieves its strongest results with the full model (Global + PFL + MV-NLN + CR loss), significantly surpassing baselines and prior part-based methods.
- The authors release ICFG-PEDES as an identity-centric, fine-grained dataset with longer captions and more challenging imagery to support future research.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.