QUICK REVIEW

[논문 리뷰] Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification

Zefeng Ding, Changxing Ding|arXiv (Cornell University)|2021. 07. 27.

Video Surveillance and Tracking Methods참고 문헌 63인용 수 79

한 줄 요약

SSAN은 텍스트-투-이미지 사람 ReID를 위해 의미적으로 정렬된 부위 수준의 시각 및 텍스트 특징을 자동으로 학습하며, Word Attention Module, multi-view non-local relations, 및 Compound Ranking loss를 사용하고 새로운 ICFG-PEDES 데이터셋을 더합니다.

ABSTRACT

Text-to-image person re-identification (ReID) aims to search for images containing a person of interest using textual descriptions. However, due to the significant modality gap and the large intra-class variance in textual descriptions, text-to-image ReID remains a challenging problem. Accordingly, in this paper, we propose a Semantically Self-Aligned Network (SSAN) to handle the above problems. First, we propose a novel method that automatically extracts semantically aligned part-level features from the two modalities. Second, we design a multi-view non-local network that captures the relationships between body parts, thereby establishing better correspondences between body parts and noun phrases. Third, we introduce a Compound Ranking (CR) loss that makes use of textual descriptions for other images of the same identity to provide extra supervision, thereby effectively reducing the intra-class variance in textual features. Finally, to expedite future research in text-to-image ReID, we build a new database named ICFG-PEDES. Extensive experiments demonstrate that SSAN outperforms state-of-the-art approaches by significant margins. Both the new ICFG-PEDES database and the SSAN code are available at https://github.com/zifyloo/SSAN.

연구 동기 및 목표

Address the challenge of cross-modal text-to-image ReID with large textual intra-class variance and variable word-body-part mappings.
Automatically derive part-level textual features aligned to visual regions without external tools.
Model relationships among body parts to better match noun phrases through multi-view non-local interactions.
Reduce textual intra-class variance via a Compound Ranking loss that leverages descriptions of other images of the same identity.
Provide a new, more challenging and identity-centric dataset (ICFG-PEDES) to advance text-to-image ReID research.

제안 방법

Extract part-level visual features by uniform partitioning of visual feature maps.
Process descriptions with Bi-LSTM to obtain word representations.
Use a Word Attention Module (WAM) to predict word-to-part associations and generate part-level textual features.
Apply a shared 1x1 convolution to align global visual and textual features in a common space (global branch).
Introduce Part-specific Feature Learning (PFL) and Part Relation Learning (PRL) in part branches to obtain semantically aligned part features.
Employ Multi-View Non-Local Network (MV-NLN) to capture intra-part and inter-part relationships in both modalities and refine part features.
Propose a Compound Ranking (CR) loss that combines strong and weak supervision terms, with adaptive margins, to exploit descriptions of other images of the same identity as supervision.
Train with global, PFL, and PRL features using a combination of ID loss and the CR loss; inference sums three modality similarities (S_g, S_l, S_n).

실험 결과

연구 질문

RQ1Can semantically self-aligned part features be automatically extracted for text-to-image ReID without external text tools?
RQ2Does modeling inter-part relationships via MV-NLN improve cross-modal alignment and retrieval performance?
RQ3Can a compound ranking loss leveraging descriptions of other images of the same identity reduce textual intra-class variance?
RQ4Does the proposed SSAN architecture outperform existing text-to-image ReID methods on standard and newly introduced datasets?

주요 결과

SSAN improves baseline performance by 4.58 percentage points in Rank-1 on CUHK-PEDES and 3.56 points on ICFG-PEDES when adding PFL (part feature learning).
Adding PRL (part relation learning) yields a further improvement of 1.33 percentage points (CUHK-PEDES) and 0.95 points (ICFG-PEDES).
Incorporating the CR loss provides an additional gain of 1.62 points (CUHK-PEDES) and 1.21 points (ICFG-PEDES) in Rank-1.
SSAN outperforms the state-of-the-art on CUHK-PEDES, surpassing ViTAA by 5.4% in Rank-1 accuracy for Rank-1, with competitive performance in other ranks.
SSAN achieves its strongest results with the full model (Global + PFL + MV-NLN + CR loss), significantly surpassing baselines and prior part-based methods.
The authors release ICFG-PEDES as an identity-centric, fine-grained dataset with longer captions and more challenging imagery to support future research.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.