QUICK REVIEW

[논문 리뷰] Contextual Non-Local Alignment over Full-Scale Representation for Text-Based Person Search

Chenyang Gao, Guanyu Cai|arXiv (Cornell University)|2021. 01. 08.

Video Surveillance and Tracking Methods참고 문헌 24인용 수 61

한 줄 요약

NAFS는 맥락적 비국소 주의를 통해 스케일 간을 적응적으로 정렬하는 전체 규모 이미지 및 텍스트 표현을 도입하여 CUHK-PEDES에서 최첨단 성능을 달성한다.

ABSTRACT

Text-based person search aims at retrieving target person in an image gallery using a descriptive sentence of that person. It is very challenging since modal gap makes effectively extracting discriminative features more difficult. Moreover, the inter-class variance of both pedestrian images and descriptions is small. So comprehensive information is needed to align visual and textual clues across all scales. Most existing methods merely consider the local alignment between images and texts within a single scale (e.g. only global scale or only partial scale) then simply construct alignment at each scale separately. To address this problem, we propose a method that is able to adaptively align image and textual features across all scales, called NAFS (i.e.Non-local Alignment over Full-Scale representations). Firstly, a novel staircase network structure is proposed to extract full-scale image features with better locality. Secondly, a BERT with locality-constrained attention is proposed to obtain representations of descriptions at different scales. Then, instead of separately aligning features at each scale, a novel contextual non-local attention mechanism is applied to simultaneously discover latent alignments across all scales. The experimental results show that our method outperforms the state-of-the-art methods by 5.53% in terms of top-1 and 5.35% in terms of top-5 on text-based person search dataset. The code is available at https://github.com/TencentYoutuResearch/PersonReID-NAFS

연구 동기 및 목표

텍스트 기반 사람 검색의 동기를 제시하고 이미지와 설명 간의 작은 클래스 간 분산 차이를 해결한다.
다중 스케일에서 이미지와 텍스트 표현을 학습하여 포괄적인 교차 모달 신호를 포착한다.
모든 스케일에 걸쳐 특징을 공동 정렬하는 적응형 정렬 메커니즘을 개발한다.
검색 결과를 더욱 정교하게 다듬기 위한 재정렬 방법을 제안한다.
CUHK-PEDES에서의 광범위한 실험을 통해 효과를 입증한다.

제안 방법

향상된 지역성을 갖춘 전체 규모 시각 표현을 추출하기 위해 계단식 CNN 백본을 도입한다.
다중 스케일 텍스트 표현을 얻기 위해 지역성 제약이 있는 BERT를 사용한다.
모든 스케일에서 이미지와 텍스트를 공동 정렬하기 위한 맥락적 비국소 주의를 제안한다.
이미지-텍스트 및 텍스트-이미지 유사도를 결합하는 교차 스케일 정렬 손실(CSAL)을 적용한다.
최종 순위를 향상시키기 위해 시각적 이웃에 의한 재정렬(RVN)을 도입한다.

실험 결과

연구 질문

RQ1텍스트 기반 사람 검색을 위해 이미지와 텍스트의 전체 규모 다중 스케일 표현을 어떻게 추출할 수 있는가?
RQ2적응형 교차 스케일 정렬이 고정 스케일 페어를 넘어 매칭을 향상시킬 수 있는가?
RQ3스케일 간 교차 모달 정렬에서 맥락적 비국소 주의의 영향은 무엇인가?
RQ4시각적 이웃에 의한 재정렬이 검색 성능을 추가로 향상시키는가?

주요 결과

전체 규모 표현을 갖춘 NAFS가 CUHK-PEDES에서 Top-1 59.94% 및 Top-5 79.86%를 달성하여 여러 기준선보다 우수하다.
RVN이 포함된 NAFS는 Top-1을 61.50%, Top-5를 81.19%로 더 향상시킨다.
스케일 간의 공동 정렬이 개별 정렬보다 우수하다(상위 1등: 59.94% 대 57.98%).
계단식 네트워크, 지역성 제약 BERT, 맥락적 비국소 주의의 도입은 기준선 대비 상당한 이점을 제공한다.
엑실험은 중간 스케일 정보(스케일 2)가 전역 및 가장 세밀한 스케일 특징에 더해 유익하다는 것을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.