QUICK REVIEW

[논문 리뷰] Person Search with Natural Language Description

Shuang Li, Tong Xiao|arXiv (Cornell University)|2017. 02. 19.

Multimodal Machine Learning Applications참고 문헌 41인용 수 63

한 줄 요약

본 논문은 자연어 설명을 활용한 대규모 인물 검색 데이터셋 CUHK-PEDES를 소개하고, 문장 단어를 시각 단위와 정렬하여 인물 이미지를 랭킹하는 게이트드 신경 주의(GNA-RNN) 모델을 제안한다.

ABSTRACT

Searching persons in large-scale image databases with the query of natural language description has important applications in video surveillance. Existing methods mainly focused on searching persons with image-based or attribute-based queries, which have major limitations for a practical usage. In this paper, we study the problem of person search with natural language description. Given the textual description of a person, the algorithm of the person search is required to rank all the samples in the person database then retrieve the most relevant sample corresponding to the queried description. Since there is no person dataset or benchmark with textual description available, we collect a large-scale person description dataset with detailed natural language annotations and person samples from various sources, termed as CUHK Person Description Dataset (CUHK-PEDES). A wide range of possible models and baselines have been evaluated and compared on the person search benchmark. An Recurrent Neural Network with Gated Neural Attention mechanism (GNA-RNN) is proposed to establish the state-of-the art performance on person search.

연구 동기 및 목표

자유 형식의 언어 설명을 사용하여 이미지나 미리 정의된 속성 없이도 실용적인 인물 검색을 촉진한다.
재식별 세트에서 인물 이미지에 대한 풍부한 자연어 주석이 포함된 대규모 데이터셋(CUHK-PEDES)을 만든다.
언어 가이드 인물 검색에서 캡션 생성, QA, 임베딩 패러다임의 여러 기준선을 평가한다.
게이트드 신경 주의(GNA-RNN)를 제안하여 신뢰할 수 있는 검색을 위한 단어-이미지 친화도를 학습한다.

제안 방법

도입 CUHK-PEDES with 40,206 images of 13,003 persons and 80,412 sentences describing appearances.
Develop visual sub-network producing 512 visual units from VGG-16-like backbone.
Use a language sub-network (LSTM) to generate unit-level attentions over visual units for each word.
Incorporate word-level gates to weight the importance of different words in the sentence.
Compute per-word affinity as a weighted sum of visual unit responses; aggregate over words for final affinity.
Train end-to-end with cross-entropy loss on positive/negative sentence-image pairs, using a 1:3 positive:negative ratio.

실험 결과

연구 질문

RQ1자연어 설명이 대규모 인물 검색에서 속성 기반 질의보다 뛰어난 성능을 보일 수 있는가?
RQ2인물을 묘사하기 위해 단어-이미지 관계를 가장 잘 포착하는 데이터셋과 모델 구조는 무엇인가?
RQ3게이트드 신경 주의 메커니즘이 이미지 캡션 생성 및 시각-의미 임베딩과 같은 기준선 대비 문장-이미지 친화도를 향상시키는가?
RQ4언어 가이드 인물 검색에서 단어 유형과 문장 길이가 검색 성능에 어떤 영향을 미치는가.

주요 결과

CUHK-PEDES는 40,206개의 이미지와 80,412개의 문장을 제공하여 강력한 언어 주도 인물 검색 벤치마크를 가능하게 한다.
GNA-RNN은 제안된 데이터셋에서 최첨단 결과를 달성하여 상위-1 및 상위-10 정확도에서 캡션생성, QA, 임베딩 기준선을 능가한다.
단위 수준 주의와 단어 수준 게이트 모두 성능에 의미 있게 기여한다; 어느 쪽을 제거해도 결과가 저하된다.
사람 재식별 데이터로 시각 백본을 사전 학습시키면 성능이 크게 향상된다.
테스트된 단위 수 중 512 시각 단위가 최상의 성능을 낸다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.