QUICK REVIEW

[논문 리뷰] A Richly Annotated Dataset for Pedestrian Attribute Recognition

Dangwei Li, Zhang Zhang|arXiv (Cornell University)|2016. 03. 23.

Video Surveillance and Tracking Methods참고 문헌 22인용 수 180

한 줄 요약

이 논문은 RAP 데이터셋을 41,585개의 보행자 샘플과 72개의 속성, 더불어 관점(viewpoint), 가림(occlusion), 신체 부위 주석을 제시하고, 다중 레이블 벤치마크 및 평가 지표를 사용해 환경 요인이 속성 인식에 미치는 영향을 분석한다.

ABSTRACT

In this paper, we aim to improve the dataset foundation for pedestrian attribute recognition in real surveillance scenarios. Recognition of human attributes, such as gender, and clothes types, has great prospects in real applications. However, the development of suitable benchmark datasets for attribute recognition remains lagged behind. Existing human attribute datasets are collected from various sources or an integration of pedestrian re-identification datasets. Such heterogeneous collection poses a big challenge on developing high quality fine-grained attribute recognition algorithms. Furthermore, human attribute recognition are generally severely affected by environmental or contextual factors, such as viewpoints, occlusions and body parts, while existing attribute datasets barely care about them. To tackle these problems, we build a Richly Annotated Pedestrian (RAP) dataset from real multi-camera surveillance scenarios with long term collection, where data samples are annotated with not only fine-grained human attributes but also environmental and contextual factors. RAP has in total 41,585 pedestrian samples, each of which is annotated with 72 attributes as well as viewpoints, occlusions, body parts information. To our knowledge, the RAP dataset is the largest pedestrian attribute dataset, which is expected to greatly promote the study of large-scale attribute recognition systems. Furthermore, we empirically analyze the effects of different environmental and contextual factors on pedestrian attribute recognition. Experimental results demonstrate that viewpoints, occlusions and body parts information could assist attribute recognition a lot in real applications.

연구 동기 및 목표

실제 감시 현장으로부터 대규모의 풍부한 주석이 달린 보행자 속성 데이터셋을 생성한다.
샘플에 72개의 세부 속성과 맥락적 요인(관점, 가림, 신체 부위)을 주석한다.
베이스라인 및 다중 레이블 모델을 평가하여 맥락이 속성 인식에 미치는 영향을 이해한다.
실제 현장 시나리오에서 속성 간 의존성을 더 잘 반영하기 위한 다중 레이블 평가 지표를 도입한다.

제안 방법

세 달에 걸쳐 26개의 카메라 현장에서 실제 감시 영상을 수집한다.
41,585개의 보행자 샘플에 72개의 속성과 맥락 요인(관점, 가림, 부품)을 주석한다.
SVM과 ELF 및 CNN 특징(FC6/FC7)으로 베이스라인을 평가하고 두 가지 다중 레이블 CNN 모델(A CN, DeepMAR)을 사용한다.
단일 속성 대 다속성 공동 학습을 비교하기 위해 두 가지 특징 유형(ELF 및 CaffeNet 기반 CNN 특징)을 사용한다.
전통적인 평균 정확도(mA)와 함께 정확도(accuracy), 정밀도(precision), 재현율(recall), F1를 포함한 다중 레이블 평가 지표를 제안하고 적용한다.
속성 인식에 대해 헤드-어 shoulder, 상체, 하체 영역의 파트를 분석하여 파트의 영향을 조사한다.

실험 결과

연구 질문

RQ1관점(viewpoints), 가림(occlusions), 그리고 신체 부위 가시성이 보행자 속성 인식 성능에 어떤 영향을 미치는가?
RQ2다중 레이블 학습 접근법(ACN, DeepMAR)이 RAP에서 단일 속성 분류기보다 성능이 우수한가?
RQ3실제 감시 조건에서 파트 기반 표현이 속성 인식에 도움을 주는가?
RQ4이 맥락에서 다수 속성 간의 의존성을 가장 잘 포착하는 평가 지표는 무엇인가?

주요 결과

RAP은 현재까지 가장 큰 보행자 속성 데이터셋으로, 41,585개의 샘플과 72개의 속성 plus 맥락 주석을 포함한다.
관점, 가림, 신체 부위 정보가 속성 인식 성능에 상당한 영향을 미친다.
CNN 기반 특징(FC6/FC7)이 이 작업에서 일반적으로 ELF 특징보다 우수하며, FC6가 강한 일반화를 보인다.
예시 기반(다중 레이블) 평가가 의미 있는 속성 간 의존성을 드러내고 단일 속성 SVM 접근법에 비해 다속성 공동 학습에서 상당한 이점을 보인다.
파트 기반 분석은 특정 신체 영역과 연관된 속성들이 헤드-숄더, 상체, 하체 특징을 사용할 때 이점을 얻고 파트를 포함하면 인식이 향상될 수 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.