QUICK REVIEW

[論文レビュー] CrowdHuman: A Benchmark for Detecting Human in a Crowd

Shuai Shao, Zijian Zhao|arXiv (Cornell University)|Apr 30, 2018

Video Surveillance and Tracking Methods参考文献 23被引用数 495

ひとこと要約

本論文はCrowdHumanを紹介します。CrowdHumanは混雑した場面での歩行者検出のための大規模で豊富な注釈データセットで、470k件のインスタンス、画像あたり22.6人、各人につき3種類の境界ボックスを備え、事前学習に使用した場合の他データセットへの強い汎化性能も示します。

ABSTRACT

Human detection has witnessed impressive progress in recent years. However, the occlusion issue of detecting human in highly crowded environments is far from solved. To make matters worse, crowd scenarios are still under-represented in current human detection benchmarks. In this paper, we introduce a new dataset, called CrowdHuman, to better evaluate detectors in crowd scenarios. The CrowdHuman dataset is large, rich-annotated and contains high diversity. There are a total of $470K$ human instances from the train and validation subsets, and $~22.6$ persons per image, with various kinds of occlusions in the dataset. Each human instance is annotated with a head bounding-box, human visible-region bounding-box and human full-body bounding-box. Baseline performance of state-of-the-art detection frameworks on CrowdHuman is presented. The cross-dataset generalization results of CrowdHuman dataset demonstrate state-of-the-art performance on previous dataset including Caltech-USA, CityPersons, and Brainwash without bells and whistles. We hope our dataset will serve as a solid baseline and help promote future research in human detection tasks.

研究の動機と目的

混雑した場面での重度の遮蔽を伴う人の検出という課題に取り組む。
既存のベンチマークよりも混雑の遮蔽をより良く表現する大規模で多様なデータセットを提供する。
各人につき頭部・可視領域・全身の3つの境界ボックスを提供し、遮蔽を考慮した検出を支援する。
CrowdHumanを用いた Cross-dataset汎化の実証と、他のベンチマークの事前学習データとしての有用性を示す。

提案手法

ウェブ画像から多様な群衆シーンを収集・注釈（候補約60k、最終約25k）し、学習用15k、検証用4,370、テスト用5,000画像に分割する。
各人を全身・可視領域・頭部の境界ボックスで注釈し、品質を二重チェックする。
密度・遮蔽・ペア/三重重複の豊富な統計を提供し、群衆の難易度を特徴付ける。
基準検出器（FPNとFaster R-CNN、および RetinaNet）をmMRとAP指標で評価し、全身/可視/頭部のタスクに合わせてアンカー比を適応させる。
CrowdHumanで事前学習を行い、その後Caltech、CityPersons、COCOPersons、Brainwashでファインチューニングして汎化を評価する。

実験結果

リサーチクエスチョン

RQ1CrowdHumanのパフォーマンスは、混雑したシナリオの既存データセットと比較してどうか。
RQ2CrowdHumanがCaltech、CityPersons、COCOPersons、Brainwashの検출性能を向上させる効果的な事前学習データになり得るか。
RQ3頭部・可視領域・全身の3つの境界ボックス注釈は、群衆の検出にどのような利点を提供するか。
RQ4CrowdHumanで事前学習した検出器は、さまざまな歩行者・頭部検出ベンチマークにどれだけ汎化するか。

主な発見

CrowdHumanには、train+validationのサブセットで約470k件の人のインスタンスが含まれ、訓練画像は約15,000枚で1枚あたり平均22.6人。
各人に対して頭部・可視-body・全身の3種類の境界ボックスを提供する。
ベースライン検出器（FPNとRetinaNet）はパフォーマンスの差が顕著で、これらのタスクでは一般にFPNがRetinaNetより優れている。
CrowdHumanでのクロスデータセット事前学習はCaltechの性能を向上させる（mMR 8.81 対 Caltech baselineの10.08）、CityPersons（mMR 21.18 対 14.81のトップクォーテーション）、Brainwash（mMR 17.24 対 19.77）。
CrowdHumanで事前学習し、その後COCOPersonsでファインチューニングするとAPが85.02、mMRが39.79となり、COCOPersonsだけで学習した場合のAP83.83、mMR41.89と比べて改善。
CrowdHumanで事前学習してからCityPersonsへファインチューニングした場合にもCityPersonsの結果が改善（例：CrowdHumanからCityPersonsへファインチューニング時のmMRは10.67）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。