QUICK REVIEW

[論文レビュー] Locate, Size and Count: Accurately Resolving People in Dense Crowds via Detection

Deepak Babu Sam, Skand Vishwanath Peri|arXiv (Cornell University)|Jun 18, 2019

Video Surveillance and Tracking Methods参考文献 46被引用数 33

ひとこと要約

本論文は LSC-CNN を提案する。密集した群衆における頭部を局在化・サイズ推定・数え上げを行う密集検出フレームワークであり、局在化とカウントにおいて密度回帰法を上回る。

ABSTRACT

We introduce a detection framework for dense crowd counting and eliminate the need for the prevalent density regression paradigm. Typical counting models predict crowd density for an image as opposed to detecting every person. These regression methods, in general, fail to localize persons accurate enough for most applications other than counting. Hence, we adopt an architecture that locates every person in the crowd, sizes the spotted heads with bounding box and then counts them. Compared to normal object or face detectors, there exist certain unique challenges in designing such a detection system. Some of them are direct consequences of the huge diversity in dense crowds along with the need to predict boxes contiguously. We solve these issues and develop our LSC-CNN model, which can reliably detect heads of people across sparse to dense crowds. LSC-CNN employs a multi-column architecture with top-down feedback processing to better resolve persons and produce refined predictions at multiple resolutions. Interestingly, the proposed training regime requires only point head annotation, but can estimate approximate size information of heads. We show that LSC-CNN not only has superior localization than existing density regressors, but outperforms in counting as well. The code for our approach is available at https://github.com/val-iisc/lsc-cnn.

研究の動機と目的

密集した群衆における正確な頭部局在化の実現を目指し、密度回帰を超える動機づけ。
極めて高密度かつ様々なスケールに適合する、単一段階の密集検出フレームワークを開発する。
点アノテーションから派生した境界ボックス予測による頭部局在化を実現する。
明示的なボックスアノテーションを必要とせず、境界ボックスのサイズを推定する訓練手法を提供する。

提案手法

複数のスケールに渡る事前定義された頭部ボックスサイズに対して、画素ごとのクラス信頼度を予測する密集検出アーキテクチャとして LSC-CNN を提案する。
修正された VGG-16 を基盤とするマルチスケール特徴抽出器を用いて、1/2、1/4、1/8、1/16 解像度の特徴マップを生成する。
Top-down Feature Modulators (TFMs) を組み込み、マルチスケール特徴を融合し、正確な局在化の文脈を提供する。
点アノテーションから得られる疑似グラウンドトゥルースを用い、事前定義されたボックスクラスに対する画素ごとのクロスエントロピーロスで訓練する。
Grid Winner-Take-All (GWTA) 訓練損失を用いて、難易度が高い領域に学習を集中させ、局所解を抑制する。スケールとクラスのバランスを取る重み付けを適用する。
最近傍距離によって頭部サイズを近似し、スケール横断で疑似グラウンドトゥルースのボックスビンを生成する。

実験結果

リサーチクエスチョン

RQ1密集した群衆の数え上げを、密度回帰ではなく画素ごとの頭部検出問題として効果的に再定式化できるか？
RQ2マルチスケール特徴とトップダウンの文脈は、極めて高密度の群衆における頭部の局在化とサイズ推定をどのように改善し得るか？
RQ3頭部検出モデルを、頭部の点アノテーションのみで、境界ボックスアノテーションなしで訓練することは可能か？
RQ4疑似グラウンドトゥルースを用いた画素ごとのボックス分類アプローチは、密度レンジを横断して正確なカウントを提供するか？

主な発見

LSC-CNN は density regressor アプローチと比較して優れた局在化を達成する。
本モデルは頭部の境界ボックスを提供し、さまざまな群衆密度において正確なカウントを実現する。
Top-down feature modulation は複数のスケールでの人物の識別を助け、混雑した場面での誤検出を低減する。
GWTA loss と点ベースの疑似監督を用いた訓練は、明示的なボックスアノテーションを必要とせず、エンドツーエンドの学習を有効にする。
本手法は、通常の顔検出器を超える高解像度の検出を可能にし、密集した群衆に適している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。