QUICK REVIEW

[論文レビュー] TransCrowd: weakly-supervised crowd counting with transformers

Dingkang Liang, Xiwu Chen|arXiv (Cornell University)|Apr 19, 2021

Video Surveillance and Tracking Methods参考文献 41被引用数 25

ひとこと要約

TransCrowdは、弱教師ありの群衆カウントに対して純粋なTransformerアプローチを導入し、画像からカウントへの問題をシーケンスからカウントへと再定式化し、カウントレベル手法の中で最先端の結果を達成する。2つの回帰ヘッドを比較し、グローバル平均プーリングを用いた場合に収束が速くなることを示す。

ABSTRACT

The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

研究の動機と目的

ポイントレベルの密度マップを超えて注釈作業を削減するため、カウントレベル（弱教師あり）群衆カウントの開発を促進する。
Transformerのグローバル受容野を活用して、長距離の群衆コンテキストをカウントのために捉える。
2つのTransformerベースのカウントアーキテクチャ（TransCrowd-TokenおよびTransCrowd-GAP）を提案し、その有効性を比較する。
純粋なTransformerモデルが、標準データセット上で完全監視型CNNベース手法と比較して競争力のあるまたは優れたカウント精度を達成できることを示す。

提案手法

入力画像を固定サイズパッチのシーケンスへ変換し、位置情報を埋め込む。
Transformer-encoder（12層; 残差接続付きのマルチヘッド自己注意）を適用して、画像パッチの豊かなグローバル表現を得る。
2つの回帰ヘッド設計を導入：TransCrowd-Tokenは学習可能な回帰トークンを使用；TransCrowd-GAPは回帰前に視覚トークンのグローバル平均プーリングを用いる。
L1損失で訓練し、画像ごとの総群衆数を予測する。
ImageNetで事前訓練し、群衆カウントデータセットでファインチューニングする。画像サイズをリサイズし、トレーニングには標準的なデータ増強を用いる。

実験結果

リサーチクエスチョン

RQ1ポイントレベルの密度監督なしで、カウントレベルの監督だけで訓練された純粋なTransformerベースのネットワークは、競争力のある群衆カウント性能を達成できるか？
RQ2回帰ヘッド設計（回帰トークン vs. グローバルにプールされたトークン）は、カウント精度と収束速度に影響を与えるか？
RQ3標準ベンチマークにおける異なる混雑密度で、TransCrowdは既存の弱教師ありおよび完全教師あり手法とどう比較されるか？
RQ42つの回帰ヘッドバリアント間で現れるアテンションマップの定性的な違いは何か、そしてそれらはカウント精度とどう関連するか？

主な発見

TransCrowd-GAPは複数のデータセットでTransCrowd-Tokenより高いカウント精度とより速い収束を達成する。
TransCrowdは既存の弱教師ありCNNベース手法を大幅に上回り、完全教師あり手法と高い競争力を持つ。
JHU-Crowd++のテストセットでは、TransCrowd-GAPはCSRNetを顕著なマージンで改善し（MAEとMSE）、いくつかのデータセットで完全教師あり手法を上回る。
アテンション可視化は、TransCrowd-GAPがTransCrowd-Tokenよりもより妥当なアテンションマップを生成し、カウント誤差の低減に寄与することを示している。
この手法は大規模データセットを含むNWPU-CrowdおよびJHU-Crowd++で高い性能を示しており、恐らくTransformerのグローバル受容野によるもの。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。