QUICK REVIEW

[论文解读] TransCrowd: weakly-supervised crowd counting with transformers

Dingkang Liang, Xiwu Chen|arXiv (Cornell University)|Apr 19, 2021

Video Surveillance and Tracking Methods参考文献 41被引用 25

一句话总结

TransCrowd 引入一种纯 Transformer 方法用于弱监督的人群计数，将 image-to-count 重新表述为 sequence-to-count，在同类计数级方法中实现了最先进的结果。

ABSTRACT

The mainstream crowd counting methods usually utilize the convolution neural network (CNN) to regress a density map, requiring point-level annotations. However, annotating each person with a point is an expensive and laborious process. During the testing phase, the point-level annotations are not considered to evaluate the counting accuracy, which means the point-level annotations are redundant. Hence, it is desirable to develop weakly-supervised counting methods that just rely on count-level annotations, a more economical way of labeling. Current weakly-supervised counting methods adopt the CNN to regress a total count of the crowd by an image-to-count paradigm. However, having limited receptive fields for context modeling is an intrinsic limitation of these weakly-supervised CNN-based methods. These methods thus cannot achieve satisfactory performance, with limited applications in the real world. The transformer is a popular sequence-to-sequence prediction model in natural language processing (NLP), which contains a global receptive field. In this paper, we propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on transformers. We observe that the proposed TransCrowd can effectively extract the semantic crowd information by using the self-attention mechanism of transformer. To the best of our knowledge, this is the first work to adopt a pure transformer for crowd counting research. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods and gains highly competitive counting performance compared with some popular fully-supervised counting methods.

研究动机与目标

推动开发基于计数级（弱监督）的人人群计数，以减少点级密度图之外的标注工作量。
利用 Transformer 的全局感受野来捕捉用于计数的长距离人群上下文。
提出两种基于 Transformer 的计数架构（TransCrowd-Token 与 TransCrowd-GAP），并比较它们的有效性。
表明纯 Transformer 模型在标准数据集上可以达到与全监督的 CNN 方法相当甚至更好的计数精度。

提出的方法

将输入图像转换为固定大小补丁序列，并通过位置信息进行嵌入。
应用 Transformer-encoder（12 层；带残差连接的多头自注意力）以获得图像补丁的丰富全局表示。
引入两种回归头设计：TransCrowd-Token 使用一个可学习的回归令牌；TransCrowd-GAP 在回归前对视觉令牌进行全局平均池化。
使用 L1 损失进行训练，以预测每张图像的总人群计数。
在 ImageNet 上进行预训练，并在 crowd counting 数据集上微调；调整图像大小并使用标准数据增强进行训练。

实验结果

研究问题

RQ1一个纯 Transformer 基于网络，接受计数级监督训练，是否能够在没有点级密度监督的情况下达到有竞争力的人群计数性能？
RQ2回归头设计（回归令牌与全局均值池化令牌）是否会影响计数精度与收敛速度？
RQ3在标准基准测试、不同密度下，TransCrowd 与现有的弱监督和全监督方法相比如何？
RQ4在注意力图上两种回归头变体出现了哪些定性差异，它们与计数精度有何关系？

主要发现

TransCrowd-GAP 在多个数据集上实现了更高的计数精度和比 TransCrowd-Token 更快的收敛速度。
TransCrowd 远超现有的弱监督 CNN 基方法，与全监督方法具有高度竞争力。
在 JHU-Crowd++ 测试集上，TransCrowd-GAP 相对 CSRNet 提升显著（MAE、MSE），并在某些数据集上甚至超过某些全监督方法。
注意力可视化显示 TransCrowd-GAP 产生的注意力图比 TransCrowd-Token 更为合理，从而有助于降低计数误差。
该方法在 NWPU-Crowd 和 JHU-Crowd++ 等大规模数据集上表现出色，这很可能归因于 Transformer 的全局感受野。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。