QUICK REVIEW

[论文解读] CounTR: Transformer-based Generalised Visual Counting

Chang Liu, Yujie Zhong|arXiv (Cornell University)|Aug 29, 2022

Video Surveillance and Tracking Methods被引用 31

一句话总结

CounTR 引入了一种基于 Transformer 的开放世界、从零-shot 到少-shot 的视觉计数架构，利用示例引导的注意力与自监督预训练，在 FSC-147 上达到最先进的结果。

ABSTRACT

In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of "exemplars", i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting Transformer (CounTR), which explicitly capture the similarity between image patches or with given "exemplars" with the attention mechanism;(2) We adopt a two-stage training regime, that first pre-trains the model with self-supervised learning, and followed by supervised fine-tuning;(3) We propose a simple, scalable pipeline for synthesizing training images with a large number of instances or that from different semantic categories, explicitly forcing the model to make use of the given "exemplars";(4) We conduct thorough ablation studies on the large-scale counting benchmark, e.g. FSC-147, and demonstrate state-of-the-art performance on both zero and few-shot settings.

研究动机与目标

促进支持任意语义类别和可变数量示例的开放世界视觉对象计数（从零-shot 到少-shot）
开发一个基于 Transformer 的 Counting TRansformer（CounTR），利用自注意力对图像区域与示例进行比较
提出两阶段训练方案：先进行基于 MAE 的自监督预训练，再进行监督微调以进行计数
引入可扩展的马赛克数据合成管线，以缓解长尾计数并改善示例条件
在 FSC-147 上展示零-shot 与少-shot 设置下的最先进性能

提出的方法

提出 CounTR：一种基于 Transformer 的架构，其中图像 ViT 编码器输出特征令牌，示例特征单独编码以用于在特征交互模块（FIM）中的跨注意力
FIM 使用解码器风格的 Transformer 层，实现图像补丁与示例表示之间的跨注意力与自注意力，生成一个密度图
一个渐进式解码器将 FIM 的输出上采样为二维密度图；最终计数是该密度图的和
两阶段训练：首先对 ViT 编码器进行 MAE 的自监督预训练（图像重建），然后进行计数的监督微调
一个可扩展的马赛克数据生成管线（拼贴与混合），以创建具有大量实例和多样背景的图像，解决长尾分布问题
在推理阶段进行归一化与裁剪策略，以校准预测并处理极小对象或示例放置问题

实验结果

研究问题

RQ1一个基于 Transformer 的模型能否在给定零或少量示例的情况下，将计数泛化到任意对象类别？
RQ2自监督预训练是否会在零-shot 和少-shot 设置下提高计数性能？
RQ3训练数据的合成马赛克是否能够缓解长尾分布并在高实例场景中提高计数？
RQ4有哪些有效的推理时策略可以校准基于示例引导的密度输出？

主要发现

方法	年份	主干网络	示例数	Val MAE	Val RMSE	Test MAE	Test RMSE
RepRPN-C	2022	ConvNets	0	31.69	100.31	28.32	128.76
RCC	2022	Pre-trained ViT	0	20.39	64.62	21.64	103.47
CounTR (ours)	2022	ViT	0	18.07	71.84	14.71	106.87
FR	2019	ConvNets	3	45.45	112.53	41.64	141.04
FSOD	2020	ConvNets	3	36.36	115.00	32.53	140.65
P-GMN	2018	ConvNets	3	60.56	137.78	62.69	159.67
GMN	2018	ConvNets	3	29.66	89.81	26.52	124.57
MAML	2017	ConvNets	3	25.54	79.44	24.90	112.68
FamNet	2021	ConvNets	3	23.75	69.07	22.08	99.54
BMNet+	2022	ConvNets	3	15.74	58.53	14.62	91.83

CounTR 在 FSC-147 的零-shot 和少-shot 设置下达到最先进的 MAE/RMSE（例如：零-shot 验证集 MAE 18.07，RMSE 71.84；测试集 MAE 14.71，RMSE 106.87。）
自监督 MAE 预训练显著优于仅进行监督微调的性能
马赛克数据合成进一步提升了结果，尤其是在具有大量实例的图像上
推理阶段的归一化和裁剪进一步提升计数精度，特别是在少-shot 设置中
CounTR 使用 3 个示例时保持鲁棒性，在 1、2、3 次示例之间差异很小

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。