QUICK REVIEW

[论文解读] ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Yufei Xu, Jing Zhang|arXiv (Cornell University)|Apr 26, 2022

Human Pose and Action Recognition被引用 349

一句话总结

ViTPose 表明，简单的视觉变换器能够用轻量解码器作为强大、可扩展的人体姿态估计基线，在 MS COCO 上实现最先进的结果，并实现灵活的训练与迁移学习。

ABSTRACT

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given person instance and a lightweight decoder for pose estimation. It can be scaled up from 100M to 1B parameters by taking the advantages of the scalable model capacity and high parallelism of transformers, setting a new Pareto front between throughput and performance. Besides, ViTPose is very flexible regarding the attention type, input resolution, pre-training and finetuning strategy, as well as dealing with multiple pose tasks. We also empirically demonstrate that the knowledge of large ViTPose models can be easily transferred to small ones via a simple knowledge token. Experimental results show that our basic ViTPose model outperforms representative methods on the challenging MS COCO Keypoint Detection benchmark, while the largest model sets a new state-of-the-art. The code and models are available at https://github.com/ViTAE-Transformer/ViTPose.

研究动机与目标

推动在不使用专门领域特定骨干网络的情况下，研究简单的视觉变换器用于姿态估计。
展示简单但有效的 ViTPose 架构，配以轻量解码器。
展示 ViTPose 在跨数据集和预训练方案中的可扩展性、训练灵活性和迁移能力。
在 MS COCO Keypoint 数据集上建立强性能基准，并分析模型大小、速度与准确度之间的权衡。

提出的方法

使用简单、非层次化的 Vision Transformer 骨干网络，进行掩码图像建模（MAE）预训练，以提取人物实例的特征。
附加一个轻量解码器来上采样特征并回归关键点热力图，提供两种解码器选项（经典的 2 层反卷积块或更简单的上采样 + 3x3 卷积）。
通过改变骨干尺寸（ViT-B/L/H 与 ViTAE-G）和特征维度来探索可扩展性。
通过在 ImageNet-1K、COCO、AI Challenger，或使用带有 MAE 的姿态专用数据进行预训练，来研究数据灵活性。
在全注意力、窗口、移位窗口和池化等注意力类型之间权衡，以平衡精度和内存使用。
通过输出蒸馏和一种新颖的基于 token 的蒸馏方法，演示从大模型向小模型的知识迁移。

实验结果

研究问题

RQ1在没有基于 CNN 的骨干网络的情况下，使用简单的视觉变换器骨干和轻量解码器，是否能在 COCO 上实现具有竞争力甚至最先进的姿态估计？
RQ2模型大小、输入/分辨率和注意力机制如何影响 ViTPose 的性能与吞吐量？
RQ3预训练数据和微调策略对 ViTPose 在姿态估计中的性能有何影响？
RQ4是否能够通过基于 token 的蒸馏有效地实现从大型 ViTPose 模型向小型模型的知识迁移？

主要发现

ViTPose 在 MS COCO test-dev 上达到 80.9 AP，使用最大的 ViTPose 模型（ViTPose-G）并结合 MS COCO + AI Challenger 数据。
ViTPose 展示出强大的可扩展性，性能随着模型规模的增大而持续提升（ViT-B 到 ViT-H 到 ViTAE-G）。
在使用强大视觉变换器骨干网络时，简单解码器的性能可与更复杂解码器相匹配（AP 下降不足 0.3）。
在下游姿态数据（COCO + AI Challenger）上的预训练与 ImageNet-1K 预训练一样有效，通常具备可比甚至更好的数据效率。
基于 token 的蒸馏为将知识从大型到小型 ViTPose 模型转移提供了可观的增益（例如 0.2–0.5 AP 的增益）。
使用多数据集训练，ViTPose 可带来进一步的改进（例如 ViTPose-B 的 AP 从 75.8 提升到 77.1）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。