Skip to main content
QUICK REVIEW

[论文解读] SMPLer-X: Scaling Up Expressive Human Pose and Shape Estimation

Zhongang Cai, Wanqi Yin|arXiv (Cornell University)|Sep 29, 2023
Human Pose and Action Recognition被引用 29
一句话总结

本文开发了一个通用 EHPS 基础模型 SMPLer-X,使用来自 32 个多样 EHPS 数据集和 ViT 骨干的多达 4.5M 训练实例,实现在跨领域的强大性能并在多个基准上达到最先进结果。

ABSTRACT

Expressive human pose and shape estimation (EHPS) unifies body, hands, and face motion capture with numerous applications. Despite encouraging progress, current state-of-the-art methods still depend largely on a confined set of training datasets. In this work, we investigate scaling up EHPS towards the first generalist foundation model (dubbed SMPLer-X), with up to ViT-Huge as the backbone and training with up to 4.5M instances from diverse data sources. With big data and the large model, SMPLer-X exhibits strong performance across diverse test benchmarks and excellent transferability to even unseen environments. 1) For the data scaling, we perform a systematic investigation on 32 EHPS datasets, including a wide range of scenarios that a model trained on any single dataset cannot handle. More importantly, capitalizing on insights obtained from the extensive benchmarking process, we optimize our training scheme and select datasets that lead to a significant leap in EHPS capabilities. 2) For the model scaling, we take advantage of vision transformers to study the scaling law of model sizes in EHPS. Moreover, our finetuning strategy turn SMPLer-X into specialist models, allowing them to achieve further performance boosts. Notably, our foundation model SMPLer-X consistently delivers state-of-the-art results on seven benchmarks such as AGORA (107.2 mm NMVE), UBody (57.4 mm PVE), EgoBody (63.6 mm PVE), and EHF (62.3 mm PVE without finetuning). Homepage: https://caizhongang.github.io/projects/SMPLer-X/

研究动机与目标

  • 将 EHPS 规模化以提升在多样场景中的泛化能力。
  • 系统性基准测试32个 EHPS 数据集,以理解数据效用与领域差距。
  • 研究使用 Vision Transformer 骨干的模型扩展对于 EHPS 的影响。
  • 展示将 SMPLer-X 微调为专门化模型以在基准上获得特定收益。
  • 为稳健 EHPS 基础模型提供数据集指导和迁移性洞见。

提出的方法

  • 使用基于 ViT 的骨干作为特征提取器。
  • 采用一个简单的三部分架构:骨干、用于手/脸 ROI 裁剪的颈部,和用于身体部位的回归头。
  • 在多样的 32 个 EHPS 数据集上进行训练,数据选择由基准测试洞察引导。
  • 分析数据属性(规模、场景多样性、真实/合成、标注类型)及其对泛化的影响。
  • 将通用模型微调为领域特定的专家模型,以提高基准性能。
Figure 1: Scaling up EHPS. Both data and model scaling are effective in reducing mean errors on primary metrics across key benchmarks: AGORA [ 50 ] , UBody [ 39 ] , EgoBody [ 68 ] , 3DPW [ 58 ] and EHF [ 51 ] . OSX [ 39 ] and H4W [ 46 ] are SOTA methods. Area of the circle indicates model size, with
Figure 1: Scaling up EHPS. Both data and model scaling are effective in reducing mean errors on primary metrics across key benchmarks: AGORA [ 50 ] , UBody [ 39 ] , EgoBody [ 68 ] , 3DPW [ 58 ] and EHF [ 51 ] . OSX [ 39 ] and H4W [ 46 ] are SOTA methods. Area of the circle indicates model size, with

实验结果

研究问题

  • RQ1在跨越大量 EHPS 数据集的训练数据规模提升如何影响对不同测试环境的泛化?
  • RQ2增大模型规模(ViT 骨干)对 EHPS 精度和鲁棒性的影响是什么?
  • RQ3通过有针对性的微调,是否可以将单一基础模型有效专门化为特定的 EHPS 基准?
  • RQ4合成数据和伪标注数据是否对真实 EHPS 任务产生有意义的迁移?
  • RQ5哪些数据集选择策略可以最大化跨域 EHPS 性能,同时最小化领域差距?

主要发现

  • 数据与模型规模都能降低主要 EHPS 基准的错误率,AGORA、UBody、EgoBody、3DPW、EHF 的错误从超过 110 mm 降至低于 70 mm。
  • 基础模型对未见环境如 DNA-Rendering 和 ARCTIC 展现出强大的迁移能力。
  • 将通用型 SMPLer-X 微调为专门化模型,在 AGORA 取得新的 SOTA,并提升 EgoBody、UBody 与 EHF 的性能。
  • 尽管存在领域差距,合成数据对 EHPS 性能有显著贡献,数据集的组合可实现稳健的泛化。
  • 当真值 SMPL-X 注释不可用时,伪 SMPL-X 标签有用,可以提升适用性。
Figure 2: Dataset attribute distributions. a) and d) are image feature extracted by HumanBench [ 57 ] and OSX [ 39 ] pretrained ViT-L backbone. b) Global orientation (represented by rotation matrix) distribution. c) Body pose (represented by 3D skeleton joints) distribution. Both e) scenes and f) Re
Figure 2: Dataset attribute distributions. a) and d) are image feature extracted by HumanBench [ 57 ] and OSX [ 39 ] pretrained ViT-L backbone. b) Global orientation (represented by rotation matrix) distribution. c) Body pose (represented by 3D skeleton joints) distribution. Both e) scenes and f) Re

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。