QUICK REVIEW

[论文解读] HUMANISE: Language-conditioned Human Motion Generation in 3D Scenes

Zan Wang, Yixin Chen|arXiv (Cornell University)|Oct 18, 2022

Human Pose and Action Recognition被引用 23

一句话总结

HUMANISE 引入了一个包含 19.6k 条动作序列、覆盖 643 个室内场景并附带语言描述的大规模合成数据集，并提出一个场景与语言条件的生成模型（cVAE），以生成多样且语义上有 grounding 的3D人体动作。

ABSTRACT

Learning to generate diverse scene-aware and goal-oriented human motions in 3D scenes remains challenging due to the mediocre characteristics of the existing datasets on Human-Scene Interaction (HSI); they only have limited scale/quality and lack semantics. To fill in the gap, we propose a large-scale and semantic-rich synthetic HSI dataset, denoted as HUMANISE, by aligning the captured human motion sequences with various 3D indoor scenes. We automatically annotate the aligned motions with language descriptions that depict the action and the unique interacting objects in the scene; e.g., sit on the armchair near the desk. HUMANISE thus enables a new generation task, language-conditioned human motion generation in 3D scenes. The proposed task is challenging as it requires joint modeling of the 3D scene, human motion, and natural language. To tackle this task, we present a novel scene-and-language conditioned generative model that can produce 3D human motions of the desirable action interacting with the specified objects. Our experiments demonstrate that our model generates diverse and semantically consistent human motions in 3D scenes.

研究动机与目标

证明需要大规模、语义丰富的人机场景交互数据，以在3D场景中实现指令感知的动作生成。
通过将 AMASS 动作与 ScanNet 场景对齐并自动用语言描述进行标注，创建合成数据集（HUMANISE）。
界定并解决在3D场景中进行语言条件的人体动作生成的问题。
提出一个基于条件 VAE 的场景与语言条件生成模型，以产生多样、有 grounding 的动作。
展示相对于基线的改进，并展示对 HSI 任务的潜在下游收益。

提出的方法

在 cVAE 框架内构建一个联合场景-语言条件的生成模型，以建模 p(Θ1:T | S, L1:D)。
用 Point Transformer 编码3D场景，使用 BERT 编码语言；通过自注意力融合以获得条件嵌入 zc。
用双向 GRU 编码动作并预测 z 的高斯参数；用 Transformer 解码器解码为 SMPL-X 身体参数。
通过重建损失、KL 散度以及用于对象对齐的辅助损失（L_o）和对动作特定生成的辅助损失（L_a）来优化。
应用两项辅助任务：回归目标对象中心和分类动作，以提升对齐与动作保真度。

实验结果

研究问题

RQ1一个大规模、语义丰富的合成数据集是否能实现对场景中与特定对象交互的3D人体动作的语言条件生成？
RQ2带有辅助对齐任务的场景-语言条件 cVAE 是否能产生语义准确、物理上可行、与语言和场景约束一致的动作？
RQ3对齐与动作辅助损失如何影响生成动作的对象对齐准确性和动作保真度？
RQ4所提出的方法对超出生成任务之外的下游人机场景交互任务是否有益处？

主要发现

模型	翻译	朝向	姿态	MPJPE	MPVPE	目标距离	APD	质量分数	动作分数
sit	5.17	3.19	1.77	113.28	112.43	0.903	10.12	2.37 ± 0.85	3.79 ± 1.17
stand up	5.63	3.43	1.69	126.05	124.84	0.802	9.57	2.83 ± 1.23	4.20 ± 0.94
lie down	6.46	3.09	0.76	136.87	136.20	0.196	9.18	2.31 ± 1.08	2.85 ± 1.31
walk	5.84	2.80	1.85	125.05	123.88	1.370	12.83	2.91 ± 1.27	3.88 ± 1.26
w/o self-att.	5.72	2.65	1.85	122.19	120.81	1.500	13.28	2.88 ± 1.14	3.80 ± 1.09
PointNet++ Enc.	5.81	2.64	1.81	124.67	123.69	1.444	12.61	2.80 ± 1.35	3.75 ± 1.27
all actions	4.20	2.91	1.96	98.01	96.53	1.008	11.83	2.57 ± 1.20	3.59 ± 1.38
w/o L_o	4.20	2.89	1.93	98.15	96.69	1.383	15.09	2.42 ± 1.21	3.57 ± 1.38
w/o L_a	4.23	2.91	1.95	98.67	97.11	1.135	12.66	2.17 ± 1.04	2.29 ± 1.43
w/o aux. loss	4.28	2.99	1.92	99.30	97.80	1.361	15.18	1.97 ± 0.98	2.44 ± 1.38

HUMANISE 能生成条件于3D场景和语言描述的多样且语义一致的动作。
辅助对齐任务提升在无动作特定设定中的对象对齐（目标距离）与动作准确性。
场景与语言特征的自注意力融合比简单拼接或 PointNet++ 基线产生更好的对齐与生成效果。
在动作特定设定中，模型在坐下、站起、躺下、走路等动作上实现了强烈的重建与生成指标。
在 PROX 基于的动作合成任务上预训练的 HUMANISE 提高了性能，表明数据集对下游 HSI 任务的潜在收益。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。