QUICK REVIEW

[论文解读] VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang, Mingyu Liu|arXiv (Cornell University)|Feb 10, 2026

Robot Manipulation and Learning被引用 0

一句话总结

引入 VIDA，一个大规模基于视频的三维对象使用权（3D object affordance）数据集，以及 VideoAfford，一个基线模型，利用多模态大语言模型和潜在动作编码器将 HOI 视频中的三维使用权 grounding，并使用空间感知损失。

ABSTRACT

3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, extit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on extit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a extit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.

研究动机与目标

通过动态交互线索来自 HOI 视频来驱动对三维使用权的定位以实现精确的机器人操控动机。
创建 VIDA，这是首个大规模基于视频的三维使用权数据集，包含 38K HOI 视频和 22K 注释点云。
开发 VideoAfford 作为基线，通过视频 MLLM 和动作嵌入将 HOI 视频先验转移到三维使用权定位。
通过空间感知损失提升空间推理能力，产生连贯的三维使用权掩模。
在分布内和分布外数据上展示鲁棒性与开放世界泛化能力。

提出的方法

将定位建模为从 HOI 视频和文本指令预测一个三维使用权掩模的任务。
使用预训练的三维点编码器并进行几何引导的上采样以获得密集点特征。
引入对邻近点加权的空间感知 Dice 损失，以强化空间连续性。
引入潜在动作编码器以从 HOI 视频中提取动态交互先验。
将 Video MLLM（Video-LLaVA）作为视频-文本推理主干，使用特殊的 <AFF> 令牌注入使用权知识。
应用基于 Transformer 的轻量化使用权解码器，通过跨注意力将使用权嵌入与点特征融合以预测使用权掩模。
采用包含 BCE、IOU 与空间损失的联合目标函数进行训练，并加入语言输出的标准文本损失。

Figure 2 : Data Collection Pipeline. We show the whole data collection and verification pipeline here. First, we utilize VLMs to caption each video and extract keywords about action and objects. We then utilize the VLMs to pair the video to an affordance type. Finally, we manually check the results

实验结果

研究问题

RQ1能否利用 HOI 视频的动力学与多模态大语言模型中的世界知识来对细粒度三维对象使用权进行 grounding？
RQ2潜在动作编码器是否能够提升对三维 grounding 的动态交互理解？
RQ3空间感知损失是否提升三维使用权区域的空间连贯性与定位准确性？
RQ4在开放世界设定中，该方法对未见对象和未见使用权的泛化能力如何？

主要发现

Method	mIoU ↑	AUC ↑	SIM ↑	MAE ↓
XMF	14.41	71.47	41.10	0.281
PFusion	16.33	78.43	46.28	0.264
IAGNet	20.39	80.22	50.11	0.188
LASO	18.65	78.44	49.46	0.257
GREAT	23.62	81.41	51.25	0.173
Seqafford	23.03	81.17	47.71	0.227
LMAfford3D*	22.74	80.74	47.28	0.234
Ours	28.20	83.64	58.80	0.157
XMF (Unseen)	6.010	53.41	31.53	0.388
PFusion (Unseen)	7.270	56.69	34.05	0.371
IAGNet (Unseen)	7.970	68.97	34.85	0.277
LASO (Unseen)	7.410	69.21	33.77	0.288
GREAT (Unseen)	8.220	70.19	35.08	0.269
Seqafford (Unseen)	8.070	65.53	32.40	0.286
LMAfford3D* (Unseen)	8.110	66.42	33.61	0.278
Ours (Unseen)	10.95	72.86	40.08	0.255

VideoAfford 在 VIDA 的 Seen/Unseen 设置上实现了最先进的结果。
在 Seen 中，VideoAfford 达到 mIoU 28.20、AUC 83.64、SIM 58.80、MAE 0.157，优于所有基线。
在 Unseen 中，VideoAfford 达到 mIoU 10.95、AUC 72.86、SIM 40.08、MAE 0.255，优于所有基线。
消融实验表明动作编码器和空间损失显著提升性能（例如两者均存在时：mIoU 28.20、AUC 83.64、SIM 58.80、MAE 0.157）。
采样 8 帧在时间上下文与计算效率之间取得平衡，优于 2/4/16 帧设置。

Figure 3 : VIDA Dataset. Here we illustrate the detailed information of VIDA. a) shows the examples of the video and corresponding affordance point clouds. b) shows the videos and point clouds radios, and c) shows the category distributions of VIDA.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。