QUICK REVIEW

[论文解读] Hyperbolic Multiview Pretraining for Robotic Manipulation

Jin Yang, Ping Wei|arXiv (Cornell University)|Mar 5, 2026

Robot Manipulation and Learning被引用 0

一句话总结

HyperMVP 在双曲空间中对 GeoLink 编码器进行预训练，以学习用于机器人操作的多视图三维表示，从而提升对扰动和任务的泛化能力。它引入了 3D-MOV，并在 Colosseum、RLBench 和现实世界场景中展示出显著提升。

ABSTRACT

3D-aware visual pretraining has proven effective in improving the performance of downstream robotic manipulation tasks. However, existing methods are constrained to Euclidean embedding spaces, whose flat geometry limits their ability to model structural relations among embeddings. As a result, they struggle to learn structured embeddings that are essential for robust spatial perception in robotic applications. To this end, we propose HyperMVP, a self-supervised framework for \underline{Hyper}bolic \underline{M}ulti\underline{V}iew \underline{P}retraining. Hyperbolic space offers geometric properties well suited for capturing structural relations. Methodologically, we extend the masked autoencoder paradigm and design a GeoLink encoder to learn multiview hyperbolic representations. The pretrained encoder is then finetuned with visuomotor policies on manipulation tasks. In addition, we introduce 3D-MOV, a large-scale dataset comprising multiple types of 3D point clouds to support pretraining. We evaluate HyperMVP on COLOSSEUM, RLBench, and real-world scenarios, where it consistently outperforms strong baselines across diverse tasks and perturbation settings. Our results highlight the potential of 3D-aware pretraining in a non-Euclidean space for learning robust and generalizable robotic manipulation policies.

研究动机与目标

通过利用非欧几里得、结构化感知表示，推动鲁棒的机器人操作研究。
开发一个自监督预训练框架，从三维点云学习多视图的双曲嵌入。
引入一个大规模的 3D-MOV 数据集，以研究多样三维数据对下游任务的影响。
证明双曲预训练在模拟和现实世界的机器人操作中提高泛化能力。
实现对下游视知觉动作策略的可扩展微调，允许对多样输入视图的灵活处理。

提出的方法

在 MAE（掩码自编码器）基础上扩展 GeoLink 编码器，将欧几里得补丁嵌入映射到双曲空间（Lorentz 模型）。
将每个三维点云渲染为五个正投影视图，使用视图特定的嵌入与掩码。
通过指数映射将嵌入 lifting 到双曲空间，并使用双曲空间损失来加强结构：面向补丁的 Top-K 等级相关性与蕴涵损失。
预训练目标将双曲表示约束与重构损失（视内重构与视间重构）结合。
在微调阶段，联合优化 GeoLink 与 Robotic View Transformer (RVT)，以学习视知动策略，且可扩展到任意视图数量。

实验结果

研究问题

RQ1超越欧几里得空间的多视图双曲表示能否提升用于机器人操作的三维感知预训练效果？
RQ2来自对象级和场景级的多样三维数据如何影响下游操作性能？
RQ3自监督的双曲预训练目标是否能在扰动和任务上产生鲁棒表示？
RQ4在微调阶段是否可将预训练扩展到不同数量的输入视图？
RQ5双曲嵌入是否能有效迁移到现实世界的机器人操作环境？

主要发现

HyperMVP 在 Colosseum 的扰动设置、RLBench 以及现实世界测试中始终优于基线方法。
与欧几里得基线及其他自监督方法相比，双曲预训练搭配 GeoLink 取得了显著提升。
3D-MOV 数据集（约 20 万个点云与 100 万个多视图图像）为在多样场景数据上进行有效预训练提供了支持。
在 RLBench 中，HyperMVP 在 18 项任务中的平均成功率达到最高，并优于从零开始训练的 RVT。
在现实世界实验中，HyperMVP 获得更高的成功率，并在扰动下表现出比 RVT 更好的鲁棒性，特别是在高精度任务上。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。