QUICK REVIEW

[论文解读] On the Strengths and Weaknesses of Data for Open-set Embodied Assistance

Pradyumna Tambwekar, Andrew Silva|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

该论文在合成的 Overcooked 轨迹上训练多模态、指令调优的具象化模型，以实现开放集纠错协助，展现对未见缺陷和新任务的泛化，并分析多样化的辅助数据如何影响性能。与 GPT-4o 基线相比，该模型在泛化方面表现更强，并给出关于具象化协助数据集设计的见解。

ABSTRACT

Embodied foundation models are increasingly performant in real-world domains such as robotics or autonomous driving. These models are often deployed in interactive or assistive settings, where it is important that these assistive models generalize to new users and new tasks. Diverse interactive data generation offers a promising avenue for providing data-efficient generalization capabilities for interactive embodied foundation models. In this paper, we investigate the generalization capabilities of a multimodal foundation model fine-tuned on diverse interactive assistance data in a synthetic domain. We explore generalization along two axes: a) assistance with unseen categories of user behavior and b) providing guidance in new configurations not encountered during training. We study a broad capability called extbf{Open-Set Corrective Assistance}, in which the model needs to inspect lengthy user behavior and provide assistance through either corrective actions or language-based feedback. This task remains unsolved in prior work, which typically assumes closed corrective categories or relies on external planners, making it a challenging testbed for evaluating the limits of assistive data. To support this task, we generate synthetic assistive datasets in Overcooked and fine-tune a LLaMA-based model to evaluate generalization to novel tasks and user behaviors. Our approach provides key insights into the nature of assistive datasets required to enable open-set assistive intelligence. In particular, we show that performant models benefit from datasets that cover different aspects of assistance, including multimodal grounding, defect inference, and exposure to diverse scenarios.

研究动机与目标

在没有预设纠错集合的前提下，激励与实现开放集纠错协助。
使用合成的 Overcooked 数据，研究缺陷类型和任务配置对泛化的影响。
评估多样化、跨模态的辅助数据如何影响对 grounding、推理与行动生成的影响。

提出的方法

微调以 ViT 图像编码器为输入的 LLaMA-3 基模型，创建一个多模态模型，使其能够从轨迹数据输出语言化的指导或纠正行动。
在 Overcooked 中生成带有多样缺陷包装的合成长时程用户轨迹，覆盖认知规划与空间视觉障碍。
创建 grounding 数据集（Image-QA、Trajectory-QA、Video-QA）与任务特定数据集（Coaching、Corrections、Defect Delineation），以训练开放集协助。
使用预测无缺陷轨迹的下一个动作所产生的真实纠正，以及由 GPT-4o 以多样化角色生成的合成 coaching 片段来获得 ground-truth 纠正。
在对新缺陷每种 10 条样本的少量微调下，与 GPT-4o 基线及保留缺陷和新菜谱的场景进行评估。
通过消融研究理解多任务训练和 grounding 数据对泛化的影响。

实验结果

研究问题

RQ1经过合成辅助数据训练的具象化基础模型，能否对未见的具有缺陷的用户行为（开放集缺陷）以及新任务配置（菜谱）实现泛化？
RQ2哪些数据集设计特征（多模态 grounding、推理轨迹、任务分解）最能促进开放集纠错协助？
RQ3模型规模（1B vs 8B 参数）如何影响开放集情景中的零-shot 与少量-shot 泛化？
RQ4在多任务训练以及 grounding 数据上的联合训练是否能提升对未见缺陷和新任务的性能？

主要发现

提出的模型在多样化合成辅助数据上训练，在对保留缺陷的零-shot 与少样本设置中，在 coaching 和 correction 两项任务上均超过 GPT-4o 基线。
在保留缺陷的场景中，1B 与 8B 变体分别达到 76.60 和 77.80 的 coaching 得分，以及 55.70 和 54.60 的 corrections 得分，均优于基线（GPT-4o: coaching 21.00，corrections 20.40）。
推理轨迹在某些设定下可提升 coaching，但可能导致模式崩溃；零-shot 的推理收益不一致，且在某些情况下相较于无推理输入会降低 coaching 表现。
对新菜谱的任务泛化随模型增大而提升，表明在未见任务中的组合性需要更强的多模态 grounding。
在 coaching、corrections 与 defect delineation 等多任务联合训练通常优于单任务训练，grounding 数据（尤其是 Trajectory-QA）有助于对新配置的泛化。
与 grounding 数据集的联合训练提升视觉组合性，帮助对新任务配置的泛化（DT-在 grounding 数据集中最有效）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。