[论文解读] Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents
Grounded Decoding (GD) 将冻结的 Large Language Model 与领域特定的接地模型耦合,以为具象化机器人解码开放词汇的计划,从而在不微调 LLM 的情况下实现长时间任务。GD 采用标记级别的组合概率来确保计划在语义上合理并且在物理上可实现。
Recent progress in large language models (LLMs) has demonstrated the ability to learn and leverage Internet-scale knowledge through pre-training with autoregressive models. Unfortunately, applying such models to settings with embodied agents, such as robots, is challenging due to their lack of experience with the physical world, inability to parse non-language observations, and ignorance of rewards or safety constraints that robots may require. On the other hand, language-conditioned robotic policies that learn from interaction data can provide the necessary grounding that allows the agent to be correctly situated in the real world, but such policies are limited by the lack of high-level semantic understanding due to the limited breadth of the interaction data available for training them. Thus, if we want to make use of the semantic knowledge in a language model while still situating it in an embodied setting, we must construct an action sequence that is both likely according to the language model and also realizable according to grounded models of the environment. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon embodiment tasks in a robotic setting by leveraging the knowledge of both models. The project's website can be found at grounded-decoding.github.io.
研究动机与目标
- 将大型语言模型的高级语义规划与来自机器人 embodiment 与环境的接地信息结合起来。
- 开发一种在标记级解码中将 LLM 概率与接地模型目标(可用性、 安全、偏好)相结合的策略。
- 展示在多个领域(仿真与现实世界)的适用性,并证明相对于以往方法的效率提升。
提出的方法
- 定义接地函数 pG(w1...n|s) 以 embodiment 状态 s 为条件。
- 将 GD 表述为在标记级自回归解码中最大化 pGD(w1...N|s, l) ∝ pLLM(wn|w1...n-1, l) · pG(w1...n|s)。
- 使用贪婪或束搜索实现 GD,以选择在 LLM 与接地模型下都可能的标记。
- 从领域数据中学习接地信号(基于标记的值函数、多模态探测器、基于规则的线索),并将其组合成(可用性、 安全、偏好)。
- 可选地通过提示或链式推理技术实现多模态接地,以在解码时利用视觉-语言模型。
实验结果
研究问题
- RQ1如何将开放词汇的语言模型在机器人具身状态中接地,以生成可执行的长时任务计划?
- RQ2哪些接地信号(可用性、安全、偏好、多模态探测器)能提高长时任务成功率,相较于未接地解码?
- RQ3在不微调 LLM 的情况下,标记级 Grounded Decoding 能否有效扩展到开放行动空间和多领域(仿真与现实世界)?
- RQ4与 SayCan 相比,GD 在规划和执行效率以及对未见任务的泛化方面有何差异?
主要发现
- GD 通过将 LLM 规划与领域接地相结合,在三个具象域中实现了强大性能。
- 束搜索在长时任务上相较贪婪解码提升了性能。
- 在所测试的任务中,GD 的效率比 SayCan 高出两个数量级,同时达到可比的性能。
- 通过可用性、安全和偏好进行接地可缩小行动空间并减少规划失败,相较于无接地或仅基于 LLM 的方法。
- 使用多模态接地结合链式推理提示在现实世界设置下有助于消除任务歧义,提升在模糊场景中的规划与执行。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。