QUICK REVIEW

[论文解读] Affordance-Graphed Task Worlds: Self-Evolving Task Generation for Scalable Embodied Learning

Xiang Liu, Sen Cui|arXiv (Cornell University)|Feb 12, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

AGT-World 能 autonomous 构建来自真实世界观测的交互式仿真环境，将长时序任务分解为图上的原子基元，并使用带自我进化的 VLM 反馈来改进策略，在 102 对自主场景任务中实现 71.6% 的成功率。

ABSTRACT

Training robotic policies directly in the real world is expensive and unscalable. Although generative simulation enables large-scale data synthesis, current approaches often fail to generate logically coherent long-horizon tasks and struggle with dynamic physical uncertainties due to open-loop execution. To address these challenges, we propose Affordance-Graphed Task Worlds (AGT-World), a unified framework that autonomously constructs interactive simulated environments and corresponding robot task policies based on real-world observations. Unlike methods relying on random proposals or static replication, AGT-World formalizes the task space as a structured graph, enabling the precise, hierarchical decomposition of complex goals into theoretically grounded atomic primitives. Furthermore, we introduce a Self-Evolution mechanism with hybrid feedback to autonomously refine policies, combining Vision-Language Model reasoning and geometric verification. Extensive experiments demonstrate that our method significantly outperforms in success rates and generalization, achieving a self-improving cycle of proposal, execution, and correction for scalable robot learning.

研究动机与目标

通过重建保留真实世界可用性与布局的交互场景，连接语义感知与物理仿真。
将任务生成形式化为基于图的路径规划问题，在一个具可用性的任务世界（AGT-World）上实现。
引入闭环的自我进化机制，利用 Vision-Language Model 推理与几何验证来改进任务策略。
通过大规模自主场景-任务生成与在复杂任务上的评估，展示可扩展性与泛化能力。
在成功率方面给出经验性提升，并对长时序任务规划与策略改进提供洞见。

提出的方法

将任务空间表示为结构化有向图 G = (V, E)，其中 V = O × A × N+，O 是可操作对象，A 是原子动作，N+ 是时间维。
从单张 RGB 图像重建仿真场景 S0，以保留语义可用性与对象状态，使用物理启用的仿真器（OmniGibson）中的匹配资产。
通过 VLM 驱动的规划阶段将复杂任务分解为简单子任务，得到子任务描述与相应的动作流 π(Tk)。
通过 ek 的动作转移边建模任务间转移，连接 Ti 的末端状态到 Ti+1 的初始状态，确保边界一致性 Sinit(k+1) ≈ Sgoal(k)+。
使用自我进化循环，对每个子任务在多个视图下分析视觉反馈，以批评并通过混合基于 VLM 的反馈机制（m, X）迭代改进动作流。

Figure 1: An introduction of our method. A. Video generation models often produce physically implausible behaviors. We instead employ a physics simulation engine to reconstruct semantic and global-state preserving simulated scenes from real-world images at low cost. B. Randomly generated scenes are

实验结果

研究问题

RQ1如何在保持语义可用性和物理可行性的前提下，将长时序机器人任务分解为可执行的原子动作？
RQ2基于图的任务世界是否可以实现从真实世界观测到仿真场景的可靠路径规划与组合可达性？
RQ3在由视觉-语言反馈引导的自我进化循环下，是否能提升仿真中的自主任务执行的成功率与泛化性？
RQ4视觉反馈、时间上下文以及任务间转移对生成任务和策略的可靠性有何影响？

主要发现

任务类别	数量	成功	SR (%)
关节对象（开启/关闭）	36	24	66.7
刚性对象（拾取）	66	49	74.2
合计	102	73	71.6

该框架在102对自主生成的场景-任务对上实现了71.6%的总体成功率。
简单原子在大多数任务上具有较高的成功率，而长时序与导航密集型子任务通过自我进化获得错误纠正的收益。
VLM 指导的任务扩展对用户意图具有高语义保真度，设计任务中 SBERT 相似度为 0.376，Self-BLEU 为 0.860。
多视角视觉输入提升规划可靠性，保留一个较小的时间上下文窗口（p1 = 1）可在性能与推理成本之间取得平衡。
四个具有代表性的长时序任务展示了将多原子组合成复杂目标的能力（如将玻璃杯运送到冰箱）。
研究提出了在给定完整性与连通性假设下，通过分层分解实现全局可达性的理论命题。

Figure 2: Affordance-Graphed Task Worlds. For any complex long-horizon task, they are decomposed into multiple simple tasks, connected via inter-task edges that bridge different object slices or reset temporal states.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。