[论文解读] SceneFoundry: Generating Interactive Infinite 3D Worlds
SceneFoundry presenta 一种 language-guided diffusion 流程,为机器人训练生成公寓规模、功能性明确的 3D 室内环境,具备基于LLM的平面布局指导、扩散式资产放置以及用于可导航性与互动性的后优化。
The ability to automatically generate large-scale, interactive, and physically realistic 3D environments is crucial for advancing robotic learning and embodied intelligence. However, existing generative approaches often fail to capture the functional complexity of real-world interiors, particularly those containing articulated objects with movable parts essential for manipulation and navigation. This paper presents SceneFoundry, a language-guided diffusion framework that generates apartment-scale 3D worlds with functionally articulated furniture and semantically diverse layouts for robotic training. From natural language prompts, an LLM module controls floor layout generation, while diffusion-based posterior sampling efficiently populates the scene with articulated assets from large-scale 3D repositories. To ensure physical usability, SceneFoundry employs differentiable guidance functions to regulate object quantity, prevent articulation collisions, and maintain sufficient walkable space for robotic navigation. Extensive experiments demonstrate that our framework generates structurally valid, semantically coherent, and functionally interactive environments across diverse scene types and conditions, enabling scalable embodied AI research. project page: https://anc891203.github.io/SceneFoundry-Demo/
研究动机与目标
- Bridge high-level natural language prompts to controllable, apartment-scale 3D scene generation.
- Ensure functional realism by embedding articulated furniture and moveable parts.
- Maintain navigability and walkable space for robotic training and embodied AI.
- Provide differentiable, post-hoc guidance to enforce object counts, articulation feasibility, and walkable areas.
提出的方法
- LLM-based parameter space guidance converts natural language prompts into low-level floor plan parameters for controllable layout generation.
- Diffusion posterior sampling places articulated assets by sampling object parameters in parallel across a 3D scene.
- Differentiable guidance functions constrain generation: Object Quantity Control and Articulated Object Collision Constraint.
- Walkable Area Control post-processing optimizes layout to guarantee navigable space for agents.
- Training integrates a constraint-guided learning objective that includes constraint gradients.
- Evaluation metrics assess controllability and functional plausibility of generated scenes.
实验结果
研究问题
- RQ1Can language-guided prompts yield apartment-scale, semantically coherent floor plans suitable for robotic tasks?
- RQ2How effectively can differentiable constraints enforce object counts and articulated feasibility during diffusion-based layout generation?
- RQ3Does post-processing walkable area optimization ensure navigable environments without compromising semantic layout quality?
- RQ4What metrics best capture controllability and functional realism in generated 3D indoor scenes?
主要发现
- The framework achieves structurally valid, semantically coherent, and functionally interactive apartment-scale scenes.
- Object Quantity Control reliably enforces target object counts with high success rates (SR ~0.95–0.97 across targets 5–16).
- Articulated Object Collision Constraint reduces functional collisions and improves object reachability compared to baselines.
- Walkable Area Control significantly increases navigability across walkable-area thresholds.
- LLM-guided layout generation attains high node, constraint, and edge similarity to ground-truth layouts.
- Compared to baselines ATISS, DiffuScene, and PhyScene, SceneFoundry shows competitive perceptual quality with enhanced functional plausibility.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。