QUICK REVIEW

[论文解读] SceneNAT: Masked Generative Modeling for Language-Guided Indoor Scene Synthesis

Jeongjun Choi, Yeonsoo Park|arXiv (Cornell University)|Jan 12, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

SceneNAT 采用带有专用三元组预测器的掩蔽非自回归 Transformer，在语言引导下合成室内三维场景，相较自回归或扩散基线具有更高的可控性与效率。

ABSTRACT

We present SceneNAT, a single-stage masked non-autoregressive Transformer that synthesizes complete 3D indoor scenes from natural language instructions through only a few parallel decoding passes, offering improved performance and efficiency compared to prior state-of-the-art approaches. SceneNAT is trained via masked modeling over fully discretized representations of both semantic and spatial attributes. By applying a masking strategy at both the attribute level and the instance level, the model can better capture intra-object and inter-object structure. To boost relational reasoning, SceneNAT employs a dedicated triplet predictor for modeling the scene's layout and object relationships by mapping a set of learnable relation queries to a sparse set of symbolic triplets (subject, predicate, object). Extensive experiments on the 3D-FRONT dataset demonstrate that SceneNAT achieves superior performance compared to state-of-the-art autoregressive and diffusion baselines in both semantic compliance and spatial arrangement accuracy, while operating with substantially lower computational cost.

研究动机与目标

实现从自然语言指令中实现可控的三维室内场景生成。
在自回归和扩散为基础的方法上提升效率和可扩展性。
通过三元组预测器显式建模对象间关系以增强布局准确性。
通过对离散化的语义与空间属性进行掩蔽建模来学习。
在 3D-FRONT 上展示最先进的性能，并对复杂指令具有鲁棒泛化能力。

提出的方法

将场景生成框架化为对离散化的对象级表示（类别、外观、位置、尺度、偏航角）的掩蔽建模。
使用非自回归 Transformer 以迭代细化的方式并行预测掩蔽标记。
引入专用的三元组预测器，从指令中解析出稀疏的（主体、谓词、对象）关系并通过交叉注意力融合关系嵌入。
以掩蔽标记的重建损失加上基于集合的三元组损失（采用匈牙利匹配）来对齐预测的三元组与 ground-truth。
采用以余弦为基础的动态掩蔽策略，结合对象层和标记层的掩蔽以及替换再掩蔽策略以实现稳定训练；推理阶段借鉴 MaskGIT 进行迭代并行解码。

实验结果

研究问题

RQ1如何利用掩蔽的非自回归生成来改进语言引导的3D室内场景合成？
RQ2通过三元组预测器进行显式关系推理是否能提升文本条件场景对复杂空间关系的遵循度？
RQ3并行、迭代解码是否能在质量与效率上达到或超过扩散或自回归方法？
RQ4掩蔽策略与离散化粒度对场景保真度与可控性有何影响？
RQ5模型在处理指令中的未见关系复杂度时的泛化能力如何？

主要发现

iRecall (%) (↑)	FID (↓)	FID^CLIP (↓)	KID_x1e3 (↓)	V_cap^sum (↓)
Bedroom (Ours)	70.45 (1.92)	109.55 (1.36)	6.19 (0.12)	-1.18 (0.16)	69.58 (12.00)
Living room (Ours)	50.01 (2.25)	110.28 (1.18)	5.49 (0.09)	6.18 (1.11)	151.24 (11.14)
Dining room (Ours)	56.29 (2.47)	129.65 (1.68)	7.51 (0.17)	12.26 (0.99)	169.31 (13.22)

SceneNAT 在 3D 室内场景合成上达到最先进的性能，在语义对齐与空间精度方面优于自回归和扩散基线。
SceneNAT 在不同房型上表现出更高的 iRecall，同时降低推理成本（比 DiffuScene 快最高达 24.7 倍，比 InstructScene 快约 5 倍）。
专用的三元组预测器实现了强健的关系推理，提升对复杂指令的可控性与布局保真度。
模型消融实验显示，三元组监督、对象与标记层掩蔽以及替换再掩蔽策略是实现最佳性能的必要条件。
零-shot 下游任务的结果与基线相比具有竞争力甚至更优，且通过布局到对象任务体现了强烈的双向上下文建模。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。