Skip to main content
QUICK REVIEW

[论文解读] SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Sashuai Zhou, Qiang Zhou|arXiv (Cornell University)|Mar 23, 2026
Generative Adversarial Networks and Image Synthesis被引用 0
一句话总结

可验证的空间奖励模型用于文本到图像生成,结合提示分解、专家检测器和视觉-语言推理,提升细粒度的空间一致性,并提供 SpatRelBench 以进行细粒度的空间评估。

ABSTRACT

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present extbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce extbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

研究动机与目标

  • 证明在文本到图像生成中,相较于全局语义的奖励,需对细粒度空间评估的需求。
  • 提出 SpatialReward,通过结构化提示和有据可依的证据来可验证地评估空间布局。
  • 开发 SpatRelBench,用于基准复杂空间关系,包括朝向、三维布局和文本放置。
  • 证明 SpatialReward 能提升空间一致性,并在经过强化学习训练的模型上与人类判断保持一致。

提出的方法

  • 提示分解器从自由形式提示中提取实体、属性和空间关系。
  • 使用专家检测器对对象位置与属性进行定地训练的可验证奖励。
  • 应用带有链式推理的视觉-语言模型,在有据可依的观测基础上推断空间关系并计算最终奖励。

实验结果

研究问题

  • RQ1一个可验证的空间奖励模型是否能在文本到图像生成中相较于整体或模板型奖励提升细粒度的空间一致性?
  • RQ2将提示分解并通过检测器进行定位是否比仅依赖视觉-语言模型能够实现更准确的空间推理?
  • RQ3具空间感知的奖励如何影响不同骨干网络(如 Stable Diffusion、FLUX)的 RL 训练文本到图像模型?
  • RQ4是否存在可靠的基准用于评估文本到图像输出中的复杂空间关系?
  • RQ5与其他奖励模型相比,人工评估的相关性是否更高地与 SpatialReward 得分相关?

主要发现

  • SpatialReward 在被整合进 SD3.5-M 与 FLUX1-dev 的 RL 训练后,始终提升了空间一致性和生成质量。
  • SpatialReward 与人工空间判断的对齐度高于基线奖励模型。
  • 消融研究显示专家检测和链式推理对性能有显著贡献,排除约束提供鲁棒性。
  • SpatRelBench 捕捉了细粒度的空间维度——朝向、三维关系和文本放置——揭示了单一维度基准未暴露的性能差距。
  • 人类对齐研究表明,在评估的奖励中,SpatialReward 与人类判断的相关性最高。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。