QUICK REVIEW

[论文解读] Re-Align: Structured Reasoning-guided Alignment for In-Context Image Generation and Editing

Runze He, Yiji Cheng|arXiv (Cornell University)|Jan 8, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

Re-Align 引入 IC-CoT（上下文内链路推理）以将结构化推理与图像生成和编辑对齐，辅以代理奖励和多样性策略以提升上下文内 ICGE 的性能。在同等规模的模型中，在 ICGE 基准测试中达到最先进水平。

ABSTRACT

In-context image generation and editing (ICGE) enables users to specify visual concepts through interleaved image-text prompts, demanding precise understanding and faithful execution of user intent. Although recent unified multimodal models exhibit promising understanding capabilities, these strengths often fail to transfer effectively to image generation. We introduce Re-Align, a unified framework that bridges the gap between understanding and generation through structured reasoning-guided alignment. At its core lies the In-Context Chain-of-Thought (IC-CoT), a structured reasoning paradigm that decouples semantic guidance and reference association, providing clear textual target and mitigating confusion among reference images. Furthermore, Re-Align introduces an effective RL training scheme that leverages a surrogate reward to measure the alignment between structured reasoning text and the generated image, thereby improving the model's overall performance on ICGE tasks. Extensive experiments verify that Re-Align outperforms competitive methods of comparable model scale and resources on both in-context image generation and editing tasks.

研究动机与目标

为上下文内图像生成与编辑（ICGE）建立一个统一框架，以连接理解与生成。
引入上下文内链路推理（IC-CoT）以解耦语义引导与参考关联。
开发一个代理奖励与推理驱动的多样性以稳定策略优化。
构建 Re-Align-410K，高质量的 ICGE 数据集并附 IC-CoT 注释。
在 ICGE 基准上展示最先进的性能，同时资源保持具有竞争力。

提出的方法

提出 IC-CoT 将推理分解为语义引导（预测的字幕/描述）与参考关联（每个参考图像的角色）。
通过遵循 BAGEL 风格的扩散式生成，在 IC-CoT 条件下学习通过 Rectified Flow 进行图像生成。
使用基于 CLIP 的图像-文本相似度的代理奖励 s(x,c)，其中图像 x 与从 IC-CoT 提取的字幕 c 相匹配。
引入推理驱动的多样性策略，在训练期间增加奖励信号的方差。
采用分组相对策略优化（GRPO）以优化 IC-CoT 与生成图像之间的一致性，分为两阶段训练：有监督微调（SFT）与基于 RL 的对齐。
实现数据自动化构建，生成带 IC-CoT 注释的多图像 ICGE 任务的 Re-Align-410K 数据集。

实验结果

研究问题

RQ1结构化推理（IC-CoT）如何提升 ICGE 任务中理解提示与图像生成之间的对齐？
RQ2基于字幕-图像对齐的代理奖励是否在 IC-CoT 指导下提升生成/编辑质量？
RQ3推理驱动的多样性策略是否能稳定 ICGE 的强化学习？
RQ4IC-CoT 对涉及主体、属性、场景的生成与编辑性能有何影响？
RQ5与同等模型规模和资源相比，Re-Align 在 ICGE 基准上如何与现有方法比较？

主要发现

Re-Align 在同等模型下的 ICGE 任务中实现了最先进的性能。
IC-CoT 提供显式的语义引导和参考角色，降低引用混淆并提升生成保真度。
基于 CLIP 图像-字幕对齐的代理奖励提升了推理与生成图像之间的对齐，有助于优化。
推理驱动的多样性增加奖励信号的方差并稳定训练，从而提升整体性能。
在 OmniContext 与 DreamOmni2Bench 基准上，Re-Align 在大多数指标上优于 BAGEL、OmniGen2、Echo-4o、Qwen-Image-Edit-2509 与 DreamOmni2 等基线方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。