[论文解读] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing
ThinkRL-Edit 将推理与图像合成解耦,实现思维链推理采样、无偏链路偏好分组以及基于检查单的奖励,用于以推理为中心的图像编辑,在 KRIS-Bench 取得最先进结果并在 RISE-Bench 展现强泛化性。
Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.
研究动机与目标
- 推动Instruction驱动的图像编辑在去噪聚焦探索之外的推理改进。
- 将视觉推理与合成解耦,以在生成前探索多样的语义推理轨迹。
- 引入无偏、多奖励的排名与细粒度、基于检查单的奖励,以实现稳定、可解释的引导。
- 在基准测试中展现更高的指令忠实度、视觉连贯性和语义落地性。
提出的方法
- 将推理与生成模块解耦,以在图像合成前探索推理轨迹。
- 在在线采样阶段应用思维链(CoT)采样,包含规划与反思阶段。
- 使用无偏的链路偏好分组,在跨多个奖励维度上对推理链进行排序,而非简单的加权求和。
- 用二元检查单取代区间VLM奖励,以产生更精确、低方差的对齐分数。
- 进行解耦的 Und-Gen 优化,分别更新理解与生成模块,推理阶段/反思在推理时进行。
- 在 KRIS-Bench 与 RISE-Bench 上评估,以 Qwen-Edit 作为基础、Qwen3-VL 作为奖励模型。
![Figure 1 : Comparisons on reasoning-centric image editing. Although unified multimodal generative models such as Qwen-Edit [ qwen-image ] have substantially improved editing quality, their underlying reasoning remains underexplored, especially for reasoning-centric editing. In contrast, our method d](https://ar5iv.labs.arxiv.org/html/2601.03467/assets/x1.png)
实验结果
研究问题
- RQ1显式将推理与生成解耦是否能提升图像编辑中的指令忠实度?
- RQ2基于 CoT 的推理采样是否扩大对编辑的语义推理路径的探索?
- RQ3无偏链路偏好分组与检查单奖励是否能为推理为中心的编辑提供更稳定、可解释的 RL 信号?
- RQ4在 KRIS-Bench 与 RISE-Bench 这类以推理为中心的编辑基准上,ThinkRL-Edit 相对于基线的表现如何?
主要发现
- 在 KRIS-Bench 的各属性上取得显著提升,指令遵循方面的提升尤为显著。
- 在 KRIS-Bench 的总体得分从 49.24 提升到 71.65(平均值),在指令遵循与知识类别方面获得显著提升。
- 在 RISE-Bench 的总体得分从 8.9 提升至 29.7,总体推理从 37.2 提升至 61.7,表明在分布外情形下具有较强的泛化能力。
- 用户研究显示在指令遵循、视觉一致性与视觉质量方面对 ThinkRL-Edit 的偏好度很高。
- 消融研究证实基于 CoT 的 Und-Gen 优化、细粒度检查单奖励以及无偏链路偏好分组的好处。
- 在多项指标上,ThinkRL-Edit 超越如 OmniGen2、Flux-Kontext、Bagel、Bagel-Think、UniCoT 与 Qwen-Edit 等开源基线。
![Figure 2 : Comparison with prior methods. Prior RL methods for visual generation [ liu2025flow , xue2025dancegrpo ] focus on exploration within the stochastic space of generation, improving synthesis quality but offering limited reasoning capability. To address this issue, we decouple and optimize t](https://ar5iv.labs.arxiv.org/html/2601.03467/assets/x2.png)
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。