QUICK REVIEW

[論文レビュー] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li, Liming Jiang|arXiv (Cornell University)|Jan 6, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

ThinkRL-Editは推論と画像合成を分離し、連鎖的思考の推論サンプリング、バイアスのない連鎖選好グルーピング、推論中心の画像編集のためのチェックリスト報酬を実現し、KRIS-Benchで最先端の結果を達成し、RISE-Benchで強い一般化を示します。

ABSTRACT

Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

研究の動機と目的

denoising-focused explorationを超える指示駆動型画像編集における推論の改善を動機づける。
生成前の段階で視覚的推論を分離して、表現意味論的推論経路の多様性を探索する。
バイアスのない多報酬ランキングと細粒度のチェックリストベース報酬を導入し、安定で解釈可能なガイダンスを提供する。
ベンチマーク全体での指示忠実性、視覚的一貫性、意味的根拠の優越性を実証する。

提案手法

推論と生成モジュールを分離して、画像合成前の推論経路を探索する。
オンラインサンプリング中に計画と反省段階を含むChain-of-Thought（CoT）サンプリングを適用する。
単純な加重和ではなく、複数の報酬次元に跨る推論チェーンをランク付けするためにバイアスのない連鎖選好グルーピングを使用する。
区間ベースのVLM報酬を二値のチェックリストに置換して、精度の高い低分散の整合スコアを生成する。
推論と生成の理解を分離してUnd-Gen最適化を実施し、推論時には計画/反省を行い、推論と生成を別個に更新する。
KRIS-BenchとRISE-Benchで評価し、ベースとしてQwen-Editを使用し、報酬にはQwen3-VLを用いる。

Figure 1 : Comparisons on reasoning-centric image editing. Although unified multimodal generative models such as Qwen-Edit [ qwen-image ] have substantially improved editing quality, their underlying reasoning remains underexplored, especially for reasoning-centric editing. In contrast, our method d

実験結果

リサーチクエスチョン

RQ1推論を生成からExplicitにデカップリングすることで、画像編集における指示忠実性を向上させられるか。
RQ2CoTベースの推論サンプリングは、編集の意味的推論経路の探索を広げるか。
RQ3バイアスのない連鎖選好グルーピングとチェックリスト報酬は、推論中心の編集に対してより安定で解釈可能なRL信号を生み出すか。
RQ4ThinkRL-EditはKRIS-BenchやRISE-Benchなどの推論中心の編集ベンチマークで、ベースラインと比較してどう性能を示すか。

主な発見

KRIS-Benchで属性別に顕著な改善を達成し、Instruction Followingで最大の伸びを示した。
KRIS-Benchにおいて、Overall Scoreは49.24から71.65（平均）へ改善し、Instruction Followingおよび知識カテゴリで顕著な向上を示した。
RISE-Benchでは、Overall scoreが8.9から29.7へ、Overall Reasoningが37.2から61.7へ上昇し、分布シフト下での高い一般化を示した。
ユーザ評価では、Instruction Following、Visual Consistency、Visual QualityでThinkRL-Editの高い Preferenceを示した。
CoTベースのUnd-Gen最適化、細粒度のチェックリスト報酬、バイアスのない連鎖選好グルーピングの利点を示すアブレーション結果。
複数の指標で、ThinkRL-Editはオープンソースのベースライン（OmniGen2、Flux-Kontext、Bagel、Bagel-Think、UniCoT、Qwen-Edit）を上回る。

Figure 2 : Comparison with prior methods. Prior RL methods for visual generation [ liu2025flow , xue2025dancegrpo ] focus on exploration within the stochastic space of generation, improving synthesis quality but offering limited reasoning capability. To address this issue, we decouple and optimize t

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。