QUICK REVIEW

[논문 리뷰] ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing

Hengjia Li, Liming Jiang|arXiv (Cornell University)|2026. 01. 06.

Generative Adversarial Networks and Image Synthesis인용 수 0

한 줄 요약

ThinkRL-Edit는 체인 오브 사고(CoT) 추론 샘플링, 편향 없는 체인 선호 그룹화, 그리고 추론 중심 이미지 편집을 위한 체크리스트 기반 보상을 통해 KRIS-Bench에서 최첨단 성과를 달성하고 RISE-Bench에서 강력한 일반화를 달성한다.

ABSTRACT

Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)-based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

연구 동기 및 목표

노이즈 제거에 초점을 맞춘 탐색을 넘어서 지시 기반 이미지 편집에서 향상된 추론을 촉진한다.
생성 전에 시각적 추론을 해체하여 다양한 의미 추론 경로를 탐색한다.
편향 없는 다중 보상 랭킹과 세밀한 체크리스트 기반 보상을 도입하여 안정적이고 해석 가능한 가이드를 제공한다.
벤치마크 전반에서 더 높은 지시 충실도, 시각적 일관성, 의미적 정합성을 입증한다.

제안 방법

이미지 합성 전에 추론 모듈과 생성 모듈의 분리를 통해 추론 경로를 탐색한다.
온라인 샘플링 중 계획 및 반성 단계를 갖춘 Chain-of-Thought(CoT) 샘플링을 적용한다.
단순 가중 합계 대신 여러 보상 차원에서 추론 체인을 순위 매기기 위해 편향 없는 체인 선호 그룹화를 사용한다.
간격 기반 VLM 보상을 이진 체크리스트로 대체하여 정밀하고 분산이 낮은 정렬 점수를 생성한다.
해석/생성 모듈을 분리 업데이트하는 Und-Gen 최적화를 수행하고 추론 시 계획/반성을 사용한다.
KRIS-Bench와 RISE-Bench에서 Qwen-Edit를 베이스로 하고 보상에 Qwen3-VL를 사용하는 평가를 수행한다.

Figure 1 : Comparisons on reasoning-centric image editing. Although unified multimodal generative models such as Qwen-Edit [ qwen-image ] have substantially improved editing quality, their underlying reasoning remains underexplored, especially for reasoning-centric editing. In contrast, our method d

실험 결과

연구 질문

RQ1명시적으로 추론을 생성과 분리하면 이미지 편집에서 지시 충실도가 향상될 수 있는가?
RQ2CoT 기반 추론 샘플링이 편집에 대한 의미 추론 경로의 탐색을 넓혀 주는가?
RQ3편향 없는 체인 선호 그룹화와 체크리스트 보상이 추론 중심 편집에 더 안정적이고 해석 가능한 RL 신호를 제공하는가?
RQ4ThinkRL-Edit는 KRIS-Bench, RISE-Bench와 같은 추론 중심 편집 벤치마크에서 베이스라인 대비 어떤 성능을 보이는가?

주요 결과

KRIS-Bench에서 속성 전반에 걸쳐 상당한 향상, 특히 지시 따름에서 가장 큰 개선.
KRIS-Bench에서 Overall Score가 49.24에서 71.65로 상승(평균), 지시 따름 및 지식 범주에서 뚜렷한 상승.
RISE-Bench에서 Overall 점수가 8.9에서 29.7로, Overall Reasoning 37.2에서 61.7로 상승, 분포 변화하에서도 뛰어난 일반화 나타냄.
사용자 연구에서 ThinkRL-Edit이 지시 준수, 시각적 일관성, 시각적 품질에서 높은 선호를 보임.
추론 기반 Und-Gen 최적화, 세밀한 체크리스트 보상, 편향 없는 체인 선호 그룹화의 이점이 입증.
다양한 지표에서 ThinkRL-Edit가 OmniGen2, Flux-Kontext, Bagel, Bagel-Think, UniCoT, Qwen-Edit와 같은 오픈 소스 벤치마크를 능가한다.

Figure 2 : Comparison with prior methods. Prior RL methods for visual generation [ liu2025flow , xue2025dancegrpo ] focus on exploration within the stochastic space of generation, improving synthesis quality but offering limited reasoning capability. To address this issue, we decouple and optimize t

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.