QUICK REVIEW

[論文レビュー] ImageEdit-R1: Boosting Multi-Agent Image Editing via Reinforcement Learning

Yiran Zhao, Yaoqi Ye|arXiv (Cornell University)|Mar 9, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

ImageEdit-R1は、指示に基づく画像編集のための3エージェント強化学習フレームワークを導入し、基本的なエディタを変更せずに、複数のバックボーンとベンチマークでより強い整合性と品質を実現します。

ABSTRACT

With the rapid advancement of commercial multi-modal models, image editing has garnered significant attention due to its widespread applicability in daily life. Despite impressive progress, existing image editing systems, particularly closed-source or proprietary models, often struggle with complex, indirect, or multi-step user instructions. These limitations hinder their ability to perform nuanced, context-aware edits that align with human intent. In this work, we propose ImageEdit-R1, a multi-agent framework for intelligent image editing that leverages reinforcement learning to coordinate high-level decision-making across a set of specialized, pretrained vision-language and generative agents. Each agent is responsible for distinct capabilities--such as understanding user intent, identifying regions of interest, selecting appropriate editing actions, and synthesizing visual content--while reinforcement learning governs their collaboration to ensure coherent and goal-directed behavior. Unlike existing approaches that rely on monolithic models or hand-crafted pipelines, our method treats image editing as a sequential decision-making problem, enabling dynamic and context-aware editing strategies. Experimental results demonstrate that ImageEdit-R1 consistently outperforms both individual closed-source diffusion models and alternative multi-agent framework baselines across multiple image editing datasets.

研究の動機と目的

間接的または多段階のユーザー指示を解釈する堅牢で文脈認識のある画像編集を動機づける。
編集リクエストを構造化された要素（アクション、対象、目標）に分解してモジュール化された計画を可能にする。
強化学習を介して専門エージェントを協調させ、整合性のある編集シーケンスを生み出す。
多様な編集バックボーンと標準ベンチマークにおいて有効性を示す。
RL対応の分解がベースエディタを変更せずに指示整合と視覚品質を改善することを示す。

提案手法

分解エージェントがユーザリクエストと入力画像からアクション、対象、目標を抽出して構造化編集表現を形成する。
シーケンスエージェントが抽出された構成要素をモジュール実行のための順序付きサブリクエストのリストに並べる。
編集エージェント—拡散モデルに基づく—がサブリクエストをシーケンスで適用して編集された画像を生成する。
強化学習（GRPO）は分解エージェントを、フォーマット、アクション、対象、目標の正確性に対する報酬（集合ベースの要素にはF1を使用）で訓練する。
RLループをRLデータセット、旧-新ポリシー更新、KL正則化を伴うGRPO損失で安定化させて学習を安定化させる。
評価はマルチターン編集ベンチマーク（PSR、RealEdit、UltraEdit）とLLMベースの評価者（GPT-4o、Gemini-2.5）を用いて指示整合と出力品質を評価する。

Figure 1: Overview of ImageEdit-R1 : \scriptsize{1}⃝ The decomposition agent analyzes the user instruction and input image to extract a structured representation of the desired edits, including editing actions, subjects, and goals. \scriptsize{2}⃝ The sequencing agent arranges these components into

実験結果

リサーチクエスチョン

RQ1複雑な画像編集において、分解–計画–編集のマルチエージェントパイプラインは単一モデルのベースラインを超えられるか。
RQ2分解ステップの強化学習はユーザの意図への整合と下流の編集品質を改善するか。
RQ3標準ベンチマークに対して、異なる編集バックボーンはマルチエージェントRLフレームワークにどう反応するか。
RQ4単一ターン実行と多ターン実行が最終編集品質に与える影響はどうなるか。
RQ5LLMベースのジャッジが人間の判断とどの程度相関するか。

主な発見

ImageEdit-R1は、Qwen-Image-Editをバックボーンとして使用した場合、PSR、RealEdit、UltraEditのいずれでも単一モデルのエディタや非RLのマルチエージェントベースラインを一貫して上回る（平均的な利得を報告）。
分解エージェントに対する強化学習は必須であり、RLなしのマルチエージェントフレームワークはほとんど改善が見られず、RLは顕著な平均利得をもたらす。
単一ターン実行戦略はベンチマークを跨いで多ターンより優れており、すべてのサブリクエストを単一の全体的パスで編集する方が一貫性が高いことを示唆している。
報酬における目標条件付けは、明示的な目標監督なしの構成より平均性能を向上させる。
人間との整合性分析は、LLMベースの判断と人間の判断との強い相関を示し、編集品質の自動評価としてLLMsの実用性を裏付ける。

Figure 2: Representative examples demonstrating the performance of ImageEdit-R1 compared to baselines on complex editing tasks.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。