QUICK REVIEW

[論文レビュー] InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Sheng, Zhiqiang, Xumeng Han|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

InEdit-Benchは多段階の画像編集と中間的な論理的経路に焦点を当てた初のベンチマークで、14モデルを状態遷移・動的プロセス・時系列・科学的シミュレーションの6つの新規評価基準で評価します。

ABSTRACT

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.

研究の動機と目的

最終出力を超えた多段階・動的な画像編集のモデル評価を動機づける。
編集タスクにおける手続的推論と因果理解を評価するための中間的な論理経路を捉える。
推論志向の画像編集モデルをベンチマークするための注釈付きデータと6次元評価プロトコルを提供する。
長距離計画と動的推論を実行する現在のモデル能力のギャップを強調する。

提案手法

4つのタスクカテゴリと16サブタスクにまたがる237件の手動注釈付きテスト事例をキュレーションする。
多段階の進化を反映した中間経路画像（Nグリッド）の生成を要求する。
編集指示と主要な中間ステップの要約を含むプロンプトを注釈付けする。
3つの視覚的品質指標と3つのプロセス志向指標の6次元評価フレームワークを採用する。
自動評価のためにGPT-4oを査定者としてLMM-as-a-Judgeのパラダイムを活用する。
InEdit-Benchにおいて、代表的な14モデル（独自・オープンソース）を評価する。

実験結果

リサーチクエスチョン

RQ1多模式エディターは初期画像から最終画像へ一貫した中間変換経路を生成できるか。
RQ2現在のモデルは複数の編集ステップを通じて外観・知覚的リアリズム・意味内容をどれだけ保持できるか。
RQ3モデルは多段階編集において論理的一貫性・科学的妥当性・プロセス妥当性をどの程度示すか。
RQ4動的推論タスクにおける独自モデルとオープンソース編集モデルの相対的長所・短所は何か。

主な発見

モデル	外観の一貫性	知覚品質	意味的一貫性	論理的一貫性	科学的妥当性	プロセス妥当性	総合平均	ブートストラップ95%信頼区間	正確性
GPT-Image-1	92.24	92.36	72.04	71.06	71.31	88.97	81.33	[79.04, 83.61]	16.75%
Nano-Banana	86.45	92.49	62.93	60.22	73.58	75.74	75.23	[72.40, 77.96]	13.30%
Flux-Kontext-pro	64.66	89.11	33.99	30.17	43.75	47.06	51.46	[48.59, 54.45]	0.99%
Doubao-SeedEdit-3.0-i2i	44.43	69.70	22.54	22.41	34.94	25.00	36.50	[34.04, 39.10]	0.00%
Qwen-Image-Edit	62.32	82.64	27.34	28.94	44.89	51.47	49.60	[46.87, 52.43]	0.49%
Emu1	5.17	48.65	2.46	3.45	5.11	3.68	11.42	[10.36, 12.57]	0.00%
Emu2	33.17	85.30	6.16	15.15	22.44	15.44	29.61	[27.61, 31.81]	0.00%
Bagel	46.18	65.89	28.08	27.34	34.09	42.65	40.70	[37.99, 43.49]	0.00%
Bagel-Think	53.94	76.72	24.01	28.94	34.09	26.47	40.70	[37.99, 43.54]	0.99%
OmniGen	9.24	35.71	5.42	7.76	13.92	13.97	14.34	[12.55, 16.29]	0.00%
OmniGen2	42.36	78.94	21.31	24.75	29.26	30.88	37.92	[35.16, 40.78]	0.49%
Step1X-Edit(v1.0)	15.89	42.66	7.39	8.00	15.06	9.56	16.43	[14.48, 18.53]	0.00%
Step1X-Edit(v1.1)	34.61	54.56	17.00	23.89	31.82	26.47	31.39	[28.72, 34.18]	0.00%
InstructPix2Pix	33.62	74.50	4.46	13.42	13.35	0.00	23.23	[21.75, 24.68]	0.00%

独自モデルのGPT-Image-1は総合平均81.33で最も高く、外観と意味的一貫性は高いが、正確性は限られている（16.75%）。
オープンソースモデルは総計スコアは低いが特定の次元で顕著な強みを示す（例：Qwen-Image-Editは意味的一貫性と科学的妥当性で優れる）。
全体として多くのモデルが長期依存の把握と多段階因果推論に苦戦しており、モデル間の正確性は低い（多くが0近い）。
プロセス妥当性と論理的一貫性は難易度が高い次元として浮上し、GPT-Image-1がプロセス関連判断で先行。
状態遷移タスクはほとんどのモデルにとって特に難しく、連続的な推論から離散推論へとタスクの複雑さが層状に増加することを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。