QUICK REVIEW

[논문 리뷰] InEdit-Bench: Benchmarking Intermediate Logical Pathways for Intelligent Image Editing Models

Sheng, Zhiqiang, Xumeng Han|arXiv (Cornell University)|2026. 03. 04.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

InEdit-Bench은 다단계 이미지 편집과 중간 논리 경로에 초점을 맞춘 최초의 벤치마크로, 상태 전이, 동적 프로세스, 시간적 시퀀스, 과학적 시뮬레이션의 14개 모델을 여섯 가지 새로운 기준으로 평가합니다.

ABSTRACT

Multimodal generative models have made significant strides in image editing, demonstrating impressive performance on a variety of static tasks. However, their proficiency typically does not extend to complex scenarios requiring dynamic reasoning, leaving them ill-equipped to model the coherent, intermediate logical pathways that constitute a multi-step evolution from an initial state to a final one. This capacity is crucial for unlocking a deeper level of procedural and causal understanding in visual manipulation. To systematically measure this critical limitation, we introduce InEdit-Bench, the first evaluation benchmark dedicated to reasoning over intermediate pathways in image editing. InEdit-Bench comprises meticulously annotated test cases covering four fundamental task categories: state transition, dynamic process, temporal sequence, and scientific simulation. Additionally, to enable fine-grained evaluation, we propose a set of assessment criteria to evaluate the logical coherence and visual naturalness of the generated pathways, as well as the model's fidelity to specified path constraints. Our comprehensive evaluation of 14 representative image editing models on InEdit-Bench reveals significant and widespread shortcomings in this domain. By providing a standardized and challenging benchmark, we aim for InEdit-Bench to catalyze research and steer development towards more dynamic, reason-aware, and intelligent multimodal generative models.

연구 동기 및 목표

다단계의 동적 이미지 편집에서 최종 출력 외의 모델 평가를 촉진한다.
편집 작업에서 절차적 추론 및 인과 이해를 평가하기 위해 중간 논리 경로를 포착한다.
추가로 주석이 달린 데이터와 추론 지향 이미지 편집 모델을 벤치마킹하는 여섯 차원 평가 프로토콜을 제공한다.
현 모델이 장기 계획 및 동적 추론을 수행하는 능력의 격차를 강조한다.

제안 방법

네 가지 작업 범주와 16개 하위 작업에 걸친 237개의 핸드 주석 테스트 인스턴스를 큐레이션한다.
다단계 진화를 반영하는 중간 경로 이미지(N 그리드) 생성을 요구한다.
프롬프트에 편집 지시와 주요 중간 단계의 간결한 요약을 주석으로 달다.
세 가지 시각적 품질 지표와 세 가지 프로세스 지향 지표의 여섯 차원 평가 프레임워크를 채택한다.
자동 점수를 위한 LMM-저자-Paradigm으로 GPT-4o를 평가자로 활용한다.
InEdit-Bench에서 대표 모델 14개(독점 및 오픈소스)를 평가한다.

실험 결과

연구 질문

RQ1다중 모달 편집기가 초기 이미지에서 최종 이미지로의 일관된 중간 변환 경로를 생성할 수 있는가?
RQ2다양한 편집 단계에서 모델이 외관, 지각적 현실성 및 의미Content를 얼마나 잘 보존하는가?
RQ3모델이 다단계 편집에서 논리적 일관성, 과학적 타당성 및 과정 타당성을 어느 정도 보여주나?
RQ4독점 대 오픈소스 편집 모델의 동적 추론 작업에서 상대적 강점과 약점은 무엇인가?

주요 결과

Models	Appearance Consistency	Perceptual Quality	Semantic Consistency	Logical Coherence	Scientific Plausibility	Process Plausibility	Overall Average	Bootstrap 95% CI	Accuracy
GPT-Image-1	92.24	92.36	72.04	71.06	71.31	88.97	81.33	[79.04, 83.61]	16.75%
Nano-Banana	86.45	92.49	62.93	60.22	73.58	75.74	75.23	[72.40, 77.96]	13.30%
Flux-Kontext-pro	64.66	89.11	33.99	30.17	43.75	47.06	51.46	[48.59, 54.45]	0.99%
Doubao-SeedEdit-3.0-i2i	44.43	69.70	22.54	22.41	34.94	25.00	36.50	[34.04, 39.10]	0.00%
Qwen-Image-Edit	62.32	82.64	27.34	28.94	44.89	51.47	49.60	[46.87, 52.43]	0.49%
Emu1	5.17	48.65	2.46	3.45	5.11	3.68	11.42	[10.36, 12.57]	0.00%
Emu2	33.17	85.30	6.16	15.15	22.44	15.44	29.61	[27.61, 31.81]	0.00%
Bagel	46.18	65.89	28.08	27.34	34.09	42.65	40.70	[37.99, 43.49]	0.00%
Bagel-Think	53.94	76.72	24.01	28.94	34.09	26.47	40.70	[37.99, 43.54]	0.99%
OmniGen	9.24	35.71	5.42	7.76	13.92	13.97	14.34	[12.55, 16.29]	0.00%
OmniGen2	42.36	78.94	21.31	24.75	29.26	30.88	37.92	[35.16, 40.78]	0.49%
Step1X-Edit(v1.0)	15.89	42.66	7.39	8.00	15.06	9.56	16.43	[14.48, 18.53]	0.00%
Step1X-Edit(v1.1)	34.61	54.56	17.00	23.89	31.82	26.47	31.39	[28.72, 34.18]	0.00%
InstructPix2Pix	33.62	74.50	4.46	13.42	13.35	0.00	23.23	[21.75, 24.68]	0.00%

독점적 GPT-Image-1은 전체 평균이 가장 높은 81.33으로, 외관 및 의미 일관성은 강하지만 명시적 정확도는 제한적(16.75%)이다.
오픈소스 모델은 총합 점수가 낮은 편이지만 특정 차원에서 주목할 만한 강점을 보인다(예: Qwen-Image-Edit는 의미 일관성과 과학적 타당성에서 우수).
대부분의 모델은 장기 의존성 포착과 다단계 인과 추론에서 어려움을 겪으며, 모델 간 정확도가 낮은 경향이 나타난다(여러 모델에서 0에 근접).
과정 타당성 및 논리적 일관성이 도전적인 차원으로 나타나며, GPT-Image-1이 과정 관련 판단에서 선두를 달리고 있다.
상태 전이 작업은 대부분의 모델에 특히 어려워, 연속적 추론에서 이산적 추론으로의 작업 난이도가 계층적으로 증가함을 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.