QUICK REVIEW

[논문 리뷰] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li, Chengquan Zhang|arXiv (Cornell University)|2026. 03. 25.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

MMTIT-Bench는 다국어, 다중 시나리오 엔드-투-엔드 TIMT 벤치마크(14개 언어의 1,400장 이미지)와 CPR-Trans를 도입하여 VLLMs의 번역 정확도와 해석성을 3B 및 7B 모델에서 향상시킨다.

ABSTRACT

End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

연구 동기 및 목표

다국어 및 다도메인 TIMT 벤치마크의 포괄적 부재를 해결한다.
다양한 언어와 현실 시각 장면에서 엔드-투-엔드 TIMT를 평가한다.
TIMT를 안내하기 위한 인지–지각–추론 데이터 패러다임을 제안한다.
다양한 모델 규모에서 CPR-Trans의 이득을 입증하고 강력한 TIMT 평가에 적합한 데이터셋을 제공한다.

제안 방법

14개 언어 및 다양한 맥락(문서, 메뉴, 포스터, 책, 상품, 장면)을 포괄하는 1,400장의 이미지를 활용한 인간 검증 다국어 다중 시나리오 TIMT 벤치마크(MMTIT-Bench)를 생성한다.
고품질 이중언어 참조(중국어 및 영어)를 산출하기 위한 OCR 기반의 MLLM 보조 텍스트 파싱과 VLLM 주도 다중 모델 투표 번역 파이프라인의 2단계 주석 파이프라인을 사용한다.
장면 인지, 텍스트 지각, 번역 추론을 하나의 다중모달 감독 시퀀스로 융합하는 CPR-Trans 데이터 패러다임을 도입한다.
학습용으로 해석 가능한 추론 감독을 가능하게 하는 구조화된 <think> 및 <answer> 흔적을 생성하는 VLLM 주도 데이터 생성 파이프라인을 활용한다.
데이터 패러다임(Direct Translation, Simple CoT, Thinking Distillation, CPR-Trans)을 비교하고 3B 및 7B 모델에서 TIMT에 미치는 영향을 평가한다.
VLLM 판단자(Gemini 2.5 Flash 및 Qwen3-VL-235B-A22B-Instruct)와 규칙 기반 COMET 지표로 번역을 평가한다.

실험 결과

연구 질문

RQ1다국어, 다중 시나리오 TIMT 벤치마크가 현재 VLLM의 언어 간 및 시각적 맥락에서의 강건성 차이를 드러낼 수 있는가?
RQ2TIMT에 특화된 추론 지향 데이터 패러다임(CPR-Trans)이 기존 CoT 및 OCR 중심 접근법을 넘어 번역 정확도와 해석성을 개선하는가?
RQ3CPR-Trans의 이점은 모델 크기(3B 대 7B)에 따라 어떻게 확장되며 증류 기반 혹은 직접 번역 패러다임과 비교하면 어떤가?
RQ4학습-무료 다회전 CPR-Trans 추론이 TIMT에 유익한가?

주요 결과

MMTIT-Bench는 1,400개의 전문가 검증 샘플로 14개 언어 및 다양한 시각적 상황에서 엔드-투-엔드 TIMT를 견고하게 평가할 수 있게 한다.
CPR-Trans는 모델 규모에 관계없이 기준 패러다임에 비해 번역 향상을 가져오며, 인지, 지각, 추론이 상보적으로 기여함을 탐색적 분석에서 보여준다.
CPR-Trans는 평균적으로 11.2( Gemini 2.5-Flash ) 및 8.2(Qwen3-VL) 만큼 기준 대비 이득을 얻는다.
Thinking 기반 데이터 증류는 노이즈가 많은 추론 흔적으로 인해 CPR-Trans보다 효과가 낮으며, CPR-Trans는 구조화되고 인지적으로 근거 있는 감독을 제공한다.
학습-무료 다회 CPR-Trans 추론은 번역 품질을 향상시키며 이 패러다임이 자연스러운 TIMT 추론 프로세스와 정렬됨을 시사한다.
OCR–번역 계단식 비교 대비, 엔드-투-엔드 VLLMs가 CPR-Trans를 사용할 때 시각적으로 복잡한 장면 및 비디지털 원문에 대해 더 강한 강건성을 보인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.