QUICK REVIEW

[論文レビュー] MMTIT-Bench: A Multilingual and Multi-Scenario Benchmark with Cognition-Perception-Reasoning Guided Text-Image Machine Translation

Gengluo Li, Chengquan Zhang|arXiv (Cornell University)|Mar 25, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MMTIT-Bench は複数言語間のエンドツーエンド TIMT ベンチマークを提供（1,400 枚の画像、14 言語）、CPR-Trans という認知–知覚–推論データパラダイムを導入し、3B および 7B モデルにおける VLLMs の翻訳精度と解釈性を向上させる。

ABSTRACT

End-to-end text-image machine translation (TIMT), which directly translates textual content in images across languages, is crucial for real-world multilingual scene understanding. Despite advances in vision-language large models (VLLMs), robustness across diverse visual scenes and low-resource languages remains underexplored due to limited evaluation resources. We present MMTIT-Bench, a human-verified multilingual and multi-scenario benchmark with 1,400 images spanning fourteen non-English and non-Chinese languages and diverse settings such as documents, scenes, and web images, enabling rigorous assessment of end-to-end TIMT. Beyond benchmarking, we study how reasoning-oriented data design improves translation. Although recent VLLMs have begun to incorporate long Chain-of-Thought (CoT) reasoning, effective thinking paradigms for TIMT are still immature: existing designs either cascade parsing and translation in a sequential manner or focus on language-only reasoning, overlooking the visual cognition central to VLLMs. We propose Cognition-Perception-Reasoning for Translation (CPR-Trans), a data paradigm that integrates scene cognition, text perception, and translation reasoning within a unified reasoning process. Using a VLLM-driven data generation pipeline, CPR-Trans provides structured, interpretable supervision that aligns perception with reasoning. Experiments on 3B and 7B models show consistent gains in accuracy and interpretability. We will release MMTIT-Bench to promote the multilingual and multi-scenario TIMT research upon acceptance.

研究の動機と目的

包括的な多言語・多領域 TIMT ベンチマークの欠如に対処する。
多様な言語と実世界の視覚場面でエンドツーエンド TIMT を評価する。
VLLMs を TIMT へと導く認知–知覚–推論データパラダイムを提案する。
複数のモデル規模で CPR-Trans の利得を示し、堅牢な TIMT 評価に適したデータセットを提供する。

提案手法

14 言語と多様な文脈（文書、メニュー、ポスター、書籍、製品、場面）を含む 1,400 枚の画像で人間検証済みの多言語・多シナリオ TIMT ベンチマーク（MMTIT-Bench）を作成する。
MLLM 支援のテキスト解析（OCR）と VLLM 駆動の多モデル投票翻訳パイプラインを用いて高品質なバイリンガルリファレンス（中国語と英語）を作成する、2 段階の注釈パイプラインを使用する。
CPR-Trans データパラダイムを導入し、場景認知、テキスト知覚、翻訳推論を統一された多モーダル監視シーケンスに統合する。
訓練用の解釈可能な推論監視を可能にするよう、<think> および <answer> 追跡を構造化して生成する VLLM 駆動のデータ生成パイプラインを採用する。
データパラダイム（Direct Translation、Simple CoT、Thinking Distillation、CPR-Trans）を比較し、3B および 7B モデルで TIMT への影響を評価する。
VLLM ジャッジ（Gemini 2.5 Flash および Qwen3-VL-235B-A22B-Instruct）とルールベースの COMET 指標の双方で翻訳を評価する。

実験結果

リサーチクエスチョン

RQ1多言語・多シナリオの TIMT ベンチマークは、現在の VLLMs が言語と視覚文脈の多様性におけるロバストネスのギャップを明らかにできるか。
RQ2TIMT に特化した推論志向のデータパラダイム（CPR-Trans）は、従来の CoT や OCR 重視アプローチを超えて翻訳精度と解釈性を改善するか。
RQ3CPR-Trans の利得はモデルサイズ（3B 対 7B）でどのようにスケールし、蒸留ベースまたは直接翻訳パラダイムと比較してどうか。
RQ4訓練不要の複数ターン CPR-Trans 推論は TIMT に有益か。

主な発見

MMTIT-Bench は 14 語言と多様な視覚シナリオを跨いでエンドツーエンド TIMT を堅牢に評価できる、1,400 の専門家検証サンプルを提供する。
CPR-Trans はベースラインパラダイムに対してモデル規模を問わず翻訳を大幅に改善し、アブレーションにより認知・知覚・推論が補完的に寄与することが示される。
CPR-Trans はベースライン比較に対して平均 11.2（Gemini 2.5-Flash）および 8.2（Qwen3-VL）の利得を達成する。
Thinking ベースのデータ蒸留はノイズの多い/不安定な推論追跡のため CPR-Trans より劣る；CPR-Trans は構造化され、認知的に基づく監督を提供する。
訓練不要のマルチターン CPR-Trans 推論は翻訳品質を改善し、このパラダイムが自然な TIMT 推論プロセスと整合していることを示唆する。
従来の OCR–翻訳のカスケードベースと比較して、 CPR-Trans を用いたエンドツーエンドの VLLMs は視覚的に複雑な場面や非デジタル生成テキストに対してより堅牢である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。