QUICK REVIEW

[论文解读] ZeroSense:How Vision matters in Long Context Compression

Yonghan Gao, Zehong Chen|arXiv (Cornell University)|Mar 12, 2026

Generative Adversarial Networks and Image Synthesis被引用 0

一句话总结

本研究通过引入解耦评估框架和 ZeroSense 基准，将视觉-文本压缩质量与语言先验解耦，揭示视觉-文本压缩质量可能与下游任务准确率出现偏离。

ABSTRACT

Recent visual-text compression (VTC) methods, typified by DeepSeek-OCR, report impressive high token compression ratios for long-context modeling tasks by leveraging text-to-image rendering. However, existing evaluation protocols heavily rely on downstream task performance. Such evaluation metrics fail to accurately measure text preservation due to the strong inherent linguistic priors of Multimodal Large Language Models (MLLMs). In this work, we introduce a new evaluation framework that decouples MLLMs' capabilities to faithfully assess VTC quality. Within this framework, we further introduce the ZeroSense Benchmark to ensure low semantic correlation of testing samples. By eliminating contextual dependencies, our benchmark guarantees that the evaluation results are purely reflective of VTC quality, unaffected by the semantic inference capabilities of downstream models. Extensive experiments across multiple datasets demonstrate that VTC quality and downstream task accuracy diverge significantly, highlighting the necessity of our decoupled evaluation framework.

研究动机与目标

Motivate evaluation of visual-text compression (VTC) independent of downstream linguistic priors.
Define a formal, model-agnostic framework to measure text preservation under VTC.
Introduce the ZeroSense Benchmark to create a semantic vacuum for unbiased evaluation.
Quantify the gap between VTC text preservation and downstream task performance across datasets.

提出的方法

Formalize VTC evaluation with compression ratio rho(theta) and an objective F(O|I, V_theta).
Propose a decoupled OCR framework to separate prior reasoning, raw OCR, and preserved text (Equation 5).
Introduce a text preservation metric K_quality derived from F(C|I,V_theta) and OCR_raw.
Construct the ZeroSense Benchmark to eliminate semantic correlations (Equation 7).
Provide calibration and baseline strategies to estimate OCR_raw and F_prior using ZeroSense and reference samples.

实验结果

研究问题

RQ1How well does visual-text compression preserve text independent of semantic priors?
RQ2To what extent do downstream tasks reflect VTC quality versus semantic inference capabilities?
RQ3Can we quantify the text-preserving ability of a VTC method across datasets with decoupled evaluation?
RQ4What is the impact of compression ratio on raw OCR capability and prior-based guidance?

主要发现

VTC quality and downstream task accuracy diverge significantly across datasets and compression ratios.
On Omni, the decoupled framework yields high text preservation (e.g., 97.1% at 7.5×) while end-to-end accuracy is 89.2%; on Fox, the decoupled metric shows larger gaps at high compression.
F_prior grows with compression (23.8% at 7.5× to 67% at 17.5× on Fox; 31.7%–45.3% on Omni), indicating reliance on semantic priors increases as visual quality degrades.
OCR_raw decays with compression (Omni: 39.5%→17.4%, Fox: 76.1%→46% from 7.5× to 17.5×).
ZeroSense yields a semantic vacuum where inserted tokens have extremely low predictability (probabilities 10^-6 to 10^-7), supporting isolated visual evaluation.]
table_headers: []
table_rows: []

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。