Skip to main content
QUICK REVIEW

[论文解读] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs

Xinzhong Wang, Ya Guo|arXiv (Cornell University)|Jan 27, 2026
Advanced Text Analysis Techniques被引用 0
一句话总结

论文提出 PIP,一种用于 VrDs 的基于掩码的并行推断范式,用 [mask] 代替目标值以实现同时令牌生成,在可忽略的准确率损失下实现 5–36× 的推断加速。

ABSTRACT

Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.

研究动机与目标

  • Motivate and address the inefficiency of autoregressive inference in KIE for VrDs.
  • Introduce a mask-based parallel decoding paradigm (PIP) that enables simultaneous extraction of multiple key fields.
  • Develop a two-stage training pipeline (mask pre-training and KV supervised fine-tuning) to enable parallel decoding in MLLMs.
  • Demonstrate that PIP achieves substantial speedups (5–36×) with competitive or improved accuracy across benchmark datasets.

提出的方法

  • Reformulate KIE by replacing target values with [mask] tokens to allow parallel decoding in a single forward pass.
  • Use bidirectional attention during mask pre-training to learn context for predictions, replacing causal attention.
  • Pre-train on a large image-caption dataset (13M images) to learn parallel inference.
  • Fine-tune on a curated KV extraction dataset with a human-in-the-loop verification to reduce hallucination and enable KV supervision.
  • Visualize attention to show tokens attend to distinct image regions corresponding to output fields.
  • Evaluate across multiple base models and datasets to demonstrate speedups and accuracy.

实验结果

研究问题

  • RQ1Can masking target outputs and decoding in parallel reduce inference latency for KIE without sacrificing accuracy?
  • RQ2How does mask pre-training plus KV supervised fine-tuning enable effective parallel decoding in MLLMs for VrD KIE?
  • RQ3What speedups and accuracy trade-offs can be achieved across standard KIE benchmarks (FUNSD, SROIE, CORD, POIE, WildReceipt)?
  • RQ4Is the PIP paradigm robust across different base model architectures and scales?

主要发现

  • PIP achieves 5–36× inference speedup with negligible performance degradation compared to autoregressive baselines.
  • PIP improves state-of-the-art on SROIE and CORD by substantial margins when combined with KV supervised fine-tuning (e.g., PIP-Qwen2-VL-7B achieves ANLS 97.0 on SROIE and 97.3 on CORD).
  • Mask pre-training with bidirectional attention enables effective parallel decoding for KIE in MLLMs.
  • The approach maintains competitive accuracy across FUNSD, SROIE, CORD, POIE, and WildReceipt while reducing latency dramatically.
  • Memory overhead is modest (up to ~30% increase in input length) with substantial gains in throughput.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。