QUICK REVIEW

[论文解读] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs

Xinzhong Wang, Ya Guo|arXiv (Cornell University)|Jan 27, 2026

Advanced Text Analysis Techniques被引用 0

一句话总结

论文提出 PIP，一种用于 VrDs 的基于掩码的并行推断范式，用 [mask] 代替目标值以实现同时令牌生成，在可忽略的准确率损失下实现 5–36× 的推断加速。

ABSTRACT

Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.

研究动机与目标

Motivate and address the inefficiency of autoregressive inference in KIE for VrDs.
Introduce a mask-based parallel decoding paradigm (PIP) that enables simultaneous extraction of multiple key fields.
Develop a two-stage training pipeline (mask pre-training and KV supervised fine-tuning) to enable parallel decoding in MLLMs.
Demonstrate that PIP achieves substantial speedups (5–36×) with competitive or improved accuracy across benchmark datasets.

提出的方法

Reformulate KIE by replacing target values with [mask] tokens to allow parallel decoding in a single forward pass.
Use bidirectional attention during mask pre-training to learn context for predictions, replacing causal attention.
Pre-train on a large image-caption dataset (13M images) to learn parallel inference.
Fine-tune on a curated KV extraction dataset with a human-in-the-loop verification to reduce hallucination and enable KV supervision.
Visualize attention to show tokens attend to distinct image regions corresponding to output fields.
Evaluate across multiple base models and datasets to demonstrate speedups and accuracy.

实验结果

研究问题

RQ1Can masking target outputs and decoding in parallel reduce inference latency for KIE without sacrificing accuracy?
RQ2How does mask pre-training plus KV supervised fine-tuning enable effective parallel decoding in MLLMs for VrD KIE?
RQ3What speedups and accuracy trade-offs can be achieved across standard KIE benchmarks (FUNSD, SROIE, CORD, POIE, WildReceipt)?
RQ4Is the PIP paradigm robust across different base model architectures and scales?

主要发现

PIP achieves 5–36× inference speedup with negligible performance degradation compared to autoregressive baselines.
PIP improves state-of-the-art on SROIE and CORD by substantial margins when combined with KV supervised fine-tuning (e.g., PIP-Qwen2-VL-7B achieves ANLS 97.0 on SROIE and 97.3 on CORD).
Mask pre-training with bidirectional attention enables effective parallel decoding for KIE in MLLMs.
The approach maintains competitive accuracy across FUNSD, SROIE, CORD, POIE, and WildReceipt while reducing latency dramatically.
Memory overhead is modest (up to ~30% increase in input length) with substantial gains in throughput.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。