[论文解读] Up to 36x Speedup: Mask-based Parallel Inference Paradigm for Key Information Extraction in MLLMs
论文提出 PIP,一种用于 VrDs 的基于掩码的并行推断范式,用 [mask] 代替目标值以实现同时令牌生成,在可忽略的准确率损失下实现 5–36× 的推断加速。
Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
研究动机与目标
- Motivate and address the inefficiency of autoregressive inference in KIE for VrDs.
- Introduce a mask-based parallel decoding paradigm (PIP) that enables simultaneous extraction of multiple key fields.
- Develop a two-stage training pipeline (mask pre-training and KV supervised fine-tuning) to enable parallel decoding in MLLMs.
- Demonstrate that PIP achieves substantial speedups (5–36×) with competitive or improved accuracy across benchmark datasets.
提出的方法
- Reformulate KIE by replacing target values with [mask] tokens to allow parallel decoding in a single forward pass.
- Use bidirectional attention during mask pre-training to learn context for predictions, replacing causal attention.
- Pre-train on a large image-caption dataset (13M images) to learn parallel inference.
- Fine-tune on a curated KV extraction dataset with a human-in-the-loop verification to reduce hallucination and enable KV supervision.
- Visualize attention to show tokens attend to distinct image regions corresponding to output fields.
- Evaluate across multiple base models and datasets to demonstrate speedups and accuracy.
实验结果
研究问题
- RQ1Can masking target outputs and decoding in parallel reduce inference latency for KIE without sacrificing accuracy?
- RQ2How does mask pre-training plus KV supervised fine-tuning enable effective parallel decoding in MLLMs for VrD KIE?
- RQ3What speedups and accuracy trade-offs can be achieved across standard KIE benchmarks (FUNSD, SROIE, CORD, POIE, WildReceipt)?
- RQ4Is the PIP paradigm robust across different base model architectures and scales?
主要发现
- PIP achieves 5–36× inference speedup with negligible performance degradation compared to autoregressive baselines.
- PIP improves state-of-the-art on SROIE and CORD by substantial margins when combined with KV supervised fine-tuning (e.g., PIP-Qwen2-VL-7B achieves ANLS 97.0 on SROIE and 97.3 on CORD).
- Mask pre-training with bidirectional attention enables effective parallel decoding for KIE in MLLMs.
- The approach maintains competitive accuracy across FUNSD, SROIE, CORD, POIE, and WildReceipt while reducing latency dramatically.
- Memory overhead is modest (up to ~30% increase in input length) with substantial gains in throughput.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。