[论文解读] IDP Accelerator: Agentic Document Intelligence from Extraction to Compliance Validation
IDP Accelerator 是一个开源、模块化的框架,能够实现端到端文档处理的智能化(agentic AI),涵盖从多模态提取到合规性验证,具备生产就绪的云原生架构与人工在环(HITL)能力。
Understanding and extracting structured insights from unstructured documents remains a foundational challenge in industrial NLP. While Large Language Models (LLMs) enable zero-shot extraction, traditional pipelines often fail to handle multi-document packets, complex reasoning, and strict compliance requirements. We present IDP (Intelligent Document Processing) Accelerator, a framework enabling agentic AI for end-to-end document intelligence with four key components: (1) DocSplit, a novel benchmark dataset and multimodal classifier using BIO tagging to segment complex document packets; (2) configurable Extraction Module leveraging multimodal LLMs to transform unstructured content into structured data; (3) Agentic Analytics Module, compliant with the Model Context Protocol (MCP) providing data access through secure, sandboxed code execution; and (4) Rule Validation Module replacing deterministic engines with LLM-driven logic for complex compliance checks. The interactive demonstration enables users to upload document packets, visualize classification results, and explore extracted data through an intuitive web interface. We demonstrate effectiveness across industries, highlighting a production deployment at a leading healthcare provider achieving 98% classification accuracy, 80% reduced processing latency, and 77% lower operational costs over legacy baselines. IDP Accelerator is open-sourced with a live demonstration available to the community.
研究动机与目标
- Address the inefficiencies of traditional, template-based document processing in industry-scale settings.
- Provide a modular, production-ready framework that can segment multi-document packets and extract structured data.
- Enable natural-language querying and analytics over processed documents via secure MCP-enabled interfaces.
- Integrate LLM-driven rule validation to handle complex compliance checks beyond deterministic engines.
提出的方法
- DocSplit for multimodal document packet segmentation using BIO tagging.
- Configurable extraction module leveraging multimodal LLMs to map content to user-defined schemas.
- Agentic Analytics Module with retrieval-augmented generation and MCP integration for enterprise data access.
- LLM-driven Rule Validation Module for complex, configurable compliance checks.
- Test Studio and CLI tooling with built-in HITL for rapid experimentation and iteration.
实验结果
研究问题
- RQ1How can multi-document packets be effectively segmented and classified to enable downstream extraction?
- RQ2What is the accuracy, latency, and cost trade-off when using multimodal LLMs for structured information extraction across different modalities?
- RQ3Can an agentic analytics layer provide meaningful natural-language querying over processed documents while maintaining security and governance?
- RQ4To what extent can LLM-driven rule validation match or surpass traditional rule engines for enterprise compliance checks?
主要发现
| Model | OCR | Image | Extraction Score | Latency | Cost | Failed |
|---|---|---|---|---|---|---|
| Claude Sonnet 4.5 | ✓ | ✗ | 0.7914 | 2m 4s | $5.56 | 0 |
| Claude Sonnet 4.5 | ✗ | ✓ | 0.7295 | 1m 47s | $5.49 | 0 |
| Claude Sonnet 4.5 | ✓ | ✓ | 0.7991 | 1m 53s | $7.18 | 0 |
| Claude Opus 4.5 | ✓ | ✗ | 0.7782 | 2m 20s | $7.28 | 0 |
| Claude Opus 4.5 | ✗ | ✓ | 0.7860 | 2m 17s | $7.71 | 0 |
| Claude Opus 4.5 | ✓ | ✓ | 0.7804 | 2m 3s | $10.26 | 0 |
| Claude Haiku 4.5 | ✓ | ✗ | 0.7554 | 1m 31s | $2.83 | 1 |
| Claude Haiku 4.5 | ✗ | ✓ | 0.6680 | 1m 33s | $2.82 | 0 |
| Claude Haiku 4.5 | ✓ | ✓ | 0.7782 | 1m 37s | $3.39 | 1 |
| Qwen3-VL | ✓ | ✗ | 0.7650 | 2m 41s | $2.08 | 0 |
| Qwen3-VL | ✗ | ✓ | 0.7450 | 200m 8s | $1.71 | 4 |
| Qwen3-VL | ✓ | ✓ | 0.7805 | 3m 1s | $1.90 | 4 |
| Gemma-3 | ✓ | ✗ | 0.7636 | 3m 14s | $1.64 | 0 |
| Gemma-3 | ✗ | ✓ | 0.5359 | 200m 17s | $1.36 | 5 |
| Gemma-3 | ✓ | ✓ | 0.7694 | 2m 47s | $1.35 | 4 |
- Production deployments show high accuracy and substantial efficiency gains across industries (e.g., healthcare achieving 98% accuracy, 80% latency reduction, 77% cost reduction).
- Multimodal models with OCR+Image inputs generally outperform image-only or OCR-alone configurations, with higher extraction scores and reduced latency in larger models.
- Open-source models offer cost advantages but may exhibit higher latency or failure rates with image inputs, emphasizing the value of structured output enforcement.
- DocSplit and the evaluation framework (DocSplit benchmark and Stickler) enable rigorous, field-level assessment of extraction and splitting quality.
- The combination of RAG-based analytics and MCP integration facilitates scalable, secure access to document data for downstream applications.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。