QUICK REVIEW

[论文解读] Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Demiao Lin|arXiv (Cornell University)|Jan 23, 2024

Topic Modeling被引用 8

一句话总结

该论文显示基于深度学习的PDF解析器（ChatDOC）在RAG性能上优于基线规则方法（PyPDF），特别是在复杂表格和阅读顺序方面，在188份真实世界文档中。

ABSTRACT

With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.

研究动机与目标

展示PDF解析质量如何影响专业文档的RAG。
在RAG流程中比较基于规则的方法与基于深度学习的PDF解析。
显示结构感知的解析可获得更准确、更加完整的检索结果。
评估在真实世界文档和案例研究中的实际影响。

提出的方法

比较两种RAG系统：ChatDOC与ChatDOC PDF Parser对比Baseline使用的PyPDF和RecursiveCharacterTextSplitter。
ChatDOC使用基于DL的解析，包含OCR、文档对象检测、跨列/跨页裁剪、阅读顺序和表格/结构识别。
将内容块汇总成约300个标记的块，检索单元中保留结构（表格、标题）。
嵌入使用 text-embedding-ada-002；检索限制为≤3000标记；问答使用 GPT-3.5-Turbo。
数据集包含188份文档（100篇学术论文，28份金融报告，60份其他）以及302个用于评估的问题。

实验结果

研究问题

RQ1PDF解析和分块质量是否会影响专业文档的RAG答案质量？
RQ2基于深度学习的解析器在PDF的RAG中是否优于基于规则的解析器？
RQ3解析错误如何影响提取性问题与全面性问题？
RQ4在RAG场景中，基于DL的解析器有哪些实际的失败模式和局限性？

主要发现

ChatDOC在提取性问题上优于Baseline 47%，并在42%时打平，Baseline获胜9%（共86个问题）。
对于综合性问题，ChatDOC在47%胜出，37%打平，Baseline为17%（216个问题）。
总体而言，ChatDOC在302个问题中胜出143次，Baseline胜出44次，打平115次。
案例研究展示了改进的表格处理、正确的阅读顺序，以及完整表格的检索，提升了LLM的理解能力。
局限性包括排名/标记窗口问题和标题偶发的误分割，表明在嵌入排序和分割方面仍有改进空间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。