QUICK REVIEW

[논문 리뷰] Revolutionizing Retrieval-Augmented Generation with Enhanced PDF Structure Recognition

Demiao Lin|arXiv (Cornell University)|2024. 01. 23.

Topic Modeling인용 수 8

한 줄 요약

이 논문은 딥러닝 기반 PDF 파서(ChatDOC)가 규칙 기반 기준(PyPDF)보다 RAG 성능을 향상시키며, 특히 복잡한 표와 읽기 순서에서 188개의 실세계 문서에 걸쳐 개선을 보인다.

ABSTRACT

With the rapid development of Large Language Models (LLMs), Retrieval-Augmented Generation (RAG) has become a predominant method in the field of professional knowledge-based question answering. Presently, major foundation model companies have opened up Embedding and Chat API interfaces, and frameworks like LangChain have already integrated the RAG process. It appears that the key models and steps in RAG have been resolved, leading to the question: are professional knowledge QA systems now approaching perfection? This article discovers that current primary methods depend on the premise of accessing high-quality text corpora. However, since professional documents are mainly stored in PDFs, the low accuracy of PDF parsing significantly impacts the effectiveness of professional knowledge-based QA. We conducted an empirical RAG experiment across hundreds of questions from the corresponding real-world professional documents. The results show that, ChatDOC, a RAG system equipped with a panoptic and pinpoint PDF parser, retrieves more accurate and complete segments, and thus better answers. Empirical experiments show that ChatDOC is superior to baseline on nearly 47% of questions, ties for 38% of cases, and falls short on only 15% of cases. It shows that we may revolutionize RAG with enhanced PDF structure recognition.

연구 동기 및 목표

전문 문서에서 PDF 파싱 품질이 RAG에 어떤 영향을 미치는지 입증한다.
RAG 파이프라인에서 규칙 기반 vs. 딥러닝 기반 PDF 파싱을 비교한다.
구조 인식 파싱이 더 정확하고 완전한 검색 결과를 산출함을 보여준다.
현실 문서와 사례 연구에서의 실질적 영향을 평가한다.

제안 방법

두 가지 RAG 시스템 비교: ChatDOC PDF Parser를 포함한 ChatDOC와 PyPDF 및 RecursiveCharacterTextSplitter를 이용한 Baseline.
ChatDOC는 OCR, 문서 객체 탐지, 열 간/페이지 잘라내기, 읽기 순서 및 표/구조 인식을 포함한 DL 기반 파싱을 사용한다.
청크는 약 300 토큰까지의 콘텐츠 블록으로 구성되며 검색 단위에서 구조(표, 제목)를 보존한다.
임베딩은 text-embedding-ada-002를 사용; 검색은 ≤3000 토큰으로 제한; QA는 GPT-3.5-Turbo로 수행.
데이터셋은 188개의 문서(100편의 학술 논문, 28개의 재무 보고서, 60개의 기타)로 구성되며 평가를 위한 302개의 질문이 있다.

실험 결과

연구 질문

RQ1전문 문서에서 PDF 파싱 및 청킹 품질이 RAG 정답 품질에 영향을 미치는가?
RQ2딥러닝 기반 파서는 PDF에서 RAG에 대해 규칙 기반 파서를 능가하는가?
RQ3파싱 오류가 추출형 대 포괄적 분석 질문에 어떤 영향을 미치는가?
RQ4RAG 맥락에서 DL 기반 파서의 실용적 실패 모드와 한계는 무엇인가?

주요 결과

ChatDOC는 추출형 질문의 47%에서 Baseline보다 우수하고 42%에서 동점, Baseline은 9% 승리(총 86개 질문)이다.
포괄적 질문의 경우 ChatDOC가 47%, 동점이 37%, Baseline은 17%(216개 질문)이다.
총 302개 질문 중 ChatDOC가 143회 승리를, Baseline은 44회 승리를 거두었고 115회 동점이다.
사례 연구는 표 처리 개선, 올바른 읽기 순서, 전체 표 검색의 향상을 보여주며 LLM 이해력을 높인다.
제한점으로는 랭킹/토큰 윈도우 문제 및 제목의 간헐적 잘못 분할이 있어 임베딩 랭킹 및 분할 개선의 여지가 있음.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.