QUICK REVIEW

[논문 리뷰] The Design of an LLM-powered Unstructured Analytics System

Eric Anderson, Jonathan Fritz|arXiv (Cornell University)|2024. 09. 01.

Neural Networks and Applications인용 수 5

한 줄 요약

이 논문은 대규모 문서 모음을 대상으로 자연어 질의를 실행하기 위해 선언형 문서 처리 엔진(Sycamore)과 질의 계획자(Luna)를 갖춘, LLM으로 구동되는 엔드 투 엔드 비정형 분석 시스템 Aryn을 제시한다.

ABSTRACT

LLMs demonstrate an uncanny ability to process unstructured data, and as such, have the potential to go beyond search and run complex, semantic analyses at scale. We describe the design of an unstructured analytics system, Aryn, and the tenets and use cases that motivate its design. With Aryn, users specify queries in natural language and the system automatically determines a semantic plan and executes it to compute an answer from a large collection of unstructured documents. At the core of Aryn is Sycamore, a declarative document processing engine, that provides a reliable distributed abstraction called DocSets. Sycamore allows users to analyze, enrich, and transform complex documents at scale. Aryn includes Luna, a query planner that translates natural language queries to Sycamore scripts, and DocParse, which takes raw PDFs and document images, and converts them to DocSets for downstream processing. We show how these pieces come together to achieve better accuracy than RAG on analytics queries over real world reports from the National Transportation Safety Board (NTSB). Also, given current limitations of LLMs, we argue that an analytics system must provide explainability to be practical, and show how Aryn's user interface does this to help build trust.

연구 동기 및 목표

간단한 검색을 넘어 대규모 비정형 문서 저장소에 대한 의미 분석에 대한 기업의 필요성을 동기 부여한다.
ETL 유사 처리와 유연한 분석을 결합한 선언적이고 계획 기반의 비정형 분석 접근법을 제안한다.
사람의 개입(human-in-the-loop) 기능을 갖춘 LLM 기반 변환 및 설명을 가능하게 하는 확장 가능한 아키텍처를 도입한다.

제안 방법

Sycamore를 이용한 문서 처리와 DocSets를 핵심 데이터 추상화로 갖춘 오픈 소스 시스템 Aryn을 소개한다.
DocLayNet에서 학습된 시각 기반 분할 모델(Deformable DETR)을 사용하여 원시 PDF/이미지를 DocSets로 변환하는 Aryn Partitioner를 설명한다.
Sycamore가 실행하는 의미론적 쿼리 계획으로 자연어 쿼리를 변환하는 계획자 Luna를 제시한다.
문서가 요소의 계층적 트리이고 DocSets가 다중 모달 콘텐츠와 메타데이터를 지원하는 데이터 모델을 설명한다.
Ray 위에 구축된 Spark 유사한 지연성 배포 파이프라인에 기반한 실행 모델을 자세히 설명하고 추적성 및 디버깅 지원을 제공한다.
Sycamore 파이프라인에 통합된 LLM 기반 변환들(예: llm_query, extract_properties, summarize)을 개략한다.

실험 결과

연구 질문

RQ1엔드투엔드 시스템이 LLM을 활용해 비정형 문서 모음에 대해 복잡한 분석을 수행하면서도 설명 가능성과 제어를 유지하려면 어떻게 해야 하는가?
RQ2다중 모달 계층적 문서 설정에서 ETL 유사 처리와 분석을 정렬하기 위해 필요한 아키텍처 구성요소는 무엇인가?
RQ3사람의 개입(human-in-the-loop) 접근법이 LLM 기반 비정형 분석 생태계의 정확도와 신뢰를 향상시킬 수 있는가?

주요 결과

Aryn은 자연어를 사용하여 의미 계획을 생성하고 이를 실행해 답을 계산하는 비정형 데이터에 대한 엔드투엔드 질의를 시연한다.
Deformable DETR를 이용한 Partitioner가 DocLayNet에서의 문서 배치 분할 성능에서( mAP 0.602, mAR 0.743 ) 경쟁 클라우드 공급업체의 API보다 더 우수하다( mAP 0.344, mAR 0.466 ).
Sycamore는 문서 수준의 변환과 디버깅 및 설명 가능성을 위한 계보를 가진 LLM 기반 확장을 제공한다.
Luna는 수익 보고서와 NTSB 보고서에 대한 마이크로 벤치마크에서 72%의 정확도를 달성했으며 정답 13개, 그럴듯한 3개, 오답 2개를 기록한다; 모호한 사례에 대한 인간의 개입(human-in-the-loop) 처리도 인정된다.
시스템은 전통적 연산자와 의미 기반 LLM 연산자의 조합을 지원하며, 계획 시각화와 투명성을 위한 JSON 형식의 계획을 제공한다.

Figure 2. Output of Aryn Partitioner (including table and cell identification) on a typical PDF NTSB accident report.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.