QUICK REVIEW

[논문 리뷰] Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

Supun Manathunga, Y. A. Illangasekara|arXiv (Cornell University)|2023. 08. 01.

Topic Modeling인용 수 14

한 줄 요약

논문은 Retrieval Augmented Generation (RAG) 접근법과 Representative Vector Summarization (RVS)를 사용하여 대규모 비구조적 의학 텍스트를 다루고 토큰 한도 내에서 콘텐츠를 검색하고 요약하기 위해 LangChain과 FAISS 위에 구축된 docGPT를 제시합니다.

ABSTRACT

Large Language Models are increasingly being used for various tasks including content generation and as chatbots. Despite their impressive performances in general tasks, LLMs need to be aligned when applying for domain specific tasks to mitigate the problems of hallucination and producing harmful answers. Retrieval Augmented Generation (RAG) allows to easily attach and manipulate a non-parametric knowledgebases to LLMs. Applications of RAG in the field of medical education are discussed in this paper. A combined extractive and abstractive summarization method for large unstructured textual data using representative vectors is proposed.

연구 동기 및 목표

의료 교육용 LLM에서 환각과 도메인 불일치를 완화하기 위해 RAG 사용을 동기화한다.
대형 문서를 다루기 위한 추출적-추상적 요약 워크플로우를 도입한다.
대표 텍스트 조각을 선택하고 콘텐츠 분포를 시각화하는 방법을 개발한다.
docGPT에 워크플로우를 구현하고 소프트웨어의 오픈소스 접근성을 제공한다.

제안 방법

PDF, 텍스트 문서, 스프레드시트, 슬라이드 및 이미지/스캔의 OCR를 포함한 비구조적 소스에서 텍스트를 추출한다.
텍스트-임베딩-ada-002로 1536차원 벡터 공간에 청크를 임베딩하고 FAISS에 저장한다.
쿼리와 가장 유사한 k개 청크를 검색하고 이를 쿼리와 결합하여 LLM 프롬 prompting에 사용한다.
가용 가능한 최대 토큰 한도 T를 계산하고 k개의 청크를 선택하되 k*s ≤ T가 되도록 한다.
벡터를 k-평균으로 양자화하여 k개의 클러스터를 형성하고 각 중심에 가장 가까운 청크를 대표로 선택한다.
각 대표 청크에 대해 추출적 요약을 수행하여 키워드를 생성하고(청크당 세 개) 군집 구성원에 걸쳐 맵핑한다; 워드 클라우드와 분포 인사이트를 위한 2D t-SNE 시각화를 생성한다.
매핑된 표현으로부터 최종 추상적 요약을 생성하고 핵심 포인트를 도출한다.

실험 결과

연구 질문

RQ1비모수 지식베이스를 사용하는 RAG가 임상 의학 및 약리학 질의에 대해 기본 LLM과 비교했을 때 정확도를 어떻게 개선하는가?
RQ2Representative Vector Summarization (RVS)가 토큰 제약 내에서 대형 의학 문서를 효과적으로 요약할 수 있는가?
RQ3키워드 생성, 워드 클라우드, 그리고 t-SNE 시각화가 문서 내용 분포를 이해하는 데 어떻게 도움을 주는가?
RQ4의료 교육을 위한 docGPT 시스템에서 RAG와 RVS의 실제 구현은 어떤가?
RQ5의료 참조 작업에서 결과가 ChatGPT와 같은 표준 모델과 어떻게 비교되는가?

주요 결과

docGPT with RAG produced more targeted and accurate answers than base ChatGPT for queries from clinical medicine and pharmacology sources.
RVS enabled selection of representative chunks under token constraints and produced visual distributions (word clouds, t-SNE) showing content coverage.
For Kumar and Clark Clinical Medicine (10th Edition), 19 representative chunks were used under a 15,000 token limit; BNF 82 used 10 representative chunks under a 5,000 token limit.
The approach integrates extraction, summarization, and visualization to support knowledge-intensive medical education tasks.
The implementation is available in docGPT (Python, LangChain) with source at the provided GitHub repository.

Figure 2: Word cloud for Kumar and Clark Clinical Medicine

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.