QUICK REVIEW

[论文解读] Retrieval Augmented Generation and Representative Vector Summarization for large unstructured textual data in Medical Education

Supun Manathunga, Y. A. Illangasekara|arXiv (Cornell University)|Aug 1, 2023

Topic Modeling被引用 14

一句话总结

该论文提出了一种基于检索增强生成（RAG）的方法，结合代表性向量摘要（RVS）来处理大规模的非结构化医疗文本，使用基于 LangChain 和 FAISS 的 docGPT 在令牌限制内检索并总结内容。

ABSTRACT

Large Language Models are increasingly being used for various tasks including content generation and as chatbots. Despite their impressive performances in general tasks, LLMs need to be aligned when applying for domain specific tasks to mitigate the problems of hallucination and producing harmful answers. Retrieval Augmented Generation (RAG) allows to easily attach and manipulate a non-parametric knowledgebases to LLMs. Applications of RAG in the field of medical education are discussed in this paper. A combined extractive and abstractive summarization method for large unstructured textual data using representative vectors is proposed.

研究动机与目标

Motivate the use of RAG to mitigate hallucinations and domain misalignment in LLMs for medical education.
Introduce a combined extractive-abstractive summarization workflow to handle large documents.
Develop a method to select representative text chunks and visualize content distribution.
Implement the workflow in docGPT and provide open-source access to the software.

提出的方法

Extract text from unstructured sources including PDFs, text docs, spreadsheets, slides, and OCR for images/scans.
Embed chunks into a 1536-dim vector space with text-embedding-ada-002 and store in FAISS.
Retrieve the k most similar chunks to a query and combine them with the query for LLM prompting.
Compute the maximum affordable token limit T and select k chunks such that k*s ≤ T.
Quantize vectors with k-means to form k clusters and pick the closest chunk to each centroid as a representative.
For each representative chunk, perform extractive summarization to generate keywords (three per chunk) and map them across cluster members; produce a word cloud and 2D t-SNE visualization for distribution insight.
Create a final abstractive summary from mapped representations and generate key points.

实验结果

研究问题

RQ1How does RAG with a non-parametric knowledgebase improve accuracy versus base LLMs for clinical medicine and pharmacology queries?
RQ2Can Representative Vector Summarization (RVS) effectively summarize large medical documents within token constraints?
RQ3How do keyword generation, word clouds, and t-SNE visualizations aid in understanding the document content distribution?
RQ4What is the practical implementation of RAG and RVS in a docGPT system for medical education?
RQ5How do results compare against standard models like ChatGPT on medical reference tasks?

主要发现

docGPT with RAG produced more targeted and accurate answers than base ChatGPT for queries from clinical medicine and pharmacology sources.
RVS enabled selection of representative chunks under token constraints and produced visual distributions (word clouds, t-SNE) showing content coverage.
For Kumar and Clark Clinical Medicine (10th Edition), 19 representative chunks were used under a 15,000 token limit; BNF 82 used 10 representative chunks under a 5,000 token limit.
The approach integrates extraction, summarization, and visualization to support knowledge-intensive medical education tasks.
The implementation is available in docGPT (Python, LangChain) with source at the provided GitHub repository.

Figure 2: Word cloud for Kumar and Clark Clinical Medicine

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。