QUICK REVIEW

[논문 리뷰] Reinforcement Learning for Optimizing RAG for Domain Chatbots

Mandar Kulkarni, Praveen Tangarajan|arXiv (Cornell University)|2024. 01. 10.

Topic Modeling인용 수 18

한 줄 요약

이 논문은 RAG 파이프라인 외부의 정책 기반 RL 방식을 활용해 FAQ 맥락을 불러올지 결정하여 GPT-4 평가에서 토큰 절감 ~31%를 달성하고 도메인 FAQ 챗봇에서 정확도 향상을 약간 얻었습니다. 또한 내부 임베딩이 공개 모델보다 검색 및 OOD 탐지에서 우수하다는 점을 보여줍니다.

ABSTRACT

With the advent of Large Language Models (LLM), conversational assistants have become prevalent for domain use cases. LLMs acquire the ability to contextual question answering through training, and Retrieval Augmented Generation (RAG) further enables the bot to answer domain-specific questions. This paper describes a RAG-based approach for building a chatbot that answers user's queries using Frequently Asked Questions (FAQ) data. We train an in-house retrieval embedding model using infoNCE loss, and experimental results demonstrate that the in-house model works significantly better than the well-known general-purpose public embedding model, both in terms of retrieval accuracy and Out-of-Domain (OOD) query detection. As an LLM, we use an open API-based paid ChatGPT model. We noticed that a previously retrieved-context could be used to generate an answer for specific patterns/sequences of queries (e.g., follow-up queries). Hence, there is a scope to optimize the number of LLM tokens and cost. Assuming a fixed retrieval model and an LLM, we optimize the number of LLM tokens using Reinforcement Learning (RL). Specifically, we propose a policy-based model external to the RAG, which interacts with the RAG pipeline through policy actions and updates the policy to optimize the cost. The policy model can perform two actions: to fetch FAQ context or skip retrieval. We use the open API-based GPT-4 as the reward model. We then train a policy model using policy gradient on multiple training chat sessions. As a policy model, we experimented with a public gpt-2 model and an in-house BERT model. With the proposed RL-based optimization combined with similarity threshold, we are able to achieve significant cost savings while getting a slightly improved accuracy. Though we demonstrate results for the FAQ chatbot, the proposed RL approach is generic and can be experimented with any existing RAG pipeline.

연구 동기 및 목표

RAG 설정에서 LLM 토큰 비용을 줄여 효율적인 도메인 특화 챗봇을 고무한다.
도메인 FAQ 검색 및 OOD 탐지에서 infoNCE를 사용하는 사내 임베딩 모델이 공개 임베딩보다 우수하다는 것을 보인다.
비용 최소화를 위해 FAQ 맥락을 언제 불러올지 결정하는 정책 경사 RL 에이전트를 제안하고 평가한다.
RL 기반 컨텍스트 선택과 유사도 임계값을 결합하면 정확도를 해치지 않으면서도 상당한 토큰 절감을 달성한다.

제안 방법

도메인 FAQ 검색을 위한 infoNCE 손실로 사내 임베딩 모델을 학습한다.
영어 및 Hinglish 쿼리에 대해 사내 임베딩과 공개 모델의 top-1/top-3 검색 정확도를 비교한다.
GPT-4를 보상 평가자로 사용하여 Good/Bad 평가를 정책 그래디언트 학습용 숫자 보상으로 변환한다.
이전 쿼리, 수행 행동, 현재 쿼리 등의 상태를 기반으로 FETCH 또는 NO_FETCH 동작을 선택하는 RAG 외부의 정책 네트워크를 개발한다.
상태-행동-보상의 궤적을 사용하여 정책 그래디언트와 엔트로피 정규화를 통해 정책을 학습한다.
RL 정책을 유사도 임계값(SimThr)과 결합하여 토큰 사용을 추가로 줄인다.

Figure 1: Proposed policy agent based architecture for optimizing RAG for domain chatbots.

실험 결과

연구 질문

RQ1외부 정책 모델이 답변 품질을 해치지 않으면서 LLM 토큰 사용을 줄이기 위해 언제 FAQ 맥락을 불러올지 학습할 수 있는가?
RQ2사내 도메인 튜닝 임베딩 모델이 공개 임베딩보다 검색 정확도와 OOD 탐지를 향상시키는가?
RQ3RL 기반 컨텍스트 선택이 유사도 임계값 규칙과 어떻게 상호작용하여 RAG 비용을 최적화하는가?
RQ4RAG 설정에서 정책 학습에 대해 GPT-4를 자동 평가자로 사용하는 것의 영향은 무엇인가?

주요 결과

infoNCE로 학습된 사내 임베딩 모델이 영어 및 Hinglish 쿼리에 대해 공개 e5-base-v2보다 더 높은 top-1/top-3 정확도를 달성한다.
사내 모델은 도메인 내/OOD 구분력이 더 뛰어나 유사도 임계값으로 선택적으로 검색을 건너뛸 수 있게 한다.
RAG 외부의 RL 정책이 유사도 임계값과 결합될 때 91쿼리의 테스트 세션에서 토큰 사용을 약 31% 감소시킬 수 있으며 정확도도 약간 향상된다.
GPT-4 평가 등급을 보상으로 변환하여 행동 선택을 위한 정책 그래디언트 업데이트를 추진할 수 있다.
정책 모델로 GPT-2를 사용하는 것도 토큰 절약(약 25%)을 낳아 정책 아키텍처에 걸쳐 접근이 일반화됨을 시사한다.
다른 보상 구성은 토큰 절감에 영향을 줄 수 있으며(예: 대체 구성에서 약 30%).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.