QUICK REVIEW

[논문 리뷰] HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution

Ehsan Kamalloo, Aref Jafari|arXiv (Cornell University)|2023. 07. 31.

Topic Modeling인용 수 27

한 줄 요약

HAGRID는 MIRACL English를 기반으로 한 출처 인용 가능한 엔드투엔드 생성 정보 탐색을 위한 오픈 데이터셋을 도입하며, GPT-3.5를 사용해 답변을 생성하고 인간 주석가가 정보성 및 귀속 가능성을 평가합니다.

ABSTRACT

The rise of large language models (LLMs) had a transformative impact on search, ushering in a new era of search engines that are capable of generating search results in natural language text, imbued with citations for supporting sources. Building generative information-seeking models demands openly accessible datasets, which currently remain lacking. In this paper, we introduce a new dataset, HAGRID (Human-in-the-loop Attributable Generative Retrieval for Information-seeking Dataset) for building end-to-end generative information-seeking models that are capable of retrieving candidate quotes and generating attributed explanations. Unlike recent efforts that focus on human evaluation of black-box proprietary search engines, we built our dataset atop the English subset of MIRACL, a publicly available information retrieval dataset. HAGRID is constructed based on human and LLM collaboration. We first automatically collect attributed explanations that follow an in-context citation style using an LLM, i.e. GPT-3.5. Next, we ask human annotators to evaluate the LLM explanations based on two criteria: informativeness and attributability. HAGRID serves as a catalyst for the development of information-seeking models with better attribution capabilities.

연구 동기 및 목표

귀속 가능성을 갖춘 생성적 검색 모델을 훈련하기 위한 공개적으로 이용 가능한 데이터셋의 필요성을 동기 부여한다.
LLM이 생성한 설명을 정보성 및 귀속성에 대한 인간 판단과 결합한 데이터셯를 만든다.
MIRACL을 활용하여 쿼리, 인용문, 관련 구절을 기반으로 근거 있는 답변을 생성하고 평가한다.
명시적 출처 인용과 함께 엔드투엔드 검색 보강 생성에 대한 공개 연구를 촉진한다.

제안 방법

MIRACL English 쿼리와 그 긍정 구절을 맥락으로 사용하여 귀속 친화적 생성 파이프라인을 구성한다.
Supporting quotes를 참조하는 인-컨텍스트 인용으로 답변을 생성하기 위해 GPT-3.5를 활용한다.
각 생성된 답변의 정보성 및 귀속 가능성을 인간 주석가가 평가하도록 한다.
Apache 2.0 하에 열람 가능한 두 개의 분할 세트를 제공한다.
IEEE 스타일의 인용 형식에 맞도록 LLM 출력물을 후처리하고 필터링한다.

실험 결과

연구 질문

RQ1주어진 구절 집합으로부터 지원 인용문을 인용하는 근거 있는 답을 자동으로 어떻게 생성할 수 있는가?
RQ2사람들이 평가했을 때 LLM이 생성한 설명은 정보적이고 귀속 가능한 정도가 어느 정도인가?
RQ3개방형의 인간-루프 데이터셋이 귀속성을 가진 엔드투엔드 검색 보강 생성 모델의 개발을 촉진할 수 있는가?

주요 결과

약 1,922개의 training 질문과 716개의 development 질문이 답변 생성을 위해 사용됐다.
GPT-3.5가 3,214개의 training 답변과 1,318개의 development 답변을 생성했다(질문당 약 1.7–1.8회).
인용은 6,577개의 training 답변과 3,305개의 development 답변에 나타났으며(답변당 약 2.0–2.5개의 인용).
정보성 답변은 84%(train)와 90%(dev)로 Yes로 라벨링되었고, 귀속 가능한 답변은 73%(train)와 71%(dev)로 Yes였다.
GPT-3.5가 생성한 답변의 약 40%가 정보적이지 않았고 20%를 넘어서는 답변에서 귀속이 부족해 개선 여지가 있음을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.