QUICK REVIEW

[논문 리뷰] APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking

Can Jin, Hongwu Peng|arXiv (Cornell University)|2024. 06. 20.

Topic Modeling인용 수 14

한 줄 요약

이 논문은 LLM 기반 패시지 재랭킹에서 프롬프트를 반복적으로 다듬는 자동 프롬프트 엔지니어링 알고리즘 APEER를 제시하여 수작업 프롬프트 대비 상당한 이득을 얻고 데이터셋과 모델 간의 강한 전이성을 보여준다.

ABSTRACT

Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs.

연구 동기 및 목표

정보검색에서 제로샷 LLM 재랭킹을 위한 프롬프트 설계에 대한 인간 노동 노력 감소를 위한 동기.
피드백 및 선호도 최적화를 통해 프롬프트를 적응시키는 독립적인 자동 프롬프트 엔지니어링 프레임워크를 개발.
다양한 데이터셋과 모델 아키텍처 전반에 걸친 생성 프롬프트의 효과성과 전이 가능성을 입증.

제안 방법

반복적 이단계 프롬프트 최적화: Feedback Optimization은 모델 응답 및 피드백에 따라 현재 프롬프트를 다듬고; Preference Optimization은 양성/음성 시연을 사용해 정제된 프롬프트를 성능이 높은 프롬프트에 맞추도록 정렬.
탐색을 안내하기 위해 양성 세트(최첨단 수동 프롬프트 기반)와 음성 세트(성능이 낮은 프롬프트)로 프롬프트를 초기화.
쿼리-패시지 그룹과 해당 관련성 순서를 포함한 프롬프트를 만들어 MS MARCO 스타일의 하위 집합에서 학습 데이터를 구성.
고정된 1차 검색기(BM25)와 다양한 LLM을 이용한 리스트형 재랭킹 설정으로 프롬프트를 평가.
생성된 프롬프트의 모델 간 및 데이터셋 간 전이 가능성을 평가하고 Preference Optimization 및 학습 데이터 크기에 대한 제거 실험(ablation)을 수행.

Figure 1: Performance overview of four prompting methods on GPT4, LLaMA3 (AI@Meta, 2024 ) and Qwen2 (qwe, 2024 ) models and BEIR datasets (Thakur et al., 2021 ) . The manual prompt is RankGPT (Sun et al., 2023 ) . Modifying the manual prompt with CoT and paraphrasing yields marginal gains.

실험 결과

연구 질문

RQ1자동 프롬프트 엔지니어링이 수작업 프롬프트를 넘어 패시지 관련성 순위 지정에서 제로샷 LLM 재랭킹을 어떻게 개선할 수 있는가?
RQ2자동 최적화를 통해 생성된 프롬프트가 다른 데이터셋 및 모델 아키텍처에 효과적으로 전이될 수 있는가?
RQ3피드백 및 선호도 최적화가 재랭킹 작업에서 프롬프트 품질에 기여하는 바는 무엇인가?

주요 결과

APEER는 MS MARCO 파생 작업(in-domain)에서 GPT-4, LLaMA3, Qwen2 모델에 대해 수작업 프롬프트 대비 지속적으로 향상을 보였고 BEIR 데이터셫(out-of-domain)에서도 강한 성능을 보였다.
BEIR의 여덟 가지 작업에서 APEER는 GPT-4에 대해 수동 프롬프트 대비 평균 nDCG@10를 5.29 향상시켰고 다른 모델에서도 두드러난 이득을 보였다.
Feedback Optimization은 국소 프롬프트 개선을 제공하고, Preference Optimization은 고품질 모범 사례에 프롬프트를 정렬시키며, 차단(ablation)에서 그 효과가 확인된다.
APEER가 생성한 프롬프트는 모델 간 전이 잘 되며(예: GPT-4에서 학습된 프롬프트가 GPT-3.5 및 LLaMA3 기반 시스템에 적용 가능), 데이터셋 간에도 MS MARCO에서 BEIR로 전이된다.
GPT-4와 함께하는 APEER가 프롬 prompting 방법, 모델, 데이터셋 중에서 최상의 overall 성능을 달성했다.
차등분해(ablation)에 따르면 Preference Optimization이 프롬프트 품질에 크게 기여하며, 학습 데이터 크기를 늘리면 일반적으로 성능이 향상되지만 비용에 대한 트레이드오프가 있다.

Figure 2: Overview of \ours . \ours iteratively refines prompts through two optimization steps. In Feedback Optimization, it refines the current prompt $p$ and creates a refined prompt $p^{\prime}$ based on feedback. In Preference Optimization, it further optimizes $p^{\prime}$ by learning preferenc

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.