QUICK REVIEW

[논문 리뷰] RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

Shreyas Chaudhari, Pranjal Aggarwal|arXiv (Cornell University)|2024. 04. 12.

Topic Modeling인용 수 8

한 줄 요약

이 논문은 LLM에 대한 RLHF 기본 원리를 분석하며, 보상 모델 그 훈련, 한계, 그리고 RL 프레임워크 내 불완전한 보상의 함의에 초점을 맞춘다.

ABSTRACT

State-of-the-art large language models (LLMs) have become indispensable tools for various tasks. However, training LLMs to serve as effective assistants for humans requires careful consideration. A promising approach is reinforcement learning from human feedback (RLHF), which leverages human feedback to update the model in accordance with human preferences and mitigate issues like toxicity and hallucinations. Yet, an understanding of RLHF for LLMs is largely entangled with initial design choices that popularized the method and current research focuses on augmenting those choices rather than fundamentally improving the framework. In this paper, we analyze RLHF through the lens of reinforcement learning principles to develop an understanding of its fundamentals, dedicating substantial focus to the core component of RLHF -- the reward model. Our study investigates modeling choices, caveats of function approximation, and their implications on RLHF training algorithms, highlighting the underlying assumptions made about the expressivity of reward. Our analysis improves the understanding of the role of reward models and methods for their training, concurrently revealing limitations of the current methodology. We characterize these limitations, including incorrect generalization, model misspecification, and the sparsity of feedback, along with their impact on the performance of a language model. The discussion and analysis are substantiated by a categorical review of current literature, serving as a reference for researchers and practitioners to understand the challenges of RLHF and build upon existing efforts.

연구 동기 및 목표

사전 학습된 언어 모델의 objective 불일치를 다루어 RLHF에 동기를 부여한다.
보상 모델링과 피드백 반영을 이해하기 위해 베이지안 관점에서 RLHF를 고찰한다.
RLHF에서 보상 함수의 역할과 한계, 보상 모델 훈련을 분석한다.

제안 방법

마르코프 결정 과정(MDPs)을 사용하여 텍스트 생성을 순차적 의사결정 과정으로 공식화한다.
오라큘러 보상, 인간 피드백, 그리고 이항 선호에 대한 Bradley–Terry 모델을 포함한 보상 구성들을 고찰한다.
보상 모델링을 회귀 문제로 다루고 선호 데이터에 대한 가능도(likelihood)를 분석한다.
불완전한 보상과 함수 근사화가 RLHF 성능에 미치는 영향을 분석한다.
RLHF 구성 요소와 RL 기반 미세조정의 대안에 대한 문헌 조사를 제공한다.

실험 결과

연구 질문

RQ1인간 피드백으로부터 보상 모델을 추정할 때 Pr(DHF|φ)의 형태는 무엇인가?
RQ2보상 모델링 선택과 불완전한 보상이 RLHF 훈련과 언어 모델 정렬에 어떻게 영향을 미치는가?
RQ3한정된 인간 피드백으로 학습된 보상 모델의 한계와 일반화 문제는 무엇인가?
RQ4RLHF가 사전 학습된 LMs에서 objective 불일치를 제거하는 더 넓은 맥락에서 어떤 위치를 차지하는가?

주요 결과

보상 모델은 RLHF의 핵심이며, 그 설계 선택은 정렬에 근본적인 한계를 부과한다.
보상 데이터는 일반적으로 희소하여 일반화 문제와 보지 못한 입력에서의 오일반화를 초래한다.
불완전하고, 잠재적으로 희소하거나 잘못 지정된 보상은 언어 모델의 성능과 정렬을 저하시킬 수 있다.
베이지안 해석은 인간 피드백 데이터를 바탕으로 보상 모델 매개변수의 MAP 추정을 강조한다.
본 논문은 현행 RLHF 관행의 한계를 문서화하고 보상 모델을 넘어선 RL의 대안과 확장을 조사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.