QUICK REVIEW

[논문 리뷰] RRHF: Rank Responses to Align Language Models with Human Feedback without tears

Zheng Yuan, Hongyi Yuan|arXiv (Cornell University)|2023. 04. 11.

Topic Modeling인용 수 34

한 줄 요약

RRHF는 모델의 로그-확률에 따라 다수의 샘플 응답을 순위 매김하고 랭킹 손실과 감독된 파인튜닝을 사용하여 언어 모델을 인간의 선호에 맞추며, 1–2개의 모델과 다양한 응답 출처만을 사용합니다. 구현과 학습 요구가 더 간단한 상태에서 PPO-동등한 성능을 달성합니다.

ABSTRACT

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). However, PPO is sensitive to hyperparameters and requires multiple models in its standard implementation, making it hard to train and scale up to larger parameter counts. In contrast, we propose a novel learning paradigm called RRHF, which scores sampled responses from different sources via a logarithm of conditional probabilities and learns to align these probabilities with human preferences through ranking loss. RRHF can leverage sampled responses from various sources including the model responses from itself, other large language model responses, and human expert responses to learn to rank them. RRHF only needs 1 to 2 models during tuning and can efficiently align language models with human preferences robustly without complex hyperparameter tuning. Additionally, RRHF can be considered an extension of SFT and reward model training while being simpler than PPO in terms of coding, model counts, and hyperparameters. We evaluate RRHF on the Helpful and Harmless dataset, demonstrating comparable alignment performance with PPO by reward model score and human labeling. Extensive experiments show that the performance of RRHF is highly related to sampling quality which suggests RRHF is a best-of-n learner. Codes available at https://github.com/GanjinZero/RRHF.

연구 동기 및 목표

PPO를 대체할 더 단순한 RLHF를 통해 LLM의 인간 선호도 정렬을 목표로 합니다.
다양한 출처의 여러 응답의 로그-확률에 대한 순위를 사용하는 RRHF를 제안합니다.
RRHF가 더 적은 모델 수와 하이퍼파라미터로 PPO와 비슷한 정렬 성능을 달성함을 보여줍니다.
Anthropic의 Helpful and Harmless 데이터셋에서 RRHF의 효과를 입증하고 샘플링 품질의 영향력을 분석합니다.

제안 방법

다양한 출처(예: 모델, 다른 LLM, 인간 전문가)에서 다수의 응답을 샘플링합니다.
현재 모델 하에서 각 응답의 로그-확률을 길이 정규화 점수 p_i로 계산합니다(로그 P_pi(y_i|x,y_i<t)).
상대적으로 높은 인간 보상 r_i에 대해 더 높은 p_i를 유도하는 랭킹 손실 L_rank를 최적화합니다(L_rank = sum_{r_i<r_j} max(0, p_i - p_j)).
최고 보상 응답을 사용하여 감독된 파인튜닝 손실 L_ft를 주입해 명령 이행에 대한 충실도를 유지합니다.
총 손실은 L = L_rank + L_ft이며 랭킹에서 마진 항은 없고 별도의 가치 모델이나 KL 항도 필요하지 않습니다.
RRHF는 SFT의 확장으로 볼 수 있으며 PPO의 경량 대안으로, 다중 모델 및 복잡한 하이퍼파라미터 조정을 피합니다.

실험 결과

연구 질문

RQ1RRHF가 로그-확률에 대한 랭킹을 사용하고 최소한의 모델 수로 PPO에 비견될 정렬 성능을 달성할 수 있습니까?
RQ2샘플링 응답의 품질이 RRHF 성능에 어떤 영향을 미칩니까?
RQ3RRHF가 자신, 타 LLM 및 인간과 같은 다양한 출처를 활용해 인간 선호도 순위를 학습할 수 있습니까?
RQ4PPO에 비해 구현 및 확장 면에서 RRHF가 더 간단하고 유사한 결과를 유지합니까?

주요 결과

다양한 샘플링(DP 또는 SP)을 사용하는 RRHF가 HH 데이터셋에서 PPO와 동등한 보상 수준에 도달합니다.
샘플링 응답의 품질에 따라 RRHF 성능이 변하고 샘플링된 최대 보상에 근접합니다.
RRHF는 1–2개의 모델만 필요하며 PPO에 비해 상당히 적은 코딩 및 하이퍼파라미터 튜닝이 필요합니다.
랭킹 손실은 필수적이며 이를 제거하면 성능이 저하됩니다.
일괄 학습(RRHF IP-2)은 단일 패스 RRHF보다 인간 평가 결과를 더 향상시킵니다.
ChatGPT, InstructGPT, LLaMA, Alpaca 샘플을 사용한 RRHF로 학습된 Wombat 모델은 유사 자원 하에서 SFT 기준선을 능가할 수 있습니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.