QUICK REVIEW

[논문 리뷰] Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu|arXiv (Cornell University)|2022. 10. 03.

Topic Modeling인용 수 54

한 줄 요약

본 논문은 강화학습으로 언어 모델을 훈련하기 위한 오픈소스 라이브러리 RL4LMs, 인간 선호도에 맞춘 RL 기반 NLP 정렬을 평가하는 GRUE 벤치마크, 그리고 LM 정렬 작업에서 PPO보다 안정성과 성능을 개선하기 위한 토큰 수준 마스킹을 갖춘 온-policy RL 알고리즘 NLPO를 소개한다.

ABSTRACT

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.

연구 동기 및 목표

RL이 사전 학습된 LMs를 인간 선호도에 맞추는 데 효과적일 수 있음을 보여준다.
RL 기반 LM 최적화를 위한 오픈 소스 모듈형 도구 키트를 제공한다.
GRUE를 인간 선호 보상에 의해 주도되는 RL 기반 NLP 작업의 벤치마크로 도입한다.
언어 생성에서 큰 행동 공간을 완화하고 학습 안정성을 높이기 위해 NLPO를 제안한다.

제안 방법

HuggingFace 모델 및 stable-baselines-3과 호환되는 온-policy RL 툴킷인 RL4LMs를 개발한다.
토큰 수준 또는 시퀀스 보상을 갖는 토큰 수준 MDP로 언어 생성을 모델링한다.
학습 중 행동 공간을 축소하기 위해 top-p 마스킹을 사용하는 마스크된 PPO 변형인 NLPO를 도입한다.
기본 LM에 가까움을 유지하면서 작업 보상을 균형 잡기 위한 KL 기반 정규화 보상을 정의한다.
다양한 보상 기반 평가와 인간 연구를 포함하는 다중 작업 벤치마크인 GRUE를 만든다.
PPO, NLPO, 그리고 감독+RL 설정을 비교하는 광범위한 차등 실험과 분석을 제공합니다.

실험 결과

연구 질문

RQ1다양한 NLP 작업에서 RL 기법이 인간 선호도에 맞춘 LM 정렬에 대해 감독 미세조정보다 우수한 성능을 보일 수 있는가?
RQ2큰 작용 공간을 갖는 언어 생성에서 NLPO가 PPO보다 안정성 및 성능 이점을 제공하는가?
RQ3보상의 품질, 기본 KL 정규화, 마스킹이 RL 안정성 및 정렬 품질에 어느 정도 영향을 미치는가?
RQ4순수 감독 방법에 비해 RL 방식이 데이터 효율성이나 매개변수 효율성을 향상시키는가?
RQ5RL 기반 언어 정책 최적화에서 자동화 지표가 인간 판단과 얼마나 잘 상관되는가?

주요 결과

평가된 작업 전반에서 RL 방법이 일반적으로 인간 선호도에 LM을 정렬하는 데 감독 방식보다 우수하다.
NLPO가 자동 평가와 인간 평가 모두에서 PPO보다 더 높은 안정성과 성능을 보인다.
KL 페널티와 작업 특화 마스킹(top-p)이 보상 남용을 완화하고 정렬 품질을 향상시키는 데 도움이 된다.
감독형 워밍업과 데이터 효율적 보상 학습은 더 작은 모델에서도 강한 성능을 낼 수 있다.
보상 모델을 개선할 때 RL이 감독 학습보다 데이터 효율적일 수 있으며, 감독이 포함된 NLPO가 일부 작업에서 대형 감독 모델을 능가할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.