QUICK REVIEW

[논문 리뷰] SimPO: Simple Preference Optimization with a Reference-Free Reward

Yu Meng, Mengzhou Xia|arXiv (Cornell University)|2024. 05. 23.

Constraint Satisfaction and Optimization인용 수 8

한 줄 요약

SimPO는 시퀀스의 평균 로그 확률에 기반한 간단하고 참조 없는 보상을 제안하고, 목표 여백을 추가하며, 여러 오픈 벤치마크 및 모델 패밀리에 걸쳐 DPO를 일관되게 능가합니다.

ABSTRACT

Direct Preference Optimization (DPO) is a widely used offline preference optimization algorithm that reparameterizes reward functions in reinforcement learning from human feedback (RLHF) to enhance simplicity and training stability. In this work, we propose SimPO, a simpler yet more effective approach. The effectiveness of SimPO is attributed to a key design: using the average log probability of a sequence as the implicit reward. This reward formulation better aligns with model generation and eliminates the need for a reference model, making it more compute and memory efficient. Additionally, we introduce a target reward margin to the Bradley-Terry objective to encourage a larger margin between the winning and losing responses, further improving the algorithm's performance. We compare SimPO to DPO and its latest variants across various state-of-the-art training setups, including both base and instruction-tuned models such as Mistral, Llama 3, and Gemma 2. We evaluate on extensive chat-based evaluation benchmarks, including AlpacaEval 2, MT-Bench, and Arena-Hard. Our results demonstrate that SimPO consistently and significantly outperforms existing approaches without substantially increasing response length. Specifically, SimPO outperforms DPO by up to 6.4 points on AlpacaEval 2 and by up to 7.5 points on Arena-Hard. Our top-performing model, built on Gemma-2-9B-it, achieves a 72.4% length-controlled win rate on AlpacaEval 2, a 59.1% win rate on Arena-Hard, and ranks 1st on Chatbot Arena among <10B models with real user votes.

연구 동기 및 목표

오프라인 선호 최적화를 RLHF 파이프라인의 더 간단한 대안으로 동기를 부여합니다.
생성 메트릭에 맞춘 보상을 길이 정규화된 평균 로그 확률을 사용하여 제안합니다.
승리 응답과 패배 응답 간의 구분을 개선하기 위해 타깃 보상 여백 γ를 도입합니다.
표준 벤치마크에서 기본 모델 및 지시문-미세조정 모델에 걸친 견고성과 성능 향상을 입증합니다.

제안 방법

암시적이고 참조 없이 보상 r_SimPO(x,y) = (β/|y|) log π_θ(y|x) 를 정의하여 훈련을 생성과 정렬과 일치시키는 방식을 제시합니다.
Bradley-Terry 목표에 타깃 여백 γ를 도입하여 r(x,y_w) − r(x,y_l) ≥ γ 를 요구합니다.
별도의 보상 모델이나 참조 정책 없이 BT 정렬 목표를 사용한 오프라인 선호 데이터로 훈련합니다.
기본 모델 및 지시문-미세조정 모델(Llama3-8B-Instruct, Mistral-7B) 및 벤치마크(AlpacaEval 2, Arena-Hard, MT-Bench) 전반에 걸쳐 평가합니다.
β(2.0–2.5) 및 γ(0.5–1.5)를 조정하여 SimPO를 DPO 및 기타 오프라인 방법과 비교하고 최상의 성능을 얻습니다.

실험 결과

연구 질문

RQ1훈련 보상을 생성 메트릭(평균 로그 가능도)과 일치시키는 것이 DPO보다 성능을 향상시키나요?
RQ2참조 모델을 제거하고 길이 정규화된 보상을 사용할 때의 영향은 무엇인가요?
RQ3타깃 보상 여백 γ를 도입하면 보상 정확도와 생성 품질에 어떤 영향을 미치나요?
RQ4SimPO의 이득이 기본 모델 및 지시문-미세조정 모델과 여러 벤치마크에서 일반화되나요?

주요 결과

SimPO는 AlpacaEval 2, Arena-Hard 및 MT-Bench 벤치마크에서 일관되게 DPO 및 관련 방법을 능가합니다.
AlpacaEval 2에서 SimPO는 강력한 베이스라인 대비 LC 승률을 최대 6.4포인트 올리고 Arena-Hard에서 최대 7.5포인트 개선합니다.
Llama3-8B-Instruct를 기반으로 한 최상위 모델은 AlpacaEval 2에서 길이 제어 승률이 44.7%, Arena-Hard에서 33.8%로 여러 경쟁자를 상회합니다.
길이 정규화는 결정적이며, 이를 제거하면 더 길고 반복적인 출력과 보상 정렬이 악화됩니다.
여백 γ를 증가시키면 보상 정확도는 향상되지만 설정이 너무 높으면 승률이 감소할 수 있어 보상 보정과 생성 품질 간의 트레이드오프를 나타냅니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.