QUICK REVIEW

[논문 리뷰] KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu|arXiv (Cornell University)|2024. 02. 02.

Global trade and economics인용 수 21

한 줄 요약

KTO는 Kahneman-Tversky Optimization을 도입하여 이진 바람직성 신호를 사용해 인간 유틸리티에서 영감을 받은 목표를 직접 최적화하는 HALO 기반 손실로, 선호 데이터 없이도 1B–30B 모델에서 DPO와 비슷하거나 이를 상회합니다.

ABSTRACT

Kahneman & Tversky's $ extit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $ extit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

연구 동기 및 목표

기존의 선호 기반 정렬 손실이 암묵적으로 인간 편향(HALOs)을 인코딩하는 이유를 동기 부여한다.
Kahneman-Tversky Optimization (KTO)을 HALO로 제시하여 이진 신호를 사용해 생성을 직접 최대화한다.
모델 규모(1B–30B) 및 데이터 맥락에서 KTO의 성능이 DPO와 동등하거나 우수함을 입증한다.
비교적 불균형한 데이터 혹은 선호 데이터 없이도 작동하며 비용이 큰 인간 선호에 대한 의존도를 줄일 수 있음을 보여준다.]
method:[
HALOs를 손실로 인간 친화적으로 모델링하기 위해 Kahneman-Tversky 가치 함수를 사용해 손실 회피와 이익의 인상을 반영한다.
r_KTO(x,y)=β log(πθ(y|x)/πref(y|x))와 탐색과 참조 정책에 가깝게 유지하는 KL 기반 참조 항을 사용한 KTO 손실을 도출한다.
전통적인 선호 가능도를 KL-정규화 프레임워크에서 로지스틱 v_KTO와 참조점 z_ref를 포함하는 유틸리티 기반 목표로 대체한다.
이진 바람직성 신호(바람직/비바람직)로 KTO를 구현하고 배치 기반의 불일치 입력으로 KL 항을 추정하여 학습을 안정화한다.
β, λ_D, λ_U 및 데이터 구성(바람직 대 바람직하지 않음)이 학습 및 데이터 효율성에 미치는 영향을 보여준다.]
research_questions:[
KTO가 단지 이진 바람직성 신호를 사용하여 1B에서 30B 매개변수까지의 모델 규모에서 DPO 성능을 매칭하거나 상회할 수 있는가?
SFT+KTO 대 SFT+DPO에서 KTO의 성능은 어떠하며, 감독된 미세조정 없이도 성공할 수 있는가?
데이터 불균형에 대해 KTO는 강인한가, 비선호 이진 신호를 효과적으로 활용할 수 있는가?
직접 유틯를 최대화하는 KTO가 소음 있거나 비전형적 피드백 하에서 선호 기반 방법보다 이론적 및 실험적 이점을 제공하는가?
KTO가 DPO를 능가하거나 매칭하는 실용적인 데이터 및 하이퍼파라미터 조건은 무엇인가?]
key_findings:[
KTO는 1B에서 30B 매개변수 규모에서 DPO와 일치하거나 초과한다.
KTO는 극심한 데이터 불균형도 처리 가능하며, 바람직한 예시를 최대 90%까지 줄여도 DPO와 일치한다.
KTO는 특정 Llama 모델에서 정답화된 미세조정(SFT) 없이도 DPO를 때때로 매칭하거나 능가할 수 있다.
더미 보상으로의 오프라인 PPO는 대부분의 모델에서 DPO와 비교해도 무난하며 가장 큰 Llama-30B를 제외하면 비슷하다.
이진 바람직성 신호로 학습된 KTO는 MMLU, GSM8K, HumanEval, BBH를 포함한 여러 벤치마크에서 강력한 성능을 발휘하며, 때로는 비선호 신호로부터 데이터가 제공될 때에도 SFT 목표를 능가한다.
KTO의 이론적 분석은 KTO를 통한 인간 유틸리티 최대화가 개방형 설정에서 선호 가능도보다 왜 우월할 수 있는지 설명한다.]
table_headers:[
Method
Winrate vs. SFT Target (Mistral-7B OpenAssistant)
Winrate vs. SFT Target (Mistral-7B OpenAssistant)

제안 방법

Mistral-7B(unaligned)
Mistral-7B + DPO
Mistral-7B + KTO (all y per x)
Mistral-7B + KTO (one y per x)
Mistral-7B-Instruct

Figure 1: The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) only ne

실험 결과

주요 결과

Method	Winrate vs. SFT Target (Mistral-7B OpenAssistant)	Winrate vs. SFT Target (Mistral-7B OpenAssistant)
Mistral-7B (unaligned)	0.525 ± 0.037	-
Mistral-7B + DPO	0.600 ± 0.037	-
Mistral-7B + KTO (all y per x)	0.652 ± 0.036	-
Mistral-7B + KTO (one y per x)	0.631 ± 0.036	-
Mistral-7B-Instruct	0.621 ± 0.031	-

Figure 2: The utility that a human gets from the outcome of a random variable, as imputed by the value function implicit in HALOs. Notice that the imputed functions share properties such as loss aversion with the human value functions that Kahneman & Tversky empirically derived ( 1992 ) .

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.