QUICK REVIEW

[论文解读] KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu|arXiv (Cornell University)|Feb 2, 2024

Global trade and economics被引用 21

一句话总结

KTO 引入 Kahneman-Tversky Optimization，一种基于 HALO 的损失函数，使用二元可取性信号直接优化以人类效用为灵感的目标，在 1B–30B 模型上与 DPO 相当或优于，并且不需要偏好数据。

ABSTRACT

Kahneman & Tversky's $ extit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $ extit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

研究动机与目标

Motivate why existing preference-based alignment losses implicitly encode human biases (HALOs).
Propose Kahneman-Tversky Optimization (KTO) as a HALO that maximizes generation utility directly using binary signals.
Demonstrate KTO’s performance parity or superiority to DPO across model scales (1B–30B) and data regimes.
Show that KTO can operate with imbalanced or non-preference data and reduce reliance on costly human preferences.

提出的方法

Formulate HALOs as human-aware losses using a Kahneman-Tversky value function to model loss aversion and concavity in gains.
Derive KTO loss that uses r_KTO(x,y)=β log(πθ(y|x)/πref(y|x)) and a KL-based reference term to balance exploration and staying close to the reference policy.
Replace the traditional preference likelihood with a utility-based objective that incorporates a logistic v_KTO and a reference-point z_ref in a KL-regularized framework.
Implement KTO with binary desirability signals (desirable/undesirable) and estimate the KL term with batch-based mismatched inputs to stabilize training.
Show how hyperparameters β, λ_D, λ_U and data composition (desirable vs undesirable) influence learning and data efficiency.

Figure 1: The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) only ne

实验结果

研究问题

RQ1Can KTO match or exceed DPO performance across model scales from 1B to 30B parameters using only binary desirability signals?
RQ2How does KTO perform with SFT+KTO versus SFT+DPO, and can KTO succeed without supervised finetuning?
RQ3Is KTO robust to imbalanced data, and can it leverage non-preference binary signals effectively?
RQ4Does direct utility maximization via KTO provide theoretical and empirical advantages over preference-based methods under noisy or intransitive feedback?
RQ5What are the practical data and hyperparameter conditions under which KTO outperforms or matches DPO?

主要发现

Method	Winrate vs. SFT Target (Mistral-7B OpenAssistant)	Winrate vs. SFT Target (Mistral-7B OpenAssistant)
Mistral-7B (unaligned)	0.525 ± 0.037	-
Mistral-7B + DPO	0.600 ± 0.037	-
Mistral-7B + KTO (all y per x)	0.652 ± 0.036	-
Mistral-7B + KTO (one y per x)	0.631 ± 0.036	-
Mistral-7B-Instruct	0.621 ± 0.031	-

KTO matches or exceeds DPO at model scales from 1B to 30B parameters.
KTO can handle extreme data imbalances, matching DPO with up to 90% fewer desirable examples.
KTO can sometimes match or outperform DPO even without supervised finetuning (SFT) prior to alignment on certain Llama models.
Offline PPO with dummy rewards can perform comparably to DPO for most models except the largest (Llama-30B).
KTO trained with binary desirability signals yields strong performance across multiple benchmarks, including MMLU, GSM8K, HumanEval, and BBH, sometimes surpassing SFT targets even when data comes from non-preference signals.
KTO’s theoretical analysis explains why maximizing human utility via KTO can outperform preference likelihood in open-ended settings.

Figure 2: The utility that a human gets from the outcome of a random variable, as imputed by the value function implicit in HALOs. Notice that the imputed functions share properties such as loss aversion with the human value functions that Kahneman & Tversky empirically derived ( 1992 ) .

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。