Skip to main content
QUICK REVIEW

[论文解读] KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh, Winnie Xu|arXiv (Cornell University)|Feb 2, 2024
Global trade and economics被引用 21
一句话总结

KTO 引入 Kahneman-Tversky Optimization,一种基于 HALO 的损失函数,使用二元可取性信号直接优化以人类效用为灵感的目标,在 1B–30B 模型上与 DPO 相当或优于,并且不需要偏好数据。

ABSTRACT

Kahneman & Tversky's $ extit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $ extit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

研究动机与目标

  • Motivate why existing preference-based alignment losses implicitly encode human biases (HALOs).
  • Propose Kahneman-Tversky Optimization (KTO) as a HALO that maximizes generation utility directly using binary signals.
  • Demonstrate KTO’s performance parity or superiority to DPO across model scales (1B–30B) and data regimes.
  • Show that KTO can operate with imbalanced or non-preference data and reduce reliance on costly human preferences.

提出的方法

  • Formulate HALOs as human-aware losses using a Kahneman-Tversky value function to model loss aversion and concavity in gains.
  • Derive KTO loss that uses r_KTO(x,y)=β log(πθ(y|x)/πref(y|x)) and a KL-based reference term to balance exploration and staying close to the reference policy.
  • Replace the traditional preference likelihood with a utility-based objective that incorporates a logistic v_KTO and a reference-point z_ref in a KL-regularized framework.
  • Implement KTO with binary desirability signals (desirable/undesirable) and estimate the KL term with batch-based mismatched inputs to stabilize training.
  • Show how hyperparameters β, λ_D, λ_U and data composition (desirable vs undesirable) influence learning and data efficiency.
Figure 1: The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) only ne
Figure 1: The traditional pipeline for LLM alignment starts with supervised finetuning, followed by fitting the LLM to paired preference data using a method such as RLHF or DPO. However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) only ne

实验结果

研究问题

  • RQ1Can KTO match or exceed DPO performance across model scales from 1B to 30B parameters using only binary desirability signals?
  • RQ2How does KTO perform with SFT+KTO versus SFT+DPO, and can KTO succeed without supervised finetuning?
  • RQ3Is KTO robust to imbalanced data, and can it leverage non-preference binary signals effectively?
  • RQ4Does direct utility maximization via KTO provide theoretical and empirical advantages over preference-based methods under noisy or intransitive feedback?
  • RQ5What are the practical data and hyperparameter conditions under which KTO outperforms or matches DPO?

主要发现

MethodWinrate vs. SFT Target (Mistral-7B OpenAssistant)Winrate vs. SFT Target (Mistral-7B OpenAssistant)
Mistral-7B (unaligned)0.525 ± 0.037-
Mistral-7B + DPO0.600 ± 0.037-
Mistral-7B + KTO (all y per x)0.652 ± 0.036-
Mistral-7B + KTO (one y per x)0.631 ± 0.036-
Mistral-7B-Instruct0.621 ± 0.031-
  • KTO matches or exceeds DPO at model scales from 1B to 30B parameters.
  • KTO can handle extreme data imbalances, matching DPO with up to 90% fewer desirable examples.
  • KTO can sometimes match or outperform DPO even without supervised finetuning (SFT) prior to alignment on certain Llama models.
  • Offline PPO with dummy rewards can perform comparably to DPO for most models except the largest (Llama-30B).
  • KTO trained with binary desirability signals yields strong performance across multiple benchmarks, including MMLU, GSM8K, HumanEval, and BBH, sometimes surpassing SFT targets even when data comes from non-preference signals.
  • KTO’s theoretical analysis explains why maximizing human utility via KTO can outperform preference likelihood in open-ended settings.
Figure 2: The utility that a human gets from the outcome of a random variable, as imputed by the value function implicit in HALOs. Notice that the imputed functions share properties such as loss aversion with the human value functions that Kahneman & Tversky empirically derived ( 1992 ) .
Figure 2: The utility that a human gets from the outcome of a random variable, as imputed by the value function implicit in HALOs. Notice that the imputed functions share properties such as loss aversion with the human value functions that Kahneman & Tversky empirically derived ( 1992 ) .

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。