[论文解读] KTO: Model Alignment as Prospect Theoretic Optimization
KTO 引入 Kahneman-Tversky Optimization,一种基于 HALO 的损失函数,使用二元可取性信号直接优化以人类效用为灵感的目标,在 1B–30B 模型上与 DPO 相当或优于,并且不需要偏好数据。
Kahneman & Tversky's $ extit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $ extit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
研究动机与目标
- Motivate why existing preference-based alignment losses implicitly encode human biases (HALOs).
- Propose Kahneman-Tversky Optimization (KTO) as a HALO that maximizes generation utility directly using binary signals.
- Demonstrate KTO’s performance parity or superiority to DPO across model scales (1B–30B) and data regimes.
- Show that KTO can operate with imbalanced or non-preference data and reduce reliance on costly human preferences.
提出的方法
- Formulate HALOs as human-aware losses using a Kahneman-Tversky value function to model loss aversion and concavity in gains.
- Derive KTO loss that uses r_KTO(x,y)=β log(πθ(y|x)/πref(y|x)) and a KL-based reference term to balance exploration and staying close to the reference policy.
- Replace the traditional preference likelihood with a utility-based objective that incorporates a logistic v_KTO and a reference-point z_ref in a KL-regularized framework.
- Implement KTO with binary desirability signals (desirable/undesirable) and estimate the KL term with batch-based mismatched inputs to stabilize training.
- Show how hyperparameters β, λ_D, λ_U and data composition (desirable vs undesirable) influence learning and data efficiency.

实验结果
研究问题
- RQ1Can KTO match or exceed DPO performance across model scales from 1B to 30B parameters using only binary desirability signals?
- RQ2How does KTO perform with SFT+KTO versus SFT+DPO, and can KTO succeed without supervised finetuning?
- RQ3Is KTO robust to imbalanced data, and can it leverage non-preference binary signals effectively?
- RQ4Does direct utility maximization via KTO provide theoretical and empirical advantages over preference-based methods under noisy or intransitive feedback?
- RQ5What are the practical data and hyperparameter conditions under which KTO outperforms or matches DPO?
主要发现
| Method | Winrate vs. SFT Target (Mistral-7B OpenAssistant) | Winrate vs. SFT Target (Mistral-7B OpenAssistant) |
|---|---|---|
| Mistral-7B (unaligned) | 0.525 ± 0.037 | - |
| Mistral-7B + DPO | 0.600 ± 0.037 | - |
| Mistral-7B + KTO (all y per x) | 0.652 ± 0.036 | - |
| Mistral-7B + KTO (one y per x) | 0.631 ± 0.036 | - |
| Mistral-7B-Instruct | 0.621 ± 0.031 | - |
- KTO matches or exceeds DPO at model scales from 1B to 30B parameters.
- KTO can handle extreme data imbalances, matching DPO with up to 90% fewer desirable examples.
- KTO can sometimes match or outperform DPO even without supervised finetuning (SFT) prior to alignment on certain Llama models.
- Offline PPO with dummy rewards can perform comparably to DPO for most models except the largest (Llama-30B).
- KTO trained with binary desirability signals yields strong performance across multiple benchmarks, including MMLU, GSM8K, HumanEval, and BBH, sometimes surpassing SFT targets even when data comes from non-preference signals.
- KTO’s theoretical analysis explains why maximizing human utility via KTO can outperform preference likelihood in open-ended settings.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。