Skip to main content
QUICK REVIEW

[论文解读] Continuous-Utility Direct Preference Optimization

Muhammad Ahmed Mohsin, Muhammad Umer|arXiv (Cornell University)|Jan 31, 2026
Topic Modeling被引用 0
一句话总结

CU-DPO 用连续分数替代二元偏好,使模型与多种基于提示的认知策略保持一致,提升策略选择与后续推理能力。

ABSTRACT

Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.

研究动机与目标

  • 通过超越二元偏好来实现更细粒度的LLM推理对齐以提升动力。
  • 提出一个两阶段训练流程以选择并执行认知策略。
  • 在多策略情景下理论上建立 CU-DPO 的样本效率与收敛性质。

提出的方法

  • 定义一个连续效用直接偏好学习目标,聚合来自 K 种策略的分数。
  • 证明在二元偏好基础上,样本复杂度提升为 Theta(K log K)。
  • 两阶段训练流程:(i) 通过最佳对比全对比进行策略选择,(ii) 通过边距分层对进行执行细化。
  • 收敛性分析表明 DPO 收敛到熵正则化的效用最大化策略。

实验结果

研究问题

  • RQ1在多个提示策略之间,连续效用信号是否比二元偏好更能改善选择?
  • RQ2使用 K 种策略时的样本复杂度和收敛性质是什么?
  • RQ3CU-DPO 如何影响分布内(ID)与分布外(OOD)推理性能?
  • RQ4两阶段训练流程是否在数学推理基准测试上带来可观提升?

主要发现

  • 策略选择准确率在七个基础模型中从 35–46% 提升到 68–78%。
  • 在分布内数据集上,后续推理提升最高可达到 +6.6 点。
  • 对分布外任务有有效的转移证据。
  • CU-DPO 在数学推理基准测试中表现出一致的提升。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。