QUICK REVIEW

[论文解读] Continuous-Utility Direct Preference Optimization

Muhammad Ahmed Mohsin, Muhammad Umer|arXiv (Cornell University)|Jan 31, 2026

Topic Modeling被引用 0

一句话总结

CU-DPO 用连续分数替代二元偏好，使模型与多种基于提示的认知策略保持一致，提升策略选择与后续推理能力。

ABSTRACT

Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.

研究动机与目标

通过超越二元偏好来实现更细粒度的LLM推理对齐以提升动力。
提出一个两阶段训练流程以选择并执行认知策略。
在多策略情景下理论上建立 CU-DPO 的样本效率与收敛性质。

提出的方法

定义一个连续效用直接偏好学习目标，聚合来自 K 种策略的分数。
证明在二元偏好基础上，样本复杂度提升为 Theta(K log K)。
两阶段训练流程：(i) 通过最佳对比全对比进行策略选择，(ii) 通过边距分层对进行执行细化。
收敛性分析表明 DPO 收敛到熵正则化的效用最大化策略。

实验结果

研究问题

RQ1在多个提示策略之间，连续效用信号是否比二元偏好更能改善选择？
RQ2使用 K 种策略时的样本复杂度和收敛性质是什么？
RQ3CU-DPO 如何影响分布内（ID）与分布外（OOD）推理性能？
RQ4两阶段训练流程是否在数学推理基准测试上带来可观提升？

主要发现

策略选择准确率在七个基础模型中从 35–46% 提升到 68–78%。
在分布内数据集上，后续推理提升最高可达到 +6.6 点。
对分布外任务有有效的转移证据。
CU-DPO 在数学推理基准测试中表现出一致的提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。