[论文解读] Continuous-Utility Direct Preference Optimization
CU-DPO 用连续分数替代二元偏好,使模型与多种基于提示的认知策略保持一致,提升策略选择与后续推理能力。
Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.
研究动机与目标
- 通过超越二元偏好来实现更细粒度的LLM推理对齐以提升动力。
- 提出一个两阶段训练流程以选择并执行认知策略。
- 在多策略情景下理论上建立 CU-DPO 的样本效率与收敛性质。
提出的方法
- 定义一个连续效用直接偏好学习目标,聚合来自 K 种策略的分数。
- 证明在二元偏好基础上,样本复杂度提升为 Theta(K log K)。
- 两阶段训练流程:(i) 通过最佳对比全对比进行策略选择,(ii) 通过边距分层对进行执行细化。
- 收敛性分析表明 DPO 收敛到熵正则化的效用最大化策略。
实验结果
研究问题
- RQ1在多个提示策略之间,连续效用信号是否比二元偏好更能改善选择?
- RQ2使用 K 种策略时的样本复杂度和收敛性质是什么?
- RQ3CU-DPO 如何影响分布内(ID)与分布外(OOD)推理性能?
- RQ4两阶段训练流程是否在数学推理基准测试上带来可观提升?
主要发现
- 策略选择准确率在七个基础模型中从 35–46% 提升到 68–78%。
- 在分布内数据集上,后续推理提升最高可达到 +6.6 点。
- 对分布外任务有有效的转移证据。
- CU-DPO 在数学推理基准测试中表现出一致的提升。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。