QUICK REVIEW

[논문 리뷰] Continuous-Utility Direct Preference Optimization

Muhammad Ahmed Mohsin, Muhammad Umer|arXiv (Cornell University)|2026. 01. 31.

Topic Modeling인용 수 0

한 줄 요약

CU-DPO는 이진 선호를 연속 점수로 대체하여 LLM을 다중 프롬프트 기반 인지 전략에 정렬하고, 전략 선택 및 다운스트림 추론 성능을 향상시킵니다.

ABSTRACT

Large language model reasoning is often treated as a monolithic capability, relying on binary preference supervision that fails to capture partial progress or fine-grained reasoning quality. We introduce Continuous Utility Direct Preference Optimization (CU-DPO), a framework that aligns models to a portfolio of prompt-based cognitive strategies by replacing binary labels with continuous scores that capture fine-grained reasoning quality. We prove that learning with K strategies yields a Theta(K log K) improvement in sample complexity over binary preferences, and that DPO converges to the entropy-regularized utility-maximizing policy. To exploit this signal, we propose a two-stage training pipeline: (i) strategy selection, which optimizes the model to choose the best strategy for a given problem via best-vs-all comparisons, and (ii) execution refinement, which trains the model to correctly execute the selected strategy using margin-stratified pairs. On mathematical reasoning benchmarks, CU-DPO improves strategy selection accuracy from 35-46 percent to 68-78 percent across seven base models, yielding consistent downstream reasoning gains of up to 6.6 points on in-distribution datasets with effective transfer to out-of-distribution tasks.

연구 동기 및 목표

이진 선호를 넘어서는 방식으로 LLM 추론의 더 미세한 정렬을 가능하게 하는 동기를 부여하고 활성화한다.
인지 전략을 선택하고 실행하기 위한 두 단계 학습 파이프라인을 제안한다.
다중 전략 하에서 CU-DPO의 샘플 효율성 및 수렴 특성을 이론적으로 확립한다.

제안 방법

K개의 전략에 걸친 점수를 집계하는 연속 유틸리티 직접 선호 학습 목표를 정의한다.
이진 선호도에 비해 Theta(K log K) 샘플 복잡도 개선을 입증한다.
두 단계 학습 파이프라인: (i) 최적-대-전체 비교를 통한 전략 선택, (ii) 여백-계층화된 쌍으로 실행 정교화.
엔트로피 정규화된 유틸리티 극대화 정책으로 수렴하는 DPO에 대한 수렴 분석을 제시한다.

실험 결과

연구 질문

RQ1연속 유틸리티 신호가 이진 선호에 비해 다중 프롬프트 전략 간 선택을 향상시킬 수 있는가?
RQ2K개의 전략을 사용할 때 샘플 복잡도 및 수렴 특성은 어떠한가?
RQ3CU-DPO가 분포 내 및 분포 외 추론 성능에 어떤 영향을 미치는가?
RQ4두 단계 학습 파이프라인이 수학적 추론 벤치마크에서 실질적 이익을 제시하는가?

주요 결과

전략 선택 정확도가 7개 기본 모델에서 35–46%에서 68–78%로 향상되었습니다.
분포 내 데이터셋에서 다운스트림 추론 성능이 최대 +6.6 포인트까지 향상되었습니다.
분포 외 과제에의 효과적 전이의 증거가 있습니다.
CU-DPO는 수학적 추론 벤치마크 전반에서 일관된 이득을 보입니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.