QUICK REVIEW

[논문 리뷰] DeDPO: Debiased Direct Preference Optimization for Diffusion Models

Khiem Pham, Quang Nguyen|arXiv (Cornell University)|2026. 02. 05.

Recommender Systems and Techniques인용 수 0

한 줄 요약

DeDPO는 Direct Preference Optimization에 편향 제거 추정기를 통합하여 소수의 인간 선호와 방대한 합성 선호를 함께 학습하도록 하며, 확산 모델 정렬에서 완전 인간 라벨 baselines와 비슷하거나 이를 상회합니다.

ABSTRACT

Direct Preference Optimization (DPO) has emerged as a predominant alignment method for diffusion models, facilitating off-policy training without explicit reward modeling. However, its reliance on large-scale, high-quality human preference labels presents a severe cost and scalability bottleneck. To overcome this, We propose a semi-supervised framework augmenting limited human data with a large corpus of unlabeled pairs annotated via cost-effective synthetic AI feedback. Our paper introduces Debiased DPO (DeDPO), which uniquely integrates a debiased estimation technique from causal inference into the DPO objective. By explicitly identifying and correcting the systematic bias and noise inherent in synthetic annotators, DeDPO ensures robust learning from imperfect feedback sources, including self-training and Vision-Language Models (VLMs). Experiments demonstrate that DeDPO is robust to the variations in synthetic labeling methods, achieving performance that matches and occasionally exceeds the theoretical upper bound of models trained on fully human-labeled data. This establishes DeDPO as a scalable solution for human-AI alignment using inexpensive synthetic supervision.

연구 동기 및 목표

Motivate the need for scalable alignment of text-to-image diffusion models with limited human feedback.
Propose a semi-supervised framework that augments small labeled data with large unlabeled data annotated via synthetic AI feedback.
Introduce DeDPO, a debiased estimator integrated into DPO to correct bias from synthetic labels.
Demonstrate robustness of DeDPO to different synthetic labeling sources and data regimes.

제안 방법

Recast DPO as a binary classification loss over pairwise image preferences.
Introduce DeDPO loss: L_DeDPO = E_{n_l+n_u} L(G_theta(y), G_hat(y)) + E_{n_l}(L(G_theta(y_l), z_l) - L(G_theta(y_l), G_hat(y_l))).
Prove unbiasedness: E[L_DeDPO] = E[L_DPO], regardless of G_hat accuracy.
Provide a debiasing interpretation: unlabeled data use pseudo-labels; labeled data receive an amplified correction toward ground-truth labels.
Describe synthetic preference generation via self-training (G_hat = G_theta_hat) and via pretrained Vision-Language Models (e.g., Qwen).
Analyze convergence: under mild conditions, the learned theta converges with rate O(1/(n_l+n_u)) plus a term depending on (||G_hat - G*||_4)^4, showing robustness to slow synthetic learners.

Figure 1 : Our method achieves performance comparable to models trained on high-quality labels, particularly excelling at capturing subtle details within complex prompts. DeDPO successfully renders challenging elements like the astronaut helmet on Abraham Lincoln’s statue, the Statue of Liberty on t

실험 결과

연구 질문

RQ1Can DeDPO achieve competitive or superior alignment using a mix of limited human labels and abundant synthetic preferences?
RQ2Is the proposed debiased loss unbiased and robust to noisy or imperfect synthetic annotations?
RQ3How do different synthetic annotators (self-training, CLIP, Qwen) impact alignment under DeDPO?
RQ4What are the convergence properties when synthetic preference models converge slowly?
RQ5How does scaling labeled/unlabeled data affect diffusion model alignment performance?

주요 결과

Training set	Model	Method	# Pref. pairs	# Unpref. pairs	PS (↑)	HPSv2 Avg (↑)	AS (↑)
FiFA-5K	SD1.5	SFT	1250	0	21.64	27.62	5.43
FiFA-5K	SD1.5	DPO [57] + 25%	1250	0	21.76	27.76	5.38
FiFA-5K	SD1.5	DPO [57] + 100%	5000	0	21.88	27.79	5.38
FiFA-5K	SD1.5	DPO [57] + synthetic pref.	1250	3750	21.71	27.39	5.33
FiFA-5K	SD1.5	DeDPO + synthetic pref.	1250	3750	21.91	27.80	5.43
FiFA-5K	SDXL	SFT	1250	0	22.01	27.87	5.60
FiFA-5K	SDXL	DPO [57] + 25%	1250	0	22.57	28.34	5.66
FiFA-5K	SDXL	DPO [57] + 100%	5000	0	22.84	28.76	5.77
FiFA-5K	SDXL	DPO [57] + synthetic pref.	1250	3750	22.61	28.71	5.66
FiFA-5K	SDXL	DeDPO + synthetic pref.	1250	3750	22.83	28.76	5.77
HPDv2	SD1.5	SFT	1250	0	21.48	26.94	5.26
HPDv2	SD1.5	DPO [57] + 25%	1250	0	21.61	27.63	5.38
HPDv2	SD1.5	DPO [57] + 100%	5000	0	21.61	27.60	5.38
HPDv2	SD1.5	DeDPO + synthetic pref.	1250	3750	21.66	27.70	5.40
HPDv2	SDXL	SFT	1250	0	21.60	27.26	5.36
HPDv2	SDXL	DPO [57] + 25%	1250	0	22.48	28.44	5.71
HPDv2	SDXL	DPO [57] + 100%	5000	0	22.53	28.45	5.71
HPDv2	SDXL	DPO [57] + synthetic pref.	1250	3750	22.52	28.53	5.71
HPDv2	SDXL	DeDPO + synthetic pref.	1250	3750	22.55	28.56	5.74

DeDPO with 25% human and 75% synthetic labels matches or exceeds fully supervised DPO on multiple backbones (SD1.5 and SDXL) across standard metrics.
On FiFA-5K with SD1.5, DeDPO achieves 21.91 PickScore and 27.80 HPSv2 vs. 21.88 and 27.79 for fully supervised DPO; on SDXL it matches 22.83 vs. 22.84 PS and 28.76 vs. 28.76 HPSv2.
On HPDv2, DeDPO remains on par with full-human baselines for SD1.5 and slightly improves SDXL metrics, despite using only a quarter of human labels.
DeDPO consistently outperforms naive semi-supervised DPO using synthetic labels alone, demonstrating robustness to noisy AI feedback.
The choice of synthetic source matters: Qwen-based preferences yield the best performance, outperforming CLIP and self-training in several settings.
Ablations show DeDPO gains across synthetic sources and is stable as unlabeled data scale, while naive DPO can degrade with more synthetic noise.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.