QUICK REVIEW

[論文レビュー] Curriculum-DPO++: Direct Preference Optimization via Data and Model Curricula for Text-to-Image Generation

Florinel-Alin Croitoru, Vlad Hondru|arXiv (Cornell University)|Feb 13, 2026

Generative Adversarial Networks and Image Synthesis被引用数 0

ひとこと要約

Curriculum-DPO++ はデータレベルとモデルレベルのカリキュラムを組み合わせることで Direct Preference Optimization を強化し、報酬なしのプロンプト撹乱代替案を導入して、テキストから画像への整合性・美学・人間の好みを改善します。

ABSTRACT

Direct Preference Optimization (DPO) has been proposed as an effective and efficient alternative to reinforcement learning from human feedback (RLHF). However, neither RLHF nor DPO take into account the fact that learning certain preferences is more difficult than learning other preferences, rendering the optimization process suboptimal. To address this gap in text-to-image generation, we recently proposed Curriculum-DPO, a method that organizes image pairs by difficulty. In this paper, we introduce Curriculum-DPO++, an enhanced method that combines the original data-level curriculum with a novel model-level curriculum. More precisely, we propose to dynamically increase the learning capacity of the denoising network as training advances. We implement this capacity increase via two mechanisms. First, we initialize the model with only a subset of the trainable layers used in the original Curriculum-DPO. As training progresses, we sequentially unfreeze layers until the configuration matches the full baseline architecture. Second, as the fine-tuning is based on Low-Rank Adaptation (LoRA), we implement a progressive schedule for the dimension of the low-rank matrices. Instead of maintaining a fixed capacity, we initialize the low-rank matrices with a dimension significantly smaller than that of the baseline. As training proceeds, we incrementally increase their rank, allowing the capacity to grow until it converges to the same rank value as in Curriculum-DPO. Furthermore, we propose an alternative ranking strategy to the one employed by Curriculum-DPO. Finally, we compare Curriculum-DPO++ against Curriculum-DPO and other state-of-the-art preference optimization approaches on nine benchmarks, outperforming the competing methods in terms of text alignment, aesthetics and human preference. Our code is available at https://github.com/CroitoruAlin/Curriculum-DPO.

研究の動機と目的

Preference Optimization の学習不全を、いくつかの好みは他より難しく学習されるべきだという認識から解決する。
データレベルのカリキュラムと新しいモデルレベルのカリキュラムを統合して、訓練中にモデル容量を段階的に増やす。
学習容量を拡張する二つの機構を提案する：層の段階的アンフリーズと LoRA のランクを増やす。
報酬なしの代替案として、プロンプト撹乱によるサンプルのランキングを提供し、複数のベンチマークで評価する。

提案手法

報酬モデルを用いて勝敗画像ペアを難易度で整理し、易→難のバッチを作成する。
訓練が進むにつれてより多くの層をアンフリーズしてモデル容量を段階的に増やす。
LoRA を低ランク適応で用い、進行的なランクスケジュールと更新間の重み伝達を実施する。
勝ちサンプルと負けサンプルの一貫性蒸留損失をシグモイド重み付けで比較する Consistency-DPO 目的を定義する。
プロンプト埋め込みを撹乱して暗黙的な難易度信号を得る、報酬モデル不要のカリキュラムを提供する。
Curriculum-DPO++ を九つのベンチマークで、Curriculum-DPO および他の先端手法と比較して、テキスト整合性・美学・人間の好みの観点から評価する。

実験結果

リサーチクエスチョン

RQ1データレベルのカリキュラムとそれに伴うモデルレベルのカリキュラムは、既存の Curriculum-DPO や diffusion-DPO 手法よりもテキストから画像生成の好み最適化を改善するか。
RQ2層をアンフリーズして学習容量を段階的に増やし、LoRA のランクを拡大することで、訓練難易度が高まるにつれてテキスト整合性と美学は向上するか。
RQ3プロンプト撹乱に基づく報酬不要のカリキュラムは、外部報酬モデルなしでサンプルを適切にランク付けできるか。
RQ4Curriculum-DPO++ は、整合性・美学・人間の好みの観点で、複数の報酬モデルとデータセットに対してどの程度優れているか。

主な発見

Curriculum-DPO++ は評価済みベンチマークで一貫して Curriculum-DPO および他の微調整戦略を上回る。
データとモデルのカリキュラム手法は、9つのタスク/データセットにおいてテキスト整合性・美学・人間の好みを改善する。
プロンプト撹乱による報酬なしの暗黙的ランキング機構は、補助的な報酬モデルが利用できない場合の有効な代替案を提供する。
Curriculum-DPO++ は、報告された評価で Diffusion-DPO や DDPO のような最先端手法よりも優れていることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。