[論文レビュー] Latent Adversarial Regularization for Offline Preference Optimization
tldr: GANPOは offline preference optimization に潜在空間対抗正則化を導入し、追加コストをほとんどかけずに言語モデルの整合性とロバスト性を改善します。
Learning from human feedback typically relies on preference optimization that constrains policy updates through token-level regularization. However, preference optimization for language models is particularly challenging because token-space similarity does not imply semantic or behavioral similarity. To address this challenge, we leverage latent-space regularization for language model preference optimization. We introduce GANPO, which achieves latent-space regularization by penalizing divergence between the internal representations of a policy model and a reference model. Given that latent representations are not associated with explicit probability densities, we adopt an adversarial approach inspired by GANs to minimize latent-space divergence. We integrate GANPO as a regularizer into existing offline preference optimization objectives. Experiments across multiple model architectures and tasks show consistent improvements from latent-space regularization. Further, by comparing GANPO-induced inferential biases with those from token-level regularization, we find that GANPO provides more robust structural feedback under distributional shift and noise while maintaining comparable downstream performance with minor computational overhead.
研究の動機と目的
- Motivate improved semantic alignment beyond token-space regularization in offline preference optimization.
- latent-space regularization using latent representations from policy and reference models.
- Develop a GAN-based, quad-representation framework to leverage paired preference data.
- Provide a plug-and-play regularizer that can be added to existing OPO objectives and demonstrate robustness under distributional shift and noise.
提案手法
- Define latent representations h_theta and h_ref from the policy and reference models.
- Augment offline preference optimization with a latent-space divergence regularizer D(p_theta || p_ref).
- Use a Jensen-Shannon divergence variational form and a relativistic average GAN (RaGAN) discriminator to implement the latent regularizer.
- Adopt a quad-tuple latent representation (h_ref^+, h_ref^-, h_theta^+, h_theta^-) with two discriminators (positive and negative) to model good and bad representations.
- Train discriminators with relativistic binary cross-entropy loss and update the policy to minimize the OPO loss plus adversarial regularization (L_adv).
- Provide a plug-and-play GANPO objective that can be combined with DPO or SimPO without changing the underlying training pipeline.
実験結果
リサーチクエスチョン
- RQ1Can latent-space regularization via adversarial discrimination improve offline preference optimization beyond token-space regularization?
- RQ2Does GANPO provide robustness to stochastic decoding and distributional shift while preserving downstream performance?
- RQ3How does the choice of discriminator architecture and latent-space representation impact alignment quality?
- RQ4What is the computational overhead and stability trade-off compared to existing OPO methods?
- RQ5Does GANPO improve performance on downstream benchmarks beyond preference alignment?
主な発見
| Model | Method | Disc. | Win | LC-Win | Len |
|---|---|---|---|---|---|
| Gemma2-2B-it | DPO | N/A | 22.76 | 27.79 | 1668 |
| Gemma2-2B-it | GANPO (DPO) | Transformer | 24.17 | 29.69 | 1664 |
| Gemma2-2B-it | SimPO | N/A | 30.66 | 36.03 | 1740 |
| Gemma2-2B-it | GANPO (SimPO) | Transformer | 31.37 | 36.74 | 1745 |
| Llama3-8B-Instruct | DPO | N/A | 33.90 | 32.34 | 2041 |
| Llama3-8B-Instruct | GANPO (DPO) | Transformer | 35.23 | 33.87 | 2043 |
| Llama3-8B-Instruct | SimPO | N/A | 44.09 | 48.31 | 1836 |
| Llama3-8B-Instruct | GANPO (SimPO) | Transformer | 46.11 | 50.48 | 1834 |
- GANPO yields consistent improvements over DPO and SimPO on AlpacaEval-2.0 across model scales in raw and length-controlled win rates.
- GANPO improves robustness to higher sampling temperatures, preserving alignment under stochastic decoding.
- Discriminator-based latent-space supervision remains correlated with oracle quality under high-entropy generation, whereas learned reward models can fail (reward hacking).
- Downstream evaluations show GANPO does not degrade, and can improve, performance on math, reasoning, and factuality benchmarks.
- Transformer-based discriminators outperform simpler architectures, highlighting the benefit of sequence-level latent feedback.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。