Skip to main content
QUICK REVIEW

[论文解读] Democratic Preference Alignment via Sortition-Weighted RLHF

Suvadip Sana, Jinzhou Wu|arXiv (Cornell University)|Feb 4, 2026
Game Theory and Voting Systems被引用 1
一句话总结

论文引入 DemPO,一种基于抽选的框架用于基于偏好的模型微调,产生 Hard Panel(按人口统计代表性采样的面板)和 Soft Panel(按包含概率加权的训练),以使 AI 值更符合代表性公众。Hard Panel 和 Soft Panel 在不同模型规模与聚合方法下均优于标准全池 RLHF,且面板带来的增益随模型容量上升而增大。

ABSTRACT

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

研究动机与目标

  • 解决便利样本评定者池带来的偏见对偏好对齐的影响。
  • 引入算法性抽选以构建在人口统计学上具代表性的训练信号。
  • 提出 Hard Panel 与 Soft Panel 训练方案并阐明其目标。
  • 基于 PRISM 数据和具代表性的美国宪法评估代表性驱动的训练。
  • 分析面板基增益的模型尺寸扩展性并提供诊断。

提出的方法

  • 使用 LEXIMIN 抽选构建一个与人口统计边际相匹配的配额可行面板的抽签。
  • 在单一采样面板 S 上定义 Hard Panel 训练,按 N_i 对每位评定者进行归一化。
  • 定义 Soft Panel 加权,使得每位评定者 i 按抽选抽签的包含概率 π_i 进行加权。
  • 将 Soft Panel 目标与带有权重 w_i 的等效 Hard Panel 目标联系起来。
  • 在多轮 PRISM 数据上使用直接偏好优化(DPO)进行模型训练。
  • 在六种聚合方法(Bradley–Terry、Plackett–Luce、Borda、Copeland、Kemeny-Young、Mallows)下进行评估,且使用 75 条宪法条款的设定。
Figure 1 : The DemPO pipeline for democratic preference alignment. A biased, self-selected pool of data labelers is transformed into a demographically representative mini-public via algorithmic sortition subject to population-derived quota constraints. Preferences from this representative panel (Har
Figure 1 : The DemPO pipeline for democratic preference alignment. A biased, self-selected pool of data labelers is transformed into a demographically representative mini-public via algorithmic sortition subject to population-derived quota constraints. Preferences from this representative panel (Har

实验结果

研究问题

  • RQ1在偏好收集阶段强制人口统计代表性是否会将模型行为引导至代表性公众所体现的价值?
  • RQ2Hard Panel 与 Soft Panel 相较于全量 PRISM 与美国代表 baselines 在不同模型规模上有何表现?
  • RQ3以代表性为导向的目标是否使模型与来自代表性公众输入的宪法相一致?
  • RQ4面板基增益如何随模型规模与聚合方法的变化而扩展?

主要发现

  • Hard Panel 在所有聚合方法上排名最高。
  • Soft Panel 相较于未加权的 Full PRISM 基线整体有所提升。
  • Hard Panel 的表现优于 US-Rep,且增益随模型规模扩大。
  • Soft Panel 相对于 Full PRISM 的增益在模型规模增加时提升(1B→3B→8B)。
  • 评审可靠性显示跨排序的一致性显著(Kendall τ≈0.776,Fleiss’ κ≈0.710)。
  • 通过自动评审的宪法评估表明面板基训练与代表性公众价值一致。
Figure 2 : Model ranking under multiple aggregation methods (Llama-3.1-8B). Left: Borda and Copeland scores with 95% bootstrap confidence intervals, and Kemeny consensus summarized as rank-position probabilities under bootstrap resampling. Right: Bradley–Terry and Plackett–Luce log-ability scores wi
Figure 2 : Model ranking under multiple aggregation methods (Llama-3.1-8B). Left: Borda and Copeland scores with 95% bootstrap confidence intervals, and Kemeny consensus summarized as rank-position probabilities under bootstrap resampling. Right: Bradley–Terry and Plackett–Luce log-ability scores wi

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。