QUICK REVIEW

[논문 리뷰] Democratic Preference Alignment via Sortition-Weighted RLHF

Suvadip Sana, Jinzhou Wu|arXiv (Cornell University)|2026. 02. 04.

Game Theory and Voting Systems인용 수 1

한 줄 요약

이 논문은 선호 기반 미세 조정에 대한 선거 기반 프레임워크 DemPO를 소개하며, Hard Panel(인구통계적으로 대표적인 샘플 패널)과 Soft Panel(포함 확률에 의한 가중치 부여) 훈련을 통해 AI 가치를 대표 대중과 일치시키는 것을 목표로 합니다. Hard Panel과 Soft Panel은 규모와 집계 방법에 상관없이 표준 전체 풀 RLHF를 능가하며, 패널 기반 이득은 모델 용량이 커질수록 증가합니다.

ABSTRACT

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

연구 동기 및 목표

편의 샘플 평가자 풀에서의 편향을 선호 기반 정렬에서 해결한다.
알고리즘적 선거를 도입해 인구통계적으로 대표하는 학습 신호를 구성한다.
Hard Panel 및 Soft Panel 학습 방식과 그 목표를 제시한다.
PRISM 데이터와 대표 미국 헌법을 사용해 representativeness 중심의 학습을 평가한다.
패널 기반 이득의 모델 규모 확장성을 분석하고 진단을 제공한다.

제안 방법

자격 마진에 맞는 패널을 로터리처럼 구성하는 LEXIMIN 선거를 사용한다.
단일 샘플 패널 S에 대해 각 평가자에 대한 N_i 정규화를 적용한 Hard Panel 학습을 정의한다.
선거 로터리에서의 포함 확률 π_i로 각 평가자 i를 가중하는 Soft Panel 가중치를 정의한다.
Soft Panel 목표를 가중치 w_i가 있는 기대 Hard Panel 목표와 연관 짓는다.
다중 턴 PRISM 데이터에서 Direct Preference Optimization (DPO)로 모델을 학습한다.
여섯 가지 집계 방법(Bradley–Terry, Plackett–Luce, Borda, Copeland, Kemeny-Young, Mallows)과 75-조항 헌법 조건에서 평가한다.

Figure 1 : The DemPO pipeline for democratic preference alignment. A biased, self-selected pool of data labelers is transformed into a demographically representative mini-public via algorithmic sortition subject to population-derived quota constraints. Preferences from this representative panel (Har

실험 결과

연구 질문

RQ1선호 수집 단계에서 인구통계적으로 대표성을 강제하는 것이 대표 대중이 이끄는 가치로 모델의 행동을 이동시키는가?
RQ2Hard Panel과 Soft Panel 학습은 전체 PRISM 및 미국 대표(US-Rep) 기반선과 비교해 모델 규모에 따라 어떤 차이가 있는가?
RQ3대표성 인식 목표가 대표 대중의 입력으로부터 도출된 헌법과 모델 정렬에 일치하는가?
RQ4패널 기반 이득이 모델 규모와 집계 방법에 따라 어떻게 확장되는가?

주요 결과

Hard Panel은 모든 집계 방법에서 순위에서 가장 높은 성능을 보인다.
Soft Panel은 비가중치가 없는 Full PRISM 기준선보다 일관되게 개선된다.
Hard Panel은 US-Rep를 능가하며, 모델 크기가 커질수록 이득이 증가한다.
Soft Panel의 Full PRISM 대비 이득은 모델 크기가 커질수록 증가한다(1B→3B→8B).
판단자의 신뢰성은 순위 간에 상당한 일치를 보인다(Kendall τ≈0.776, Fleiss’ κ≈0.710).
자동화된 판단자 평가를 통한 헌법적 평가가 패널 기반 학습이 대표 대중의 가치에 정렬됨을 시사한다.

Figure 2 : Model ranking under multiple aggregation methods (Llama-3.1-8B). Left: Borda and Copeland scores with 95% bootstrap confidence intervals, and Kemeny consensus summarized as rank-position probabilities under bootstrap resampling. Right: Bradley–Terry and Plackett–Luce log-ability scores wi

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.