QUICK REVIEW

[논문 리뷰] A Roadmap to Pluralistic Alignment

Taylor Sorensen, Jared Moore|arXiv (Cornell University)|2024. 02. 07.

Ethics and Social Impacts of AI인용 수 10

한 줄 요약

이 논문은 AI 모델에 대한 세 가지 형태의 다원주의를 정의하고(Overton, Steerable, Distributional), 세 가지에 상응하는 벤치마크 클래스(multi-objective, trade-off steerable, jury-pluralistic)를 제안하며, 현재의 정합이 distributional pluralism을 감소시킬 수 있다는 실증적 우려를 제시하고, 다원적 평가와 정합을 위한 연구 의제를 제시한다.

ABSTRACT

With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.

연구 동기 및 목표

다양한 인간 가치와 관점을 제공하기 위해 AI 정합에서의 다원성의 중요성을 제고한다.
모델에서의 다원성의 세 가지 작동적 정의를 형식화한다: Overton, Steerable, Distributional.
다양한 목표와 인구를 대상으로 모델을 평가하기 위한 다원적 벤치마크의 세 가지 클래스를 제안한다.
현재의 정합 기법이 분포적 다원성을 감소시킬 수 있음을 주장하고, 향후 연구 방향을 개요한다.

제안 방법

Overton 다원성의 형식적 정의(합리적인 답변의 전체 집합을 산출)와 이를 작동화하는 메커니즘.
Steerable 다원성의 형식적 정의(특성이나 관점에 따른 응답 조건화) 및 충실도를 측정하는 방법.
Distributional 다원성의 형식적 정의(답변에 대한 대상 인구 분포를 일치시키기) 및 보정(calibration)을 평가하는 지표.
세 가지 벤치마크 계열의 정의: 다목적 벤치마크, 트레이드오프 스티어러블 벤치마크, 그리고 Jury-pluralistic 벤치마크.
정합 절차에 대한 논의와 RLHF/포스트 어라이먼트가 분포적 다원성을 감소시킬 수 있음을 시사하는 실증적 관찰들.

Figure 1 : Three kinds of pluralism in models.

실험 결과

연구 질문

RQ1평균 인간 선호를 넘어 AI 시스템에서 다원성을 어떻게 정의하고 작동화할 수 있는가?
RQ2모델의 다원성(Overton, Steerable, Distributional)을 측정하기에 적합한 벤치마크 설계는 무엇인가?
RQ3현재의 정합 기법(RLHF 등)이 분포적 다원성을 감소시키는가, 어떤 조건에서인가?
RQ4실용적 LLM 응용에서 Overton, Steerable, Distributional 다원성을 어떻게 구현하고 평가할 수 있는가?
RQ5다원적 평가 및 정합 전략을 개발하기 위해 필요한 향후 연구는 무엇인가?

주요 결과

모델에 대한 다원성의 세 가지 형식화: Overton(합리적인 답변의 전체 스펙트럼), Steerable(속성-충실한 방향 제어), Distributional(인구 보정 분포).
제안된 세 가지 벤치마크 클래스: 다목적 벤치마크, 트레이드오프 스티어러블 벤치마크, 다양한 평가의 명시적 모델링을 위한 Jury-pluralistic 벤치마크.
표준 정합이 분포적 다원성을 감소시킬 수 있다는 경험적·이론적 시사점으로, 다원적 평가 및 정합 접근법에 대한 추가 연구를 촉진한다.
각 다원성 유형과 벤치마크 클래스에 대한 실용적 한계와 응용에 대한 논의.
다원적 평가 및 정합을 향한 향후 작업의 로드맵과 권고안.

Figure 2 : Three kinds of pluralistic benchmarks.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.