QUICK REVIEW

[논문 리뷰] Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker, Martin J. Chadwick|arXiv (Cornell University)|2022. 11. 28.

Topic Modeling인용 수 109

한 줄 요약

저자들은 70B 언어모델을 미세 조정하여 다양한 의견 간의 합의를 극대화하는 합의 진술을 생성하고, baselines 및 인간 의견 대비 더 높은 승인(>70% vs baselines, >65% vs best human)을 달성합니다.

ABSTRACT

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.

연구 동기 및 목표

다양한 선호를 가진 그룹이 정책 쟁점에 대해 합의에 도달할 수 있는지 여부를 조사한다.
대규모의 다양한 인간 의견을 수집하고 이를 강화형 재랭크(reinforcement-like reranking) 형태로 모델 학습에 활용한다.
합의 진술에 대한 개인의 동의 여부를 예측하는 보상 모델을 개발한다.
개별 선호를 그룹 합의로 집계하기 위한 사회복지 함수의 활용을 탐색한다.
훈련 데이터에서 제외된 의견에 대한 민감도와 분포 밖(Out-of-distribution) 쟁점으로의 일반화를 평가한다.

제안 방법

그룹 의견으로부터 70B 프리트레인된 LLM(Chinchilla)을 프롬프트 주도 루프에서 활용해 합의 진술 후보를 생성한다.
생성 안정화를 위해 고품질 합의 후보에 대해 감독 학습 미세조정(SFT) 모델을 생성한다.
주관적 의견에 따라 주어진 합의 진술에 대한 개인의 동의를 예측하는 보상 모델을 훈련한다.
선택한 사회복지 함수 하에서 예상되는 복지를 기반으로 여러 후보 진술을 재랭크해서 최적의 진술을 선택한다.
훈련 중 비율 alpha 값을 샘플링해 공리주의– Rawlsian 스펙트럼을 커버한다.
합의 진술을 인간의 평가(동의도/품질)로 평가하고 Baselines 및 인간 의견과 비교한다.

실험 결과

연구 질문

RQ1다양한 의견을 가진 그룹이 선호하는 합의 진술을 LLM이 생성할 수 있는가?
RQ2사회복지 함수(예: 공리주의, Rawls주의) 최적화가 합의 진술의 품질과 분열성에 영향을 주는가?
RQ3모델이 훈련 중 보지 못한 분포 밖의 문제에 일반화하는가?
RQ4프롬프트에 포함된 의견의 특정 구성에 따른 합의에 대한 민감도(의견 배제 효과)는 어떠한가?

주요 결과

SFT-Utilitarian 모델이 평균 그룹 동의 및 최하층(최소) 동의 평가 모두에서 베이스라인보다 우수하다.
모델의 합의 진술은 인간이 생성한 의견보다 집계 평가에서 선호되며(평균 점수 최대 >65% 승률), 우수하다.
내부 및 외부 분포 문제 전반에서 SFT-Utilitarian 모델은 강력한 성능을 유지하며 일반화 능력을 시사한다.
합의 생성 시 일부 참가자 의견을 배제하면 그룹 합의가 예측된 값에서 평균 0.47 리커트 척도 감소한다(포함 vs 제외 비교).
파이프라인 전반에서 품질 평가가 향상된다(SFT 및 보상 모델링으로 인식되는 품질 증가).
약 50%의 라운드에서 입장 진술은 비분열적이었으나, 모델이 생성한 합의 진술은 여러 분열적인 라운드에서 분열성을 줄였으며(예: 초기 입장 대비 65.6% 감소).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.