QUICK REVIEW

[论文解读] Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker, Martin J. Chadwick|arXiv (Cornell University)|Nov 28, 2022

Topic Modeling被引用 109

一句话总结

作者对一个70B语言模型进行微调，以生成在多样意见中最大化群体认同的共识陈述，其批准度高于基线和人类意见（>70% 相对于基线，>65% 相对于最佳人类）。

ABSTRACT

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.

研究动机与目标

研究大型语言模型是否能帮助具有多元偏好的群体就政策问题达成一致。
收集大量多样化的人类观点，并利用它们通过强化学习式重新排序来训练模型。
开发一个奖励模型来预测个体对共识陈述的认同程度。
探索使用社会福利函数将个体偏好聚合为群体共识的可行性。
评估对分布外问题的泛化能力以及对排除意见的敏感性。

提出的方法

在提示驱动循环中，使用一个70B预训练语言模型（Chinchilla）从群体意见生成候选共识陈述。
在高质量的共识候选文本上创建一个有监督微调模型（SFT）以稳定生成。
训练一个奖励模型以在考虑个人意见的条件下预测个体对给定共识陈述的认同。
通过所选社会福利函数下的预测福利对多条候选陈述进行重新排序，以选择最佳陈述。
在训练过程中抽样不平等规避参数alpha，以覆盖功利主义–罗尔斯主义的光谱。
通过人类对认同和质量的评分来评估共识陈述，并与基线和人类意见进行比较。

实验结果

研究问题

RQ1语言模型是否能生成被具有多元意见的群体偏好的共识陈述？
RQ2优化社会福利函数（如功利主义、罗尔斯主义）是否会影响共识陈述的质量与分歧性？
RQ3模型是否对训练中未见过的分布外问题具有泛化能力？
RQ4共识对提示中包含的具体意见有多敏感（意见排除效应）？

主要发现

SFT-Utilitarian 模型在平均群体认同和最低认同（最不利者）的评分上均优于基线。
模型的共识陈述在综合评分中优于人类生成的意见（平均可达到 >65% 的胜出率）。
在分布内和分布外问题中，SFT-Utilitarian 模型保持强劲表现，表明具有泛化能力。
在共识生成过程中排除部分参与者的意见，导致群体认同预测值平均下降0.47个 Likert 点（在包含与排除的对比中）。
整个流程中质量评分提升（SFT 和奖励建模提高了感知质量）。
大约50%的回合中，立场陈述并非分裂性，但模型生成的共识陈述在许多分裂性回合中降低了分歧性（例如比初始立场少分裂65.6%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。