QUICK REVIEW

[論文レビュー] Fine-tuning language models to find agreement among humans with diverse preferences

Michiel A. Bakker, Martin J. Chadwick|arXiv (Cornell University)|Nov 28, 2022

Topic Modeling被引用数 109

ひとこと要約

著者らは70B言語モデルを微調整して、多様な意見の間でグループの承認を最大化する合意声明を生成させ、ベースラインおよび人間の意見より承認を高めた（ベースラインと比べて>70%、最高の人間と比べて>65%）

ABSTRACT

Recent work in large language modeling (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.

研究の動機と目的

多様な嗜好を持つグループが政策問題で合意を見つけるのにLLMsが役立つかを調査する。
大規模で多様な人間の意見を収集し、それを強化学習様のリランキングを用いてモデルの訓練に活用する。
合意声明に対する個々の同意を予測する報酬モデルを開発する。
個々の嗜好をグループ合意へ統合するための社会福祉関数の利用を検討する。
分布外の質問への一般化と、除外された意見に対する感度を評価する。

提案手法

プロンプト駆動ループで70Bの事前学習済みLLM（Chinchilla）を用い、グループの意見から候補の合意声明を生成する。
高品質な合意候補を基に監視付きファインチューニングモデル（SFT）を作成し、生成を安定化させる。
個人の意見を条件として、特定の合意声明に対する個人の同意を予測する報酬モデルを訓練する。
選択した社会福祉関数の下で予測される福祉に基づいて複数の候補声明をリランキングし、最良のものを選択する。
訓練中に不平等回避パラメータαをサンプリングして、功利主義–ロールズ派のスペクトルを網羅する。
人間の同意と品質の評価によって合意声明を評価し、ベースラインおよび人間の意見と比較する。

実験結果

リサーチクエスチョン

RQ1多様な意見を持つグループにとって好まれる合意声明を言語モデルは生成できるか？
RQ2社会福祉関数（例：功利主義、ロールズ派）を最適化することは、合意声明の品質や分裂性に影響を与えるか？
RQ3訓練時に見られなかった分布外の質問にモデルは一般化できるか？
RQ4プロンプトに含まれる特定の意見に対する合意の感度はどれくらいか（意見除外効果）？

主な発見

SFT-Utilitarianモデルは、平均的なグループ合意と最小値（最悪者）合意評価の両方でベースラインを上回る。
モデルの合意声明は人間が生成した意見より集計評価で好まれる（平均で>65%の勝率に達する）。
分布内外の質問を通じて、SFT-Utilitarianモデルは強い性能を維持し、一般化能力を示す。
合意生成時に一部参加者の意見を除外すると、推定されるグループ合意が平均0.47リッカート尺度ポイント低下する（ inclusion vs. exclusion比較で）。
パイプライン全体で品質評価が向上する（SFTおよび報酬モデリングが知覚品質を高める）。
約50%のラウンドでポジション声明は非分裂的だったが、モデルは多くの分裂的なラウンドで合意声明を生成し、分裂を減らす効果を示した（例：初期立場より65.6%分裂を減少）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。