Skip to main content
QUICK REVIEW

[论文解读] Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback

Hannah Rose Kirk, Bertie Vidgen|arXiv (Cornell University)|Mar 9, 2023
Open Source Software Innovations被引用 23
一句话总结

本文提出了个性化大语言模型(LLMs)的收益与风险分类,以及一个三层治理框架,用以在社会边界内规范个性化对齐,区分个人影响与社会影响。

ABSTRACT

Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinforcement learning with human feedback (RLHF) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. Different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. In this paper, we ask how, and in what ways, LLMs should be personalised. First, we review literature on current paradigms for aligning LLMs with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. Second, we present a taxonomy of benefits and risks associated with personalised LLMs, for individuals and society at large. Finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable LLM-behaviours within (supra-)national and organisational bounds.

研究动机与目标

  • 说明为何需要对LLMs进行显式个性化,而不仅仅是整体对齐。
  • 在个人与社会层面描述个性化LLMs的收益与风险。
  • 开发一个三层治理框架,以实现安全、受限的个性化。
  • 突出在定义对齐与个性化方面的规范性与实际挑战。

提出的方法

  • 回顾现有对齐文献、RLHF以及众包工作者问题,识别当前方法中的空缺。
  • 构建一个来源于AI、LLMs、推荐系统及相关互联网技术的收益与风险分类。
  • 提出一个三层政策框架(国家/超国家边界、提供方驱动、最终用户定制化要求)。
  • 概述未来工作,包括通过利益相关者访谈来细化该分类。

实验结果

研究问题

  • RQ1在个人层面和社会层面,个性化LLMs的潜在收益与风险是什么?
  • RQ2如何约束个性化,使其保持在安全且可接受的范围内,同时保留收益?
  • RQ3哪种治理结构最能支持显式个性化而不放大伤害?
  • RQ4在为个性化LLMs定义对齐、价值观和最终用户表达方面存在哪些规范性挑战?

主要发现

  • 个性化的LLMs可以提高效率、效用、自主性和共情,但也带来如工作负担、成瘾、偏见强化、隐私担忧以及拟人化等风险。
  • 风险会从个体互动积聚,导致社会层面的极化、获取不平等和劳动力置换等后果。
  • 三层治理框架通过结合不可变的国家边界、提供方强制的约束和最终用户的适应,来平衡收益与安全。
  • 需要将对齐从基于隐性众包工作者的方式转向与个人最终用户情境相符的显性个性化。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。