QUICK REVIEW

[论文解读] Learning the Value Systems of Societies with Preference-based Multi-objective Reinforcement Learning

Andrés Holgado-Sánchez, Peter Vamplew|arXiv (Cornell University)|Feb 9, 2026

Ethics and Social Impacts of AI被引用 0

一句话总结

本论文提出一种通过聚类代理并使用基于偏好的一多目标强化学习，在马尔可夫决策过程（MDP）中学习社会的价值体系，从而为每个群体推导出与价值相符的策略。

ABSTRACT

Value-aware AI should recognise human values and adapt to the value systems (value-based preferences) of different users. This requires operationalization of values, which can be prone to misspecification. The social nature of values demands their representation to adhere to multiple users while value systems are diverse, yet exhibit patterns among groups. In sequential decision making, efforts have been made towards personalization for different goals or values from demonstrations of diverse agents. However, these approaches demand manually designed features or lack value-based interpretability and/or adaptability to diverse user preferences. We propose algorithms for learning models of value alignment and value systems for a society of agents in Markov Decision Processes (MDPs), based on clustering and preference-based multi-objective reinforcement learning (PbMORL). We jointly learn socially-derived value alignment models (groundings) and a set of value systems that concisely represent different groups of users (clusters) in a society. Each cluster consists of a value system representing the value-based preferences of its members and an approximately Pareto-optimal policy that reflects behaviours aligned with this value system. We evaluate our method against a state-of-the-art PbMORL algorithm and baselines on two MDPs with human values.

研究动机与目标

通过捕获多元 valued 偏好并解决价值 grounding 的错误指定来激发具价值感知的 AI。
用社会 grounding 与对应不同代理聚类的多种价值体系来表示社会。
开发在线 PbMORL 方法，联合学习价值对齐（groundings）与价值体系聚类，并得到帕累托高效的策略。

提出的方法

将值表示为一个集合 V，具有对齐效用与 grounded 于 MDP 轨迹的 grounding 函数。
将价值体系定义为对多目标奖励的线性标量化，权重对每个代理独立。
提出一个双层优化，以学习一个社会价值体系，在与个体 grounding 的一致性最大化的同时实现代表性与简洁性。
使用深度学习方法，结合奖励向量网络与多个权重网络来学习 grounding 与价值体系。
应用Bradley-Terry模型对轨迹对进行偏好编码，以实现价值对齐和基于价值的偏好。
采用 Envelope Q-Learning 学习一组 Pareto 高效的、以权重向量为条件的策略，代表不同的价值体系。
引入在线人工干预的反馈，推动所学价值体系更贴近实际代理行为。

Figure 1. FF environment. Approximated Pareto front and clusters learned with PbMORL (Top) and SVSL-P (bottom, ours) with a particular seed. Black squares form the ground-truth Pareto front. White dots depict weights which policies are in the approximated front. Coloured dots indicate the policies r

实验结果

研究问题

RQ1社会的价值 grounding 能否从轨迹偏好和价值对齐数据中学习？
RQ2如何发现一组能够简明代表不同群体又对个体偏好具有代表性的价值体系？
RQ3所学的价值体系是否能产生与各组价值观对齐的帕累托高效策略？
RQ4在学习价值体系时，在线人机交互反馈是否能提高与实际代理行为的一致性？

主要发现

所提出的 SVSL-P 方法能够学习一个社会 grounding 及与代理聚类对应的多组价值体系。
双层优化在最大化 grounding 一致性的同时平衡社会价值体系的代表性与简洁性。
通过优化加权奖励向量，该方法为每个聚类产生帕累托高效、与价值对齐的策略。
在线 HiL（人类在环）反馈引导所学价值体系更贴近实际代理行为，提升对齐度。
与基线 PbMORL 方法及前沿 PbMORL 算法相比，本方法在学习社会价值体系方面的性能表现进行评估。

Figure 2. FF environment. Pareto front and clusters learned with PbMORL with the different 10 seeds. Black squares indicate the known Pareto front of the environment in terms of the alignment with the two values. White dots depict weights which policies are in the learned front with each method. Col

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。