QUICK REVIEW

[论文解读] A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation

Runzhe Yang, Xingyuan Sun|arXiv (Cornell University)|Aug 21, 2019

Reinforcement Learning in Robotics参考文献 48被引用 108

一句话总结

提出 envelope Q-learning 用于线性偏好下的多目标强化学习，使单一策略网络能够适应偏好空间的全范围，并实现少量样本偏好推断。

ABSTRACT

We introduce a new algorithm for multi-objective reinforcement learning (MORL) with linear preferences, with the goal of enabling few-shot adaptation to new tasks. In MORL, the aim is to learn policies over multiple competing objectives whose relative importance (preferences) is unknown to the agent. While this alleviates dependence on scalar reward design, the expected return of a policy can change significantly with varying preferences, making it challenging to learn a single model to produce optimal policies under different preference conditions. We propose a generalized version of the Bellman equation to learn a single parametric representation for optimal policies over the space of all possible preferences. After an initial learning phase, our agent can execute the optimal policy under any given preference, or automatically infer an underlying preference with very few samples. Experiments across four different domains demonstrate the effectiveness of our approach.

研究动机与目标

解决在MORL中未知线性偏好下学习策略的挑战。
提出泛化贝尔曼框架和凸包包络更新，以使用一个策略网络覆盖偏好空间。
给出包络 MOQ 学习的理论收敛性结果，并展示深度网络的可扩展性。
实现对新任务的少量样本自适应以及隐藏偏好的推断。
在四个领域进行评估，展示相较基线的更好学习与适应性。

提出的方法

将多目标Q值表述为MOQ函数 Q(s,a,ω)，并定义一个带有包络为基础的最优性过滤器 H 的多目标Bellman 类算子。
提出 envelope MOQ-learning（算法1），利用当前解前沿的凸包包络来更新 Q，使其与任意给定线性偏好 ω 对齐。
证明 envelope 运算符 T 是压缩映射，且不动点对应于偏好下的最优值函数；引入多目标Banach 类定理。
用一个单一深度网络表示 Q，输入为 (state, ω)，输出 a-m×|A| 个值；用组合损失 L = (1−λ)L^A + λL^B 进行优化，并通过同伦逐渐将重点从拟合奖励转换为与效用对齐。
使用类似后见回放和小批量包络更新以提高样本效率；在需要时，使用策略梯度加随机搜索的方式，通过对标量奖励来推断 ω 的策略自适应阶段。
在四个领域：DST、FTN、Dialog 和 Super Mario 上评估 CR（覆盖率）、AE（适应误差）和 Avg.UT（平均效用）。

实验结果

研究问题

RQ1单一策略网络是否能高效覆盖具线性偏好的MOMDP的整个 CCS，并在测试时对任意给定 ω 实现快速适应？
RQ2 envelope Q-learning 是否提供理论收敛性保证并相较标量化的 MORL 方法具有更高的样本效率？
RQ3所提方法在更高维的偏好空间以及更大状态/动作空间下的扩展性如何？
RQ4训练好的模型能否在对新任务的适应中从有限样本推断隐藏偏好？

主要发现

与基线相比，Envelope MORL 在所有四个领域的学习和适应上都达到最佳表现。
在 Dialog 任务中，Envelope MORL 在平均用户效用方面显著优于标量化 MORL。
在具有随机偏好的 Super Mario 中，Envelope MORL 的平均效用大约提升了约2倍。
该方法展现出强大的自适应能力，能从少量轨迹推断隐藏偏好。
在 FTN、DST、Dialog 和 Super Mario 中，Envelope MORL 提供更好的覆盖率（CR）和更低的适应误差（AE）比基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。