QUICK REVIEW

[论文解读] Where Do You Think You're Going?: Inferring Beliefs about Dynamics from Behavior

Siddharth Reddy, Anca D. Dragan|arXiv (Cornell University)|May 21, 2018

Reinforcement Learning in Robotics参考文献 47被引用 26

一句话总结

本文提出了一种新颖的方法，通过从次优行为中学习用户对环境动态的内在信念，而非假设其行为为最优或带有噪声，来推断人类意图。通过将用户建模为在其自身动态模型下最优行动（该模型通过软Q值最大化估计），该方法在连续、非线性MDP中实现了优于先前方法的意图推断。

ABSTRACT

Inferring intent from observed behavior has been studied extensively within the frameworks of Bayesian inverse planning and inverse reinforcement learning. These methods infer a goal or reward function that best explains the actions of the observed agent, typically a human demonstrator. Another agent can use this inferred intent to predict, imitate, or assist the human user. However, a central assumption in inverse reinforcement learning is that the demonstrator is close to optimal. While models of suboptimal behavior exist, they typically assume that suboptimal actions are the result of some type of random noise or a known cognitive bias, like temporal inconsistency. In this paper, we take an alternative approach, and model suboptimal behavior as the result of internal model misspecification: the reason that user actions might deviate from near-optimal actions is that the user has an incorrect set of beliefs about the rules -- the dynamics -- governing how actions affect the environment. Our insight is that while demonstrated actions may be suboptimal in the real world, they may actually be near-optimal with respect to the user's internal model of the dynamics. By estimating these internal beliefs from observed behavior, we arrive at a new method for inferring intent. We demonstrate in simulation and in a user study with 12 participants that this approach enables us to more accurately model human intent, and can be used in a variety of applications, including offering assistance in a shared autonomy framework and inferring human preferences.

研究动机与目标

为解决逆强化学习（IRL）在假设人类行为最优时的局限性，当用户因错误的内在动态模型而表现次优时，该假设会失效。
不将次优人类行为视为噪声或偏见，而是将其建模为在环境动态的误设内在模型下的最优行为。
开发一种可扩展的方法，从高维连续状态空间中的行为示范中推断内在动态模型。
通过使用推断出的内在动态模型来预测和辅助人类行为，从而改进意图推断、共享自主与偏好学习。

提出的方法

通过最大化在软Q值策略下观察到的动作的似然性，估计用户的内在动态模型，其中动作基于指数化的Q值以概率方式选择。
使用软贝尔曼方程将内在动态模型与软Q函数联系起来，从而实现从示范中端到端可微分的动态参数学习。
使用一组少量可学习参数（最多七个）对内在动态模型进行参数化，即使在连续状态空间中也能实现高效优化。
在已知奖励函数的任务上使用示范数据训练内在动态模型，然后将策略从内在模型迁移到真实动态以提供辅助。
将该方法应用于模拟MDP以及使用Lunar Lander游戏进行的真实用户研究，以验证内在动态的恢复效果与辅助性能。
利用学习到的内在动态模型预测期望的下一状态，并通过将策略从内在动态转移到真实动态，实现共享自主。

实验结果

研究问题

RQ1我们能否从连续、非线性MDP中用户次优行为的样本中，准确推断出其对环境动态的内在模型？
RQ2将次优行为建模为在误设内在动态模型下的最优行为，是否能比假设噪声或认知偏差带来更优的意图推断？
RQ3推断出的内在动态模型能否用于改进共享自主系统中的辅助性能，例如将策略从内在动态迁移到真实动态？
RQ4该方法在真实人类用户中的泛化能力如何，特别是在Lunar Lander等复杂、高维控制任务中？

主要发现

该方法在具有连续状态空间的模拟MDP中，成功恢复了能比真实世界动态更好地解释人类行为的内在动态模型。
在12名参与者参与的Lunar Lander用户研究中，推断出的内在动态模型在解释观察到的人类行为方面优于真实动态模型。
恢复出的内在动态模型实现了从内在动态到真实动态的有效策略迁移，使系统能够更可靠地辅助用户完成游戏。
该方法超越了线性或离散模型的限制，可扩展至非线性、高维连续状态空间，而此前的方法在这些场景中难以实现。
通过将人类行为建模为在信念系统下的最优行为，而非对最优性的偏离，该方法显著提升了意图推断与偏好学习的效果。
结果表明，内在动态模型估计可作为自适应辅助、个性化反馈与意图感知AI系统的基础。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。