QUICK REVIEW

[论文解读] Hyperbolic Discounting and Learning over Multiple Horizons

William Fedus, Carles Gelada|arXiv (Cornell University)|Feb 19, 2019

Mathematical and Theoretical Analysis参考文献 67被引用 52

一句话总结

本文提出一种在强化学习中实现超曲线（非指数）折扣的实用方法：通过在多个时间尺度上聚合大量指数折扣的 Q 值来实现，并且表明多时间尺度学习作为辅助任务也能提升性能。

ABSTRACT

Reinforcement learning (RL) typically defines a discount factor as part of the Markov Decision Process. The discount factor values future rewards by an exponential scheme that leads to theoretical convergence guarantees of the Bellman equation. However, evidence from psychology, economics and neuroscience suggests that humans and animals instead have hyperbolic time-preferences. In this work we revisit the fundamentals of discounting in RL and bridge this disconnect by implementing an RL agent that acts via hyperbolic discounting. We demonstrate that a simple approach approximates hyperbolic discount functions while still using familiar temporal-difference learning techniques in RL. Additionally, and independent of hyperbolic discounting, we make a surprising discovery that simultaneously learning value functions over multiple time-horizons is an effective auxiliary task which often improves over a strong value-based RL agent, Rainbow.

研究动机与目标

质疑在强化学习中使用单一指数折扣，并以经验性超曲线折扣为证据，推动与时间偏好相一致的模型。
证明在时差学习（TD-learning）中，可以通过对指数折扣的积分来近似超曲线折扣。
展示一种实用的深度学习方法，通过多时间尺度 Q-函数来计算超曲线 Q 值。
研究基于危险率的解释以及危险率先验与折扣函数之间的等价性。
评估多时间尺度辅助任务在复杂环境中提升基线强化学习代理的潜力。

提出的方法

形式化危险率与折扣函数的等价性，以证明折扣作为对风险的鲁棒性。
将超曲线 Q 值推导为在连续 gamma 值上的指数 Q 值的积分。
提出一种使用有限 gamma 值集合并采用类黎曼和权重的实际近似。
使用深度网络学习多个共享参数但以不同 gamma 值折扣的 Q 值。
建立指数权重条件以推广超曲线折扣之外的情况。
在 Pathworld 和 ALE 中应用该方法，以评估性能提升和辅助任务收益。

实验结果

研究问题

RQ1是否可以通过聚合指数折扣值从标准 TD-learning 计算出超曲线及其他非指数折扣？
RQ2在不同时间尺度学习多个 Q 值是否作为有益的辅助任务，超越像 Rainbow 这样的强基线？
RQ3在危险不确定性或非平凡的跨期权衡下，何时超曲线折扣具有优势？
RQ4在马尔可夫决策过程（MDP）中，危险先验与折扣函数之间的等价性是什么，以及它如何指导鲁棒策略学习？
RQ5有限时域 gamma 近似在高维强化学习领域中对超曲线折扣的捕捉程度如何？

主要发现

超曲线折扣可以被计算为对指数折扣的积分，使得 TD 方法能够逼近非指数偏好。
在实践中，有限集合的指数折扣 Q 值，结合适当权重，可以近似超曲线 Q 值。
在不同时间尺度学习多个 Q 值可以作为一种有效的辅助任务，提升在 ALE 上相对于强基线的性能。
Pathworld 环境表明，在危险不确定性和非平凡的跨期选择下，超曲线折扣是有益的。
危险先验对应于特定的折扣函数，为 RL 中的风险建模与折扣之间提供了一个有原则的联系。
当环境存在不确定的危险和奖励实现的风险时，该方法能产生鲁棒策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。