QUICK REVIEW

[论文解读] Avoiding Side Effects By Considering Future Tasks

Victoria Krakovna, Laurent Orseau|arXiv (Cornell University)|Jan 1, 2020

Computability, Logic, AI Algorithms被引用 8

一句话总结

该论文提出了一种自动生成辅助奖励函数的方法，通过奖励智能体执行未来任务的能力来惩罚副作用。通过使用基线策略过滤可实现的未来任务，该方法避免了干扰激励，并在网格世界环境中优于不可逆动作惩罚方法。

ABSTRACT

Designing reward functions is difficult: the designer has to specify what to do (what it means to complete the task) as well as what not to do (side effects that should be avoided while completing the task). To alleviate the burden on the reward designer, we propose an algorithm to automatically generate an auxiliary reward function that penalizes side effects. This auxiliary objective rewards the ability to complete possible future tasks, which decreases if the agent causes side effects during the current task. The future task reward can also give the agent an incentive to interfere with events in the environment that make future tasks less achievable, such as irreversible actions by other agents. To avoid this interference incentive, we introduce a baseline policy that represents a default course of action (such as doing nothing), and use it to filter out future tasks that are not achievable by default. We formally define interference incentives and show that the future task approach with a baseline policy avoids these incentives in the deterministic case. Using gridworld environments that test for side effects and interference, we show that our method avoids interference and is more effective for avoiding side effects than the common approach of penalizing irreversible actions.

研究动机与目标

通过自动化强化学习中的副作用避免来减轻奖励设计者的负担。
解决在完成任务之外，需避免哪些动作的挑战。
防止智能体干扰环境以保持未来任务的可行性。
在确定性环境中正式定义并消除干扰激励。
通过网格世界环境评估该方法在避免副作用和干扰方面的有效性。

提出的方法

该方法引入了一个辅助奖励函数，以激励维持完成未来任务的能力。
它使用代表默认行为（例如，什么都不做）的基线策略，过滤掉默认下不可实现的未来任务。
仅当未来任务在基线策略下可实现时才被考虑，从而防止人为制造的干扰激励。
辅助奖励会惩罚那些降低智能体实现这些筛选后未来任务能力的动作。
该方法在确定性环境中被形式化证明可消除干扰激励。
该方法在设计用于测试副作用和干扰行为的网格世界环境中进行了评估。

实验结果

研究问题

RQ1基于未来任务能力的辅助奖励是否能有效减少副作用，而无需人工奖励塑造？
RQ2使用基线策略是否能消除智能体行为中的干扰激励？
RQ3与标准的不可逆动作惩罚相比，该方法在减少副作用方面表现如何？
RQ4该方法能否在防止有害环境干扰的同时保持任务性能？
RQ5通过基线策略对未来任务进行筛选，是否能提升鲁棒性和对齐性？

主要发现

所提出的方法通过过滤在基线策略下不可实现的未来任务，成功避免了对环境的干扰。
该方法在减少网格世界环境中副作用方面优于常见的不可逆动作惩罚方法。
形式化分析表明，该方法在确定性环境中消除了干扰激励。
基于未来任务能力的辅助奖励导致了更鲁棒和对齐的智能体行为。
基线过滤机制防止了智能体人为操纵环境以改善未来任务前景。
实证结果表明，该方法在最小化意外副作用的同时保持了较高的任务性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。