QUICK REVIEW

[论文解读] Action and Perception as Divergence Minimization

Danijar Hafner, Pedro A. Ortega|arXiv (Cornell University)|Sep 3, 2020

Explainable Artificial Intelligence (XAI)参考文献 146被引用 23

一句话总结

本文提出了一种统一框架——动作与感知的分歧最小化（APD），将感知与动作均表述为世界分布与共享的、表达能力强的目标分布之间Kullback-Leibler（KL）分歧的联合最小化。通过利用隐变量，该框架将表征学习、信息增益、赋能（empowerment）和技能发现统一于同一原则之下，表明具备强大世界模型的智能体可在无需特定任务奖励的情况下自主探索与适应。

ABSTRACT

To learn directed behaviors in complex environments, intelligent agents need to optimize objective functions. Various objectives are known for designing artificial agents, including task rewards and intrinsic motivation. However, it is unclear how the known objectives relate to each other, which objectives remain yet to be discovered, and which objectives better describe the behavior of humans. We introduce the Action Perception Divergence (APD), an approach for categorizing the space of possible objective functions for embodied agents. We show a spectrum that reaches from narrow to general objectives. While the narrow objectives correspond to domain-specific rewards as typical in reinforcement learning, the general objectives maximize information with the environment through latent variable models of input sequences. Intuitively, these agents use perception to align their beliefs with the world and use actions to align the world with their beliefs. They infer representations that are informative of past inputs, explore future inputs that are informative of their representations, and select actions or skills that maximally influence future inputs. This explains a wide range of unsupervised objectives from a single principle, including representation learning, information gain, empowerment, and skill discovery. Our findings suggest leveraging powerful world models for unsupervised exploration as a path toward highly adaptive agents that seek out large niches in their environments, rendering task rewards optional.

研究动机与目标

将强化学习与表征学习中的多种目标统一于单一原理性框架之下。
阐明已知目标（如内在动机、赋能、信息增益）之间的关系。
探讨表达性强的世界模型是否可使任务奖励对智能体行为变得不再必要。
提供一种基于分歧最小化的通用方法，用于设计新型智能体目标。
通过可扩展的统一表述，将深度强化学习与主动推断及自由能原理相连接。

提出的方法

将感知与动作表述为世界分布与共享目标分布之间KL分歧的联合最小化。
利用隐变量表示内部状态，过去输入通过变分推断建模，未来输入通过信息增益建模。
推导出隐变量与输入之间互信息最大化，作为最小化联合KL分歧的推论结果。
提出联合KL分歧的分解，将过去（表征学习）与未来（探索）项分离。
将该框架应用于推导出已知目标，如对比学习、SLAC与赋能，均基于同一原则。
提出表达性强的世界模型可作为目标，使智能体通过内在探索发现广阔的生态位。

实验结果

研究问题

RQ1如何为智能体在单一目标函数下统一感知与动作？
RQ2隐变量在连接表征学习与面向未来的探索中起什么作用？
RQ3赋能、信息增益与对比学习等不同目标如何从同一原则中自然涌现？
RQ4表达性强的世界模型是否可使智能体在无需特定任务奖励的情况下实现自主探索与适应？
RQ5该框架与现有理论（如主动推断与自由能原理）之间存在何种关系？

主要发现

联合KL分歧最小化框架统一了从狭窄任务奖励到通用内在目标的广泛目标谱系。
与表达性强目标最小化分歧，可导致隐变量与感官输入之间互信息最大化。
过去输入项通过变分推断实现表征学习，而未来输入项则支持基于信息增益的探索。
随机动作与技能通过未来互信息最大化，实现广义赋能与技能发现。
该框架为传统主动推断提供了可扩展的替代方案，克服了其计算瓶颈。
该方法表明，强大的世界模型可使任务奖励变得可选，使智能体能够自主发现并占据丰富的环境生态位。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。