QUICK REVIEW

[论文解读] Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Yasin Abbasi, Peter L. Bartlett|arXiv (Cornell University)|Dec 5, 2013

Advanced Bandit Algorithms Research参考文献 25被引用 51

一句话总结

本文提出了一种针对转移分布和损失函数由对手选择的马尔可夫决策过程（MDPs）的高效在线学习算法，在混合性假设下实现了 O(√T log |Π| + log |Π|) 的遗憾。该方法可扩展至分期在线最短路径问题，表明在对抗性图结构与随机损失下存在高效解法，但当图结构和损失均被对手选择时，问题变得与学习带噪声的奇偶性问题一样困难。

ABSTRACT

We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves O(√T log |II| + log |II|) regret with respect to a comparison set of policies II. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set II has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. It was shown that for randomly chosen graphs and adversarial losses, the problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes. Finally, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs.

研究动机与目标

解决转移概率和损失函数均由对手选择的在线 MDP 学习问题。
设计一种与状态空间和动作空间大小无关的高效算法，实现低遗憾。
分析在不同对抗性设置下，分期对抗性在线最短路径问题的计算复杂性。
刻画在线 MDP 学习中可解与不可解情况之间的边界。
建立对抗性在线 MDP 与学习带噪声奇偶性问题之间的联系。

提出的方法

该算法在 MDP 上使用混合性假设，以确保价值估计随时间快速收敛。
采用策略集合 Π 作为比较集，遗憾随 |Π| 对数增长且随 √T 增长。
对于分期最短路径问题，该算法可适应对抗性图结构和随机损失函数。
该方法依赖于当比较集 Π 的大小为多项式时，对样本路径上的期望进行高效计算。
引入一种遗憾分析，将状态空间和动作空间大小的影响与最终遗憾界解耦。
当期望可高效计算且 |Π| 为多项式时，该算法被证明是高效的。

实验结果

研究问题

RQ1当转移和损失均由对手选择时，是否能高效地执行 MDP 中的在线学习？
RQ2在不同对抗性模型下，分期对抗性在线最短路径问题的计算复杂性如何？
RQ3在状态空间和动作空间较大的情况下，何时可为对抗性 MDP 设计高效算法？
RQ4当图结构和损失均被对手选择时，对抗性在线最短路径问题是否与学习带噪声奇偶性问题一样困难？
RQ5当图结构为对抗性但损失随机生成时，能否构造出高效的遗憾最小化算法？

主要发现

所提出的算法在相对于策略集合 Π 的遗憾上达到 O(√T log |Π| + log |Π|)，且与状态空间和动作空间的大小无关。
当样本路径上的期望可高效计算且 |Π| 为多项式时，该算法是高效的。
对于对抗性图结构和随机选择的损失，分期在线最短路径问题存在高效解法。
当图结构和损失均被对手选择时，求解该问题的难度等同于求解学习带噪声奇偶性问题。
提出了一种高效算法，其遗憾在对抗性设置下与不同图结构的数量呈线性关系。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。