QUICK REVIEW

[论文解读] Deep Reinforcement Learning of Marked Temporal Point Processes

Utkarsh Upadhyay, Abir De|arXiv (Cornell University)|May 23, 2018

Innovation Diffusion and Forecasting被引用 34

一句话总结

该论文提出了一种用于标记时间点过程（MTPPs）的深度强化学习框架，其中智能体动作与环境反馈均被建模为异步、连续时间事件。通过使用深度循环网络参数化策略的强度与标记分布，该方法实现了任意奖励函数的端到端训练，在真实世界Duolingo和Twitter数据上的个性化教学与病毒式营销应用中，性能优于专用基线方法。

ABSTRACT

In a wide variety of applications, humans interact with a complex environment by means of asynchronous stochastic discrete events in continuous time. Can we design online interventions that will help humans achieve certain goals in such asynchronous setting? In this paper, we address the above problem from the perspective of deep reinforcement learning of marked temporal point processes, where both the actions taken by an agent and the feedback it receives from the environment are asynchronous stochastic discrete events characterized using marked temporal point processes. In doing so, we define the agent's policy using the intensity and mark distribution of the corresponding process and then derive a flexible policy gradient method, which embeds the agent's actions and the feedback it receives into real-valued vectors using deep recurrent neural networks. Our method does not make any assumptions on the functional form of the intensity and mark distribution of the feedback and it allows for arbitrarily complex reward functions. We apply our methodology to two different applications in personalized teaching and viral marketing and, using data gathered from Duolingo and Twitter, we show that it may be able to find interventions to help learners and marketers achieve their goals more effectively than alternatives.

研究动机与目标

解决在动作与反馈均为随机事件的异步、连续时间环境中设计在线干预的挑战。
克服先前随机最优控制方法的局限性，后者假设强度与标记分布具有固定函数形式。
在强化学习中实现任意、复杂的奖励函数，而无需依赖可解析的解析解。
开发一种直接作用于标记时间点过程的策略梯度方法，避免对环境动态的假设。
在个性化教学与病毒式营销等真实世界应用中，证明该方法的有效性。

提出的方法

智能体的策略由一个条件强度函数和一个标记分布定义，二者均由深度循环神经网络（RNNs）参数化。
动作从策略的强度函数中采样，标记从标记分布中采样，若反馈事件发生在预定动作时间之前，则进行重采样。
推导出一种新型策略梯度方法，通过MTPP似然函数与奖励函数进行反向传播，实现端到端训练。
该方法不对反馈的强度或标记分布采用特定函数形式，从而可利用最先进的深度MTPP模型。
使用带二次正则化的随机梯度下降优化策略参数，训练与评估在分割的反馈序列上进行。
该框架支持任意奖励函数，包括复杂目标如最小化平均排名或最大化在社交媒体信息流中的置顶时间。

实验结果

研究问题

RQ1深度强化学习框架能否有效建模并优化连续时间、异步事件环境中的干预？
RQ2与专为特定目标（如最小化信息流中的排名或最大化置顶时间）设计的专用基线方法相比，所提方法表现如何？
RQ3在不假设反馈形式已知或可解析的情况下，该方法在不同奖励函数与反馈动态下的泛化能力如何？
RQ4当底层环境动态未知或复杂时，该方法能否学习到有效的策略？
RQ5在真实世界环境中，与启发式方法及最先进算法相比，该方法在性能与方差方面表现如何？

主要发现

在反向时间顺序信息流中，所提方法在最小化平均排名与最大化置顶时间方面，优于RedQueen与Karimi等人提出的方法，即使其未对信息流排序算法做任何假设。
在简单设置中，尽管无法获取真实动态，该方法在性能上与随机最优控制基线方法相当，后者在这些场景下具有可解析解。
在奖励函数不可交易的复杂设置中，该方法成功学习到有效干预，而先前方法则失败。
该方法的性能方差低于RedQueen，尤其在高优先级用户竞争激烈的环境中表现更优。
在简化示例中，该方法学会在高优先级用户发帖时避免发帖，显示出对竞争动态的战略性认知。
该方法在TensorFlow中的开源实现已发布，以支持基于MTPP的强化学习研究的广泛开展。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。