QUICK REVIEW

[论文解读] Learning Temporal Point Processes via Reinforcement Learning

Shuang Li, Shuai Xiao|arXiv (Cornell University)|Nov 12, 2018

Point processes and geometric inequalities被引用 54

一句话总结

本文将 temporal point process 学习视为强化学习，通过将事件生成建模为随机策略的动作，并通过基于解析 RKHS 的奖励函数学习，相较基于MLE的方法有改善。

ABSTRACT

Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.

研究动机与目标

在不离散化时间的情况下，激励对连续时间中复杂事件动力学的建模。
通过在训练过程中直接监控生成的样本来解决最大似然估计的局限性。
提出一个将每个事件视作一个动作并使用逆强化学习来推断奖励的强化学习框架。
开发一个可行的训练流程，使用 RKHS 获得解析奖励和策略梯度更新。

提出的方法

将下一个事件时间建模为来自随机策略 pi_theta(a|s_t) 的一个动作，该策略由具有随机神经元的 RNN 参数化。
通过下式将策略与强度函数 lambda_theta(t|s_t) 相关联：lambda_theta(t|s_t) = pi_theta(t-t_i|s_t_i) / (1 - ∫_{t_i}^{t} pi_theta(τ-t_i|s_t_i)dτ)。
通过在 RKHS 单位球上进行优化来推断未知奖励函数，使用逆强化学习，从而得到解析的奖励形式。
将 IRL 问题转化为在 RKHS 中专家与学习者均值嵌入之间的差异最小化，从而实现闭式更新（定理 1）。
使用策略梯度和方差约减技术来优化策略，结合 reward-to-go 和 baselines。
提供一个包含小批量训练的实用 RLPP 算法来训练策略。

实验结果

研究问题

RQ1强化学习能否为学习时点过程提供一个灵活的替代 MLE 的方案，而无需手工设计强度函数？
RQ2基于 RKHS 的解析奖励是否能实现点过程的高效稳定策略学习？
RQ3提出的 RL 框架在合成数据和真实数据上与最先进方法（如 RMTPP、WGANTPP）相比如何？
RQ4使用随机 RNN 策略对建模复杂时间依赖关系有何影响？

主要发现

RLPP 在学习到的强度函数方面，在合成数据和真实数据集上优于 RMTPP，并且与 WGANTPP 相比具有竞争力或更好。
基于 RKHS 的奖励给出了最优奖励的闭式表达式，使得通过梯度方法进行高效的策略更新成为可能。
在模型错配的情况下，RLPP 仍然鲁棒，在拟合经验强度方面达到与基线方法相当或更优。
与 LGCP 和非参数 Hawkes 相比，RLPP 在不离散化时间的情况下实现了类似或更好的经验强度，且运行时间更有优势。
RLPP 相较于对抗性基线（如 WGANTPP）在运行时间上具有显著优势（约快 40 倍），同时保持性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。