QUICK REVIEW

[Paper Review] Learning Temporal Point Processes via Reinforcement Learning

Shuang Li, Shuai Xiao|arXiv (Cornell University)|Nov 12, 2018

Point processes and geometric inequalities54 citations

TL;DR

This paper treats temporal point process learning as reinforcement learning by modeling event generation as actions of a stochastic policy and learns via an analytic RKHS-based reward function, improving over MLE-based methods.

ABSTRACT

Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.

Motivation & Objective

Motivate modeling complex event dynamics in continuous time without discretizing time.
Address limitations of maximum likelihood estimation by directly monitoring generated samples during training.
Propose a reinforcement learning framework that treats each event as an action and uses IRL to infer rewards.
Develop a tractable training pipeline using RKHS to obtain an analytic reward and policy gradient updates.

Proposed method

Model the next event time as an action from a stochastic policy pi_theta(a|s_t) parameterized by an RNN with stochastic neurons.
Link the policy to the intensity function lambda_theta(t|s_t) via lambda_theta(t|s_t) = pi_theta(t-t_i|s_t_i) / (1 - ∫_{t_i}^{t} pi_theta(τ-t_i|s_t_i)dτ).
Use inverse reinforcement learning to infer the unknown reward function by optimizing over RKHS unit ball, yielding an analytic reward form.
Transform the IRL problem into a discrepancy minimization between expert and learner mean embeddings in RKHS, enabling a closed-form update (Theorem 1).
Optimize the policy with a policy gradient and variance-reduction techniques, using reward-to-go and baselines.
Provide a practical RLPP algorithm with mini-batches to train the policy.

Experimental results

Research questions

RQ1Can reinforcement learning provide a flexible alternative to MLE for learning temporal point processes without hand-crafting the intensity function?
RQ2Does an RKHS-based analytic reward enable efficient and stable policy learning for point processes?
RQ3How does the proposed RL framework compare to state-of-the-art methods (e.g., RMTPP, WGANTPP) on synthetic and real data?
RQ4What is the impact of using a stochastic RNN policy on modeling complex temporal dependencies?

Key findings

RLPP outperforms RMTPP and is competitive with or better than WGANTPP on synthetic and real datasets for learned intensity functions.
The RKHS-based reward yields a closed-form expression for the optimal reward, enabling efficient policy updates via gradient methods.
RLPP remains robust under model misspecification, matching or surpassing baseline methods in fitting empirical intensities.
Compared to LGCP and non-parametric Hawkes, RLPP achieves similar or better empirical intensity without time discretization, with favorable runtime.
RLPP demonstrates substantial runtime advantages over adversarial baselines (e.g., ~40x faster than WGANTPP) while maintaining performance.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.