QUICK REVIEW

[논문 리뷰] Learning Temporal Point Processes via Reinforcement Learning

Shuang Li, Shuai Xiao|arXiv (Cornell University)|2018. 11. 12.

Point processes and geometric inequalities인용 수 54

한 줄 요약

이 논문은 시점 포인트 프로세스 학습을 강화 학습으로 다루며, 이벤트 생성을 확률 정책의 행동으로 모델링하고 해석적 RKHS 기반 보상 함수를 통해 학습하여 MLE 기반 방법보다 성능이 향상된다.

ABSTRACT

Social goods, such as healthcare, smart city, and information networks, often produce ordered event data in continuous time. The generative processes of these event data can be very complex, requiring flexible models to capture their dynamics. Temporal point processes offer an elegant framework for modeling event data without discretizing the time. However, the existing maximum-likelihood-estimation (MLE) learning paradigm requires hand-crafting the intensity function beforehand and cannot directly monitor the goodness-of-fit of the estimated model in the process of training. To alleviate the risk of model-misspecification in MLE, we propose to generate samples from the generative model and monitor the quality of the samples in the process of training until the samples and the real data are indistinguishable. We take inspiration from reinforcement learning (RL) and treat the generation of each event as the action taken by a stochastic policy. We parameterize the policy as a flexible recurrent neural network and gradually improve the policy to mimic the observed event distribution. Since the reward function is unknown in this setting, we uncover an analytic and nonparametric form of the reward function using an inverse reinforcement learning formulation. This new RL framework allows us to derive an efficient policy gradient algorithm for learning flexible point process models, and we show that it performs well in both synthetic and real data.

연구 동기 및 목표

시간을 이산화하지 않고 연속 시간에서 복잡한 이벤트 다이내믹스를 모델링하는 동기를 부여한다.
학습 중에 생성된 샘플을 직접 모니터링함으로써 최대 우도 추정의 한계를 해결한다.
각 이벤트를 하나의 행동으로 다루고 IRL을 사용하여 보상을 추론하는 강화 학습 프레임워크를 제안한다.
RKHS를 사용하여 해석 가능한 보상과 정책 그래디언트 업데이트를 얻는 실용적인 학습 파이프라인을 개발한다.

제안 방법

다음 이벤트 시간을 확률 정책 pi_theta(a|s_t)로부터의 행동으로 모델링하며, 확률 뉴런을 가진 RNN으로 매개화된다.
정책을 강도 함수 lambda_theta(t|s_t)와 연결하는 식: lambda_theta(t|s_t) = pi_theta(t-t_i|s_t_i) / (1 - ∫_{t_i}^{t} pi_theta(τ-t_i|s_t_i)dτ).
RKHS 단위 구를 최적화하여 알려지지 않은 보상 함수를 추론하기 위해 IRL을 사용하고, 해석 가능한 보상 형태를 얻는다.
IRL 문제를 RKHS에서 전문가와 학습자의 평균 임베딩 간의 불일치 최소화로 변환하여 폐쇄 형태의 업데이트를 가능하게 한다(정리 1).
보상-이동(reward-to-go)와 기준선을 사용하여 정책 그래디언트 및 분산 감소 기법으로 정책을 최적화한다.
정책을 학습하기 위한 미니배치와 함께 실용적인 RLPP 알고리즘을 제공한다.

실험 결과

연구 질문

RQ1강화 학습이 강도 함수를 수작업으로 설계하지 않고 시점 포인트 프로세스를 학습하기 위한 MLE의 유연한 대안이 될 수 있는가?
RQ2RKHS 기반 해석적 보상이 포인트 프로세스에 대해 효율적이고 안정적인 정책 학습을 가능하게 하는가?
RQ3제안된 RL 프레임워크가 합성 및 실제 데이터에서 최첨단 방법(RMTPP, WGANTPP 등)과 어떻게 비교되는가?
RQ4복잡한 시간 의존성을 모델링하는 데 확률적 RNN 정책을 사용하는 것이 미치는 영향은 무엇인가?

주요 결과

RLPP는 합성 및 실제 데이터 세트에서 학습된 강도 함수에 대해 RMTPP를 능가하고 WGANTPP와 경쟁력 있거나 더 나은 성능을 보인다.
RKHS 기반 보상은 최적 보상의 폐쇄 형식을 제공하여 그래디언트 방법을 통한 정책 업데이트를 가능하게 한다.
RLPP는 모형 잘못 지정에 대해 견고하며 경험적 강도를 맞추는 데 기준 방법과 일치하거나 이를 상회한다.
LGCP 및 비모수 Hawkes에 비해 RLPP는 시간 이산화 없이 유사하거나 더 나은 경험적 강도를 달성하며 실행 시간도 우수하다.
RLPP는 적대적 기반과 비교해 상당한 실행 시간 이점을 보여주며(예: WGANTPP보다 약 40배 빠름), 성능을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.