QUICK REVIEW

[论文解读] Reinforcement Learning for Dividend Optimization in Partially Observed Regime-Switching Diffusion Model

Zhongqin Gao, Lv Yan|arXiv (Cornell University)|Jan 28, 2026

Stochastic processes and financial applications被引用 0

一句话总结

论文在部分信息下针对具有 regime-switching 扩散模型的最优股利分配问题，给出价值函数的半解析结构，并使用信念状态滤波的 actor-critic 算法。

ABSTRACT

This paper studies the optimal dividend problem with a bounded payout rate in a partially observed regime-switching diffusion model, where, in practice, the market regime is unobserved and key model parameters are unknown. To address this partial-information setting, we propose a continuous-time reinforcement learning (RL) approach within an exploratory (entropy-regularized) stochastic control framework for discounted dividends under regime switching. The associated exploratory Hamilton-Jacobi-Bellman (HJB) system admits semi-analytical characterizations of the value function and the optimal exploratory dividend policy, determined by two unknown functions solving two ordinary differential equations (ODEs) together with positive real roots of the induced quadratic equations. Exploiting this structure, we introduce parametric families for both the value function and the policy, using low-degree polynomial approximations to the ODE solutions. We then develop an actor-critic RL algorithm to learn the optimal exploratory policy through interactions with the market environment: it performs belief-state filtering from observed data and iterates policy evaluation and policy improvement online to refine the policy. Numerical experiments demonstrate strong out-of-sample performance of the learned dividend policies.

研究动机与目标

在具有监管约束和模型不确定性的 regime-switching 设置中，激发最优股利分配。
建立市场 regime 未观测且模型参数未知的部分信息股利问题。
发展探索性（熵正则化）随机控制框架以学习最优股利策略。
提供价值函数与策略的半解析表征，为算法设计提供指导。
通过带有样本外验证的数值实验演示学习策略的性能。

提出的方法

将盈余建模为具有未观测 regime 的 regime-switching 扩散过程，且 payout 率有界。
应用分离原理通过 Wonham 滤波将部分信息问题转化为信念状态下的完全信息问题。
采用熵正则化的探索性控制框架并推导探索性 HJB 方程。
通过两个常微分方程和二次方程的解得到价值函数与最优探索性股利策略的半解析表示。
将最优策略表征为依赖于盈余水平和温度参数的截断 Gibbs 分布。
开发一个使用信念状态滤波并交替进行策略评估与改进的 actor-critic RL 算法。

实验结果

研究问题

RQ1当市场 regime 未直接观测时，如何实现最优股利支付控制？
RQ2探索性（熵正则化）RL 框架是否在 regime switching 与部分信息下给出鲁棒策略？
RQ3在本设定下，价值函数与最优策略的半解析结构是什么？
RQ4如何将信念状态滤波集成到用于股利优化的连续时间 RL 算法中？
RQ5与基准方法相比，学习到的策略是否在样本外表现出更强的性能？

主要发现

探索性 HJB 系统给出价值函数与最优策略的半解析描述，由两组未知函数通过两个 ODE 与二次方程求解。
最优探索性股利策略呈截断 Gibbs 形式，并随盈余水平与温度参数的变化而调整。
信念状态（Wonham）滤波使得对未知 regime 的条件化成为对学习的完全信息控制问题。
一个在线更新策略与价值估计的 actor-critic RL 算法显示出较强的样本外性能并降低跨路径方差。
数值实验表明学习得到的策略在训练数据之外表现良好，并在均值估计上与有限差分基准对齐。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。