QUICK REVIEW

[论文解读] Model-based Deep Reinforcement Learning for Dynamic Portfolio Optimization

P. L. Yu, Joon Sern Lee|arXiv (Cornell University)|Jan 25, 2019

Advanced Bandit Algorithms Research参考文献 49被引用 63

一句话总结

本文提出了一种用于动态投资组合优化的基于模型的深度强化学习架构，引入 Infused Prediction Module、Data Augmentation Module with GANs，以及 Behavior Cloning Module，以稳定训练并提升风险调整后的回报。

ABSTRACT

Dynamic portfolio optimization is the process of sequentially allocating wealth to a collection of assets in some consecutive trading periods, based on investors' return-risk profile. Automating this process with machine learning remains a challenging problem. Here, we design a deep reinforcement learning (RL) architecture with an autonomous trading agent such that, investment decisions and actions are made periodically, based on a global objective, with autonomy. In particular, without relying on a purely model-free RL agent, we train our trading agent using a novel RL architecture consisting of an infused prediction module (IPM), a generative adversarial data augmentation module (DAM) and a behavior cloning module (BCM). Our model-based approach works with both on-policy or off-policy RL algorithms. We further design the back-testing and execution engine which interact with the RL agent in real time. Using historical {\em real} financial market data, we simulate trading with practical constraints, and demonstrate that our proposed model is robust, profitable and risk-sensitive, as compared to baseline trading strategies and model-free RL agents from prior work.

研究动机与目标

在现实交易约束下，推动将强化学习用于动态投资组合优化。
开发一个基于模型的 RL 框架来解决金融领域的数据效率、非平稳性和风险管理问题。
集成预测、数据增强和模仿组件，以提升交易代理的稳定性和性能。
使用历史市场数据对比基线和基于模型的 RL 方法，评估所提出的架构。

提出的方法

Introduce an infused prediction module (IPM) that adds future observation predictions to the state used by RL algorithms.
Incorporate a data augmentation module (DAM) using a recurrent GAN with maximum mean discrepancy (MMD) to generate realistic synthetic market data.
Implement a behavior cloning module (BCM) that provides one-step greedy action demonstrations to constrain policy updates.
Adopt a model-based adaptation of DDPG (and discuss applicability to PPO/TRPO) with an actor–critic setup.
Extend the state with predictive features and market index signals, employing an LSTM-based or CNN-based feature extractor for the actor/critic networks.
Train and test the agent on hourly-acted, daily-decided portfolios with transaction costs and slippage to reflect real-world constraints.

实验结果

研究问题

RQ1Can a model-based RL framework with prediction, augmentation, and imitation components improve dynamic portfolio optimization under transaction costs and market frictions?
RQ2Do IPM、DAM、and BCM each contribute to improved risk-adjusted performance compared to model-free baselines and traditional strategies?
RQ3How does integrating future-based predictions and synthetic data affect stability and robustness in non-stationary financial environments?
RQ4Is the approach extensible to on-policy methods like PPO/TRPO beyond the off-policy DDPG setting?
RQ5What are the impacts on risk metrics such as drawdown and CVaR when employing the proposed modules?

主要发现

The proposed architecture improves metrics such as Sharpe ratio, Sortino ratio, maximum drawdown, VaR, and CVaR relative to baselines and model-free RL agents.
IPM provides significant performance gains by incorporating predicted future observations into the RL state.
DAM helps reduce over-fitting and typically leads to portfolios with less volatility through synthetic data augmentation.
BCM contributes to reducing portfolio weight volatility while preserving or enhancing returns in some cases.
The framework demonstrates robustness and profitability under practical trading constraints and non-stationary market conditions.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。