QUICK REVIEW

[论文解读] Reinforcement Learning for Optimal Execution when Liquidity is Time-Varying

Andrea Macrì, Fabrizio Lillo|arXiv (Cornell University)|Feb 19, 2024

Scheduling and Optimization Algorithms被引用 5

一句话总结

本文将 Double Deep Q-Learning 应用于带有时间变化流动性的 Almgren-Chriss 框架下的最优执行，展示了模型无关的策略学习能够匹配已知解并在动态未知时优于基准。

ABSTRACT

Optimal execution is an important problem faced by any trader. Most solutions are based on the assumption of constant market impact, while liquidity is known to be dynamic. Moreover, models with time-varying liquidity typically assume that it is observable, despite the fact that, in reality, it is latent and hard to measure in real time. In this paper we show that the use of Double Deep Q-learning, a form of Reinforcement Learning based on neural networks, is able to learn optimal trading policies when liquidity is time-varying. Specifically, we consider an Almgren-Chriss framework with temporary and permanent impact parameters following several deterministic and stochastic dynamics. Using extensive numerical experiments, we show that the trained algorithm learns the optimal policy when the analytical solution is available, and overcomes benchmarks and approximated solutions when the solution is not available.

研究动机与目标

阐明在动态、潜在流动性条件下需要鲁棒的最优执行的必要性。
开发一个与模型无关的强化学习框架，在没有精确冲击参数知识的情况下学习执行策略。
在确定性和随机性流动性动态下，将 DDQL 的性能与解析解和基准进行比较评估。
证明 DDQL 能在常数市场中恢复类似 TWAP 的策略，并在冲击变化时提升性能。

提出的方法

在 Almgren-Chriss 基线中使用随时间变化的永久性和临时性冲击参数（确定性和随机性动态）。
实现带有两个神经网络（Q-main 和 Q-target）以及经验回放以提高稳定性的 Double Deep Q-Learning。
将状态定义为 (q_t, t) 或 (q_t, t, S_{t-1})，将行动定义为在剩余库存内的卖出量 v_t。
在 M 轮中进行训练，采用探索-开发（epsilon-greedy）策略，并使用 gamma=1 的 TD 目标更新（风险中性）。
在常数、确定性时间变化和随机冲击情形下，将 DDQL 结果与解析解（若已知）及 TWAP 基准进行比较。

实验结果

研究问题

RQ1当流动性随时间变化、冲击是潜在因素时，DDQL 能否学习到最优执行策略？
RQ2在恒定冲击情境下 DDQL 是否能恢复已知的最优策略，并在冲击动态未知或复杂时优于基准？
RQ3在状态中加入价格等特征如何影响 DDQL 在不同流动性动态下的表现？
RQ4一个模型无关的 DDQL 智能体在确定性和随机性冲击路径下适应以产生鲁棒清算策略的程度。

主要发现

在恒定冲击情景下，DDQL 能再现接近 TWAP 的成本，Delta P&L 很小（例如，在 Q,T 下和 2.5 个标准差时为 -0.455 bp）。
将 mid-price 作为特征并没有显著超过 TWAP，在恒定冲击情景中（Delta P&L 约 -0.225 至 -0.455 bp）。
在冲击逐步增大且确定性时，将 Q,T,S 特征用于 DDQL 的结果几乎匹配理论最优，Delta P&L 约 2 bp（对比理论）。
在冲击逐渐减小时，DDQL 使用 Q,T,S 特征相对于 TWAP 有所提升并接近理论最优；加入价格特征可进一步提升，但仍略显次优。
总体而言，DDQL 展现了模型鲁棒学习能力，能够适应时间变化的流动性，在冲击动态未知时优于基准。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。