[论文解读] A Deep Q-Network for the Beer Game: A Deep Reinforcement Learning algorithm to Solve Inventory Optimization Problems
本文提出了一种形状奖励深度Q网络(SRDQN)强化学习算法,用于优化啤酒游戏中的库存决策,该问题属于去中心化的多智能体供应链问题。该方法在不了解需求分布的情况下学习近似最优策略,在队友采用现实的人类行为模式时,其表现优于基线库存策略,并实现了10倍速的跨智能体迁移学习。
The beer game is a widely used in-class game that is played in supply chain management classes to demonstrate the bullwhip effect. The game is a decentralized, multi-agent, cooperative problem that can be modeled as a serial supply chain network in which agents cooperatively attempt to minimize the total cost of the network even though each agent can only observe its own local information. Each agent chooses order quantities to replenish its stock. Under some conditions, a base-stock replenishment policy is known to be optimal. However, in a decentralized supply chain in which some agents (stages) may act irrationally (as they do in the beer game), there is no known optimal policy for an agent wishing to act optimally. We propose a machine learning algorithm, based on deep Q-networks, to optimize the replenishment decisions at a given stage. When playing alongside agents who follow a base-stock policy, our algorithm obtains near-optimal order quantities. It performs much better than a base-stock policy when the other agents use a more realistic model of human ordering behavior. Unlike most other algorithms in the literature, our algorithm does not have any limits on the beer game parameter values. Like any deep learning algorithm, training the algorithm can be computationally intensive, but this can be performed ahead of time; the algorithm executes in real time when the game is played. Moreover, we propose a transfer learning approach so that the training performed for one agent and one set of cost coefficients can be adapted quickly for other agents and costs. Our algorithm can be extended to other decentralized multi-agent cooperative games with partially observed information, which is a common type of situation in real-world supply chain problems.
研究动机与目标
- 解决去中心化供应链中因智能体行为非理性或不可预测而导致缺乏最优策略的问题。
- 开发一种数据驱动的强化学习方法,无需假设已知的需求分布或成本结构,即可学习最优订货量。
- 实现高效的迁移学习,使训练好的智能体能够快速适应新智能体或具有不同成本系数或动作空间的新环境。
- 在模拟和现实环境中,与基线库存策略及人类行为模式下的订货行为进行性能评估。
- 证明将深度强化学习应用于复杂现实供应链协调问题的可行性。
提出的方法
- SRDQN算法通过奖励塑造扩展了深度Q网络(DQN),以引导在啤酒游戏多智能体协作环境中的学习。
- 该算法使用深度神经网络近似Q函数,将状态-动作对映射到预期累积奖励。
- 状态表示包括库存水平、缺货量和订单历史,动作空间由订货量定义。
- 通过奖励塑造来鼓励成本最小化并稳定训练,尤其在稀疏奖励环境中效果显著。
- 通过使用源智能体的预训练模型初始化目标智能体的策略网络来实现迁移学习,减少可训练参数并加速收敛。
- 采用经验回放和目标网络进行训练,以提高训练稳定性,超参数通过网格搜索进行调优。
实验结果
研究问题
- RQ1当其他智能体遵循基线库存策略或人类行为模式时,深度强化学习智能体是否能在啤酒游戏中学习到近似最优的库存策略?
- RQ2当队友采用非理性或非最优的订货策略时,SRDQN算法相较于基线库存策略表现如何?
- RQ3在适应新智能体或成本结构时,迁移学习能在多大程度上减少训练时间?
- RQ4训练后的SRDQN智能体对成本系数、动作空间或智能体角色变化的鲁棒性如何?
- RQ5SRDQN智能体是否能在不从头开始重新训练的情况下,泛化到不同的供应链配置?
主要发现
- 当队友遵循基线库存策略时,SRDQN智能体的成本与最优基线库存到基线库存(BS-BS)策略相比仅高出2.31%。
- 当队友使用更贴近现实的人类行为模式(Strm-BS)时,SRDQN智能体的成本相比基线库存策略降低了11.65%。
- 与从零开始训练相比,迁移学习可将训练时间减少最多46.89%;即使源智能体与目标智能体在成本系数和动作空间上存在差异,其性能仍与BS-BS策略相差仅12.58%。
- 训练后的SRDQN智能体对持有成本和缺货成本系数的变化具有鲁棒性,在敏感性分析中均保持近似最优性能。
- 在迁移学习过程中,该算法实现了稳定且快速的收敛,训练噪声低,并迅速达到近似最优成本水平。
- SRDQN智能体已成功部署于在线啤酒游戏平台,被超过4,000名玩家使用超过17,000次,证明了其在真实场景中的适用性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。