QUICK REVIEW

[论文解读] Adaptive Traffic Signal Control: Deep Reinforcement Learning Algorithm with Experience Replay and Target Network

Juntao Gao, Yulong Shen|arXiv (Cornell University)|May 8, 2017

Traffic control and management参考文献 12被引用 148

一句话总结

该论文提出一种深度强化学习方法，采用基于CNN的特征提取器、经验回放和目标网络，利用实时原始数据自适应控制交通信号，提升稳定性并减少车辆延误。

ABSTRACT

Adaptive traffic signal control, which adjusts traffic signal timing according to real-time traffic, has been shown to be an effective method to reduce traffic congestion. Available works on adaptive traffic signal control make responsive traffic signal control decisions based on human-crafted features (e.g. vehicle queue length). However, human-crafted features are abstractions of raw traffic data (e.g., position and speed of vehicles), which ignore some useful traffic information and lead to suboptimal traffic signal controls. In this paper, we propose a deep reinforcement learning algorithm that automatically extracts all useful features (machine-crafted features) from raw real-time traffic data and learns the optimal policy for adaptive traffic signal control. To improve algorithm stability, we adopt experience replay and target network mechanisms. Simulation results show that our algorithm reduces vehicle delay by up to 47% and 86% when compared to another two popular traffic signal control algorithms, longest queue first algorithm and fixed time control algorithm, respectively.

研究动机与目标

将自适应交通信号控制动机化，以比固定时序或排队基础方法更好地处理动态实时交通。
通过从原始交通数据学习，消除对人工设计特征的依赖。
开发一个基于经验回放和目标网络的稳定DRL框架。
通过与流行基线控制器的仿真对比，证明有效性。

提出的方法

将路口控制建模为马尔可夫决策过程，并基于实时交通数据定义状态、动作和奖励。
使用深卷积神经网络从车辆位置和速度矩阵以及信号状态中提取特征。
实现一个类似DQN的架构，具有独立的目标网络以稳定学习，以及用于高效训练的经验回放。
使用ε-greedy策略训练，并采用RMSProp最小化时间差误差，软目标网络更新。
将输入表示为每条道路的P（车辆位置）和V（规范化速度）矩阵，以及L作为两动作绿灯配置向量。

实验结果

研究问题

RQ1深度强化学习代理是否能够直接从原始交通数据学习有效的自适应交通信号控制，而无需人工设计特征？
RQ2经验回放和目标网络是否能提高基于DRL的交通信号控制的稳定性和性能？
RQ3在不同交通需求下，所提方法与固定时序和最长队列优先（LQF）基线相比有何差异？

主要发现

DRL代理学习出一个减少车辆停留时间之和的策略，经过充分训练后收敛到稳定、较小的值。
随着训练进行，各路口的平均车辆延迟下降，表明学到了一个公平的控制策略。
在更高的交通需求下，与固定时序和最长队列先行基线相比，DRL方法在延迟方面有显著降低（相比固定时序最高降幅可达86%，相比LQF最高降幅可达47%）。
该方法对需求变化具有鲁棒性，随着需求增长，繁忙路段的延迟仅略有增加。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。