QUICK REVIEW

[论文解读] Neural Episodic Control

Alexander Pritzel, Benigno Uría|arXiv (Cornell University)|Mar 6, 2017

Reinforcement Learning in Robotics参考文献 41被引用 92

一句话总结

NEC 使用 per-action differentiable neural dictionaries to store and rapidly back up Q-values from recent experiences, achieving substantially faster data-efficient learning on Atari games than several deep RL baselines.

ABSTRACT

Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.

研究动机与目标

Address data inefficiency in deep reinforcement learning by accelerating reward propagation and value estimation.
Leverage a semi-tabular memory that combines slow-changing state representations with fast-updating value estimates.
Enable rapid assimilation of new experiences via an append-only, memory-based Q-function akin to episodic memory.
Investigate how fast memory updates interact with N-step returns and a shared CNN embedding to improve learning speed.

提出的方法

Introduce a Differentiable Neural Dictionary (DND) per action that stores (key, value) pairs.
Process states with a shared convolutional neural network to produce a key h for lookup in each action’s DND.
Retrieve Q(s,a) as a weighted sum of values in the DND using a nearest-neighbor kernel over keys.
Write new (h, Q^(N)(s,a)) pairs to the corresponding action’s DND; update existing keys via Q-learning like a tabular updater.
Use N-step Q-learning targets Q^(N)(s,a)= sum_{j=0}^{N-1} gamma^j r_{t+j} + gamma^N max_a' Q(s_{t+N}, a'), with the max taken by querying all memories.
Train end-to-end with a differentiable network by minimizing the L2 loss between predicted Q(s,a) and Q^(N)(s,a) over mini-batches from a replay buffer.

实验结果

研究问题

RQ1Can a memory-augmented, semi-tabular value function accelerate data-efficient learning in deep RL environments like Atari?
RQ2How does adding a fast-updating memory (DND) per action influence reward propagation and learning speed compared to standard DQN/A3C baselines?
RQ3What is the impact of N-step Q-learning and differentiable memory on final performance and data efficiency across diverse Atari games?
RQ4Does an append-only, large-scale memory with approximate nearest-neighbor access provide practical benefits over episodic resets in memory?

主要发现

帧数	Nature DQN	Q*(λ)	Retrace(λ)	Prioritised Replay	A3C	NEC	MFEC
1M	-0.7%	-0.8%	-0.4%	-2.4%	0.4%	16.7%	12.8%
2M	0.0%	0.1%	0.2%	0.0%	0.9%	27.8%	16.7%
4M	2.4%	1.8%	3.3%	2.7%	1.9%	36.0%	26.6%
10M	15.7%	13.0%	17.3%	22.4%	3.6%	54.6%	45.4%
20M	26.8%	26.9%	30.4%	38.6%	7.9%	72.0%	55.9%
40M	52.7%	59.6%	60.5%	89.0%	18.4%	83.3%	61.9%
(Table 1) Median human-normalised scores across games at different frames
Note: values represent human-normalised scores as reported in the paper.

NEC learns significantly faster in the small data regime across Atari games than DQN, A3C, and several λ-return baselines.
Across early learning, NEC outperforms all baselines; at around 40 million frames, DQN with Prioritised Replay can surpass NEC on average.
NEC achieves human-level performance in about 25% of the tested games within 10 million frames, indicating strong data efficiency.
NEC and MFEC both explore episodic-like value estimation; NEC, however, uses a reward-guided embedding to improve interpolation of values.
NEC generally outperforms MFEC and Prioritised Replay in learning speed and data efficiency, especially before around 5-10 million frames.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。