Skip to main content
QUICK REVIEW

[论文解读] Neural Episodic Control

Alexander Pritzel, Benigno Uría|arXiv (Cornell University)|Mar 6, 2017
Reinforcement Learning in Robotics参考文献 41被引用 92
一句话总结

NEC 使用 per-action differentiable neural dictionaries to store and rapidly back up Q-values from recent experiences, achieving substantially faster data-efficient learning on Atari games than several deep RL baselines.

ABSTRACT

Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.

研究动机与目标

  • Address data inefficiency in deep reinforcement learning by accelerating reward propagation and value estimation.
  • Leverage a semi-tabular memory that combines slow-changing state representations with fast-updating value estimates.
  • Enable rapid assimilation of new experiences via an append-only, memory-based Q-function akin to episodic memory.
  • Investigate how fast memory updates interact with N-step returns and a shared CNN embedding to improve learning speed.

提出的方法

  • Introduce a Differentiable Neural Dictionary (DND) per action that stores (key, value) pairs.
  • Process states with a shared convolutional neural network to produce a key h for lookup in each action’s DND.
  • Retrieve Q(s,a) as a weighted sum of values in the DND using a nearest-neighbor kernel over keys.
  • Write new (h, Q^(N)(s,a)) pairs to the corresponding action’s DND; update existing keys via Q-learning like a tabular updater.
  • Use N-step Q-learning targets Q^(N)(s,a)= sum_{j=0}^{N-1} gamma^j r_{t+j} + gamma^N max_a' Q(s_{t+N}, a'), with the max taken by querying all memories.
  • Train end-to-end with a differentiable network by minimizing the L2 loss between predicted Q(s,a) and Q^(N)(s,a) over mini-batches from a replay buffer.

实验结果

研究问题

  • RQ1Can a memory-augmented, semi-tabular value function accelerate data-efficient learning in deep RL environments like Atari?
  • RQ2How does adding a fast-updating memory (DND) per action influence reward propagation and learning speed compared to standard DQN/A3C baselines?
  • RQ3What is the impact of N-step Q-learning and differentiable memory on final performance and data efficiency across diverse Atari games?
  • RQ4Does an append-only, large-scale memory with approximate nearest-neighbor access provide practical benefits over episodic resets in memory?

主要发现

帧数Nature DQNQ*(λ)Retrace(λ)Prioritised ReplayA3CNECMFEC
1M-0.7%-0.8%-0.4%-2.4%0.4%16.7%12.8%
2M0.0%0.1%0.2%0.0%0.9%27.8%16.7%
4M2.4%1.8%3.3%2.7%1.9%36.0%26.6%
10M15.7%13.0%17.3%22.4%3.6%54.6%45.4%
20M26.8%26.9%30.4%38.6%7.9%72.0%55.9%
40M52.7%59.6%60.5%89.0%18.4%83.3%61.9%
(Table 1) Median human-normalised scores across games at different frames
Note: values represent human-normalised scores as reported in the paper.
  • NEC learns significantly faster in the small data regime across Atari games than DQN, A3C, and several λ-return baselines.
  • Across early learning, NEC outperforms all baselines; at around 40 million frames, DQN with Prioritised Replay can surpass NEC on average.
  • NEC achieves human-level performance in about 25% of the tested games within 10 million frames, indicating strong data efficiency.
  • NEC and MFEC both explore episodic-like value estimation; NEC, however, uses a reward-guided embedding to improve interpolation of values.
  • NEC generally outperforms MFEC and Prioritised Replay in learning speed and data efficiency, especially before around 5-10 million frames.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。