[论文解读] Neural Episodic Control
NEC 使用 per-action differentiable neural dictionaries to store and rapidly back up Q-values from recent experiences, achieving substantially faster data-efficient learning on Atari games than several deep RL baselines.
Deep reinforcement learning methods attain super-human performance in a wide range of environments. Such methods are grossly inefficient, often taking orders of magnitudes more data than humans to achieve reasonable performance. We propose Neural Episodic Control: a deep reinforcement learning agent that is able to rapidly assimilate new experiences and act upon them. Our agent uses a semi-tabular representation of the value function: a buffer of past experience containing slowly changing state representations and rapidly updated estimates of the value function. We show across a wide range of environments that our agent learns significantly faster than other state-of-the-art, general purpose deep reinforcement learning agents.
研究动机与目标
- Address data inefficiency in deep reinforcement learning by accelerating reward propagation and value estimation.
- Leverage a semi-tabular memory that combines slow-changing state representations with fast-updating value estimates.
- Enable rapid assimilation of new experiences via an append-only, memory-based Q-function akin to episodic memory.
- Investigate how fast memory updates interact with N-step returns and a shared CNN embedding to improve learning speed.
提出的方法
- Introduce a Differentiable Neural Dictionary (DND) per action that stores (key, value) pairs.
- Process states with a shared convolutional neural network to produce a key h for lookup in each action’s DND.
- Retrieve Q(s,a) as a weighted sum of values in the DND using a nearest-neighbor kernel over keys.
- Write new (h, Q^(N)(s,a)) pairs to the corresponding action’s DND; update existing keys via Q-learning like a tabular updater.
- Use N-step Q-learning targets Q^(N)(s,a)= sum_{j=0}^{N-1} gamma^j r_{t+j} + gamma^N max_a' Q(s_{t+N}, a'), with the max taken by querying all memories.
- Train end-to-end with a differentiable network by minimizing the L2 loss between predicted Q(s,a) and Q^(N)(s,a) over mini-batches from a replay buffer.
实验结果
研究问题
- RQ1Can a memory-augmented, semi-tabular value function accelerate data-efficient learning in deep RL environments like Atari?
- RQ2How does adding a fast-updating memory (DND) per action influence reward propagation and learning speed compared to standard DQN/A3C baselines?
- RQ3What is the impact of N-step Q-learning and differentiable memory on final performance and data efficiency across diverse Atari games?
- RQ4Does an append-only, large-scale memory with approximate nearest-neighbor access provide practical benefits over episodic resets in memory?
主要发现
| 帧数 | Nature DQN | Q*(λ) | Retrace(λ) | Prioritised Replay | A3C | NEC | MFEC |
|---|---|---|---|---|---|---|---|
| 1M | -0.7% | -0.8% | -0.4% | -2.4% | 0.4% | 16.7% | 12.8% |
| 2M | 0.0% | 0.1% | 0.2% | 0.0% | 0.9% | 27.8% | 16.7% |
| 4M | 2.4% | 1.8% | 3.3% | 2.7% | 1.9% | 36.0% | 26.6% |
| 10M | 15.7% | 13.0% | 17.3% | 22.4% | 3.6% | 54.6% | 45.4% |
| 20M | 26.8% | 26.9% | 30.4% | 38.6% | 7.9% | 72.0% | 55.9% |
| 40M | 52.7% | 59.6% | 60.5% | 89.0% | 18.4% | 83.3% | 61.9% |
| (Table 1) Median human-normalised scores across games at different frames | |||||||
| Note: values represent human-normalised scores as reported in the paper. | |||||||
- NEC learns significantly faster in the small data regime across Atari games than DQN, A3C, and several λ-return baselines.
- Across early learning, NEC outperforms all baselines; at around 40 million frames, DQN with Prioritised Replay can surpass NEC on average.
- NEC achieves human-level performance in about 25% of the tested games within 10 million frames, indicating strong data efficiency.
- NEC and MFEC both explore episodic-like value estimation; NEC, however, uses a reward-guided embedding to improve interpolation of values.
- NEC generally outperforms MFEC and Prioritised Replay in learning speed and data efficiency, especially before around 5-10 million frames.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。