QUICK REVIEW

[论文解读] Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing

Liang Chen, Mohammad Norouzi|arXiv (Cornell University)|Jul 6, 2018

Parallel Computing and Optimization Techniques被引用 99

一句话总结

MAPO 引入一种记忆增强的策略梯度方法，使用高奖励轨迹的记忆缓冲区来降低梯度方差，在弱监督语义解析任务上取得了强劲结果。

ABSTRACT

We present Memory Augmented Policy Optimization (MAPO), a simple and novel way to leverage a memory buffer of promising trajectories to reduce the variance of policy gradient estimate. MAPO is applicable to deterministic environments with discrete actions, such as structured prediction and combinatorial optimization tasks. We express the expected return objective as a weighted sum of two terms: an expectation over the high-reward trajectories inside the memory buffer, and a separate expectation over trajectories outside the buffer. To make an efficient algorithm of MAPO, we propose: (1) memory weight clipping to accelerate and stabilize training; (2) systematic exploration to discover high-reward trajectories; (3) distributed sampling from inside and outside of the memory buffer to scale up training. MAPO improves the sample efficiency and robustness of policy gradient, especially on tasks with sparse rewards. We evaluate MAPO on weakly supervised program synthesis from natural language (semantic parsing). On the WikiTableQuestions benchmark, we improve the state-of-the-art by 2.6%, achieving an accuracy of 46.3%. On the WikiSQL benchmark, MAPO achieves an accuracy of 74.9% with only weak supervision, outperforming several strong baselines with full supervision. Our source code is available at https://github.com/crazydonkey200/neural-symbolic-machines

研究动机与目标

为确定性、离散动作域（如程序合成）中的高方差策略梯度提供动机与解决方案。
利用有前景轨迹的记忆缓冲区将目标分解为缓冲内与缓冲外的期望。
提出机制（记忆权重裁剪、系统性探索、分布式采样）以稳定和扩展训练。
在弱监督语义解析基准上评估MAPO以评估样本效率与鲁棒性的提升。

提出的方法

将期望回报表达为两项的加权和：缓冲内的期望与缓冲外的期望。
用总概率 pi_B 定义记忆缓冲区 B，剩余的缓冲外概率为 1 - pi_B。
使用包含 pi_B 的梯度加上来自缓冲外样本的第二梯度，其中 pi_B 作为权重（Equation 7）。
引入记忆权重裁剪 pi_B^c = max(pi_B, alpha) 以稳定冷启动训练（Equation 8）。
实现系统性探索，使用基于 bloom 过滤的完全探索前缀集合来发现高奖励轨迹（Algorithm 1）。
采用分布式 actor-learner 采样以并行化数据收集和梯度计算（Algorithm 2）。
根据缓冲区大小提供缓冲内期望的精确或分层近似（精确枚举或采样）。
在计算缓冲外的期望时使用拒绝采样从当前策略采样缓冲外轨迹。

实验结果

研究问题

RQ1如何将基于记忆的回放集成到确定性、离散动作领域的策略梯度方法中以降低梯度方差？
RQ2将缓冲内外的期望分解与记忆权重梯度耦合，是否能提升弱监督程序合成的样本效率？
RQ3像记忆权重裁剪、系统性探索和分布式采样这样的机制，是否能在 MAPO 的语义解析基准上实现可扩展且鲁棒的训练？

主要发现

MAPO 在 WikiTableQuestions 的开发集/测试准确率为 42.7/43.8，单次运行时；若进行集成则达到 46.3（如所述）。
MAPO 在 WikiSQL 的弱监督下达到 72.6 的测试准确率，相同基准下集成可达 74.9。
消融实验显示去除系统性探索（SE）或记忆权重裁剪（MWC）会显著降低性能。
MAPO 在 WikiTableQuestions 和 WikiSQL 上均优于若干基线方法，包括使用全监督训练的方法。
采用 30 个执行体的分布式采样在采样速度上实现约 20 倍的加速，显示了可扩展的训练能力。
MAPO 在鲁棒性和样本效率方面优于传统的 REINFORCE 及其他基于记忆的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。