Skip to main content
QUICK REVIEW

[论文解读] Meta-Learning with Warped Gradient Descent

Sebastian Flennerhag, Andrei A. Rusu|arXiv (Cornell University)|Aug 30, 2019
Domain Adaptation and Few-Shot Learning参考文献 64被引用 66
一句话总结

WarpGrad 元学习扭曲层以预条件梯度,使在少-shot、标准、连续学习和强化学习任务中实现可扩展、轨迹无关的基于梯度的元学习。

ABSTRACT

Learning an efficient update rule from data that promotes rapid learning of new tasks from the same distribution remains an open problem in meta-learning. Typically, previous works have approached this issue either by attempting to train a neural network that directly produces updates or by attempting to learn better initialisations or scaling factors for a gradient-based update rule. Both of these approaches pose challenges. On one hand, directly producing an update forgoes a useful inductive bias and can easily lead to non-converging behaviour. On the other hand, approaches that try to control a gradient-based update rule typically resort to computing gradients through the learning process to obtain their meta-gradients, leading to methods that can not scale beyond few-shot task adaptation. In this work, we propose Warped Gradient Descent (WarpGrad), a method that intersects these approaches to mitigate their limitations. WarpGrad meta-learns an efficiently parameterised preconditioning matrix that facilitates gradient descent across the task distribution. Preconditioning arises by interleaving non-linear layers, referred to as warp-layers, between the layers of a task-learner. Warp-layers are meta-learned without backpropagating through the task training process in a manner similar to methods that learn to directly produce updates. WarpGrad is computationally efficient, easy to implement, and can scale to arbitrarily large meta-learning problems. We provide a geometrical interpretation of the approach and evaluate its effectiveness in a variety of settings, including few-shot, standard supervised, continual and reinforcement learning.

研究动机与目标

  • 激发并解决现有基于梯度的元学习者在收敛性、可扩展性和信用分配方面的局限性。
  • 提出一种轨迹无关的预条件框架,在任务学习器层之间嵌入 warp-layers 以对梯度进行预条件处理。
  • 通过黎曼度量为 WarpGrad 提供几何解释,并在 few-shot、multi-shot、continual 与 reinforcement learning 设置中展示可扩展的性能。

提出的方法

  • 引入与任务学习层交错的 warp-layers,形成一个扭曲网络,从而实现对梯度的数据相关预条件。
  • 定义一般的预条件规则 U(θ;φ)=θ−αP(θ;φ)∇L(θ),其中 P 通过 warp-layers 及其雅可比矩阵实现。
  • 通过对任务的联合分布和中间任务学习参数进行优化,推导出轨迹无关的元目标 L(φ),避免对完整自适应轨迹进行反向传播。
  • 解释几何:warp-layers 在扭曲空间诱导度量 G,G−1 作为预条件器;建立扭曲空间中的更新与黎曼度量下的下降之间的一阶等价性。
  • 提出在线(Algorithm 1)和离线(Algorithm 2)元训练过程,在学习 warp-参数 φ 的同时可选学习或使用初始任务参数 θ0|τ 的先验。
  • 展示与学习初始化和先验的整合,使得可使用多种训练模式(在线/离线,监督/RL 持续学习)。
  • 演示非线性 warp-layers,以捕捉超越块对角结构的更丰富预条件,并在强化学习任务中表现出记忆能力。

实验结果

研究问题

  • RQ1Can WarpGrad retain the inductive bias of gradient-based few-shot learners while avoiding backpropagation through adaptation trajectories?
  • RQ2To what extent can WarpGrad scale beyond few-shot learning to multi-shot and standard supervised/ RL tasks?
  • RQ3Does WarpGrad generalize to complex meta-learning scenarios such as continual learning and tasks requiring memory?
  • RQ4Is the learned warp-geometry interpretable as a curvature-based preconditioner facilitating convergence guarantees?

主要发现

  • WarpGrad 在标准少样本基准测试(mini-ImageNet 和 tiered-ImageNet)上优于基线基于梯度的元学习方法。
  • Warp-MAML 与 Warp-Leap 变体在 few-shot 和 multi-shot 设置中相较于未加 warp 的对手取得更高的准确率,包括 Omniglot 与 tiered-ImageNet,并且具有扩展的自适应步骤。
  • 非线性 warp-layers 使预条件超越块对角结构,提升在持续学习和 RL 迷宫导航等复杂任务上的性能。
  • 使用 warp 学得参数进行离线元训练可带来显著提升(例如 Omniglot 的测试准确率从 76.3% 提升至 84.3%)。
  • WarpGrad 通过将预条件嵌入为类似梯度下降的更新,借助隐式黎曼度量,维持收敛性属性,提供跨任务的稳定性和可扩展性。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。