QUICK REVIEW

[论文解读] Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda|arXiv (Cornell University)|Jan 6, 2022

Neural Networks and Applications被引用 77

一句话总结

本文研究神经网络如何在小型算法数据集上超越记忆进行泛化，揭示一种称为 grokking 的晚期泛化现象，并分析数据效率、优化时间和正则化效应。

ABSTRACT

In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset.

研究动机与目标

研究在小型算法数据集上训练的神经网络的泛化行为。
刻画 grokking 现象，即在过拟合开始后很久才出现泛化提升。
评估数据效率以及数据集规模如何影响到达泛化的优化时间。
评估正则化和优化设置对 grokking 的影响。
可视化学习到的嵌入以理解新兴的结构。

提出的方法

使用一个仅解码器的 Transformer，在二元运算表 a ∘ b = c 上进行训练，使用抽象符号作为标记。
通过在较长的优化预算后测量验证准确度来评估泛化。
系统性地改变数据集规模和优化设置，以观察对 grokking 的影响。
测试多种二元运算，分析对称性和群结构如何影响学习。
应用消融研究，包括权重衰减、梯度噪声和学习率，以评估数据效率。
可视化输出层嵌入以解释学习到的结构。

实验结果

研究问题

RQ1grokking 是否在一系列二元运算和数据集规模上出现？
RQ2当训练数据比例减少时，达到泛化所需的优化时间如何缩放？
RQ3哪些正则化或优化技巧最能提升数据效率和 grokking？
RQ4对模运算任务学习到的嵌入中出现了哪些结构？
RQ5是否存在在损失与准确度曲线中能够表征 grokking 的定性模式？

主要发现

grokking 出现在几个二元运算上，验证准确度在训练准确度饱和后很久才从偶然性提升。
对于较小的数据集，随着数据比例降低，泛化所需时间快速增长，在验证损失上呈现出双下降式的行为。
权重衰减显著提升数据效率和泛化，相较于其他干预手段。
某些对称运算在数据较少时也能泛化，某些非对称运算则需要更多数据才能 grok。
嵌入有时显示可解释的结构，例如模运算中的圆形/拓扑组织。
grokking 的最佳学习率窗口相对较窄，观察该效应需要较大的优化预算。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。