QUICK REVIEW

[论文解读] ReLoRA: High-Rank Training Through Low-Rank Updates

Vladislav Lialin, Namrata Shivagunde|arXiv (Cornell University)|Jul 11, 2023

Advanced Neural Network Applications被引用 11

一句话总结

ReLoRA 通过在重启时累积多次低秩更新来训练高秩变换器网络，在实现接近全秩训练的性能的同时，降低 GPU RAM 使用并加速训练。

ABSTRACT

Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer language models with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

研究动机与目标

推动极大规模变换器模型的参数高效预训练。
研究是否能够通过序列化的低秩更新实现高秩更新。
开发具备重启、锯齿状学习率调度和部分优化器重置的 ReLoRA。
在高达 1.3B 参数的变换器上演示 ReLoRA，并与 LoRA 与全秩训练进行比较。

提出的方法

以 warm-start 的全秩训练基线作为起点。
对线性层应用 LoRA 风格的低秩更新，秩 r=128。
使用多次重启，将低秩更新合并回基权重（更新的和）。
采用锯齿状余弦学习率调度，在每次合并并重新初始化后不进行预热。
通过幅值裁剪部分重置优化器状态，以避免过时的梯度矩引导更新。
在通过 ReLoRA 更新线性层的同时，保持嵌入和归一化为全秩。

Figure 1: Training loss for 250M models. ReLoRA learns a high-rank network through a sequence of low-rank updates. It outperforms networks with the same trainable parameter count and achieves similar performance to training a full network at 100M+ scale. The efficiency of ReLoRA increases with the m

实验结果

研究问题

RQ1是否可以通过一系列低秩更新有效地训练出高秩网络？
RQ2在不同模型大小下，ReLoRA 的性能与效率与 LoRA 及全秩训练相比如何？
RQ3哪些训练技巧（重启、优化器重置、热启动）对 ReLoRA 的成功至关重要？
RQ4ReLoRA 在更大规模的变换器模型（高达 1.3B 参数）上是否在效率和性能上具备可扩展性？

主要发现

ReLoRA 每个 GPU 节省高达 5.5 GB 的 RAM，训练速度提升为 9-40%，具体取决于模型大小和硬件。
ReLoRA 的困惑度接近全秩训练，并且优于 LoRA，1.3B 模型在最后的困惑度为 17.27，而全训练为 16.83。
奇异值分析表明 ReLoRA 的更新分布更像高秩/全秩训练，而非 LoRA 的主要为零/低秩谱。
对于 1.3B 模型，带有 warm start 和重启的 ReLoRA 在整个训练过程中的性能优于 LoRA，并缩小与全秩训练的差距（最终困惑度 17.27 对 16.83）。
训练加速因硬件而异；在 8x A100 的设置下，ReLoRA 大约实现 9% 的时钟时间加速，在成本更低的硬件上获得更大的提升。
在本研究中，在线 ReLoRA（非常频繁的重置）并未比标准 ReLoRA 提升结果。

Figure 2: Jagged cosine scheduler used in ReLoRA. As a base for our scheduler we follow a standard cosine decay schedule as in Touvron et al. ( 2023 ) . On every optimizer reset, we set the learning rate to zero and perform a quick (50-100 steps) learning rate warm-up back to the cosine schedule.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。