[论文解读] Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer
BSZO 介绍了一种基于贝叶斯子空间的一阶最优化器,利用卡尔曼滤波聚合多方向梯信息,提升对大型语言模型微调的收敛性与鲁棒性,同时保持低内存使用。
Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive extbf{B}ayesian extbf{S}ubspace extbf{Z}eroth-Order extbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$ imes$--1.08$ imes$ of MeZO).
研究动机与目标
- Motivate memory-efficient fine-tuning of large language models without backpropagation-based gradients.
- Address instability and performance degradation of existing zeroth-order methods under low-precision training.
- Propose BSZO to fuse finite-difference signals across multiple perturbation directions via Bayesian inference.
- Provide theoretical convergence guarantees and empirical validation across RoBERTa, Mistral, and OPT models.
提出的方法
- Sample k random directions to form a k-dimensional subspace and model the projected gradient as a latent variable.
- Treat each finite-difference measurement as a noisy linear observation of the normalized subspace gradient and update a Gaussian posterior via Kalman filtering.
- Use a residual-based adaptive scheme to dynamically adjust the observation noise variance during training.
- Update parameters by descending along the posterior mean in the subspace, enabling k updates per batch instead of a single direction.
- Cache early perturbation results and reuse them to reduce forward passes, with an optional extra forward pass in the basic version (BSZO-B) to improve landscape capture under reduced precision.
实验结果
研究问题
- RQ1Can zeroth-order LLM fine-tuning be stabilized and made more data-efficient by aggregating information across multiple perturbation directions?
- RQ2Does Bayesian inference with Kalman filtering over subspace-projected gradients improve convergence rate and robustness under fp16/bf16 precision?
- RQ3What are the memory and computation trade-offs of BSZO compared to existing zeroth-order and first-order optimization methods?
- RQ4How does adaptive residual-based noise estimation affect performance across tasks and model sizes?
主要发现
- BSZO achieves stable and competitive accuracy across RoBERTa, OPT, and Mistral models, often outperforming baselines on several tasks.
- Convergence rate is theoretically improved by a factor of k/γ compared to standard ZO methods.
- BSZO maintains memory usage close to inference-only baselines (1.00×–1.08× MeZO) and is significantly more memory-efficient than HiZOO and MeZO-Adam.
- Under reduced precision, BSZO and BSZO-B remain robust, while several baselines collapse or degrade substantially.
- In decoder-only models, BSZO consistently achieves top or near-top average accuracy, with larger gains as model size increases.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。