Skip to main content
QUICK REVIEW

[论文解读] Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer

Jian Feng, Zhihong Huang|arXiv (Cornell University)|Jan 4, 2026
Metaheuristic Optimization Algorithms Research被引用 0
一句话总结

BSZO 介绍了一种基于贝叶斯子空间的一阶最优化器,利用卡尔曼滤波聚合多方向梯信息,提升对大型语言模型微调的收敛性与鲁棒性,同时保持低内存使用。

ABSTRACT

Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive extbf{B}ayesian extbf{S}ubspace extbf{Z}eroth-Order extbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$ imes$--1.08$ imes$ of MeZO).

研究动机与目标

  • Motivate memory-efficient fine-tuning of large language models without backpropagation-based gradients.
  • Address instability and performance degradation of existing zeroth-order methods under low-precision training.
  • Propose BSZO to fuse finite-difference signals across multiple perturbation directions via Bayesian inference.
  • Provide theoretical convergence guarantees and empirical validation across RoBERTa, Mistral, and OPT models.

提出的方法

  • Sample k random directions to form a k-dimensional subspace and model the projected gradient as a latent variable.
  • Treat each finite-difference measurement as a noisy linear observation of the normalized subspace gradient and update a Gaussian posterior via Kalman filtering.
  • Use a residual-based adaptive scheme to dynamically adjust the observation noise variance during training.
  • Update parameters by descending along the posterior mean in the subspace, enabling k updates per batch instead of a single direction.
  • Cache early perturbation results and reuse them to reduce forward passes, with an optional extra forward pass in the basic version (BSZO-B) to improve landscape capture under reduced precision.

实验结果

研究问题

  • RQ1Can zeroth-order LLM fine-tuning be stabilized and made more data-efficient by aggregating information across multiple perturbation directions?
  • RQ2Does Bayesian inference with Kalman filtering over subspace-projected gradients improve convergence rate and robustness under fp16/bf16 precision?
  • RQ3What are the memory and computation trade-offs of BSZO compared to existing zeroth-order and first-order optimization methods?
  • RQ4How does adaptive residual-based noise estimation affect performance across tasks and model sizes?

主要发现

  • BSZO achieves stable and competitive accuracy across RoBERTa, OPT, and Mistral models, often outperforming baselines on several tasks.
  • Convergence rate is theoretically improved by a factor of k/γ compared to standard ZO methods.
  • BSZO maintains memory usage close to inference-only baselines (1.00×–1.08× MeZO) and is significantly more memory-efficient than HiZOO and MeZO-Adam.
  • Under reduced precision, BSZO and BSZO-B remain robust, while several baselines collapse or degrade substantially.
  • In decoder-only models, BSZO consistently achieves top or near-top average accuracy, with larger gains as model size increases.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。