QUICK REVIEW

[论文解读] Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer

Jian Feng, Zhihong Huang|arXiv (Cornell University)|Jan 4, 2026

Metaheuristic Optimization Algorithms Research被引用 0

一句话总结

BSZO 介绍了一种基于贝叶斯子空间的一阶最优化器，利用卡尔曼滤波聚合多方向梯信息，提升对大型语言模型微调的收敛性与鲁棒性，同时保持低内存使用。

ABSTRACT

Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive extbf{B}ayesian extbf{S}ubspace extbf{Z}eroth-Order extbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$ imes$--1.08$ imes$ of MeZO).

研究动机与目标

Motivate memory-efficient fine-tuning of large language models without backpropagation-based gradients.
Address instability and performance degradation of existing zeroth-order methods under low-precision training.
Propose BSZO to fuse finite-difference signals across multiple perturbation directions via Bayesian inference.
Provide theoretical convergence guarantees and empirical validation across RoBERTa, Mistral, and OPT models.

提出的方法

Sample k random directions to form a k-dimensional subspace and model the projected gradient as a latent variable.
Treat each finite-difference measurement as a noisy linear observation of the normalized subspace gradient and update a Gaussian posterior via Kalman filtering.
Use a residual-based adaptive scheme to dynamically adjust the observation noise variance during training.
Update parameters by descending along the posterior mean in the subspace, enabling k updates per batch instead of a single direction.
Cache early perturbation results and reuse them to reduce forward passes, with an optional extra forward pass in the basic version (BSZO-B) to improve landscape capture under reduced precision.

实验结果

研究问题

RQ1Can zeroth-order LLM fine-tuning be stabilized and made more data-efficient by aggregating information across multiple perturbation directions?
RQ2Does Bayesian inference with Kalman filtering over subspace-projected gradients improve convergence rate and robustness under fp16/bf16 precision?
RQ3What are the memory and computation trade-offs of BSZO compared to existing zeroth-order and first-order optimization methods?
RQ4How does adaptive residual-based noise estimation affect performance across tasks and model sizes?

主要发现

BSZO achieves stable and competitive accuracy across RoBERTa, OPT, and Mistral models, often outperforming baselines on several tasks.
Convergence rate is theoretically improved by a factor of k/γ compared to standard ZO methods.
BSZO maintains memory usage close to inference-only baselines (1.00×–1.08× MeZO) and is significantly more memory-efficient than HiZOO and MeZO-Adam.
Under reduced precision, BSZO and BSZO-B remain robust, while several baselines collapse or degrade substantially.
In decoder-only models, BSZO consistently achieves top or near-top average accuracy, with larger gains as model size increases.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。