[論文レビュー] Robust and Efficient Zeroth-Order LLM Fine-Tuning via Adaptive Bayesian Subspace Optimizer
BSZOは、多方向勾配情報をカルマンフィルタリングで統合するベイジアン部分空間 zeroth-order 最適化手法を導入し、LLMのファインチューニングの収束性と頑健性を向上させつつ、メモリ使用量を低く維持します。
Fine-tuning large language models (LLMs) with zeroth-order (ZO) optimization reduces memory by approximating gradients through function evaluations. However, existing methods essentially perform updates in a one-dimensional space, and suffer from collapse or substantial performance degradation under low-precision training. We introduce BSZO, an adaptive extbf{B}ayesian extbf{S}ubspace extbf{Z}eroth-Order extbf{O}ptimizer, which applies Kalman filtering to combine finite-difference information across multiple perturbation directions within a subspace. By treating each finite-difference measurement as a noisy observation, BSZO builds a posterior distribution over the subspace-projected gradient and updates it through Bayesian inference, with a residual-based adaptive mechanism to adapt to noise variations. Theoretical analysis shows that BSZO improves the convergence rate by a factor of $k/γ$ compared to standard ZO methods. Experiments on RoBERTa, Mistral, and OPT models show that BSZO outperforms the baselines across various tasks, achieving up to 6.67\% absolute average improvement on OPT-13B while remaining robust under fp16/bf16 precision and keeping memory usage close to inference-only baselines (1.00$ imes$--1.08$ imes$ of MeZO).
研究の動機と目的
- Motivate memory-efficient fine-tuning of large language models without backpropagation-based gradients.
- Address instability and performance degradation of existing zeroth-order methods under low-precision training.
- Propose BSZO to fuse finite-difference signals across multiple perturbation directions via Bayesian inference.
- Provide theoretical convergence guarantees and empirical validation across RoBERTa, Mistral, and OPT models.
提案手法
- Sample k random directions to form a k-dimensional subspace and model the projected gradient as a latent variable.
- Treat each finite-difference measurement as a noisy linear observation of the normalized subspace gradient and update a Gaussian posterior via Kalman filtering.
- Use a residual-based adaptive scheme to dynamically adjust the observation noise variance during training.
- Update parameters by descending along the posterior mean in the subspace, enabling k updates per batch instead of a single direction.
- Cache early perturbation results and reuse them to reduce forward passes, with an optional extra forward pass in the basic version (BSZO-B) to improve landscape capture under reduced precision.
実験結果
リサーチクエスチョン
- RQ1Can zeroth-order LLM fine-tuning be stabilized and made more data-efficient by aggregating information across multiple perturbation directions?
- RQ2Does Bayesian inference with Kalman filtering over subspace-projected gradients improve convergence rate and robustness under fp16/bf16 precision?
- RQ3What are the memory and computation trade-offs of BSZO compared to existing zeroth-order and first-order optimization methods?
- RQ4How does adaptive residual-based noise estimation affect performance across tasks and model sizes?
主な発見
- BSZO achieves stable and competitive accuracy across RoBERTa, OPT, and Mistral models, often outperforming baselines on several tasks.
- Convergence rate is theoretically improved by a factor of k/γ compared to standard ZO methods.
- BSZO maintains memory usage close to inference-only baselines (1.00×–1.08× MeZO) and is significantly more memory-efficient than HiZOO and MeZO-Adam.
- Under reduced precision, BSZO and BSZO-B remain robust, while several baselines collapse or degrade substantially.
- In decoder-only models, BSZO consistently achieves top or near-top average accuracy, with larger gains as model size increases.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。