QUICK REVIEW

[論文レビュー] Mousse: Rectifying the Geometry of Muon with Curvature-Aware Preconditioning

Yechen Zhang, Shuhao Xing|arXiv (Cornell University)|Mar 10, 2026

Stochastic Gradient Optimization Techniques被引用数 0

ひとこと要約

Mousse は Shampoo によって誘導されるホワイトニング空間で Newton-Schulz 正交化を適用することで Muon を強化し、約12% の収束速度向上を得つつ、160M–800M パラメータの言語モデルではオーバーヘッドはほとんどありません。

ABSTRACT

Recent advances in spectral optimization, notably Muon, have demonstrated that constraining update steps to the Stiefel manifold can significantly accelerate training and improve generalization. However, Muon implicitly assumes an isotropic optimization landscape, enforcing a uniform spectral update norm across all eigen-directions. We argue that this "egalitarian" constraint is suboptimal for Deep Neural Networks, where the curvature spectrum is known to be highly heavy-tailed and ill-conditioned. In such landscapes, Muon risks amplifying instabilities in high-curvature directions while limiting necessary progress in flat directions. In this work, we propose extbf{Mousse} ( extbf{M}uon extbf{O}ptimization extbf{U}tilizing extbf{S}hampoo's extbf{S}tructural extbf{E}stimation), a novel optimizer that reconciles the structural stability of spectral methods with the geometric adaptivity of second-order preconditioning. Instead of applying Newton-Schulz orthogonalization directly to the momentum matrix, Mousse operates in a whitened coordinate system induced by Kronecker-factored statistics (derived from Shampoo). Mathematically, we formulate Mousse as the solution to a spectral steepest descent problem constrained by an anisotropic trust region, where the optimal update is derived via the polar decomposition of the whitened gradient. Empirical results across language models ranging from 160M to 800M parameters demonstrate that Mousse consistently outperforms Muon, achieving around $\sim$12\% reduction in training steps with negligible computational overhead.

研究の動機と目的

Muon の等方ストがスペクトル制約と高度に非等方なニューラル曲率の不一致を動機づける。
曲率認識スペクトル最適化フレームワークを提案し、二次前条件付けをスペクトル制約と統合する。
Mousse 更新の安定化と実装を効率的に開発・分析する。
160M から 800M パラメータの言語モデルでの頑健性と効率向上を実証する。

提案手法

Muon を操作ノーマル（op-norm）制約付きスペクトル勾配降下問題として定式化する。
Shampoo のクロネッカー分解統計を用いて勾配をホワイトニングし、球状幾何を導入する。
白色座標系で Newton-Schulz 正交化を適用し、Stiefel多様体上での更新を得る。
制約付き最小化を前置勾配で解くことで L^{-1/4} および R^{-1/4} 前条件子を含む更新式を得る。
安定化技術を導入する：Trace Normalization と Spectral Tempering（α = 0.125）で条件づけと曲率強度を制御する。
計算とメモリ overhead を削減する単面前条件子のバリアントを提供する。

実験結果

リサーチクエスチョン

RQ1各方向の曲率認識幾何が Muon の等方制約と比較してスペクトル更新にどのような影響を与えるか？
RQ2Shampoo 統計によるホワイトニングは曲率認識フレームワーク内で有効な Newton-Schulz 更新を可能にするか？
RQ3深層ネットワークにおいて曲率認識スペクトル最適化を堅牢にするために不可欠な安定化技術は何か？
RQ4Mousse を用いた大規模言語モデルの学習で収束速度とサンプル効率にどの程度の実証的向上があるか？
RQ5単面前条件子は性能や安定性を損なうことなく実用的か？

主な発見

Mousse は 800M パラメータモデルで Muon と比較してターゲット損失に到達する学習ステップ数を約12%低減する。
Mousse は Muon とほぼ同等のウォールクロック学習時間を維持し、ほとんどオーバーヘッドがない。
Mousse は 160M から 800M のモデル規模で最終検証損失をより低く達成する。
Trace Normalization と Spectral Tempering は安定性と効果的な曲率補正に不可欠である。
単面前条件子はフル Kronecker ベースのアプローチと同等の性能を示し、計算とメモリコストを削減する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。