QUICK REVIEW

[論文レビュー] Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Hyunji Jung, Sungbin Shin|arXiv (Cornell University)|Feb 3, 2026

Parallel Computing and Optimization Techniques被引用数 0

ひとこと要約

この論文は遅延生成の影響を受ける非同期パイプライン並列化における基底のずれを根本原因として特定し、ヘッセ行列に基づく固有基底推定を用いた基底回転で最適化空間を再配置し、スケーラビリティを回復して収束を加速させることを提案する。

ABSTRACT

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

研究の動機と目的

ア싱크ロナスなパイプライン並列におけるパイプライン深さと勾配遅延が収束にどう影響するかを調査する。
Adam型オプティマイザにおける遅延感度の鍵となる機構として基底ずれを特定する。
標準基底とヘッセ行列の固有基底を整列させるための基底回転を導入し、遅延効果を緩和する。
大規模Transformersに適した効率的な固有基底推定戦略を提供する。
最大1Bパラメータの言語モデル事前学習実験で収束性とスケーラビリティの改善を示す。

提案手法

基底整列とずれの下での遅延が収束に与える影響を分析する。
更新を基底整列空間へ回転させる回転マトリクスU（対称ケースではV）を用いた基底回転を提案する。
Kronecker-factored empirical Fisherを用いてヘッセ行列を近似し、その固有基底を推定する。
対称・非対称の回転幾何学を用いた2つの固有基底推定戦略（S = 2nd momentおよびS = 1st moment）を実装する。
実用的なアルゴリズム（アルゴリズム1）としてBasis Rotationを用いたAdamと固有基底推定手順（アルゴリズム2）を提供する。
基底回転により基底整列空間へ回転させると曲率を考慮した適応性が回復し、遅延による発振を抑制できることを示す。

実験結果

リサーチクエスチョン

RQ1非同期パイプライン並列におけるトランスフォーマー損失のヘッセジオメトリと勾配遅延はどう相互作用するのか。
RQ2ヘッセ行列の固有基底と標準座標基底のずれは遅延下でAdam型オプティマイザの適応性をより低下させるか。
RQ3基底回転は最適化空間を再配置して遅延を緩和し、スケーラブルな非同期トレーニングを回復できるか。
RQ4大規模モデル向けのヘッセ行列固有基底を推定する実用的かつスケーラブルな方法とは。
RQ5基底回転は1BパラメータのLLM事前学習で収束とスケーラビリティにどの程度の実証的な改善をもたらすか。

主な発見

パイプライン深さの増加に伴い、勾配遅延は非同期パイプライン並列での収束を著しく劣化させる。
遅延下での基底ずれはAdamの座標ごとの適応性の有効性を低下させ、発振と収束の遅さを引き起こす。
基底回転はヘッセ行列の固有基底を標準基底と再整列させ、遅延効果を緩和し収束を加速させる。
基底回転を用いると、1BパラメータのLLMは最も優れた非同期ベースラインよりも76.8%少ない反復回数で同じ訓練損失に到達する。
32パイプラインステージでは、オープン実験において基底回転はベースラインより最大81.6%少ない反復回数で同じ損失を達成する。
基底回転は、より正確でない固有基底推定やメモリ制約下でのウェイトスタッシングなしでも遅延に対する堅牢性を維持する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。