QUICK REVIEW

[논문 리뷰] Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Hyunji Jung, Sungbin Shin|arXiv (Cornell University)|2026. 02. 03.

Parallel Computing and Optimization Techniques인용 수 0

한 줄 요약

논문은 비대칭 파이프라인 병렬성에서 지연으로 인한 저하의 근본 원인으로 기저 불일치를 식별하고 Hessian 기반 고유 벡터 추정을 이용한 기저 회전을 제안하여 최적화 공간을 재정렬하고 확장성을 회복하며 수렴을 가속한다.

ABSTRACT

Asynchronous pipeline parallelism maximizes hardware utilization by eliminating the pipeline bubbles inherent in synchronous execution, offering a path toward efficient large-scale distributed training. However, this efficiency gain can be compromised by gradient staleness, where the immediate model updates with delayed gradients introduce noise into the optimization process. Crucially, we identify a critical, yet often overlooked, pathology: this delay scales linearly with pipeline depth, fundamentally undermining the very scalability that the method originally intends to provide. In this work, we investigate this inconsistency and bridge the gap by rectifying delayed gradients through basis rotation, restoring scalable asynchronous training while maintaining performance. Specifically, we observe that the deleterious effects of delayed gradients are exacerbated when the Hessian eigenbasis is misaligned with the standard coordinate basis. We demonstrate that this misalignment prevents coordinate-wise adaptive schemes, such as Adam, from effectively leveraging curvature-aware adaptivity. This failure leads to significant oscillations in the optimization trajectory and, consequently, slower convergence. We substantiate these findings through both rigorous theoretical analysis and empirical evaluation. To address this challenge, we propose the use of basis rotation, demonstrating that it effectively mitigates the alignment issue and significantly accelerates convergence in asynchronous settings. For example, our training of a 1B-parameter LLM with basis rotation achieves the same training loss in 76.8% fewer iterations compared to the best-performing asynchronous pipeline parallel training baseline.

연구 동기 및 목표

asynchronous pipeline parallelism에서 파이프라인 깊이가 기울기 지연에 어떻게 스케일되는지 조사한다.
Adam 유형 최적화에서 지연에 의한 키 메커니즘으로 기저 불일치를 확인한다.
Hessian 고유 벡터를 표준 기저와 재정렬하기 위한 기저 회전을 도입하고 지연 효과를 완화한다.
대규모 Transformer에 호환 가능한 효율적인 고유 벡터 추정 전략을 제공한다.
1B 매개변수 언어 모델 사전 학습에서 수렴 및 확장성 향상을 시연한다.

제안 방법

기저 정렬 여부에 따른 수렴에서의 gradient 지연 영향 분석.
업데이트를 회전 행렬 U(및 양방향의 경우 V)를 사용하여 기저 정렬된 공간으로 회전시키는 기저 회전을 제안한다.
Kronecker-팩토링된 경험적 Fisher 정보를 통해 해 Hessian을 근사하고 그 고유 벡터를 추정한다.
양방향 또는 단방향 회전 기하학으로 두 가지 고유 벡터 추정 전략(S = 2nd moment 및 S = 1st moment)을 구현한다.
Adam with Basis Rotation에 대한 실용적 알고리즘(Algorithm 1)과 고유 벡터 추정 절차(Algorithm 2)를 제공한다.
기저 회전을 통해 기저 정렬 공간으로 회전시키면 곡률 인식 적응성이 회복되고 지연으로 인한 진동이 감소한다.

실험 결과

연구 질문

RQ1비동기 파이프라인 병렬성에서 Transformer 손실의 Hessian 기하학과 gradient 지연이 어떻게 상호작용하는가?
RQ2 Hessian 고유 벡터와 표준 좌표 기저 사이의 불일치가 지연하에서 Adam 유형 최적화를 더 악화시키는가?
RQ3 기저 회전이 최적화 공간을 재정렬하여 지연을 완화하고 확장 가능한 비동기 훈련을 회복할 수 있는가?
RQ4 대형 모델에 대한 Hessian 고유 벡터를 추정하는 실용적이고 확장 가능한 방법은 무엇인가?
RQ5 1B 매개변수 LLM 사전 학습에서 기저 회전이 수렴 및 확장성에 어떤 실증적 이득을 제공하는가?

주요 결과

파이프라인 깊이가 증가함에 따라 비동기 파이프라인 병렬성에서 gradient 지연이 수렴을 크게 저하시킨다.
기저 불일치가 지연하에서 Adam의 좌표별 적응성을 감소시켜 진동과 더 느린 수렴을 초래한다.
기저 회전은 Hessian 고유 벡터를 표준 기저와 재정렬하여 지연 효과를 완화하고 수렴 속도를 높인다.
기저 회전을 사용하여 1B 매개변수 LLM은 최적의 비동기 기반선보다 더 적은 반복으로 같은 학습 손실에 도달한다(76.8%).
32 파이프라인 스테이지에서 기저 회전은 오픈 실험에서 기준선보다 최대 81.6% 적은 반복으로 같은 손실에 도달한다.
기저 회전은 추정 고유 벡터의 정확도가 떨어지거나 메모리 제약 환경에서 가중치 저장 없이도 지연에 대한 강건성을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.