QUICK REVIEW

[논문 리뷰] DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

Hanlin Tang, Xiangru Lian|arXiv (Cornell University)|2019. 05. 15.

Stochastic Gradient Optimization Techniques인용 수 94

한 줄 요약

DoubleSqueeze는 두-pass(작업자와 매개변수 서버) 압축을 사용한 병렬 오차 보상 SGD의 수렴을 분석하고 증명하며, 선형 가속과 압축 바이어스/잡음에 대한 내성을 향상시킨다.

ABSTRACT

A standard approach in large scale machine learning is distributed stochastic gradient training, which requires the computation of aggregated stochastic gradients over multiple nodes on a network. Communication is a major bottleneck in such applications, and in recent years, compressed stochastic gradient methods such as QSGD (quantized SGD) and sparse SGD have been proposed to reduce communication. It was also shown that error compensation can be combined with compression to achieve better convergence in a scheme that each node compresses its local stochastic gradient and broadcast the result to all other nodes over the network in a single pass. However, such a single pass broadcast approach is not realistic in many practical implementations. For example, under the popular parameter server model for distributed learning, the worker nodes need to send the compressed local gradients to the parameter server, which performs the aggregation. The parameter server has to compress the aggregated stochastic gradient again before sending it back to the worker nodes. In this work, we provide a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server. We show that the error-compensated stochastic gradient algorithm admits three very nice properties: 1) it is compatible with an \emph{arbitrary} compression technique; 2) it admits an improved convergence rate than the non error-compensated stochastic gradient methods such as QSGD and sparse SGD; 3) it admits linear speedup with respect to the number of workers. The empirical study is also conducted to validate our theoretical results.

연구 동기 및 목표

분산 확률적 경사 학습에서 통신 병목 현상을 줄이려는 동기를 제시한다.
이중 패스 통신 모델에서 오차 보상을 작업자와 매개변수 서버 모두로 확장한다.
비볼록 손실 하에서 제안된 DoubleSqueeze 알고리즘의 수렴성과 선형 스피드를 증명한다.
이론적 수렴성과 실제 대역폭 절감을 뒷받침하는 실증적 검증을 제공한다.

제안 방법

작업자와 매개변수 서버가 전달된 그래디언트에 대해 오차 보상 압축을 수행하는 DoubleSqueeze를 도입한다.
편향될 수 있거나 편향되지 않을 수 있는 압축 연산자 Q_ω[·]를 사용하고, 정보 손실 보상을 위해 작업자에서의 δ^{(i)}와 서버의 δ 벡터를 도입한다.
전역 업데이트가 x_{t+1}=x_t-γ∇f(x_t)+γξ_t-γΩ_{t-1}+γΩ_t로 쓸 수 있음을 보인다. 여기서 Ω_t와 ξ_t는 압축 오차와 확률적 그래디언트 분산을 포착한다.
적절한 가정(리프시츠 기울기, 한정된 분산, 한정된 압축 오차)하에서 DoubleSqueeze가 작업자 수 n에 대해 선형 가속과 함께 수렴 속도를 달성한다는 것을 증명한다.
그 결과(Corollary)는 O(σ/√(nT))의 속도와 ε 및 T에 의존하는 항들을 제공하여 병렬성 하에서 더 빠른 수렴과 압축 오차에 대한 내성을 나타낸다.

실험 결과

연구 질문

RQ1오차 보상을 이중 패스 압축 설정에서 작업자와 매개변수 서버 모두에 효과적으로 확장할 수 있는가?
RQ2병렬 이중 패스 오차 보상 SGD가 작업자 수에 대해 선형 가속을 달성하는가?
RQ3비오차 보상 SGD 및 다른 압축 SGD 방법들과 비교하여 DoubleSqueeze는 비볼록 손실 하에서 어떤 차이가 있는가?
RQ4수렴을 보존하면서 DoubleSqueeze 내에서 사용할 수 있는 압축 연산자(편향적이든 비편향적이든)가 무엇인가?
RQ5일반적인 모델과 데이터셋에 대해 실무적으로 어떤 대역폭 절감과 수렴 동력이 나타나는가?

주요 결과

DoubleSqueeze는 작업자 수 n에 비례하는 선형 가속으로 수렴한다.
이 방법은 비오차 보상 SGD보다 압축 바이어스와 잡음에 더 잘 견디며 압축 설정에서 수렴을 개선한다.
양쪽에서 전체 압축이 적용되며 각 반복마다 오로지 n회의 통신으로 충분하므로 상당한 대역폭 절감이 가능하다.
이론적 결과는 비볼록 손실로 확장되어 압축이 존재하는 상황에서 SGD와 유사한 속도를 달성한다.
CIFAR-10에서 ResNet-18의 실험은 압축 없이 SGD와 비슷한 수렴을 보이면서도 제한된 대역폭에서 매 반복 시간은 더 빠르고, 대역폭 제약하에서 비보정 방법을 능가한다.
1비트 및 상위-k(top-k) 압축을 사용할 때도 DoubleSqueeze는 제한된 네트워크 조건에서 상당한 속도 향상으로 경쟁력 있는 학습 및 테스트 성능을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.