QUICK REVIEW

[논문 리뷰] Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization

Zhize Li, Dmitry Kovalev|arXiv (Cornell University)|2020. 02. 26.

Stochastic Gradient Optimization Techniques참고 문헌 31인용 수 37

한 줄 요약

이 논문은 단일 기계 문제를 위한 가속 압축 기울기 하강법(ACGD)과 연합/분산 최적화를 위한 분산 대응 ADIANA를 도입하여, 기울기 압축과 가속을 결합해 수렴 속도와 통신 효율을 개선한다.

ABSTRACT

Due to the high communication cost in distributed and federated learning problems, methods relying on compression of communicated messages are becoming increasingly popular. While in other contexts the best performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of iterations, there are no methods which combine the benefits of both gradient compression and acceleration. In this paper, we remedy this situation and propose the first accelerated compressed gradient descent (ACGD) methods. In the single machine regime, we prove that ACGD enjoys the rate $O\Big((1+ω)\sqrt{\frac{L}μ}\log \frac{1}ε\Big)$ for $μ$-strongly convex problems and $O\Big((1+ω)\sqrt{\frac{L}ε}\Big)$ for convex problems, respectively, where $ω$ is the compression parameter. Our results improve upon the existing non-accelerated rates $O\Big((1+ω)\frac{L}μ\log \frac{1}ε\Big)$ and $O\Big((1+ω)\frac{L}ε\Big)$, respectively, and recover the optimal rates of accelerated gradient descent as a special case when no compression ($ω=0$) is applied. We further propose a distributed variant of ACGD (called ADIANA) and prove the convergence rate $\widetilde{O}\Big(ω+\sqrt{\frac{L}μ}+\sqrt{\big(\fracω{n}+\sqrt{\fracω{n}}\big)\frac{ωL}μ}\Big)$, where $n$ is the number of devices/workers and $\widetilde{O}$ hides the logarithmic factor $\log \frac{1}ε$. This improves upon the previous best result $\widetilde{O}\Big(ω+ \frac{L}μ+\frac{ωL}{nμ} \Big)$ achieved by the DIANA method of Mishchenko et al. (2019). Finally, we conduct several experiments on real-world datasets which corroborate our theoretical results and confirm the practical superiority of our accelerated methods.

연구 동기 및 목표

분산/연합 최적화에서 기울기 압축과 가속화를 도입하여 통신 병목을 줄이고 동기를 부여하기.
압축된 통신 하에서 가속 수렴을 달성하는 이론적 프레임워크와 알고리즘을 개발하기.
비가속 압축 기준선 대비 반복 및 통신 라운드 복잡도 개선 분석을 제공하기.
제안된 가속 방법의 실제 데이터세트에서의 실용적 성능을 보여준다.

제안 방법

무편향성과 유계 분산을 갖는 무작위 압축 연산자를 정의한다(정의 1).
단일 기계 매끄러운 최적화에 대한 가속 압축 기울기 하강법(ACGD)을 제안한다(알고리즘 1).
압축이 없을 때(ω=0) 가속 GD를 회복하는 ACGD의 볼록 및 강볼록 수렴 속도를 보인다.
분산 최적화를 위한 가속 DIANA 변형인 ADIANA를 제안하며, 압축 잡음을 보상하기 위한 분산 감소를 포함한다(알고리즘 2).
DIANA보다 개선된 수렴 보장을 도출하고, ω와 n의 구간을 포함하며 속도에 도달하는 파라미터 선택을 제시한다.
표준 데이터세트에서 여러 압축 연산자(무작위 희소화, 무작위 디더링, 자연 압축)를 사용한 실험 검증을 제공한다.

실험 결과

연구 질문

RQ1그래디언트 압축을 가속화와 결합하여 가속 수렴 속도를 보존할 수 있는가?
RQ2단일 기계 및 분산/연합 설정에서 가속 압축 기울기 방법의 반복 및 통신 복잡도는 무엇인가?
RQ3가속 압축 방법이 비가속 압축 방법(CGD, DIANA)과 ω 및 장치 수 n의 구간에서 어떻게 비교되는가?
RQ4실제 데이터세트에서의 실험이 이론적 개선을 확인하고 통신 효율성을 입증하는가?

주요 결과

압축 하에서 가속 수렴 속도를 달성하는 ACGD: μ-강볼록 문제의 경우 O((1+ω)√(L/μ) log(1/ε)), 볼록 문제의 경우 O((1+ω)√(L/ε)).
분산 설정에서 ADIANA는 DIANA보다 개선된 속도를 달성한다: ω≥n 구간에서 O(ω(1+√(L/(nμ)))), ω<n 구간에서 O((ω+√(L/μ)+√(√(ω/n))·ωL/μ))로, 로그(1/ε)까지.
ω ≤ min{n^{1/3}, √(L/μ)}인 경우 ADIANA는 커뮤니케이션 라운드에서 비압축 가속 기울기 하강법과 일치하여, 반복 복잡도에 해를 주지 않으면서 압축을 가능하게 한다.
세 가지 압축 체계로 실제 데이터세트에서의 실험은 ADIANA가 DIANA나 비압축 기준선보다 더 빨리 수렴하고 더 적은 통신 비트를 사용하는 경우가 많음을 보여준다.
다양한 압축기에서 무작위 디더링 및 자연 압축을 사용하는 ADIANA는 DIANA 및 DCGD에 비해 현저한 통신 효율 향상을 나타낸다.
결과는 이론적 개선과 연합/분산 최적화에서의 가속 압축 방법의 실용성을 실증적으로 검증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.