QUICK REVIEW

[논문 리뷰] signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Jeremy Bernstein, Jiawei Zhao|arXiv (Cornell University)|2018. 10. 11.

Retinal Imaging and Analysis인용 수 125

한 줄 요약

본 논문은 majority vote가 적용된 signSGD를 소개한다. 1비트 통신 방식의 견고한 분산 최적화 방법으로, 노이즈에서도 수렴하며 최대 50%의 Byzantine 워커를 허용하고, 대규모 작업에서 경험적으로 속도 향상을 보인다.

ABSTRACT

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32 imes$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

연구 동기 및 목표

대규모 신경망의 빠르고 견고하며 통신 효율적인 분산 학습 필요성에 대한 동기를 제시한다.
통신 압축과 내결함성을 높이기 위한 다수결 투표를 이용한 간단한 그래디언트 부호 기반 업데이트를 제안한다.
미니배치 학습과 Byzantine 장애 모델에 대한 현실적 가정 하의 수렴 이론을 제공한다.
ImageNet과 언어 모델에 대한 대규모 실험을 통해 실용적 이점과 트레이드오프를 보여준다.

제안 방법

워커들이 미니배치 그래디언트를 계산하고 모멘텀의 부호(sign)를 파라미터 서버에 전송한다.
서버가 부호를 다수결로 모아 워커들에게 다수 부호를 다시 방송한다.
업데이트 규칙은 x <- x - eta * (sign(V) + lambda x)로 각 워커의 모멘텀을 사용한다.
(가정 1-4)에 따른 비함수형(non-convex) 목표에서의 unimodal, 대칭적 그래디언트 노이즈를 가정한 이론적 수렴 분석을 제공한다.
알파 < 1/2의 워커가 공격적(adversarial)일 때 블라인드 곱셈적 적대자에 대한 강건성(Theorem 2)을 보인다.
대규모 실험에서 NCCL과 비교하여 1비트 텐서 압축을 사용한 PyTorch의 Signum 구현을 제시한다.

실험 결과

연구 질문

RQ1다수결이 있는 signSGD가 현실적인 그래디언트 노이즈 가정 하에서 비볼록(non-convex) 환경에서 수렴하는가?
RQ2다수결 투표가 Byzantine/공격적 워커 동작에 대한 강건성에 어떤 영향을 미치는가?
RQ3전체 정밀도 SGD 및 다른 압축 방식과 비교했을 때 대규모 데이터셋에서의 커뮤니케이션 및 벽시계 시간 성능 이점은 무엇인가?

주요 결과

제시된 가정 하에서 미니배치 signSGD의 이론적 수렴 속도는 SGD와 O(1/√N)로 일치한다.
다수결 투표는 최대 50%의 공격적 워커에 대해 Byzantine 장애 허용을 제공하며 수렴은 약화되지만 보장된다.
실험 결과 Imagenet 학습에서 7–15대의 AWS p3.2xlarge 머신으로 벽시계 시간 측면에서 최대 25%의 속도향상을 보이며 일반화는 약간 저하된다.
Signum이 다수결 투표로 커뮤니케이션 비용에서 일부 QSGD 변형을 능가하면서도 수렴을 경쟁력 있게 유지한다.
실제 그래디언트 노이즈는 대개 unimodal하고 대칭적이어서 방법의 가정과 수렴을 지지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.