QUICK REVIEW

[논문 리뷰] QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Dan Alistarh, Demjan Grubic|arXiv (Cornell University)|2016. 10. 07.

Stochastic Gradient Optimization Techniques인용 수 907

한 줄 요약

QSGD는 Elias 코딩을 이용한 확률적 그래디언트 양자화를 도입하여 데이터-병렬 SGD의 통신량을 줄이고, 정확도를 해치지 않으면서 심층 네트워크에 대한 수렴 보장과 실용적 속도향상을 제공합니다.

ABSTRACT

Parallel implementations of stochastic gradient descent (SGD) have received significant research attention, thanks to excellent scalability properties of this algorithm, and to its efficiency in the context of training deep neural networks. A fundamental barrier for parallelizing large-scale SGD is the fact that the cost of communicating the gradient updates between nodes can be very large. Consequently, lossy compression heuristics have been proposed, by which nodes only communicate quantized gradients. Although effective in practice, these heuristics do not always provably converge, and it is not clear whether they are optimal. In this paper, we propose Quantized SGD (QSGD), a family of compression schemes which allow the compression of gradient updates at each node, while guaranteeing convergence under standard assumptions. QSGD allows the user to trade off compression and convergence time: it can communicate a sublinear number of bits per iteration in the model dimension, and can achieve asymptotically optimal communication cost. We complement our theoretical results with empirical data, showing that QSGD can significantly reduce communication cost, while being competitive with standard uncompressed techniques on a variety of real tasks. In particular, experiments show that gradient quantization applied to training of deep neural networks for image classification and automated speech recognition can lead to significant reductions in communication cost, and end-to-end training time. For instance, on 16 GPUs, we are able to train a ResNet-152 network on ImageNet 1.8x faster to full accuracy. Of note, we show that there exist generic parameter settings under which all known network architectures preserve or slightly improve their full accuracy when using quantization.

연구 동기 및 목표

그래디언트 교환에서의 통신 병목 현상을 해결하여 확장 가능한 데이터-병렬 SGD를 촉진한다.
볼록 및 비볼록 목적함수 하에서 수렴 보장을 갖는 양자화된 SGD 프레임워크를 개발한다.
수렴을 깨지 않으면서 양자화된 그래디언트를 효율적으로 압축하는 실용적인 인코딩 체계를 제공한다.
실질적인 엔드-투-엔드 학습 시간 감소를 보여주고 심층 신경망에의 적용 가능성을 입증한다.

제안 방법

편향 없이 보존하고 분산을 제어하기 위해 s 수준의 확률적 그래디언트 양자화 Q_s(v)를 제안한다.
양자화된 값의 분포 특성을 활용하는 무손실 Elias 기반 코딩 체계를 사용하여 양자화된 그래디언트를 인코딩한다.
버킷 크기 d에 따라 분산을 제어하고 벡터 노름으로 스케일링하여 안정성을 확보한다.
이론적 경계: 분산 증가가 min(n/s^2, sqrt(n)/s)이고 라운드당 통신 길이다.
QSVRG(분산 감소) 및 비볼록 설정 등 수렴 보장을 갖는 변형으로 확장한다.
버킷팅, 최대 정규화, GPU 친화적 코딩에 대한 실용적 구현 노트를 제공한다.

실험 결과

연구 질문

RQ1볼록 및 비볼록 목적하에서 병렬 SGD의 수렴 보장에 그래디언트 양자화가 어떤 영향을 미치는가?
RQ2QSGD에서 이터레이션당 통신 비트수와 수렴/분산 간의 트레이드오프는 무엇인가?
RQ3효율적 인코딩을 갖춘 확률적 양자화가 심층 신경망에서 정확도 손실 없이 상당한 통신 감소를 달성할 수 있는가?
RQ4전체 정밀도 SGD와 비교하여 QSGD 변형들(분산 감소 버전 포함)이 실제로 어떻게 수행되는가?

주요 결과

Network	Dataset	Params.	Init. Rate	Top-1 (32bit)	Top-1 (QSGD)	Speedup (8 GPUs)
AlexNet	ImageNet	62M	0.07	59.50%	60.05% (4bit)	2.05×
ResNet152	ImageNet	60M	1	77.0%	76.74% (8bit)	1.56×
ResNet50	ImageNet	25M	1	74.68%	74.76% (4bit)	1.26×
ResNet110	CIFAR-10	1M	0.1	93.86%	94.19% (4bit)	1.10×
BN-Inception	ImageNet	11M	3.6	-	-	1.16× (projected)
VGG19	ImageNet	143M	0.1	-	-	2.25× (projected)
LSTM	AN4	13M	0.5	81.13%	81.15% (4bit)	2× (2 GPUs)

QSGD는 수렴 보장을 갖춘 상당한 통신 감소를 달성하여 GPU에서 실용적인 속도향상을 가능하게 한다.
dense 조건에서 (s = sqrt(n)) 라운드당 통신은 최대 2배의 분산 증가가 있을 수 있지만 2.8n + 32 비트까지 낮아질 수 있다.
두 가지 극단: 한편으로는 O(sqrt(n) (log n + O(1)))의 기대 비트수와 최대 O(sqrt(n))의 분산 증가, 다른 한편으로는 매 이터레이션 당 ≤ 2.8n + 32 비트와 대략 두 배의 반복 수.
실험 결과는 ImageNet 분류기 및 LSTM 음성 모델에서 정확도 손실이 거의 없거나 없으면서 상당한 학습 시간 감소를 보여준다(예: 16 GPUs의 AlexNet: 통신 4배 감소, 에폭 2.5배 빠름; 16 GPUs의 ResNet-152: 엔드-투-엔드 약 2배 빠름).
QSVRG와 같은 QSGD 변형은 지수적 수렴 특성을 유지하고 문제 조건에 비해 에포크당 통신이 유리하다.
양자화 잡음은 일부 설정에서 정확도를 약간 개선할 수 있으며, 딥러닝에서 그래디언트 노이즈의 이점을 보이는 것과 일치한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.