QUICK REVIEW

[논문 리뷰] Large scale distributed neural network training through online distillation

Rohan Anil, Gabriel Pereyra|arXiv (Cornell University)|2018. 04. 09.

Advanced Neural Network Applications참고 문헌 21인용 수 152

한 줄 요약

본 논문은 codistillation을 소개합니다. 이는 여러 모델을 병렬로 학습시키고 이들 모델의 예측 간 합의를 촉진하는 온라인 distillation 변형으로, 표준 distributed SGD를 넘어 더 빠른 확장을 가능하게 하고, 추가적인 테스트 시 비용 없이 재현성을 향상시킵니다.

ABSTRACT

Techniques such as ensembling and distillation promise model quality improvements when paired with almost any base model. However, due to increased test-time cost (for ensembles) and increased complexity of the training pipeline (for distillation), these techniques are challenging to use in industrial settings. In this paper we explore a variant of distillation which is relatively straightforward to use as it does not require a complicated multi-stage setup or many new hyperparameters. Our first claim is that online distillation enables us to use extra parallelism to fit very large datasets about twice as fast. Crucially, we can still speed up training even after we have already reached the point at which additional parallelism provides no benefit for synchronous or asynchronous stochastic gradient descent. Two neural networks trained on disjoint subsets of the data can share knowledge by encouraging each model to agree with the predictions the other model would have made. These predictions can come from a stale version of the other model so they can be safely computed using weights that only rarely get transmitted. Our second claim is that online distillation is a cost-effective way to make the exact predictions of a model dramatically more reproducible. We support our claims using experiments on the Criteo Display Ad Challenge dataset, ImageNet, and the largest to-date dataset used for neural language modeling, containing $6 imes 10^{11}$ tokens and based on the Common Crawl repository of web data.

연구 동기 및 목표

대규모 신경망에서 전통적인 distributed SGD를 넘어선 확장 가능한 학습 필요성을 동기 부여한다.
여러 모델이 병렬로 학습하고 지식을 공유할 수 있도록 하는 간단한 online distillation 접근법( codistillation )을 제안한다.
codistillation이 학습 속도를 높이고 테스트 시 비용을 증가시키지 않으면서 정확도를 향상시킨다는 것을 시연한다.
실제 대규모 데이터셋에 codistillation을 적용하기 위한 실용적 고려사항과 설계 선택을 조사한다.

제안 방법

다른 모델의 평균 예측과의 일치를 유도하는 distillation 항을 포함하여 서로 다른 데이터 부분집합에 대해 모델의 n개 복사본을 학습한다.
다른 모델의 과거 예측을 사용하여 distillation 손실을 계산함으로써 낮은 통신 요구를 가능하게 한다.
초기 버닝 후에 distillation을 가능하게 하되 기본 손실을 유지하고 distillation 항을 추가하여 결합된 목적함수를 형성한다.
worker를 그룹화하고 그룹 간에 체크포인트를 교환하여 codistillation을 distributed SGD와 결합할 수 있음을 보여준다.
체크포인트나 예측 서버를 통한 커뮤니케이션 및 오래된 예측에 대한 내성 등의 구현 고려사항을 제공한다.
훈련 시간 효율성과 재현성에 초점을 맞춰 codistillation을 앙상블(ensembling) 및 offline distillation과 비교한다.

실험 결과

연구 질문

RQ1온라인 codistillation이 다단계 디스틸레이션 파이프라인 없이도 추가 병렬성을 활용해 distributed SGD보다 더 빠른 학습을 가능하게 할 수 있는가?
RQ2codistillation이 기존의 디스틸레이션 및 앙상블과 비교하여 최종 모델의 정확도와 재현성을 유지하거나 향상시키는가?
RQ3오래된 예측 및 통신 전략이 대규모 설정에서 codistillation의 효과성과 실용성에 어떤 영향을 미치는가?
RQ4파이프라인 복잡성을 최소화하면서 codistillation의 이점을 최대화하는 실용적 설계 선택은 무엇인가?

주요 결과

Codistillation은 전통적 SGD가 수익 감소를 보일 때에도 더 많은 계산 자원을 생산적으로 활용하도록 허용함으로써 더 빠른 훈련을 가능하게 한다.
양방향 codistillation은 베이스라인과 동일한 검증 오차에 도달하는 데 학습 시간을 약 2배 단축시키고 앙상블의 성능에 근접할 수 있다.
Codistillation은 앙상블과 유사한 재현성 이점을 제공하여 예측의 변동성을 줄이고 제공 비용을 증가시키지 않는다.
codistilling 모델에 서로 다른 데이터 부분집합을 사용하는 것이 같은 데이터를 사용하는 것보다 더 높은 이득을 제공하며, 데이터 파티션 간의 지식 전달이 성공적으로 이뤄짐을 시사한다.
codistillation은 오래된 예측에서도 효과를 유지하며 synchronous 또는 asynchronous SGD와 통합될 수 있고 통신 비용도 관리 가능하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.