QUICK REVIEW

[논문 리뷰] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini|arXiv (Cornell University)|2017. 01. 23.

Advanced Neural Network Applications인용 수 268

한 줄 요약

수천 개의 전문가로 구성된 희소 게이트 Mixture-of-Experts(MoE) 계층을 도입하여 모델 용량을 대폭 확장(최대 137B 매개변수)하면서도 실용적 계산 비용을 유지하며, 언어 모델링과 기계 번역에서 시연합니다. 큰 용량 이점을 보이고 효율성은 완만하게 감소합니다.

ABSTRACT

The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significant algorithmic and performance challenges. In this work, we address these challenges and finally realize the promise of conditional computation, achieving greater than 1000x improvements in model capacity with only minor losses in computational efficiency on modern GPU clusters. We introduce a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks. A trainable gating network determines a sparse combination of these experts to use for each example. We apply the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora. We present model architectures in which a MoE with up to 137 billion parameters is applied convolutionally between stacked LSTM layers. On large language modeling and machine translation benchmarks, these models achieve significantly better results than state-of-the-art at lower computational cost.

연구 동기 및 목표

계산량의 비례 상승 없이 모델 용량을 극적으로 증가시키기 위한 조건부 계산의 필요성을 제시한다.
수천 개의 전문가를 갖는 희소 게이트 Mixture-of-Experts 계층을 제안하고 구현한다.
언어 모델링 및 기계 번역 벤치마크에서 MoE 확장 아키텍처를 평가한다.
대규모 MoE 시스템에서 배치 처리, 대역폭, 로드 밸런싱과 같은 실용적 학습 문제를 다룬다.

제안 방법

다수의 전문가 네트워크와 각 입력에 대해 희소 subset의 전문가를 선택하는 학습 가능한 게이팅 네트워크를 가진 MoE 계층을 정의한다.
softmax 또는 노이즈 추가 top-k 게이팅을 사용하여 희소 게이팅 가중치를 생성하고, 예제별 전문가 선택을 가능하게 한다.
로드 밸런싱을 촉진하고 특정 전문가의 지배를 완화하는 메커니즘과 함께, 게이팅 네트워크와 전문가 네트워크를 역전파로 함께 학습한다.
MoE 계산을 위한 효과적 배치 크기를 늘리기 위해 데이터 및 모델 병렬성을 혼합하여 성능 문제를 처리한다.
쌓인 LSTM 계층 사이에서 MoE의 컨볼루션 적용을 활용하여 위치별 게이팅 결정을 가능하게 한다.
LSTM 계층 사이에 MoE 계층을 삽입하는 구조를 실험하고, 매우 큰 MoE 용량(수천 개의 전문가 포함)을 포함한다.

실험 결과

연구 질문

RQ1절연된(조건부) 계산이 신경망 용량을 계산 효율성을 유지하며 어떻게 확장시킬 수 있는가?
RQ2가장 좋은 성능을 내는 게이팅 전략(softmax 대 노이즈 있는 top-k)과 아키텍처 배치는 무엇인가?
RQ3MoE 기반 모델이 현실적인 계산 예산으로 대규모 언어 모델링 및 기계 번역 벤치마크에서 최첨단 결과를 달성할 수 있는가?
RQ4배치 처리, 대역폭, 로드 밸런싱과 같은 실용적 학습 및 배치 문제는 무엇이며 어떻게 완화될 수 있는가?

주요 결과

수천 개의 전문가를 가진 MoE 모델이 대형 언어 모델링 벤치마크에서 최첨단에 비해 계산 비용이 낮은 상태로 상당히 더 나은 결과를 달성했다.
1B-단어 언어 모델링 설정에서 대형 MoE 용량으로 perplexity가 최대 24% 개선되었다.
Google News 100B-단어 코퍼스에서 최대 137B 매개변수의 MoE 모델은 perplexity를 지속적으로 개선했고, 데이터 규모에 비례하는 성능 향상을 보였다.
기계 번역에서 MoE 보강 GNMT 유사 모델이 여러 언어쌍에서 강력한 베이스라인 대비 BLEU를 개선했으며, 일부 구성에서 학습 시간이 감소했다.
다국어 번역 실험에서 다국어 기준선 대비 상당한 개선을 보여주었고, 대부분의 언어쌍에서 perplexity와 BLEU가 향상되었다.
논문은 하드웨어 확장과 MoE 기반 조건부 계산의 활용으로 트릴리언 파라미터 모델 학습의 가능성을 입증한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.