QUICK REVIEW

[論文レビュー] From Sparse to Soft Mixtures of Experts

Joan Puigcerver, Carlos Riquelme|arXiv (Cornell University)|Aug 2, 2023

Domain Adaptation and Few-Shot Learning被引用数 23

ひとこと要約

Soft MoEは、全てのトークンに対して各エキスパートへ入力をソフトに混合する完全に微分可能なスパースTransformerを導入し、推論コストを抑えつつ安定性とスケーラブルな容量を実現し、視覚タスクにおいてViTや従来のMoEを上回る。

ABSTRACT

Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.

研究の動機と目的

計算資源やメモリコストが過度にかからずに、トランスフォーマーモデルをスケールさせる動機づけ。
従来のスパースMoEにおける不安定性とトークンドロップの問題を解決しつつ、専門家分業の利点を維持する。
数千のエキスパートを可能にする完全に微分可能なソフトルーティング機構を提案する。
このアプローチを画像分類で実証し、ViTや既存のMoEと比較する。

提案手法

Soft MoEを、ルーティングがソフトアサインメントによって達成される完全に微分可能な層として定義する。
トークン-スロット軸とスロット-トーク軸に対してソフトマックスを用いてディスパッチ重みと結合重みを計算する（論文の（1）–（3）に類似した方程式）。
各入力スロットを対応するエキパート（通常はMLP）で処理する。
トレーニングを安定化させるため、入力とルーターパラメータをトークンごとおよびスロットごとにL2正規化して正規化する。
Transformersの密結合MLPブロックをSoft MoEブロックで置換し、総スロット数で計算量を制御する。
簡単なJAX実装を提供し、完全なコードをGoogle Research GitHub（vmoe）で参照できる。

Figure 1 : Main differences between Sparse and Soft MoE layers. While the router in Sparse MoE layers (left) learns to assign individual input tokens to each of the available slots, in Soft MoE layers (right) each slot is the result of a (different) weighted average of all the input tokens. Learning

実験結果

リサーチクエスチョン

RQ1Soft MoEは、トレーニングと推論の予算を跨いで、 dense ViTsおよび既存のSparse MoEと同等または優れた精度を達成できるか。
RQ2ソフトルーティングは、スケール時のトークンドロップやエキスパートのアンバランスといった従来のMoEの問題を緩和するか。
RQ3Soft MoEはエキスパート数とエキスパートあたりのスロット数が増えるとどうスケールするか、視覚タスクに最適な構成はどれか。
RQ4Soft MoEは画像-テキスト対照学習などの下流タスクへ利点を拡張できるか。
RQ5密結合およびスパースなベースラインと比較した訓練時間、FLOPs、実測時間のトレードオフはどうなるか。

主な発見

Soft MoEは、複数のモデルサイズに渡るトレーニングコストと性能のパレートにおいて、Dense ViTsと人気のSparse MoEの両方を凌駕する。
Soft MoE Base/16は、ViT-Huge/14と同等の性能を維持しつつ推論コストを10.5x低減し、実測時間を5.7x高速化している。
Soft MoEは128のエキスパートと16層で、ViT Huge/14よりパラメータ数を>40x増やせる一方で推論コストは約2%増に留まり、性能は格段に向上。
長期的な実験では、Soft MoEモデルは同等の計算予算でVision Transformersよりも優れた性能を発揮し、より小さなバックボーンで顕著な改善を、より大規模な規模では競合するまたは優れた結果を示す。
Soft MoE B/16とL/16のバリアントは、Upstreamおよびファインチューニングで強力な結果を達成し、ViTベースラインに対して推論時の大幅なスピードアップを実現（例：Soft MoE L/16は密なH/14を上回る性能で、かつ高速）。
Soft MoEはエキスパート数の増加に対してスケールし、1エキスパートあたり1つのスロットと数百〜数千のエキスパートが、費用を過度に上げずに性能を向上させる。

Figure 2 : The Soft MoE routing algorithm. Soft MoE first computes scores or logits for every pair of input token and slot, based on some learnable per-slot parameters. These logits are then normalized per slot (columns) and every slot computes a linear combination of all the input tokens based on t

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。