QUICK REVIEW

[論文レビュー] SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Beidi Chen, Tharun Medini|arXiv (Cornell University)|Mar 7, 2019

Advanced Neural Network Applications参考文献 43被引用数 44

ひとこと要約

SLIDEはCPUベースの疎な局所性敏感ハッシュ法を用いるアプローチで、マルチコア並列を可能にし、大規模全結合ネットワークでGPU加速TensorFlowを上回り、同程度の精度でwall-clock速度を最大3.5x–10x達成。

ABSTRACT

Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memorize these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.

研究の動機と目的

大規模ディープラーニングにおけるハードウェア加速の代替アルゴリズムを模索する動機づけ。
適応的な疎性を活用して計算を削減する実用的なCPUベースのシステム（SLIDE）を提案する。
業界規模のデータセットでGPU加速ベースのベースラインを上回るスマートなアルゴリズムの実証。
再現性のあるコードとベンチマークを提供してアプローチを検証する。
提案システムの性能特性とボトルネックを分析する。

提案手法

ニューロンの活性化を疎にするため局所感知ハッシング（LSH）を適用し、サブ線形の候補選択を可能にする。
各レイヤーでK個のLSHハッシュを用いて前向き伝播のための疎なニューロン集合を生成する。
アクティブな接続のみを更新し、非同期SGD（HOGWILD風）を用いて疎な逆伝播を実行する。
バッチレベルの独立性を活用したマルチコアOpenMP並列性を活かし、ほぼ線形スケーリングを達成する。
HugePagesやSIMDなどのメモリ・キャッシュ対応最適化を取り入れてCPU実行を高速化する。
Delicious-200KおよびAmazon-670Kのデータセットを用いた大規模全結合ネットワークでSLIDEをTF-GPUとTF-CPUと比較する。

実験結果

リサーチクエスチョン

RQ1LSHを介したアルゴリズム的な疎化はCPUコア上の大規模ニューラルネットワークの訓練をハードウェア加速よりも上回れるか？
RQ2適応的ニューロンサンプリングが収束と精度に及ぼす影響は、全体またはサンプル済みsoftmaxベースラインと比較してどのようになるか？
RQ3SLIDEはCPUコア数とデータセットサイズの増加に対して、ウォールクロック時間とコア利用率の点でどのようにスケールするか？
RQ4実用的なボトルネック（メモリ、帯域）は何で、CPUベースのDLシステムでどう緩和できるか？

主な発見

44コアのCPU上のSLIDEは同等の精度でTF-Tesla V100 GPUよりウォールクロック時間で優れている。
Delicious-200KではSLIDEはTF-GPUより約1.8x速く、Amazon-670Kでは約2.7x速い。
SLIDEはTF-CPUより1000%超速く、同様の反復-精度挙動で収束する。
コア数が増えるにつれてSLIDEではメモリ依存の非効率が低減される一方、TF-CPUでは増加し、CPUの利用効率が向上する。
LSHによる適応的サンプリングは活性ニューロンと更新を大幅に削減し、精度低下を最小限に抑えつつ大きな速度向上を生む。
SLIDEは8–32スレッドでほぼピークに近いコア利用率（約80–85%）を達成し、TF-CPUを上回る効率を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。