QUICK REVIEW

[論文レビュー] Deep Learning on FPGAs: Past, Present, and Future

Griffin Lacey, Graham W. Taylor|arXiv (Cornell University)|Feb 13, 2016

CCD and CMOS Imaging Sensors参考文献 29被引用数 154

ひとこと要約

この論文はFPGA上の深層学習を概説し、高レベルのOpenCLツールの活用を論じ、CNN/MLPの実装と設計フローを評価し、将来の方向性と電力効率の高いアクセラレーションの可能性を強調します。

ABSTRACT

The rapid growth of data size and accessibility in recent years has instigated a shift of philosophy in algorithm design for artificial intelligence. Instead of engineering algorithms by hand, the ability to learn composable systems automatically from massive amounts of data has led to ground-breaking performance in important domains such as computer vision, speech recognition, and natural language processing. The most popular class of techniques used in these domains is called deep learning, and is seeing significant attention from industry. However, these models require incredible amounts of data and compute power to train, and are limited by the need for better hardware acceleration to accommodate scaling beyond current data and model sizes. While the current solution has been to use clusters of graphics processing units (GPU) as general purpose processors (GPGPU), the use of field programmable gate arrays (FPGA) provide an interesting alternative. Current trends in design tools for FPGAs have made them more compatible with the high-level software practices typically practiced in the deep learning community, making FPGAs more accessible to those who build and deploy models. Since FPGA architectures are flexible, this could also allow researchers the ability to explore model-level optimizations beyond what is possible on fixed architectures such as GPUs. As well, FPGAs tend to provide high performance per watt of power consumption, which is of particular importance for application scientists interested in large scale server-based deployment or resource-limited embedded applications. This review takes a look at deep learning and FPGAs from a hardware acceleration perspective, identifying trends and innovations that make these technologies a natural fit, and motivates a discussion on how FPGAs may best serve the needs of the deep learning community moving forward.

研究の動機と目的

深層学習のためのGPUを超えたハードウェアアクセラレーションの必要性を動機づける。
DLワークロードに対する柔軟で省電力なアクセラレータとしてのFPGAの役割を特徴づける。
現在のFPGAベースのCNN/MLP実装と設計上のトレードオフを評価する。
DLとFPGAコミュニティを結ぶ高レベル抽象化ツールとOpenCLの採用を論じる。
FPGA上でDLをスケールさせ、ツールとワークフローを改善するための将来の方向性を推奨する。

提案手法

CNNおよびMLPアーキテクチャとそれらのFPGAアクセラレーション適性のレビュー。
再構成性、メモリ階層、およびパイプライン並列性を含むFPGAの特性についてのディスカッション。
高水準合成とOpenCLをDL研究者にFPGAをアクセス可能にする経路としての分析。
OpenCLベースのワークフローと統合するDLモデルのためのFPGA中心の設計フローの説明。
FPGAハードウェア上のトレーニングと推論の比較と関連するパフォーマンス影響の検討。

実験結果

リサーチクエスチョン

RQ1GPPやGPUと比較して、FPGAがDLアクセラレーションの魅力的なプラットフォームである理由は何か？
RQ2DLモデル（特にCNNとMLP）はFPGAアーキテクチャにどのようにマッピングされ、得られる性能/電力トレードオフは何か？
RQ3DLコミュニティへのFPGA採用を広げるために、どのようなツールと設計フローの開発が必要か？
RQ4マルチ-FPGAや電力制約のある環境でDLをスケールさせるための近短期および長期の方向性は何か？

主な発見

FPGA上の最先端CNN実装は、数十から百の画像/秒を達成し、電力予算は数十ワット程度（例：Stratix Vプラットフォームで25 W、ImageNet 1Kで134画像/秒）.
OpenCLはFPGA、GPU、CPU間のクロスハードウェアプログラミングを可能にし、プラットフォーム固有の制限にもかかわらずFPGA上でのDLワークフローの採用を促進する。
高レベル設計ツールとOpenCLサポートは、研究者やDL実践者に対するFPGAの利用性を拡大し、ソフトウェアに近いDLワークフローと再構成可能なハードウェアを橋渡しする。
FPGAはパイプライン並列性とカスタマイズ可能なアーキテクチャを提供し、特定のDLプリミティブとストリーミングワークロードにおいて性能/ワットの観点で固定されたGPUを上回ることがある。
将来の方向性には、より大きなメモリ、マルチ-FPGA構成、コンパイル時間のボトルネックを減らし反復を加速する、より抽象化された設計ツールが含まれる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。