QUICK REVIEW

[論文レビュー] Memory-aware fusing and tiling of neural networks for accelerated edge inference

Jackson Farley|arXiv (Cornell University)|May 1, 2021

Advanced Neural Network Applications参考文献 12被引用数 3

ひとこと要約

本稿では、2つの畳み込み層グループを独立して統合・タイリングすることで、エッジデバイス上のニューラルネットワーク推論を高速化するメモリに配慮した統合およびタイリング技術を提案する。これにより、メモリ使用量を50%以上削減し、限られたメモリ制約下で最大2.78倍の高速化を達成した。また、遅延が手動で最適化された結果と6%以内に収まる最適な設定を自動で特定するためのメモリ予測子と探索アルゴリズムを導入した。

ABSTRACT

A rising research challenge is running costly machine learning (ML) networks locally on resource-constrained edge devices. ML networks with large convolutional layers can easily exceed available memory, increasing latency due to excessive swapping. Previous memory reduction techniques such as pruning and quantization reduce model accuracy and often require retraining. Alternatively, distributed methods partition the convolutions into equivalent smaller sub-computations, but the implementations introduce communication costs and require a network of devices. However, a distributed partitioning approach can also be used to run in a reduced memory footprint on a single device by subdividing the network into smaller operations. This report extends prior work on distributed partitioning using tiling and fusing of convolutional layers into a memory-aware execution on a single device. Our approach extends prior fusing strategies to allow for two groups of convolutional layers that are fused and tiled independently. This approach reduces overhead via data reuse, and reduces the memory footprint further. We also propose a memory usage predictor coupled with a search algorithm to provide fusing and tiling configurations for an arbitrary set of convolutional layers. When applied to the YOLOv2 object detection network, results show that our approach can run in less than half the memory, and with a speedup of up to 2.78 under severe memory constraints. Additionally, our algorithm will return a configuration with a latency that is within 6% of the best latency measured in a manual search.

研究の動機と目的

リソース制約のあるエッジデバイス上で大規模でメモリ集約的なニューラルネットワークを実行する課題に対処すること。
再訓練やプルーニングや量子化といった精度を低下させる手法を必要とせず、モデルの精度を損なわずにメモリフットプリントを削減すること。
一本調の実行モデルにおいて分散処理で用いられるようなタイリングと統合技術を適用することで、単一デバイス上での効率的な推論を可能にすること。
メモリ効率と推論遅延の両立を図る自動設定探索手法の開発

提案手法

従来の統合手法を拡張し、2つの異なる畳み込み層グループを独立して統合およびタイリング可能にする。
タイリングを用いて大規模な畳み込み演算を、限られたオンチップメモリに収まる小さな計算に分割する。
メモリ使用量予測子を用いて、さまざまな統合およびタイリング設定のメモリフットプリントを推定し、探索プロセスをガイドする。
探索アルゴリズムにより、メモリ使用量と推論遅延の最適なトレードオフを満たす設定を探索する。
タイル化された演算間でデータ再利用を実現し、重複するメモリアクセスや計算を削減する。
フレームワークはYOLOv2に適用され、実世界の物体検出モデルにおける有効性が実証された。

実験結果

リサーチクエスチョン

RQ11つの統合タイリングとは異なり、2つの畳み込み層グループを独立してタイリング・統合することで、メモリ使用量の削減がより効果的に行えるか？
RQ2提案されたメモリ予測子は、低メモリ・高性能な設定を探索する際にどれほど有効か？
RQ3自動設定探索手法は、手動で最適化された設定に比べてどの程度遅延を近づけることができるか？
RQ4限られたメモリ制約下で、実世界のモデル（YOLOv2）に対して、最大どれほどのメモリ削減と高速化が達成できるか？
RQ5再訓練や量子化を必要とせずに、精度の劣化を回避できるか？

主な発見

提案手法により、極めて厳しいメモリ制約下でも、元のYOLOv2モデルの半分以下のメモリ使用量にまで削減された。
制約のあるエッジデバイス上で、ベースライン推論と比較して最大2.78倍の高速化が達成された。
自動探索アルゴリズムにより、手動探索で得られた最良結果と比較して遅延が6%以内の設定が見つかった。
分散システムで一般的に用いられるタイリングと統合戦略を活用することで、単一デバイス上での効率的な推論が可能になった。
メモリ予測子のおかげで、無作為な試行錯誤や再訓練を伴わず、効果的な設定空間探索が可能になった。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。