QUICK REVIEW

[論文レビュー] MCUNet: Tiny Deep Learning on IoT Devices

Ji Lin, Wei-Ming Chen|arXiv (Cornell University)|Jul 20, 2020

Advanced Neural Network Applications参考文献 51被引用数 255

ひとこと要約

MCUNetはTinyNASとTinyEngineの共同設計により、店頭入手可能なマイクロコントローラ上でImageNet規模の深層学習を可能にし、70.7%のトップ-1と、厳しいメモリ予算の中で高速なウェイクワード性能を発揮する。

ABSTRACT

Machine learning on tiny IoT devices based on microcontroller units (MCU) is appealing but challenging: the memory of microcontrollers is 2-3 orders of magnitude smaller even than mobile phones. We propose MCUNet, a framework that jointly designs the efficient neural architecture (TinyNAS) and the lightweight inference engine (TinyEngine), enabling ImageNet-scale inference on microcontrollers. TinyNAS adopts a two-stage neural architecture search approach that first optimizes the search space to fit the resource constraints, then specializes the network architecture in the optimized search space. TinyNAS can automatically handle diverse constraints (i.e.device, latency, energy, memory) under low search costs.TinyNAS is co-designed with TinyEngine, a memory-efficient inference library to expand the search space and fit a larger model. TinyEngine adapts the memory scheduling according to the overall network topology rather than layer-wise optimization, reducing the memory usage by 4.8x, and accelerating the inference by 1.7-3.3x compared to TF-Lite Micro and CMSIS-NN. MCUNet is the first to achieves >70% ImageNet top1 accuracy on an off-the-shelf commercial microcontroller, using 3.5x less SRAM and 5.7x less Flash compared to quantized MobileNetV2 and ResNet-18. On visual&audio wake words tasks, MCUNet achieves state-of-the-art accuracy and runs 2.4-3.4x faster than MobileNetV2 and ProxylessNAS-based solutions with 3.7-4.1x smaller peak SRAM. Our study suggests that the era of always-on tiny machine learning on IoT devices has arrived. Code and models can be found here: https://tinyml.mit.edu.

研究の動機と目的

極めて限られたSRAM/Flashを持つマイクロコントローラでImageNet規模の深層学習を動機づけ、実現する。
ピークメモリを最小化し精度を最大化するため、ニューラルアーキテクチャ検索と推論スケジューリングを組み合わせたシステム-アルゴリズム共設計フレームワークを開発する。
多様な小型ハードウェア制約に適合するよう、探索空間の最適化を自動化する。
小型デバイスの実行可能モデル空間を拡張するメモリ効率の良い推論ライブラリを提供する。

提案手法

TinyNASは2段階のNASを実行: 108の探索空間構成全体にわたる満足ネットワークのFLOPs分布を分析して自動的に探索空間を最適化し、次に選択空間内で重み共有と進化探索を用いたワンショットNASを行う。
TinyEngineはコード生成を用いてランタイムオーバーヘッドを排除し、モデル適応メモリスケジューリング、カーネル最適化、インプレース深さ方向畳み込みを実現してピークメモリを削減しスループットを向上させる。
このフレームワークはMCUメモリ予算の下で実行可能なモデル容量を拡張するようにTinyNASとTinyEngineを共同設計する。
モデルをデプロイするためにint8への量子化を用い、メモリ制限下でより大きなモデルを適合させるために4ビット量子化の検討を行う。
評価はImageNet、Visual Wake Words、Speech Commandsを複数のMCU（例：STM32F746, F412, H743）に跨って実施する。

実験結果

リサーチクエスチョン

RQ1メモリ制約のあるMCU上で神経アーキテクチャと推論ランタイムのシステムレベルの共設計はImageNet規模のモデルを可能にするか？
RQ2アーキテクチャ探索とメモリ認識推論エンジンを共設計することにより、どれだけのメモリと遅延が削減できるか？
RQ3厳しい SRAM/Flash予算下で、探索空間最適化とメモリスケジューリングが最終精度に与える影響は？
RQ4int8（およびそれ以下のビット）量子化はMCU制約モデルで競争力のある精度を達成するのに十分か？
RQ5従来のTinyMLベースラインと比較して、ウェイクワードおよび物体検出タスクでのMCUNetの性能はどうか？

主な発見

MCUNetは市販のマイクロコントローラ上でImageNetトップ-1精度70.7%を記録した。
TinyEngineはpeak memoryを3.4×削減し、推論をTF-Lite MicroおよびCMSIS-NNと比較して1.7–3.3×高速化する。
TinyEngineとTinyNASを組み合わせると、MobileNetV2とProxylessNASのベースラインは厳しいメモリ条件下で61.8%のトップ-1精度に改善され、ライブラリのみの場合の47.4–56.4%と比較される。
ウェイクワードデータセット（VWWとSpeech Commands）上で、MCUNetはベースラインと比較して2.4–3.4×高速、ピークSRAMを3.7–4.1×小さく実行する。
512kB SRAM下のPascal VOCでMCUNetのmAPは51.4%、メモリ制約下でMobileNetV2+CMSIS-NNは31.6%。
8-bitでのResNet-18/MobileNetV2相当と比較して、MCUNetはSRAMで約3.5×、Flashで約5.7×のメモリ削減を達成し、ImageNet精度も向上する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。