QUICK REVIEW

[論文レビュー] On-Device Training Under 256KB Memory

Ji Lin, Ligeng Zhu|arXiv (Cornell University)|Jun 30, 2022

Advanced Neural Network Applications被引用数 71

ひとこと要約

本論文は、256KBのSRAMと1MBのフラッシュしかないマイクロコントローラ上での実CNN訓練を可能にするアルゴリズム-システム共設計を提示し、Quantization-Aware Scaling、Sparse Update、Tiny Training Engine を用いて、精度を維持しつつ大幅なメモリと速度の改善を達成する。

ABSTRACT

On-device training enables the model to adapt to new data collected from the sensors by fine-tuning a pre-trained model. Users can benefit from customized AI models without having to transfer the data to the cloud, protecting the privacy. However, the training memory consumption is prohibitive for IoT devices that have tiny memory resources. We propose an algorithm-system co-design framework to make on-device training possible with only 256KB of memory. On-device training faces two unique challenges: (1) the quantized graphs of neural networks are hard to optimize due to low bit-precision and the lack of normalization; (2) the limited hardware resource does not allow full back-propagation. To cope with the optimization difficulty, we propose Quantization-Aware Scaling to calibrate the gradient scales and stabilize 8-bit quantized training. To reduce the memory footprint, we propose Sparse Update to skip the gradient computation of less important layers and sub-tensors. The algorithm innovation is implemented by a lightweight training system, Tiny Training Engine, which prunes the backward computation graph to support sparse updates and offload the runtime auto-differentiation to compile time. Our framework is the first solution to enable tiny on-device training of convolutional neural networks under 256KB SRAM and 1MB Flash without auxiliary memory, using less than 1/1000 of the memory of PyTorch and TensorFlow while matching the accuracy on tinyML application VWW. Our study enables IoT devices not only to perform inference but also to continuously adapt to new data for on-device lifelong learning. A video demo can be found here: https://youtu.be/0pUFZYdoMY8.

研究の動機と目的

厳しいメモリ予算（256KB SRAM）を持つ超ミニマスエッジデバイスでの端末内訓練を可能にする。
追加メモリなしで8ビット訓練を安定化させるための量子化対応最適化を開発する。
スパース勾配更新と自動更新スキームによってメモリ使用量を削減する。
効率化のために逆伝播グラフを剪定・再配置する軽量コンパイラベースの訓練システムを設計する。
実機での展開を実証し、TinyMLタスクにおけるクラウド訓練と比較可能な精度を示す。

提案手法

実量子化グラフを直接更新する（int8の順伝播/逆伝播）ことで厳しいメモリ予算に適合させる。
量子化対応スケーリング（QAS）は、追加メモリなしで8ビット訓練を安定化させるように勾配を自動的にスケールする。
スパース更新は寄与分析に基づいて層/テンソルを選択的に更新し、メモリ制約を満たす。
自動化された寄与分析は、どのバイアス/重みをどの粒度で更新するかを選択する。
Tiny Training Engine（TTE）はコード生成を用いて静的な逆グラフをコンパイルし、使用されないノードを剪定し、インプレース更新と演算子融合を可能にする。
グラフ剪定と演算子再配置によりピークメモリを削減し、マイクロコントローラ上の訓練を加速する。

実験結果

リサーチクエスチョン

RQ1256KB SRAMと1MBフラッシュを搭載したデバイスでCNNの端末内訓練を実現可能にできるか？
RQ2追加のメモリコストなしで、QASはint8訓練と浮動小数点訓練の精度ギャップを縮小できるか？
RQ3スパース更新とコンパイル時のグラフ最適化は、転移学習の性能を維持しつつ厳密なメモリ予算を満たせるか？

主な発見

メモリ使用量はPyTorch/TensorFlowと比較して1000倍超削減され、256KB SRAMと1MB Flashでの訓練を可能にする。
QASは複数データセットで浮動小数点訓練と同等の精度を満たす、int8での完全な量子化訓練を可能にする。
寄与分析と組み合わせたスパース更新は、全層を更新するよりはるかに低いメモリで下流の精度を向上させる。
TTEのグラフ生成と剪定は7-9xのメモリ節約を達成し、演算子再配置によりピークメモリも20-21x削減する。
Cortex M7 MCU (STM32F746) 上での端末内訓練は競争力のある精度を達成（例: VWWでのTop-1 89.1%）し、単一イテレーションあたりの性能も大幅に向上（TF-Lite Micro 全更新と比べ23-25倍高速）。
このフレームワークは、tinyMLアプリケーションの生涯学習とプライバシー保護によるパーソナライズを端末上で実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。