QUICK REVIEW

[論文レビュー] TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

Young D. Kwon, Rui Li|arXiv (Cornell University)|Jul 19, 2023

Advanced Neural Network Applications被引用数 8

ひとこと要約

TinyTrain はタスク適応的スパース更新と少数ショット事前学習により、極端なエッジデバイス上での高速でメモリ・計算効率の高い on-device DNN トレーニングを実現し、以前の手法よりはるかに低いオーバーヘッドで高い精度を達成します。

ABSTRACT

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

研究の動機と目的

極端に制約のあるエッジデバイス上での on-device トレーニングのデータ不足に対処する。
各目標タスクに適応するメモリ・計算効率の高いスパース更新ポリシーを開発する。
データ不足シナリオでの適応性能を高めるために few-shot 学習レジームで事前学習を行う。
デプロイ時のオーバーヘッドを最小化するために動的な層/チャネル選択を有効にする。
実デバイス測定を用いて MCU 等級のハードウェアでの実用的実現性を示す。

提案手法

少数-shot 適応に適した堅牢なグローバル表現を得るためのオフライン事前学習とメタ学習。
Fisher 情報と正規化コスト項を組み合わせた多目的基準を用いて訓練する層/チャネルを選択するタスク適応的スパース更新。
デバイス予算内でターゲットタスクごとに再計算される動的なオンライン層/チャネル選択。
オフラインスコアリングとオンライン選択の両方で活性化に対する Fisher 情報をチャネル/層の重要性の代理指標として使用。
デバイス適応前のサンプル効率を向上させるための few-shot 学習 (FSL) 事前学習段階。

実験結果

リサーチクエスチョン

RQ1極端なエッジデバイス上でのトレーニングを実現可能にしつつ、クロスドメイン・少数ショットタスクでの精度を維持できるか。
RQ2動的なタスク適応的スパース更新ポリシーは、厳しいメモリと計算予算下で静的スパース更新や完全微調整より優れているか。
RQ3メタ学習ベースの事前学習は、データ不足のシナリオで複数のアーキテクチャにわたり適応性能をどれだけ改善するか。
RQ4MCU ライクデバイス上での TinyTrain の実行時コスト（メモリ、MAC、待ち時間、エネルギー）はどの程度か。

主な発見

モデル	手法	トラフィック	Omniglot	Aircraft	Flower	CUB	DTD	QDraw	Fungi	COCO	Avg.
MCUNet	None	35.5	42.3	42.1	73.8	48.4	60.1	40.9	30.9	26.8	44.5
MCUNet	FullTrain	82.0	72.7	75.3	90.7	66.4	74.6	64.0	40.4	36.0	66.9
MCUNet	LastLayer	55.3	47.5	56.7	83.9	54.0	72.0	50.3	36.4	35.2	54.6
MCUNet	TinyTL	78.9	73.6	74.4	88.6	60.9	73.3	67.2	41.1	36.9	66.1
MCUNet	SparseUpdate	72.8	67.4	69.0	88.3	67.1	73.2	61.9	41.5	37.5	64.3
MCUNet	TinyTrain (Ours)	79.3	73.8	78.8	93.3	69.9	76.0	67.3	45.5	39.4	69.3
Mobile	None	39.9	44.4	48.4	81.5	61.1	70.3	45.5	38.6	35.8	51.1
Mobile	FullTrain	75.5	69.1	68.9	84.4	61.8	71.3	60.6	37.7	35.1	62.7
Mobile	LastLayer	58.2	55.1	59.6	86.3	61.8	72.2	53.3	39.8	36.7	58.1
Mobile	TinyTL	71.3	69.0	68.1	85.9	57.2	70.9	62.5	38.2	36.3	62.1
Mobile	SparseUpdate	77.3	69.1	72.4	87.3	62.5	71.1	61.8	38.8	35.8	64.0
Mobile	TinyTrain (Ours)	77.4	68.1	74.1	91.6	64.3	74.9	60.6	40.8	39.1	65.6
Proxyless	None	42.6	50.5	41.4	80.5	53.2	69.1	47.3	36.4	38.6	51.1
Proxyless	FullTrain	78.4	73.3	71.4	86.3	64.5	71.7	63.8	38.9	37.2	65.0
Proxyless	LastLayer	57.1	58.8	52.7	85.5	56.1	72.9	53.0	38.6	38.7	57.0
Proxyless	NASNet	72.5	73.6	70.3	86.2	57.4	71.0	65.8	38.6	37.6	63.7
Proxyless	TinyTL	72.5	73.6	70.3	86.2	57.4	71.0	65.8	38.6	37.6	63.7
Proxyless	SparseUpdate	76.0	72.4	71.2	87.8	62.1	71.7	64.1	39.6	37.1	64.7
Proxyless	TinyTrain (Ours)	79.0	71.9	76.7	92.7	67.4	76.0	65.9	43.4	41.6	68.3

TinyTrain は九つのクロスドメインデータセットで、全層微調整と比較して精度を 3.6–5.0 ポイント向上させる。
逆伝播のメモリと計算コストを、それぞれ FullTrain と比較して最大で 2,286 倍および 7.68 倍削減。
TinyTrain は最先端の SparseUpdate 手法より 2.6–7.7% の精度向上と 2.4–3.1× のメモリ削減、計算は 1.5–1.8×低減。
Raspberry Pi Zero 2 と Jetson Nano で、オンライン層/チャネル選択を 20–35 秒で実行（総トレーニング時間の 3.4–3.8%）。
エンドツーエンドの on-device トレーニングは約 10 分で完了し、Pi Zero 2 の 2 時間の FullTrain より桁違いに高速。
TinyTrain は MCU級プラットフォームのメモリ 1 MB 台の範囲を維持しつつ、競争力のある精度を提供。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。