QUICK REVIEW

[论文解读] TinyTrain: Resource-Aware Task-Adaptive Sparse Training of DNNs at the Data-Scarce Edge

Young D. Kwon, Rui Li|arXiv (Cornell University)|Jul 19, 2023

Advanced Neural Network Applications被引用 8

一句话总结

TinyTrain 通过任务自适应的稀疏更新和少-shot 预训练，实现极端边缘设备上的快速、节省内存与计算的端设备DNN训练，在准确性方面显著高于先前方法，且开销更低。

ABSTRACT

On-device training is essential for user personalisation and privacy. With the pervasiveness of IoT devices and microcontroller units (MCUs), this task becomes more challenging due to the constrained memory and compute resources, and the limited availability of labelled user data. Nonetheless, prior works neglect the data scarcity issue, require excessively long training time (e.g. a few hours), or induce substantial accuracy loss (>10%). In this paper, we propose TinyTrain, an on-device training approach that drastically reduces training time by selectively updating parts of the model and explicitly coping with data scarcity. TinyTrain introduces a task-adaptive sparse-update method that dynamically selects the layer/channel to update based on a multi-objective criterion that jointly captures user data, the memory, and the compute capabilities of the target device, leading to high accuracy on unseen tasks with reduced computation and memory footprint. TinyTrain outperforms vanilla fine-tuning of the entire network by 3.6-5.0% in accuracy, while reducing the backward-pass memory and computation cost by up to 1,098x and 7.68x, respectively. Targeting broadly used real-world edge devices, TinyTrain achieves 9.5x faster and 3.5x more energy-efficient training over status-quo approaches, and 2.23x smaller memory footprint than SOTA methods, while remaining within the 1 MB memory envelope of MCU-grade platforms.

研究动机与目标

解决极端受限边缘设备上端设备训练的数据稀缺问题。
开发一种对目标任务自适应的节省内存与计算的稀疏更新策略。
通过少-shot学习范式进行预训练，以提升适应性能。
实现部署时的动态层/通道选择，降低额外开销。
在MCU级及边缘类硬件上通过真实设备测量展示实际可行性。

提出的方法

离线预训练和元训练，生成适用于少-shot 自适应的鲁棒全局表征。
基于多目标准则（结合Fisher信息和归一化成本项）选择要训练的层/通道，实现任务自适应的稀疏更新。
在设备预算内对目标任务进行动态在线层/通道选择，重新计算稀疏更新策略。
在离线评分和在线选择阶段均使用激活的Fisher信息作为通道/层重要性的代理。
少-shot学习（FSL）预训练阶段，在端设备自适应之前提升样本效率。

实验结果

研究问题

RQ1在极端边缘设备上实现端设备训练是否可行，同时在跨领域、少-shot 任务上保持准确性？
RQ2在严格的内存与计算预算下，动态任务自适应稀疏更新策略是否优于静态稀疏更新和全微调？
RQ3元学习为基础的预训练在数据稀缺场景下对多种体系结构的适应性能提升幅度如何？
RQ4在MCU类设备上，TinyTrain 的实际运行成本（内存、MACs、延迟、能耗）是多少？

主要发现

模型	方法	Traffic	Omniglot	Aircraft	Flower	CUB	DTD	QDraw	Fungi	COCO	平均值
MCUNet	无	35.5	42.3	42.1	73.8	48.4	60.1	40.9	30.9	26.8	44.5
MCUNet	FullTrain	82.0	72.7	75.3	90.7	66.4	74.6	64.0	40.4	36.0	66.9
MCUNet	LastLayer	55.3	47.5	56.7	83.9	54.0	72.0	50.3	36.4	35.2	54.6
MCUNet	TinyTL	78.9	73.6	74.4	88.6	60.9	73.3	67.2	41.1	36.9	66.1
MCUNet	SparseUpdate	72.8	67.4	69.0	88.3	67.1	73.2	61.9	41.5	37.5	64.3
MCUNet	TinyTrain (Ours)	79.3	73.8	78.8	93.3	69.9	76.0	67.3	45.5	39.4	69.3
Mobile	None	39.9	44.4	48.4	81.5	61.1	70.3	45.5	38.6	35.8	51.1
Mobile	FullTrain	75.5	69.1	68.9	84.4	61.8	71.3	60.6	37.7	35.1	62.7
Mobile	LastLayer	58.2	55.1	59.6	86.3	61.8	72.2	53.3	39.8	36.7	58.1
Mobile	TinyTL	71.3	69.0	68.1	85.9	57.2	70.9	62.5	38.2	36.3	62.1
Mobile	SparseUpdate	77.3	69.1	72.4	87.3	62.5	71.1	61.8	38.8	35.8	64.0
Mobile	TinyTrain (Ours)	77.4	68.1	74.1	91.6	64.3	74.9	60.6	40.8	39.1	65.6
Proxyless	None	42.6	50.5	41.4	80.5	53.2	69.1	47.3	36.4	38.6	51.1
Proxyless	FullTrain	78.4	73.3	71.4	86.3	64.5	71.7	63.8	38.9	37.2	65.0
Proxyless	LastLayer	57.1	58.8	52.7	85.5	56.1	72.9	53.0	38.6	38.7	57.0
Proxyless	NASNet	72.5	73.6	70.3	86.2	57.4	71.0	65.8	38.6	37.6	63.7
Proxyless	TinyTL	72.5	73.6	70.3	86.2	57.4	71.0	65.8	38.6	37.6	63.7
Proxyless	SparseUpdate	76.0	72.4	71.2	87.8	62.1	71.7	64.1	39.6	37.1	64.7
Proxyless	TinyTrain (Ours)	79.0	71.9	76.7	92.7	67.4	76.0	65.9	43.4	41.6	68.3

TinyTrain 在九个跨域数据集上，相较于全网络微调，精度提升为3.6-5.0个百分点。
相较于FullTrain，向后传播的内存和计算成本分别降低至最多2168×和7.68×。
TinyTrain 相较于SOTA SparseUpdate 方法，在精度上提升2.6-7.7%，内存降低2.4-3.1×，计算降低1.5-1.8×。
在树莓派Zero 2和Jetson Nano上，TinyTrain 的在线层/通道选择耗时为20-35秒，占总训练时间的3.4-3.8%。
端到端的端设备训练大约完成在10分钟左右，比树莓派Zero 2上的两小时FullTrain快一个数量级。
TinyTrain 在MCU级平台内存维度保持在1 MB 量级，同时保持具有竞争力的精度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。