Skip to main content
QUICK REVIEW

[论文解读] Performance Analysis of Edge and In-Sensor AI Processors: A Comparative Review

Luigi Capogrosso, Pietro Bonazzi|arXiv (Cornell University)|Feb 18, 2026
Advanced Memory and Neural Computing被引用 0
一句话总结

本论文综述超低功耗边缘AI处理器并在GAP9、STM32N6与索尼IMX500上基准PicoSAM2,以比较延迟、MAC/时钟、MAC/能源、以及能耗-延时乘积(EDP)。

ABSTRACT

This review examines the rapidly evolving landscape of ultra-low-power edge processors, covering heterogeneous Systems-on-Chips (SoCs), neural accelerators, near-sensor and in-sensor architectures, and emerging dataflow and memory-centric designs. We categorize commercially available and research-grade platforms according to their compute paradigms, power envelopes, and memory hierarchies, and analyze their suitability for always-on and latency-critical Artificial Intelligence (AI) workloads. To complement the architectural overview with empirical evidence, we benchmark a 336 million Multiply-Accumulate (MAC) segmentation model (PicoSAM2) on three representative processors: GAP9, leveraging a multi-core RISC-V architecture augmented with hardware accelerators; the STM32N6, which pairs an advanced ARM Cortex-M55 core with a dedicated neural architecture accelerator; and the Sony IMX500, representing in-sensor stacked-Complementary Metal-Oxide-Semiconductor (CMOS) compute. Collectively, these platforms span MCU-class, embedded neural accelerator, and in-sensor paradigms. The evaluation reports latency, inference efficiency, energy efficiency, and energy-delay product. The results show a clear divergence in hardware behavior, with the IMX500 achieving the highest utilization (86.2 MAC/cycle) and the lowest energy-delay product, highlighting the growing significance and technological maturity of in-sensor processing. GAP9 offers the best energy efficiency within microcontroller-class power budgets, and the STM32N6 provides the lowest raw latency at a significantly higher energy cost. Together, the review and benchmarks provide a unified view of the current design directions and practical trade-offs that are shaping the next generation of ultra-low-power and in-sensor AI processors.

研究动机与目标

  • 由于边缘的延迟、隐私和感知需求,需要在设备端实现能效AI的动机。
  • 描绘超低功耗边缘处理器的全景,包括MCU级别、嵌入式加速器与感知内计算。
  • 提供实证基准测试,揭示不同架构在真实工作负载下的设计折中。
  • 为面向常时运行与低延迟AI工作负载的架构选择提供指导。

提出的方法

  • 按计算范式、功耗包络和存储层次结构,对商用与研究级边缘AI平台进行调研。
  • 在三款处理器(GAP9、STM32N6、IMX500)上,对336 MMAC的PicoSAM2分割模型进行循环精确分析的基准并进行功耗测量。
  • 评估四个以硬件为中心的指标:每次推断的延迟、MAC/时钟、MAC/J,以及能耗-延迟乘积(EDP)。
  • 报告在利用率、数据流效率和跨体系结构的内存瓶颈方面的定性与定量洞察。
Figure 1: Peak performance in TOPS vs. power consumption of publicly announced AI accelerators and processors. Data are from [ 10 , 11 , 12 , 13 ] .
Figure 1: Peak performance in TOPS vs. power consumption of publicly announced AI accelerators and processors. Data are from [ 10 , 11 , 12 , 13 ] .

实验结果

研究问题

  • RQ1在运行一个具有代表性的分割模型时,异构超低功耗边缘处理器的实际性能与能效特性如何?
  • RQ2感知内、MCU级与嵌入式神经加速器在利用率、延迟、每推断能耗与EDP方面的对比如何?
  • RQ3哪些架构因素(内存层次、数据流、数据移动)在200 mW以下预算下对设备端AI性能的影响最大?

主要发现

HW PlatformPeak Perf. (TOPS)Power (W)PrecisionHW ArchitectureEfficiency (TOPS/W)
Netcast1.00E+010.001int8Dataflow ASIC1.00E+04
Ergo4.00E+000.073int8Tensor ASIC5.48E+01
Ethos N774.10E+000.800int8Tensor ASIC5.13E+00
MX35.00E+001.000fp16Manycore ASIC5.00E+00
Tianjic1.21E+000.950int8Neuromorphic1.27E+00
AML2002.00E+000.100int8Analog In-Memory2.00E+01
GAP91.51E-010.0640int8RISC-V Manycore2.36E+00
AIStorm2.50E+000.225int8Analog Compute-in-Sensor1.11E+01
Gyrfalcon2.80E+000.224int8Manycore ASIC1.25E+01
AML1004.00E-010.020int8Analog In-Memory2.00E+01
STM32N66.00E-010.200int8ARM Cortex-M55 + NPU3.00E+00
Cortex-M85 (STM32V8/RA8)1.30E-010.250int8ARM Cortex-M855.20E-01
NDP1012.00E-010.010int4RISC-V + HW Acc2.00E+01
NDP2006.20E-030.010int8RISC-V + HW Acc6.20E-01
NDP2503.00E-020.100int8RISC-V + HW Acc3.00E-01
IMX5007.952E-020.016int8Manycore ASIC4.97E+00
Max 780005.60E-020.028int8Tensor Accelerator MCU2.00E+00
GAP82.27E-020.100int8RISC-V Manycore2.27E-01
Eyeriss6.72E-020.278int16Dataflow ASIC2.42E-01
ShiDianNao1.94E-010.320int16Dataflow ASIC6.06E-01
DianNao4.52E-010.485int16Dataflow ASIC9.32E-01
PuDianNao1.06E+000.596int16Dataflow ASIC1.78E+00
EIE1.02E-010.600int16Dataflow ASIC (Sparse)1.70E-01
K2102.50E-010.300int8RISC-V Dual Core + KPU8.33E-01
Kendrite K2102.30E-010.300int8RISC-V Dual Core + KPU7.67E-01
TrueNorth1.89E+000.500int8Neuromorphic3.78E+00
KL520 NPU3.00E-010.500int8Tensor ASIC6.00E-01
xcore.ai5.12E-021.000int8DSP-like Multicore5.12E-02
KL7201.40E+001.556int8Tensor ASIC9.00E-01
  • IMX500在测试平台中实现最高的计算密度,达到86.2 MAC/时钟,并且在能耗-延迟乘积方面为最低。
  • GAP9在MCU级功耗预算下依然提供具有竞争力的MAC/J,强调在低频下的能效。
  • STM32N6提供最低的原始延迟(13.7 ms),但伴随显著更高的能耗。
  • IMX500在能效方面表现优越(MAC/J为1359.6 MMAC/J),这得益于其感知内计算设计的驱动。
  • GAP9在面向电池受限的MCU级部署中仍具竞争力;STM32N6以低延迟为导向,能耗较高;IMX500展示了感知内处理的优势。
  • 基准测试凸显了边缘、近传感与传感内架构之间的不同设计取舍。
Figure 2: Benchmarking results of PicoSAM2 [ 25 ] , comparing its energy efficiency, latency, inference efficiency, and energy–delay product (EDP) on GAP9, STM32N6, and IMX500. The results highlight the advantages of in-sensor compute for improved energy efficiency and latency.
Figure 2: Benchmarking results of PicoSAM2 [ 25 ] , comparing its energy efficiency, latency, inference efficiency, and energy–delay product (EDP) on GAP9, STM32N6, and IMX500. The results highlight the advantages of in-sensor compute for improved energy efficiency and latency.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。