QUICK REVIEW

[論文レビュー] Performance Analysis of Edge and In-Sensor AI Processors: A Comparative Review

Luigi Capogrosso, Pietro Bonazzi|arXiv (Cornell University)|Feb 18, 2026

Advanced Memory and Neural Computing被引用数 0

ひとこと要約

論文は超低電力エッジAIプロセッサを概観し、PicoSAM2をGAP9、STM32N6、Sony IMX500上でベンチマークして、 latency、MAC/cycle、MAC/J、Energy–Delay Product (EDP) を比較します。

ABSTRACT

This review examines the rapidly evolving landscape of ultra-low-power edge processors, covering heterogeneous Systems-on-Chips (SoCs), neural accelerators, near-sensor and in-sensor architectures, and emerging dataflow and memory-centric designs. We categorize commercially available and research-grade platforms according to their compute paradigms, power envelopes, and memory hierarchies, and analyze their suitability for always-on and latency-critical Artificial Intelligence (AI) workloads. To complement the architectural overview with empirical evidence, we benchmark a 336 million Multiply-Accumulate (MAC) segmentation model (PicoSAM2) on three representative processors: GAP9, leveraging a multi-core RISC-V architecture augmented with hardware accelerators; the STM32N6, which pairs an advanced ARM Cortex-M55 core with a dedicated neural architecture accelerator; and the Sony IMX500, representing in-sensor stacked-Complementary Metal-Oxide-Semiconductor (CMOS) compute. Collectively, these platforms span MCU-class, embedded neural accelerator, and in-sensor paradigms. The evaluation reports latency, inference efficiency, energy efficiency, and energy-delay product. The results show a clear divergence in hardware behavior, with the IMX500 achieving the highest utilization (86.2 MAC/cycle) and the lowest energy-delay product, highlighting the growing significance and technological maturity of in-sensor processing. GAP9 offers the best energy efficiency within microcontroller-class power budgets, and the STM32N6 provides the lowest raw latency at a significantly higher energy cost. Together, the review and benchmarks provide a unified view of the current design directions and practical trade-offs that are shaping the next generation of ultra-low-power and in-sensor AI processors.

研究の動機と目的

エッジ上での latency、プライバシー、センサ要件のためのオンデバイスでのエネルギー効率の高いAIの必要性を動機づける。
超低電力エッジプロセッサのランドスケープを特徴づける（MCU級、埋め込みアクセラレータ、インセンサ計算を含む）。
現実的なワークロードの下で設計上のトレードオフを露呈させる実証的ベンチマークを提供する。
常時動作および遅延クリティカルなAIワークロードに対するアーキテクチャ選択のガイダンスを提供する。

提案手法

計算パラダイム、電力範囲、メモリ階層別に市販および研究段階のエッジAIプラットフォームを調査する。
3つのプロセッサ（GAP9、STM32N6、IMX500）上で336 MMACのPicoSAM2セグメンテーションモデルを、周期正確プロファイリングと電力測定を用いてベンチマークする。
4つのハードウェア中心指標を評価する：推論あたりのレイテンシ、MAC/サイクル、MAC/J、Energy–Delay Product (EDP)。
アーキテクチャ間の利用率、データフロー効率、メモリボトルネックに関する定性的・定量的洞察を報告する。

Figure 1: Peak performance in TOPS vs. power consumption of publicly announced AI accelerators and processors. Data are from [ 10 , 11 , 12 , 13 ] .

実験結果

リサーチクエスチョン

RQ1代表的なセグメンテーションモデルを動かしたときに、異種の超低電力エッジプロセッサの実用的なパフォーマンスとエネルギー効率特性はどうなるか？
RQ2インセンサ、MCU級、埋め込みニューラルアクセラレータは、利用率、遅延、推定あたりのエネルギー、EDPの観点でどう比較されるか？
RQ3サブ200 mW予算でのデバイス内AIの性能に最も影響を与えるアーキテクチャ要因（メモリ階層、データフロー、データ移動）は何か？

主な発見

HW Platform	Peak Perf. (TOPS)	Power (W)	Precision	HW Architecture	Efficiency (TOPS/W)
Netcast	1.00E+01	0.001	int8	Dataflow ASIC	1.00E+04
Ergo	4.00E+00	0.073	int8	Tensor ASIC	5.48E+01
Ethos N77	4.10E+00	0.800	int8	Tensor ASIC	5.13E+00
MX3	5.00E+00	1.000	fp16	Manycore ASIC	5.00E+00
Tianjic	1.21E+00	0.950	int8	Neuromorphic	1.27E+00
AML200	2.00E+00	0.100	int8	Analog In-Memory	2.00E+01
GAP9	1.51E-01	0.0640	int8	RISC-V Manycore	2.36E+00
AIStorm	2.50E+00	0.225	int8	Analog Compute-in-Sensor	1.11E+01
Gyrfalcon	2.80E+00	0.224	int8	Manycore ASIC	1.25E+01
AML100	4.00E-01	0.020	int8	Analog In-Memory	2.00E+01
STM32N6	6.00E-01	0.200	int8	ARM Cortex-M55 + NPU	3.00E+00
Cortex-M85 (STM32V8/RA8)	1.30E-01	0.250	int8	ARM Cortex-M85	5.20E-01
NDP101	2.00E-01	0.010	int4	RISC-V + HW Acc	2.00E+01
NDP200	6.20E-03	0.010	int8	RISC-V + HW Acc	6.20E-01
NDP250	3.00E-02	0.100	int8	RISC-V + HW Acc	3.00E-01
IMX500	7.952E-02	0.016	int8	Manycore ASIC	4.97E+00
Max 78000	5.60E-02	0.028	int8	Tensor Accelerator MCU	2.00E+00
GAP8	2.27E-02	0.100	int8	RISC-V Manycore	2.27E-01
Eyeriss	6.72E-02	0.278	int16	Dataflow ASIC	2.42E-01
ShiDianNao	1.94E-01	0.320	int16	Dataflow ASIC	6.06E-01
DianNao	4.52E-01	0.485	int16	Dataflow ASIC	9.32E-01
PuDianNao	1.06E+00	0.596	int16	Dataflow ASIC	1.78E+00
EIE	1.02E-01	0.600	int16	Dataflow ASIC (Sparse)	1.70E-01
K210	2.50E-01	0.300	int8	RISC-V Dual Core + KPU	8.33E-01
Kendrite K210	2.30E-01	0.300	int8	RISC-V Dual Core + KPU	7.67E-01
TrueNorth	1.89E+00	0.500	int8	Neuromorphic	3.78E+00
KL520 NPU	3.00E-01	0.500	int8	Tensor ASIC	6.00E-01
xcore.ai	5.12E-02	1.000	int8	DSP-like Multicore	5.12E-02
KL720	1.40E+00	1.556	int8	Tensor ASIC	9.00E-01

IMX500は86.2 MAC/サイクルで最も高い計算密度と、試験プラットフォームの中で最も低いEnergy–Delay Productを達成。
GAP9はMCU級電力予算内で競争力のあるMAC/Jを提供し、低周波数でのエネルギー効率を強調。
STM32N6は最も低い生のレイテンシ（13.7 ms）を提供するが、エネルギーコストは著しく高い。
IMX500はインセンサ計算設計により、他の機種と比較して1359.6 MMAC/Jの優れたエネルギー効率を示す。
GAP9はバッテリ制約下のMCU級デプロイに依然競争力を維持。STM32N6は遅延主導でエネルギー使用量が高い。IMX500はインセンサ処理の利点を示す。
ベンチマークはエッジ、ニアセンサ、インセンサアーキテクチャ間の明確な設計トレードオフを浮き彫りにする。

Figure 2: Benchmarking results of PicoSAM2 [ 25 ] , comparing its energy efficiency, latency, inference efficiency, and energy–delay product (EDP) on GAP9, STM32N6, and IMX500. The results highlight the advantages of in-sensor compute for improved energy efficiency and latency.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。