QUICK REVIEW

[論文レビュー] Lightweight Transformer Architectures for Edge Devices in Real-Time Applications

Hema Hariharan Samson|arXiv (Cornell University)|Jan 5, 2026

Advanced Neural Network Applications被引用数 0

ひとこと要約

軽量トランスフォーマーアーキテクチャのエッジ展開に関する包括的調査。圧縮、量子化、プルーニング、蒸留技法を詳述し、NLPとビジョンタスクのベンチマークとハードウェア対応展開のガイダンスを提供。

ABSTRACT

The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.

研究の動機と目的

リソース制約のあるエッジデバイス上でのリアルタイムAIアプリケーションのためのトランスフォーマーモデルの展開を促進する。
軽量トランスフォーマーのバリアントとそれらの圧縮・最適化技術を分析・比較する。
標準データセットでのベンチマークを提供し、ハードウェアプラットフォーム、展開フレームワーク、最適化ツールを評価する。
エッジ展開の実用的な最適化戦略と実運用のガイドラインを特定する。）

提案手法

エッジ展開を想定して設計された軽量トランスフォーマーアーキテクチャの系統的レビュー。
NLP（GLUE、SQuAD）とビジョン（ImageNet-1K、COCO）タスクにおけるベンチマークの統合。
ハードウェアプラットフォーム（NVIDIA Jetson、Snapdragon、Apple Neural Engine、ARM）と展開フレームワーク（TensorFlow Lite、ONNX Runtime、PyTorch Mobile、CoreML）の分析。
モデル圧縮、量子化、プルーニング、蒸留、ハードウェア認識NASを含む最適化技術の評価。
実用的な展開のベストプラクティスと実世界のケーススタディの抽出。

実験結果

リサーチクエスチョン

RQ1デバイス上でのリアルタイム推論に最も適した軽量トランスフォーマーアーキテクチャはどれか。
RQ2圧縮、量子化、プルーニング、蒸留はエッジハードウェアの精度、サイズ、レイテンシにどのような影響を与えるか。
RQ3エッジトランスフォーマー推論を最も支援する展開フレームワークとハードウェアプラットフォームはどれか。
RQ4エッジデバイスでのリアルタイム性能を実現する際の最小の精度低下でのベストプラクティスとガイドラインは何か。

主な発見

モデル	パラメータ数（M）	GLUEスコア	SQuAD F1	レイテンシ（ms）
BERT-base	110	79.5	88.5	580
DistilBERT	66	77.0	79.8	230
TinyBERT-4	14.5	77.0	82.1	62
TinyBERT-6	67	79.4	87.5	95
MobileBERT	25.3	77.7	90.3	62
MobileBERT	15.1	75.8	84.2	40

軽量トランスフォーマーは、モデルサイズを4–10×削減し、レイテンシを3–9×削減しつつ、フルモデル精度の75–96%を達成できる。
二段蒸留（一般的 + タスク特化）で最大の単一改善を提供し、教員/学生のパラメータ比は4–6×が最適。
混合精度量子化（感度の高い層はFP16、密な変換はINT8）は、NLPより視覚モデルの方が量子化に強く、精度と効率のバランスが良い。
ハードウェア認識型ニューラルアーキテクチャ探索は、実機の待機時間をターゲットにした場合 FLOP最適設計より20–30%速いモデルを生み出す。
エッジトランスフォーマーの性能はメモリ帯域幅に左右されることが多く、モバイル利用での最適パラメータ範囲は約15–40Mパラメータ（効率60–75%）である。
EfficientFormer、MobileBERT、TinyBERT、MobileViTはモバイル機器上で視覚・NLPタスクに対してパレート最適性の高い性能を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。