QUICK REVIEW

[논문 리뷰] Lightweight Transformer Architectures for Edge Devices in Real-Time Applications

Hema Hariharan Samson|arXiv (Cornell University)|2026. 01. 05.

Advanced Neural Network Applications인용 수 0

한 줄 요약

경량 트랜스포머 아키텍처의 엣지 배치를 위한 포괄적 설문으로, 압축, 양자화, 가지치기 및 증류 기술을 상세히 다루고 NLP와 비전 작업에 걸친 벤치마크와 하드웨어 인식 배포에 대한 지침을 제공합니다.

ABSTRACT

The deployment of transformer-based models on resource-constrained edge devices represents a critical challenge in enabling real-time artificial intelligence applications. This comprehensive survey examines lightweight transformer architectures specifically designed for edge deployment, analyzing recent advances in model compression, quantization, pruning, and knowledge distillation techniques. We systematically review prominent lightweight variants including MobileBERT, TinyBERT, DistilBERT, EfficientFormer, EdgeFormer, and MobileViT, providing detailed performance benchmarks on standard datasets such as GLUE, SQuAD, ImageNet-1K, and COCO. Our analysis encompasses current industry adoption patterns across major hardware platforms (NVIDIA Jetson, Qualcomm Snapdragon, Apple Neural Engine, ARM architectures), deployment frameworks (TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML), and optimization strategies. Experimental results demonstrate that modern lightweight transformers can achieve 75-96% of full-model accuracy while reducing model size by 4-10x and inference latency by 3-9x, enabling deployment on devices with as little as 2-5W power consumption. We identify sparse attention mechanisms, mixed-precision quantization (INT8/FP16), and hardware-aware neural architecture search as the most effective optimization strategies. Novel findings include memory-bandwidth bottleneck analysis revealing 15-40M parameter models achieve optimal hardware utilization (60-75% efficiency), quantization sweet spots for different model types, and comprehensive energy efficiency profiling across edge platforms. We establish real-time performance boundaries and provide a practical 6-step deployment pipeline achieving 8-12x size reduction with less than 2% accuracy degradation.

연구 동기 및 목표

리소스가 제한된 엣지 디바이스에서 실시간 AI 애플리케이션을 위한 트랜스포머 모델의 배포를 촉진한다.
경량 트랜스포머 변형 및 그들의 압축/최적화 기법을 분석하고 비교한다.
표준 데이터셋에 대한 벤치마크를 제공하고 하드웨어 플랫폼, 배포 프레임워크, 최적화 도구를 평가한다.
엣지 배포를 위한 효과적인 최적화 전략과 실제 배포 지침을 식별한다.

제안 방법

엣지 배포를 위해 설계된 경량 트랜스포머 아키텍처에 대한 체계적 검토.
NLP(GLUE, SQuAD) 및 비전(ImageNet-1K, COCO) 작업에 대한 벤치마크 종합.
하드웨어 플랫폼(NVIDIA Jetson, Snapdragon, Apple Neural Engine, ARM) 및 배포 프레임워크(TensorFlow Lite, ONNX Runtime, PyTorch Mobile, CoreML) 분석.
모델 압축, 양자화, 가지치기, 증류 및 하드웨어 인식 NAS를 포함한 최적화 기법의 평가.
실무 배포 모범 사례 및 실제 사례 연구의 추출.

실험 결과

연구 질문

RQ1엣지에서의 실시간 추론을 위해 가장 효과적인 경량 트랜스포머 아키텍처는 무엇인가?
RQ2압축, 양자화, 가지치기 및 증류가 엣지 하드웨어에서 정확도, 크기 및 지연에 어떤 영향을 미치는가?
RQ3어떤 배포 프레임워크와 하드웨어 플랫폼이 엣지 트랜스포머 추론을 가장 잘 지원하는가?
RQ4엣지 디바이스에서 최소한의 정확도 손실로 실시간 성능을 달성하기 위한 모범 사례와 가이드라인은 무엇인가?

주요 결과

모델	매개변수 수 (M)	GLUE 점수	SQuAD F1	지연 시간 (ms)
BERT-base	110	79.5	88.5	580
DistilBERT	66	77.0	79.8	230
TinyBERT-4	14.5	77.0	82.1	62
TinyBERT-6	67	79.4	87.5	95
MobileBERT	25.3	77.7	90.3	62
MobileBERT	15.1	75.8	84.2	40

경량 트랜스포머는 4-10× 모델 크기 감소 및 3-9× 지연 감소를 달성하면서 전체 모델 정확도의 75-96%를 달성할 수 있다.
일반 지식 증류 + 작업별 증류의 2단계 증류가 가장 큰 단일 개선을 제공하고 최적의 교사/학생 파라미터 비율은 4-6×이다.
감지 가능한 층에 대해 FP16, 밀집 변환에 대해 INT8의 혼합 정밀도 양자화가 정확도-효율성의 최적 균형을 제공하며 시각 모델이 NLP 모델보다 양자화에 더 잘 대처한다.
하드웨어 인식 신경망 아키텍처 검색은 실제 디바이스 지연을 목표로 FLOP 최적 디자인보다 20-30% 더 빠른 모델을 낳는다.
메모리 대역폭은 종종 엣지 트랜스포머의 성능을 제한하며 모바일 활용에 대한 최적 파라미터 범위는 대략 15-40M 파라미터이다(효율성 60-75%).
EfficientFormer, MobileBERT, TinyBERT 및 MobileViT는 모바일 하드웨어에서 비전 및 NLP 작업에 대해 강력한 파레토 프런트를 보여준다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.