[Paper Review] Fast YOLO: A Fast You Only Look Once System for Real-time Embedded Object Detection in Video
Fast YOLO speeds up YOLOv2 for embedded video by using evolutionary network optimization and motion-adaptive inference, achieving ~3.3x speed and ~38% fewer deep inferences (≈18 FPS on Jetson TX1) with ~2.8x fewer parameters and ~2% IOU drop.
Object detection is considered one of the most challenging problems in this field of computer vision, as it involves the combination of object classification and object localization within a scene. Recently, deep neural networks (DNNs) have been demonstrated to achieve superior object detection performance compared to other approaches, with YOLOv2 (an improved You Only Look Once model) being one of the state-of-the-art in DNN-based object detection methods in terms of both speed and accuracy. Although YOLOv2 can achieve real-time performance on a powerful GPU, it still remains very challenging for leveraging this approach for real-time object detection in video on embedded computing devices with limited computational power and limited memory. In this paper, we propose a new framework called Fast YOLO, a fast You Only Look Once framework which accelerates YOLOv2 to be able to perform object detection in video on embedded devices in a real-time manner. First, we leverage the evolutionary deep intelligence framework to evolve the YOLOv2 network architecture and produce an optimized architecture (referred to as O-YOLOv2 here) that has 2.8X fewer parameters with just a ~2% IOU drop. To further reduce power consumption on embedded devices while maintaining performance, a motion-adaptive inference method is introduced into the proposed Fast YOLO framework to reduce the frequency of deep inference with O-YOLOv2 based on temporal motion characteristics. Experimental results show that the proposed Fast YOLO framework can reduce the number of deep inferences by an average of 38.13%, and an average speedup of ~3.3X for objection detection in video compared to the original YOLOv2, leading Fast YOLO to run an average of ~18FPS on a Nvidia Jetson TX1 embedded system.
Motivation & Objective
- Reduce computational and memory demands of YOLOv2 for embedded devices while maintaining detection performance.
- Automatically optimize network architecture to be ~2.8x smaller with minimal IOU loss.
- Introduce motion-adaptive inference to decrease deep inferences and power consumption in video processing.
Proposed method
- Use evolutionary deep intelligence to synthesize an optimized architecture (O-YOLOv2) with ~2.8x fewer parameters and ~2% IOU drop.
- Construct an image stack (I_t, I_ref) and apply a 1x1 convolution to generate a motion probability map.
- Apply a motion-adaptive inference module to decide whether to perform deep inference on a frame.
- If deep inference is needed, run O-YOLOv2 to update class probability maps and update I_ref and reference maps; otherwise reuse the reference maps.
- Evaluate optimized model on Pascal VOC 2007 to compare parameter counts and IOU against YOLOv2; evaluate video runtime on Nvidia Jetson TX1 to assess FPS and deep-inference frequency.
Experimental results
Research questions
- RQ1Can evolutionary synthesis yield a compact yet effective YOLOv2-based network (O-YOLOv2) suitable for embedded devices?
- RQ2Does motion-adaptive inference reduce the number of deep inferences and power consumption while maintaining detection performance in video streams?
- RQ3What is the resulting speedup and resource usage when deploying Fast YOLO on an embedded platform compared to YOLOv2?
- RQ4How does O-YOLOv2 compare to YOLOv2 in terms of parameters and IOU on standard benchmarks?
Key findings
- O-YOLOv2 is ~2.8x smaller in parameters than YOLOv2 with only ~2% IOU drop (67.2% vs 65.10%).
- Fast YOLO reduces deep inferences by ~38.13% on average and achieves ~3.3x speed-up over YOLOv2 on Jetson TX1 (≈18 FPS).
- Fast YOLO yields an average runtime improvement from 184 ms (YOLOv2) to 56 ms per frame.
- On Pascal VOC 2007, O-YOLOv2 maintains competitive detection performance with substantially fewer parameters.
- The framework combines an optimized architecture with motion-aware inference, reducing power consumption and enabling real-time embedded video detection.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.