[论文解读] ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation
ENet 是一个轻量级的编码器-解码器网络,旨在实现实时语义分割,其参数量和 FLOPs 比现有模型少得多,在 Cityscapes、CamVid 和 SUN 数据集上实现有竞争力甚至优越的准确性,同时可在嵌入式硬件上运行。
The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\ imes$ faster, requires 75$\ imes$ less FLOPs, has 79$\ imes$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.
研究动机与目标
- Address the need for real-time pixel-wise semantic segmentation on low-power/mobile devices.
- Develop an efficient encoder-decoder network with a small memory footprint and fast inference.
- Explore design choices that preserve spatial information while maintaining speed.
- Benchmark ENet on Cityscapes, CamVid, and SUN, including embedded hardware performance.
提出的方法
- Introduce ENet architecture with bottleneck blocks and an encoder-decoder structure.
- Use early downsampling with parallel pooling to preserve information flow and improve speed.
- Employ dilated and asymmetric convolutions to enlarge receptive field without excessive computation.
- Replace ReLU with PReLU non-linearities to improve information flow, especially in early layers.
- Apply Spatial Dropout and avoid bias terms in projections to reduce memory/compute.
- Adopt kernel fusion considerations and avoid extensive post-processing to enable end-to-end fast inference.
实验结果
研究问题
- RQ1Can ENet achieve real-time semantic segmentation on embedded hardware while maintaining competitive accuracy on standard benchmarks?
- RQ2What architectural choices (downsampling strategy, dilated/asymmetric convolutions, non-linearities) best balance speed and accuracy for ENet?
- RQ3How does ENet perform on Cityscapes, CamVid, and SUN compared to SegNet and other baselines?
- RQ4What are the hardware requirements and potential software limitations affecting ENet’s practical deployment?
主要发现
- ENet achieves substantially lower FLOPs (3.83 GFLOPs) and parameters (0.37M) than SegNet (286.03 GFLOPs, 29.46M parameters), with a model size of about 0.7 MB (fp16).
- On TX1 embedded hardware, ENet runs at 21.1 fps (480×320) and 14.6 fps (640×360), far exceeding SegNet’s speed on the same platform.
- On Titan X, ENet maintains real-time performance with competitive accuracy (Cityscapes: class IoU 58.3 vs SegNet 56.1; Cityscapes category IoU 80.4 vs 79.8).
- Cityscapes test results show ENet achieving higher class IoU and competitive category IoU compared with SegNet, while being the fastest model in the Cityscapes benchmark at the time.
- CamVid results indicate ENet outperforms several baselines on multiple classes, with competitive meanIoU.
- SUN RGB-D results show ENet’s global average and class average accuracy are lower than SegNet, but still offer meaningful real-time performance advantages for RGB data.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。