QUICK REVIEW

[论文解读] ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Adam Paszke, Abhishek Chaurasia|arXiv (Cornell University)|Jun 7, 2016

Advanced Neural Network Applications被引用 1,258

一句话总结

ENet 是一个轻量级的编码器-解码器网络，旨在实现实时语义分割，其参数量和 FLOPs 比现有模型少得多，在 Cityscapes、CamVid 和 SUN 数据集上实现有竞争力甚至优越的准确性，同时可在嵌入式硬件上运行。

ABSTRACT

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\ imes$ faster, requires 75$\ imes$ less FLOPs, has 79$\ imes$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

研究动机与目标

Address the need for real-time pixel-wise semantic segmentation on low-power/mobile devices.
Develop an efficient encoder-decoder network with a small memory footprint and fast inference.
Explore design choices that preserve spatial information while maintaining speed.
Benchmark ENet on Cityscapes, CamVid, and SUN, including embedded hardware performance.

提出的方法

Introduce ENet architecture with bottleneck blocks and an encoder-decoder structure.
Use early downsampling with parallel pooling to preserve information flow and improve speed.
Employ dilated and asymmetric convolutions to enlarge receptive field without excessive computation.
Replace ReLU with PReLU non-linearities to improve information flow, especially in early layers.
Apply Spatial Dropout and avoid bias terms in projections to reduce memory/compute.
Adopt kernel fusion considerations and avoid extensive post-processing to enable end-to-end fast inference.

实验结果

研究问题

RQ1Can ENet achieve real-time semantic segmentation on embedded hardware while maintaining competitive accuracy on standard benchmarks?
RQ2What architectural choices (downsampling strategy, dilated/asymmetric convolutions, non-linearities) best balance speed and accuracy for ENet?
RQ3How does ENet perform on Cityscapes, CamVid, and SUN compared to SegNet and other baselines?
RQ4What are the hardware requirements and potential software limitations affecting ENet’s practical deployment?

主要发现

ENet achieves substantially lower FLOPs (3.83 GFLOPs) and parameters (0.37M) than SegNet (286.03 GFLOPs, 29.46M parameters), with a model size of about 0.7 MB (fp16).
On TX1 embedded hardware, ENet runs at 21.1 fps (480×320) and 14.6 fps (640×360), far exceeding SegNet’s speed on the same platform.
On Titan X, ENet maintains real-time performance with competitive accuracy (Cityscapes: class IoU 58.3 vs SegNet 56.1; Cityscapes category IoU 80.4 vs 79.8).
Cityscapes test results show ENet achieving higher class IoU and competitive category IoU compared with SegNet, while being the fastest model in the Cityscapes benchmark at the time.
CamVid results indicate ENet outperforms several baselines on multiple classes, with competitive meanIoU.
SUN RGB-D results show ENet’s global average and class average accuracy are lower than SegNet, but still offer meaningful real-time performance advantages for RGB data.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。