Skip to main content
QUICK REVIEW

[论文解读] ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation

Adam Paszke, Abhishek Chaurasia|arXiv (Cornell University)|Jun 7, 2016
Advanced Neural Network Applications被引用 1,258
一句话总结

ENet 是一个轻量级的编码器-解码器网络,旨在实现实时语义分割,其参数量和 FLOPs 比现有模型少得多,在 Cityscapes、CamVid 和 SUN 数据集上实现有竞争力甚至优越的准确性,同时可在嵌入式硬件上运行。

ABSTRACT

The ability to perform pixel-wise semantic segmentation in real-time is of paramount importance in mobile applications. Recent deep neural networks aimed at this task have the disadvantage of requiring a large number of floating point operations and have long run-times that hinder their usability. In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18$\ imes$ faster, requires 75$\ imes$ less FLOPs, has 79$\ imes$ less parameters, and provides similar or better accuracy to existing models. We have tested it on CamVid, Cityscapes and SUN datasets and report on comparisons with existing state-of-the-art methods, and the trade-offs between accuracy and processing time of a network. We present performance measurements of the proposed architecture on embedded systems and suggest possible software improvements that could make ENet even faster.

研究动机与目标

  • Address the need for real-time pixel-wise semantic segmentation on low-power/mobile devices.
  • Develop an efficient encoder-decoder network with a small memory footprint and fast inference.
  • Explore design choices that preserve spatial information while maintaining speed.
  • Benchmark ENet on Cityscapes, CamVid, and SUN, including embedded hardware performance.

提出的方法

  • Introduce ENet architecture with bottleneck blocks and an encoder-decoder structure.
  • Use early downsampling with parallel pooling to preserve information flow and improve speed.
  • Employ dilated and asymmetric convolutions to enlarge receptive field without excessive computation.
  • Replace ReLU with PReLU non-linearities to improve information flow, especially in early layers.
  • Apply Spatial Dropout and avoid bias terms in projections to reduce memory/compute.
  • Adopt kernel fusion considerations and avoid extensive post-processing to enable end-to-end fast inference.

实验结果

研究问题

  • RQ1Can ENet achieve real-time semantic segmentation on embedded hardware while maintaining competitive accuracy on standard benchmarks?
  • RQ2What architectural choices (downsampling strategy, dilated/asymmetric convolutions, non-linearities) best balance speed and accuracy for ENet?
  • RQ3How does ENet perform on Cityscapes, CamVid, and SUN compared to SegNet and other baselines?
  • RQ4What are the hardware requirements and potential software limitations affecting ENet’s practical deployment?

主要发现

  • ENet achieves substantially lower FLOPs (3.83 GFLOPs) and parameters (0.37M) than SegNet (286.03 GFLOPs, 29.46M parameters), with a model size of about 0.7 MB (fp16).
  • On TX1 embedded hardware, ENet runs at 21.1 fps (480×320) and 14.6 fps (640×360), far exceeding SegNet’s speed on the same platform.
  • On Titan X, ENet maintains real-time performance with competitive accuracy (Cityscapes: class IoU 58.3 vs SegNet 56.1; Cityscapes category IoU 80.4 vs 79.8).
  • Cityscapes test results show ENet achieving higher class IoU and competitive category IoU compared with SegNet, while being the fastest model in the Cityscapes benchmark at the time.
  • CamVid results indicate ENet outperforms several baselines on multiple classes, with competitive meanIoU.
  • SUN RGB-D results show ENet’s global average and class average accuracy are lower than SegNet, but still offer meaningful real-time performance advantages for RGB data.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。