QUICK REVIEW

[论文解读] Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

Soheil Bahrampour, Naveen Ramakrishnan|arXiv (Cornell University)|Nov 19, 2015

Advanced Neural Network Applications参考文献 7被引用 103

一句话总结

本文比较了Caffe、Neon、Theano和Torch在CPU和GPU（NVIDIA Titan X）上对各种深度学习架构的可扩展性、硬件利用率和速度表现。主要发现显示，Torch在CPU上表现优异，且在GPU上处理大型网络时表现最佳；Theano在LSTM训练和部署方面领先；Caffe在评估标准架构方面表现最优。

ABSTRACT

Deep learning methods have resulted in significant performance improvements in several application domains and as such several software frameworks have been developed to facilitate their implementation. This paper presents a comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed. The study is performed on several types of deep learning architectures and we evaluate the performance of the above frameworks when employed on a single machine for both (multi-threaded) CPU and GPU (Nvidia Titan X) settings. The speed performance metrics used here include the gradient computation time, which is important during the training phase of deep networks, and the forward time, which is important from the deployment perspective of trained networks. For convolutional networks, we also report how each of these frameworks support various convolutional algorithms and their corresponding performance. From our experiments, we observe that Theano and Torch are the most easily extensible frameworks. We observe that Torch is best suited for any deep architecture on CPU, followed by Theano. It also achieves the best performance on the GPU for large convolutional and fully connected networks, followed closely by Neon. Theano achieves the best performance on GPU for training and deployment of LSTM networks. Finally Caffe is the easiest for evaluating the performance of standard deep architectures.

研究动机与目标

评估并比较四种主流深度学习框架（Caffe、Neon、Theano和Torch）的可扩展性、硬件利用率和速度。
评估框架在多种深度学习架构（包括卷积网络和循环网络）上的性能表现。
在CPU和GPU环境下，分别测量梯度计算时间（训练效率）和前向传播时间（部署性能）作为关键指标。
分析各框架对不同卷积算法的支持情况及其执行速度。
识别在特定使用场景（如CPU推理、GPU训练或标准架构评估）下最合适的框架。

提出的方法

在单台机器上，分别使用CPU（多线程）和GPU（NVIDIA Titan X）配置对各框架进行基准测试。
在多种深度学习架构（包括卷积网络和全连接网络）上评估性能表现。
将梯度计算时间作为训练效率的关键指标，前向传播时间作为部署性能的衡量标准。
评估各框架对不同卷积算法的支持情况及其执行速度。
使用标准化的数据集和网络配置，确保框架间比较的一致性。
重点关注训练（梯度计算）和推理（前向传播）工作负载，以反映真实世界的应用模式。

实验结果

研究问题

RQ1哪一框架在自定义深度学习架构方面具有最高的可扩展性？
RQ2各框架在CPU和GPU上的训练与推理任务中，硬件利用率如何？
RQ3哪一框架在大规模深度神经网络中实现了最快的梯度计算时间？
RQ4哪一框架在部署模型时提供了最佳的前向传播性能？
RQ5各框架对各种卷积算法的支持程度及其优化效果如何？

主要发现

Theano和Torch是可扩展性最强的框架，能够相对轻松地支持自定义网络结构的修改。
在CPU上，Torch在所有评估的深度学习架构中均表现出最佳性能，优于其他框架。
在GPU上，Torch在大型卷积网络和全连接网络中表现最佳，Neon紧随其后。
Theano在LSTM网络的训练和部署任务中实现了最佳GPU性能，优于其他框架。
Caffe是在评估标准深度学习架构方面最直接且高效的框架，尤其在易用性和性能测量方面表现突出。
性能排名因网络类型而异，没有单一框架在所有指标和架构上全面领先。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。