QUICK REVIEW

[论文解读] GPU-acceleration for Large-scale Tree Boosting

Huan Zhang, Si Si|arXiv (Cornell University)|Jun 26, 2017

Machine Learning and Data Classification参考文献 12被引用 61

一句话总结

提出一种基于 GPU 的直方图方法，用于加速 GBDT 和随机森林中的决策树构建，相较于基于 CPU 的直方图和精确分裂方法实现了大幅加速，同时保持精度。

ABSTRACT

In this paper, we present a novel massively parallel algorithm for accelerating the decision tree building procedure on GPUs (Graphics Processing Units), which is a crucial step in Gradient Boosted Decision Tree (GBDT) and random forests training. Previous GPU based tree building algorithms are based on parallel multi-scan or radix sort to find the exact tree split, and thus suffer from scalability and performance issues. We show that using a histogram based algorithm to approximately find the best split is more efficient and scalable on GPU. By identifying the difference between classical GPU-based image histogram construction and the feature histogram construction in decision tree training, we develop a fast feature histogram building kernel on GPU with carefully designed computational and memory access sequence to reduce atomic update conflict and maximize GPU utilization. Our algorithm can be used as a drop-in replacement for histogram construction in popular tree boosting systems to improve their scalability. As an example, to train GBDT on epsilon dataset, our method using a main-stream GPU is 7-8 times faster than histogram based algorithm on CPU in LightGBM and 25 times faster than the exact-split finding algorithm in XGBoost on a dual-socket 28-core Xeon server, while achieving similar prediction accuracy.

研究动机与目标

动机：由于叶子分裂计算成本高，推动决策树集成的可扩展 GPU 加速的需求。
提出一种基于直方图的 GPU 算法来近似最佳分裂并提升可扩展性。
将 GPU 直方图方法集成到 LightGBM，并与 CPU/GPU 基线进行基准测试。
在跨越不同 GPU 架构的大规模数据集上展示加速和内存效率。

提出的方法

在 GPU 上使用特征直方图开发一种基于直方图的方法来近似 GBDT 的叶子分裂。
在每一步构建多个直方图以减少原子更新冲突并最大化 GPU 利用率。
将特征打包成小元组并使用 4 字节表示，以使直方图能够装入本地内存并最小化全局内存访问。
利用较小的箱数（例如 64）来提高并行度并减小内存占用，同时不牺牲准确性。
提供一个可直接替换的 GPU 直方图实现，集成到 LightGBM 以支持大规模训练。

实验结果

研究问题

RQ1在用于大规模 GBDT 训练时，基于直方图的分裂查找在 GPU 上是否能超越精确分裂的 GPU 和 CPU 方法？
RQ2在 GPU 上并行构建大量特征直方图时，内存和线程冲突的考虑因素有哪些？
RQ3减小箱数如何影响多样数据集的训练速度和模型准确性？
RQ4GPU 直方图方法是否能扩展到多 GPU 结构和超过 CPU 能力的大型数据集？

主要发现

基于 GPU 直方图的树构建相较 CPU 直方图方法实现了显著加速（在 epsilon 数据集上使用 63-bin 直方图时快 7-8x）。
GPU 直方图方法在 CPU 和 GPU 上均优于精确分裂方法，在某些数据集上达到大约 25x 的训练加速。
尽管使用了降低的精度和更小的箱数，该方法仍能保持与基于 CPU 的方法相当的预测指标（AUC、NDCG）。
GPU 上的内存使用较低（所有数据集在 8 GB GPU 上最多 1 GB），使得在单个 GPU 上对比 Higgs 更大规模的数据集进行训练成为可能。
使用较小的箱数（例如 64）在所测试的数据集上提高吞吐量且不牺牲准确性。
精确-GPU 方法受限于内存，扩展性不如基于直方图的 GPU 方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。