QUICK REVIEW

[论文解读] An investigation into machine learning approaches for forecasting spatio-temporal demand in ride-hailing service

Ismaïl Saadi, Melvin Wong|arXiv (Cornell University)|Mar 7, 2017

Transportation and Mobility Innovations参考文献 13被引用 29

一句话总结

本研究提出并评估了多种机器学习模型——具体为梯度提升树、随机森林、神经网络和集成决策树——基于2016年1月真实的滴滴出行数据，对网约车服务中的短期时空需求进行预测。梯度提升树模型取得了最高的准确率（RMSE = 16.41），在最小化过拟合的同时优于其他模型，展现出在城市交通系统中平衡供需的强大预测能力。

ABSTRACT

In this paper, we present machine learning approaches for characterizing and forecasting the short-term demand for on-demand ride-hailing services. We propose the spatio-temporal estimation of the demand that is a function of variable effects related to traffic, pricing and weather conditions. With respect to the methodology, a single decision tree, bootstrap-aggregated (bagged) decision trees, random forest, boosted decision trees, and artificial neural network for regression have been adapted and systematically compared using various statistics, e.g. R-square, Root Mean Square Error (RMSE), and slope. To better assess the quality of the models, they have been tested on a real case study using the data of DiDi Chuxing, the main on-demand ride hailing service provider in China. In the current study, 199,584 time-slots describing the spatio-temporal ride-hailing demand has been extracted with an aggregated-time interval of 10 mins. All the methods are trained and validated on the basis of two independent samples from this dataset. The results revealed that boosted decision trees provide the best prediction accuracy (RMSE=16.41), while avoiding the risk of over-fitting, followed by artificial neural network (20.09), random forest (23.50), bagged decision trees (24.29) and single decision tree (33.55).

研究动机与目标

开发并比较用于按需网约车服务中短期时空需求预测的机器学习模型。
识别在时间与地理区域上预测需求波动最准确且高效的模型。
评估外部变量（如交通、价格和天气）对网约车需求模式的影响。
提供一种可扩展的非参数建模框架，以处理高维、复杂且偏态的需求数据。
支持网约车平台在高峰和非高峰时段主动管理供需失衡问题。

提出的方法

本研究使用滴滴出行2016年1月服务的199,584个10分钟时间区间数据，按行政区进行聚合。
采用RreliefF方法进行特征选择，以识别最相关的预测变量，包括交通状况、价格和天气条件。
评估了五种基于回归的机器学习模型：单棵决策树、袋装决策树、随机森林、梯度提升决策树（GBDT）和人工神经网络（ANN）。
通过标准回归指标评估模型性能：决定系数（R-squared）、均方根误差（RMSE）以及预测值与实际值之间的斜率。
在两个独立的数据样本上进行模型训练与验证，以确保结果的稳健性与泛化能力。
评估了计算效率与过拟合风险，GBDT与ANN在运行时间与泛化能力之间表现出良好的权衡。

实验结果

研究问题

RQ1哪种机器学习模型在短期时空网约车需求预测中具有最高的预测准确率？
RQ2交通、价格和天气等外部因素如何影响不同区域与时间段的需求模式？
RQ3集成模型相较于单模型与神经网络，在准确率与计算效率方面表现如何？
RQ4非参数模型能否有效处理现实网约车需求数据的高维性与左偏态特征？
RQ5为何尽管SVM在类似预测任务中被使用，仍被排除在本研究之外？

主要发现

梯度提升决策树（GBDT）取得了最佳预测准确率，RMSE为16.41，优于所有其他模型。
人工神经网络（ANN）表现中等，RMSE为20.09，表明具备较强的但非最优的预测能力。
随机森林（RMSE = 23.50）与袋装决策树（RMSE = 24.29）准确率较低，表明其在复杂需求模式上的泛化能力较弱。
单棵决策树表现最差（RMSE = 33.55），表明存在严重过拟合且在该数据集上泛化能力差。
支持向量机（SVM）因计算成本过高且运行时间随数据集规模呈指数增长而被放弃。
RreliefF特征选择方法有效识别了相关预测变量，提升了模型性能与可解释性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。