[Paper Review] Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction
DMVST-Net jointly models spatial, temporal, and semantic relations to predict taxi demand, improving over state-of-the-art baselines on large-scale Guangzhou data.
Taxi demand prediction is an important building block to enabling intelligent transportation systems in a smart city. An accurate prediction model can help the city pre-allocate resources to meet travel demand and to reduce empty taxis on streets which waste energy and worsen the traffic congestion. With the increasing popularity of taxi requesting services such as Uber and Didi Chuxing (in China), we are able to collect large-scale taxi demand data continuously. How to utilize such big data to improve the demand prediction is an interesting and critical real-world problem. Traditional demand prediction methods mostly rely on time series forecasting techniques, which fail to model the complex non-linear spatial and temporal relations. Recent advances in deep learning have shown superior performance on traditionally challenging tasks such as image classification by learning the complex features and correlations from large-scale data. This breakthrough has inspired researchers to explore deep learning techniques on traffic prediction problems. However, existing methods on traffic prediction have only considered spatial relation (e.g., using CNN) or temporal relation (e.g., using LSTM) independently. We propose a Deep Multi-View Spatial-Temporal Network (DMVST-Net) framework to model both spatial and temporal relations. Specifically, our proposed model consists of three views: temporal view (modeling correlations between future demand values with near time points via LSTM), spatial view (modeling local spatial correlation via local CNN), and semantic view (modeling correlations among regions sharing similar temporal patterns). Experiments on large-scale real taxi demand data demonstrate effectiveness of our approach over state-of-the-art methods.
Motivation & Objective
- Motivate accurate taxi demand prediction to optimize resource allocation and reduce empty taxis in smart cities.
- Propose a unified framework that integrates spatial, temporal, and semantic correlations for demand forecasting.
- Leverage local spatial modeling, sequential temporal dynamics, and semantic region similarities to improve predictions.
Proposed method
- Three-view DMVST-Net framework combining a local CNN for near-region spatial dependencies, an LSTM for temporal sequence modeling, and a semantic graph embedding for region similarities.
- Local CNN operates on centered SxS neighborhood images to capture local spatial patterns with shared parameters across regions.
- Temporal view uses LSTM to model sequential demand with context features concatenated as input.
- Semantic view builds a region similarity graph using Dynamic Time Warping on weekly demand patterns, embeds nodes via LINE, and integrates embeddings into prediction.
- Final prediction combines LSTM output with semantic embeddings through a fully connected network ending with a sigmoid output, followed by denormalization to actual demand values.
- Loss combines MSE and a weighted squared relative error (MAPE component) to balance large values and relative accuracy.
Experimental results
Research questions
- RQ1Can a unified deep framework capture local spatial, temporal, and semantic region correlations for taxi demand prediction?
- RQ2Does incorporating semantic similarity among distant regions improve forecasting beyond local spatial and temporal models?
- RQ3How do variants (temporal-only, with spatial, with semantic, with local CNN) compare in predictive accuracy?
- RQ4What are the robustness characteristics of DMVST-Net across days of the week and varying LSTM sequence lengths?
Key findings
- DMVST-Net achieves the best overall performance with MAPE 0.1616 and RMSE 9.642, outperforming all baselines.
- Baseline methods include HA, ARIMA, OLSR, Ridge, Lasso, MLP, XGBoost, and ST-ResNet; DMVST-Net improves MAPE by 12.17% and RMSE by 3.70% relative to the best baseline.
- Ablation studies show temporal+semantic and temporal+local-spatial variants improve over temporal-only; the full multi-view model yields the best results (MAPE 0.1616, RMSE 9.642).
- The local CNN (LCNN) outperforms a simple neighbor-aggregate approach, underscoring the value of nonlinear local spatial modeling.
- Semantic view via region embeddings provides additional gains, reducing MAPE further when combined with temporal view.
- DMVST-Net demonstrates robust performance across days, with weekends presenting more challenging predictions, yet DMVST-Net shows the smallest weekend-to-weekday error increase.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.