[论文解读] WeatherBench Probability: A benchmark dataset for probabilistic medium-range weather forecasting along with deep learning baseline models
该论文在 WeatherBench 上扩展了概率预测,通过评估 MC dropout、参数化和分类神经网络方法,与运行中的 IFS 集合进行对比,使用概率验证指标。
WeatherBench is a benchmark dataset for medium-range weather forecasting of geopotential, temperature and precipitation, consisting of preprocessed data, predefined evaluation metrics and a number of baseline models. WeatherBench Probability extends this to probabilistic forecasting by adding a set of established probabilistic verification metrics (continuous ranked probability score, spread-skill ratio and rank histograms) and a state-of-the-art operational baseline using the ECWMF IFS ensemble forecast. In addition, we test three different probabilistic machine learning methods -- Monte Carlo dropout, parametric prediction and categorical prediction, in which the probability distribution is discretized. We find that plain Monte Carlo dropout severely underestimates uncertainty. The parametric and categorical models both produce fairly reliable forecasts of similar quality. The parametric models have fewer degrees of freedom while the categorical model is more flexible when it comes to predicting non-Gaussian distributions. None of the models are able to match the skill of the operational IFS model. We hope that this benchmark will enable other researchers to evaluate their probabilistic approaches.
研究动机与目标
- Extend the WeatherBench benchmark to probabilistic forecasting for medium-range weather prediction.
- Introduce probabilistic verification metrics to assess forecast reliability and sharpness.
- Evaluate deep learning-based probabilistic models (MC dropout, parametric, categorical) against an operational ensemble baseline.
提出的方法
- Use a deep ResNet-based architecture with 114 input channels derived from ERA5 data ( variables across 7 levels, plus surface fields ).
- Generate probabilistic forecasts via three approaches: Monte Carlo dropout (varying dropout rates and creating ensembles), parametric prediction (Gaussian for Z500, T850, T2M with CRPS loss), and categorical prediction (discretized bins with softmax and cross-entropy).
- Evaluate forecasts at 3-day lead time using probabilistic metrics: CRPS, spread-skill ratio, and rank histograms, plus deterministic RMSE for ensemble mean.
实验结果
研究问题
- RQ1How do probabilistic neural network approaches compare to the operational IFS ensemble in 3-day forecasts across geopotential, temperature, and precipitation?
- RQ2What are the reliability and calibration characteristics (via spread-skill, CRPS, and rank histograms) of MC dropout, parametric, and categorical probabilistic forecasts?
- RQ3Do parametric or categorical models offer advantages in modeling non-Gaussian distributions for weather variables like precipitation?
主要发现
| 模型 | Z500 均方根误差 | Z500 散度 | Z500 CRPS | T850 均方根误差 | T850 散度 | T850 CRPS | T2M 均方根误差 | T2M 散度 | T2M CRPS | TP 均方根误差 | TP 散度 | TP CRPS |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MC Dropout Dr=0.1 | 312.96 | 1.80 | 1.52 | 155.70 | 1.03 | 0.77 | 0.57 | [missing] | [missing] | [missing] | [missing] | [missing] |
| Parametric | 315.30 | 1.82 | 1.55 | 142.67 | 0.90 | 0.70 | 0. - | 0. - | 0. - | [missing] | [missing] | [missing] |
| Categorical | 327.48 | 1.80 | 1.49 | 142.59 | 0.87 | 0.65 | 0.47 | [missing] | [missing] | [missing] | [missing] | [missing] |
| TIGGE (3/5 days) | 145/297 | 1.20/1.73 | 1.26/1.57 | 2.02/2.15 | 1.05/1.00 | 0.93/0.96 | 0.69/0.80 | 0.84/0.85 | 0.65/0.0 | 0.58/0.70 | 0.41/0.47 | [missing] |
| Deterministic | 313.70 | 1.79 | 1.53 | 194.90 | 1.24 | 0.96 | 0. - | 0. - | 0. - | [missing] | [missing] | [missing] |
- MC dropout yields the lowest ensemble mean RMSE and CRPS at a dropout rate of 0.1, but is severely underdispersive (low spread).
- Parametric and categorical models achieve similar verification scores with downside trade-offs: parametric models are simpler with fewer degrees of freedom, while categorical models better handle non-Gaussian distributions (notably for precipitation).
- Neither probabilistic neural network approach matches the skill of the operational TIGGE/IFS ensemble; precipitation remains challenging for data-driven methods.
- Deterministic DL baselines without post-processing perform worse than probabilistic methods in most metrics.
- The operational TIGGE ensemble generally outperforms data-driven methods across variables, with some exceptions for precipitation where RMSE is less informative.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。