QUICK REVIEW

[Paper Review] Obtaining Calibrated Probabilities from Boosting

Alexandru Niculescu-Mizil, Rich Caruana|arXiv (Cornell University)|Jul 4, 2012

Explainable Artificial Intelligence (XAI)17 references155 citations

TL;DR

This paper investigates the miscalibration of probability outputs in boosting algorithms, particularly AdaBoost, and evaluates three calibration techniques—Platt Scaling, Isotonic Regression, and Logistic Correction—to improve posterior probability estimates. It finds that Platt Scaling and Isotonic Regression significantly enhance probability calibration, especially when boosting weak learners like decision stumps, while Logistic Correction and log-loss boosting perform poorly with complex models.

ABSTRACT

Boosted decision trees typically yield good accuracy, precision, and ROC area. However, because the outputs from boosting are not well calibrated posterior probabilities, boosting yields poor squared error and cross-entropy. We empirically demonstrate why AdaBoost predicts distorted probabilities and examine three calibration methods for correcting this distortion: Platt Scaling, Isotonic Regression, and Logistic Correction. We also experiment with boosting using log-loss instead of the usual exponential loss. Experiments show that Logistic Correction and boosting with log-loss work well when boosting weak models such as decision stumps, but yield poor performance when boosting more complex models such as full decision trees. Platt Scaling and Isotonic Regression, however, significantly improve the probabilities predicted by

Motivation & Objective

To address the issue of poorly calibrated probabilities in boosted decision trees, which leads to poor performance in squared error and cross-entropy metrics.
To investigate why AdaBoost produces distorted probability estimates despite strong accuracy and ROC performance.
To evaluate the effectiveness of three calibration techniques—Platt Scaling, Isotonic Regression, and Logistic Correction—in correcting probability miscalibration.
To examine whether using log-loss instead of exponential loss in boosting improves probability calibration.
To determine the conditions under which each calibration method performs best, particularly in relation to the complexity of the base learner.

Proposed method

Empirically analyze the root cause of probability distortion in AdaBoost by examining the behavior of its output scores.
Apply Platt Scaling, a parametric method that fits a sigmoid function to map raw scores to calibrated probabilities.
Apply Isotonic Regression, a non-parametric method that fits a piecewise constant, non-decreasing function to the scores for calibration.
Implement Logistic Correction, a method that re-estimates probabilities using logistic regression on the boosted model's outputs.
Modify the boosting algorithm to use log-loss instead of exponential loss during training to improve inherent probability calibration.
Evaluate all methods on multiple datasets using metrics such as Brier score and log-loss to assess calibration quality.

Experimental results

Research questions

RQ1Why do boosted models like AdaBoost produce poorly calibrated probabilities despite strong discrimination performance?
RQ2How effective are Platt Scaling, Isotonic Regression, and Logistic Correction in calibrating the probability outputs of boosting algorithms?
RQ3Does replacing the exponential loss with log-loss in the boosting framework improve the intrinsic calibration of the model's outputs?
RQ4How does the complexity of the base learner (e.g., decision stump vs. full decision tree) affect the performance of different calibration techniques?
RQ5Under what conditions do calibration methods like Platt Scaling and Isotonic Regression outperform Logistic Correction and log-loss boosting?

Key findings

Platt Scaling and Isotonic Regression significantly improve the calibration of probability estimates produced by boosting, especially when using weak learners such as decision stumps.
Logistic Correction and log-loss boosting perform well when boosting weak models but degrade in performance when applied to more complex models like full decision trees.
The original AdaBoost algorithm with exponential loss produces severely miscalibrated probabilities, resulting in high Brier scores and poor log-loss performance.
Isotonic Regression generally outperforms Platt Scaling in terms of calibration quality, particularly on datasets with non-linear decision boundaries.
The choice of calibration method should be guided by the complexity of the base estimator, with stronger models requiring more robust calibration techniques.
Empirical results show that post-processing with Isotonic Regression can reduce Brier scores by up to 50% compared to uncalibrated AdaBoost outputs.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.