QUICK REVIEW

[論文レビュー] Why do tree-based models still outperform deep learning on tabular data?

Léo Grinsztajn, Edouard Oyallon|arXiv (Cornell University)|Jul 18, 2022

Big Data and Digital Economy被引用数 133

ひとこと要約

本論文は、45のデータセットにわたる表形式データで、木ベースのモデル（例：XGBoost、Random Forest）がディープラーニングを上回る大規模ベンチマークを提供し、なぜそうなるのかを説明する帰納的バイアスを分析している。

ABSTRACT

While deep learning has enabled tremendous progress on text and image datasets, its superiority on tabular data is not clear. We contribute extensive benchmarks of standard and novel deep learning methods as well as tree-based models such as XGBoost and Random Forests, across a large number of datasets and hyperparameter combinations. We define a standard set of 45 datasets from varied domains with clear characteristics of tabular data and a benchmarking methodology accounting for both fitting models and finding good hyperparameters. Results show that tree-based models remain state-of-the-art on medium-sized data ($\sim$10K samples) even without accounting for their superior speed. To understand this gap, we conduct an empirical investigation into the differing inductive biases of tree-based models and Neural Networks (NNs). This leads to a series of challenges which should guide researchers aiming to build tabular-specific NNs: 1. be robust to uninformative features, 2. preserve the orientation of the data, and 3. be able to easily learn irregular functions. To stimulate research on tabular architectures, we contribute a standard benchmark and raw data for baselines: every point of a 20 000 compute hours hyperparameter search for each learner.

研究の動機と目的

慎重なデータセット選択とハイパーパラメータ調整を行い、表形式データの標準化されたベンチマークを確立する。
多様な表形式データセットに対して、深層学習モデルと木ベースのモデルを比較する。
再現性と予算を意識した比較を可能にするため、未加工のベンチマーク結果を共有する。
表形式データで木ベースのモデルを有利にする帰納的バイアスを実証的に調査する。

提案手法

厳格な包含基準を用いてOpenMLから45の異質な表形式データセットを定義する。
データセットごとに約400回のランダム探索反復を用いてハイパーパラメータ探索のばらつきを考慮するベンチマーク手順を適用する。
木ベースのモデル（RandomForest、GradientBoosting、XGBoost）と深層モデル（MLP、ResNet、FT-Transformer、SAINT）を評価する。
分類はテスト精度、回帰はR2で性能を測定する。
再利用とさらなる検証を可能にするため、コードと未加工の20,000 compute-hour検索結果を共有する。

実験結果

リサーチクエスチョン

RQ1ハイパーパラメータが慎重に調整された場合、木ベースのモデルは広範で代表的な表形式データセットの集合で深層学習モデルを上回るのか。
RQ2表形式データにおける性能差を説明する木ベースのモデルとニューラルネットワークの内在的な帰納的バイアスとは何か。
RQ3データ変換（例：スムージング、特徴の無情報性、回転）は木ベースのモデルとニューラルネットワークのギャップにどう影響するか。
RQ4データセットと予算を横断して表形式学習手法を公正に比較できる標準的なベンチマーク手法は存在するのか。

主な発見

木ベースのモデルは、ハイパーパラメータ調整を考慮しても中規模の表形式データ（約10Kサンプル）で引き続き最先端である。
ニューラルネットワークは不規則なターゲット関数の学習に苦労し、表形式データにおける回転不変性によって阻害される。
情報量の少ない特徴はMLP様のアーキテクチャに不均衡に影響を及ぼし、木ベースモデルとの性能ギャップを広げる。
無情報特徴を除去するとNNのギャップが縮まる一方で、そうした特徴を追加するとギャップが広がる。
ニューラルネットの回転不変性は元データの向きを活用するのを妨げ、不変性を破る埋め込みはNNの性能を改善できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。