QUICK REVIEW

[論文レビュー] When Do Neural Nets Outperform Boosted Trees on Tabular Data?

Duncan C. McElfresh, Sujay Khandagale|arXiv (Cornell University)|May 4, 2023

Explainable Artificial Intelligence (XAI)被引用数 71

ひとこと要約

この論文は176の表形式データセットに対して19のアルゴリズムの大規模比較を行い、NN対GBDTの議論が過度に強調されることが多く、GBDTのシンプルなベースラインや軽いハイパーパラメータ調整で多くのデータセットでNNの性能と同等か上回ることを示す。TabPFNは小さなデータセットでしばしば優れる一方、GBDTは大規模または不規則なデータセットを支配する。著者はTabZillaというベンチマーク Suiteを公開する。

ABSTRACT

Tabular data is one of the most commonly used types of data in machine learning. Despite recent advances in neural nets (NNs) for tabular data, there is still an active discussion on whether or not NNs generally outperform gradient-boosted decision trees (GBDTs) on tabular data, with several recent works arguing either that GBDTs consistently outperform NNs on tabular data, or vice versa. In this work, we take a step back and question the importance of this debate. To this end, we conduct the largest tabular data analysis to date, comparing 19 algorithms across 176 datasets, and we find that the 'NN vs. GBDT' debate is overemphasized: for a surprisingly high number of datasets, either the performance difference between GBDTs and NNs is negligible, or light hyperparameter tuning on a GBDT is more important than choosing between NNs and GBDTs. A remarkable exception is the recently-proposed prior-data fitted network, TabPFN: although it is effectively limited to training sets of size 3000, we find that it outperforms all other algorithms on average, even when randomly sampling 3000 training datapoints. Next, we analyze dozens of metafeatures to determine what properties of a dataset make NNs or GBDTs better-suited to perform well. For example, we find that GBDTs are much better than NNs at handling skewed or heavy-tailed feature distributions and other forms of dataset irregularities. Our insights act as a guide for practitioners to determine which techniques may work best on their dataset. Finally, with the goal of accelerating tabular data research, we release the TabZilla Benchmark Suite: a collection of the 36 'hardest' of the datasets we study. Our benchmark suite, codebase, and all raw results are available at https://github.com/naszilla/tabzilla.

研究の動機と目的

表形式データ設定におけるNN対GBDTのパフォーマンス強調を疑問視する。
多様なデータセットでアルゴリズム選択とハイパーパラメータ調整のどちらが性能向上をもたらすかを評価する。
NNとGBDTのどちらが優れるかを予測するデータセット特性（メタ特徴量）を特定する。
表データの方法選択と調整に関する実務的指針を提供する。

提案手法

OpenMLスイートからの176の表データセットで19のアルゴリズム（GBDT、NN、TabPFN、およびベースライン）を評価する。
Optunaを用いてデータセットごとに最大30設定、各実行最大10時間までハイパーパラメータを調整する。
データセットごとに10分割交差検証を用い、テスト精度と対数損失を主要指標として報告する。
PyMFEを用いてメタ特徴量965個という大規模な特徴量セットを計算し、データセットの特性を分析する。
有意性を評価するためにFriedman検定とWilcoxon符号順位検定を Holm-Bonferroni補正とともに実施する。
オープンソースのコードと結果とともに、36の難しいデータセットを含むTabZillaベンチマークスイートを公開する。

実験結果

リサーチクエスチョン

RQ1大規模かつ多様な表データセットにおいて、アルゴリズムファミリー（GBDT対NN）は互いに相対的にどのように性能を発揮するか？
RQ2データセットのサイズ、不規則性、その他のメタ特徴量がNNとGBDTのどちらがより良く機能するかを予測するか？
RQ3強力なモデルのシンプルなベースラインや軽いハイパーパラメータ調整が、ファミリを跨いだアルゴリズム選択を上回ることが多いか？
RQ4特定の手法の成功や失敗を最もよく説明するデータセットの特性は何か、そしてこれが新しいデータセットに対する実践的な選択にどう役立つか？

主な発見

176のデータセット全体で単一のアルゴリズムが優勢というわけではない。CatBoostがしばしば首位になるが、他の手法が勝つデータセットもある。
TabPFNは平均的にトップの性能を達成し、驚くべきことに非常に高速な学習時間を示す。小規模データセット（≤1250サンプル）では、TabPFNが他を上回ることもあり、推論が高速である。
メモリ/時間の極端な問題を除外したサブセットの98データセットにおいて、TabPFNは平均で他のすべてを上回り、統計的有意性を持つ。
強力なベースライン（例: CatBoost）のハイパーパラメータ調整は、GBDTとNNの切替よりも約1/3のデータセットで大きな改善をもたらす。
GBDTは、より大規模で不規則なデータセット（例：裾の長い分布や歪んだ特徴分布）でNNより優れる傾向がある。
実務家向けの指針：まずシンプルなベースラインから始め、次にCatBoostの軽い調整を行い、新しいデータに対してはメタ特徴量を用いてアルゴリズム選択をガイドする。
TabZillaベンチマークスイート36の難しいデータセットを公開して、表データ研究を加速させ、コードと結果は公開されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。