QUICK REVIEW

[論文レビュー] TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second

Noah Hollmann, Samuel Müller|arXiv (Cornell University)|Jul 5, 2022

Machine Learning and Data Classification被引用数 94

ひとこと要約

TabPFNは、事前学習されたTransformerで、ハイパーパラメータ調整なしで1秒未満に小規模表形式分類を行い、文脈内学習を通じてベイズ後方予測分布を近似することで、数値データセットにおける最先端のAutoML性能と同等である。

ABSTRACT

We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN performs in-context learning (ICL), it learns to make predictions using sequences of labeled examples (x, f(x)) given in the input, without requiring further parameter updates. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On the 18 datasets in the OpenML-CC18 suite that contain up to 1 000 training data points, up to 100 purely numerical features without missing values, and up to 10 classes, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 230$ imes$ speedup. This increases to a 5 700$ imes$ speedup when using a GPU. We also validate these results on an additional 67 small numerical datasets from OpenML. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.

研究の動機と目的

データセット固有のチューニングを行わず、単一の事前学習済みTransformerが小規模な表形式分類タスクを1秒未満で解くことを示す。
表形式データの事前分布の下でベイズ推論を近似するよう、オフラインで訓練されたPrior-Data Fitted Network (PFN)を開発する。
因果応用を意識した事前分布（SCMsとBNNs）を組み込み、表形式データの多様な生成メカニズムをモデル化する。
TabPFNが boosted treesを上回り、OpenML-CC18の数値データセットでAutoMLシステムと競えることを示す。
再現とコミュニティによる検証を可能とするオープンソースコード、事前学習済みTabPFN、およびデモを提供する。

提案手法

新しい表形式の事前分布の下で、後方予測分布を近似する PFN として12層の Transformer を訓練する。
SCMs( Structural Causal Models )とBNN(Bayesian Neural Networks)の混合から事前分布を構築し、単純で因果的な多様なデータ生成過程をモデル化する。
Priorから生成した合成データセットでオフライン訓練を行い、保持した合成点に対するクロスエントロピーを最小化し、ワンパスのオンライン予測を達成する。
推論時には、訓練データとテスト特徴を集合として入力し、単一の順伝播でPPD予測を得る。
変長の訓練データとテストサンプルの順列不変性を有効にし、特徴数の変動に対応するためゼロパディングを用いる。
安定性向上のため、データ変換を伴う32回の順伝播をアンサンブルすることも可能。）

実験結果

リサーチクエスチョン

RQ1単一の事前学習済みTransformerが、データセットごとの微調整なしで、小規模な表形式データに対してベイズ風の後方予測推論を学習できるか？
RQ2SCMsとBNNに基づく事前分布は、より単純で因果的な説明を促進し、小規模な表形式データの予測性能を向上させるか？
RQ3厳密に小規模な数値表データセットに対して、TabPFNは精度と速度の点でboosted treesやAutoMLシステムとどう比較されるか？
RQ4カテゴリ特徴と欠損値に関するTabPFNの制限は何か、アンサンブルや事前分布の調整でそれらを緩和できるか？

主な発見

TabPFNは、OpenML-CC18の数値データセット（訓練点数が最大1,000、特徴数が最大100）で、最先端のAutoMLシステムと競合する精度を、データセットあたり1秒未満で達成する。
TabPFNはCPUベースのAutoMLパイプラインに対して大幅な速度向上を提供し（約230倍）、小規模データセットの予測ではGPUベースの速度向上が約5,700倍に達する。
この手法は、カテゴリ特徴や欠損値を含むデータセットで一般に劣るが、TabPFNを他の手法とアンサンブルすることでさらなる改善が得られる。
TabPFNは、単純で因果的な説明へ向けた帰納的バイアスの恩恵を受けており、予測の定性的分析とロバスト性チェックで示されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。