QUICK REVIEW

[論文レビュー] Is rotation forest the best classifier for problems with continuous features?

Anthony Bagnall, Flynn, M.|arXiv (Cornell University)|Sep 18, 2018

Time Series Analysis and Forecasting参考文献 45被引用数 30

ひとこと要約

この論文は、実数値特徴を有するデータセットにおける回転フォレストをデフォルトの分類器として評価し、広範な実験的比較を通じて、分類誤差、AUC、対数損失の観点で、ランダムフォレスト、SVM、ニューラルネットワークなどの代替手法を顕著に上回ることを示している。著者らは、スケーラビリティを向上させるために契約ベースのバージョンを提案し、最小限の精度損失で高速な学習を可能にした。計算リソースが許す場合には、回転フォレストが連続的特徴問題のデフォルトアルゴリズムであるべきだと結論づけている。

ABSTRACT

In short, our experiments suggest that yes, on average, rotation forest is better than the most common alternatives when all the attributes are real-valued. Rotation forest is a tree based ensemble that performs transforms on subsets of attributes prior to constructing each tree. We present an empirical comparison of classifiers for problems with only real-valued features. We evaluate classifiers from three families of algorithms: support vector machines; tree-based ensembles; and neural networks tuned with a large grid search. We compare classifiers on unseen data based on the quality of the decision rule (using classification error) the ability to rank cases (area under the receiver operating characteristic) and the probability estimates (using negative log likelihood). We conclude that, in answer to the question posed in the title, yes, rotation forest is significantly more accurate on average than competing techniques when compared on three distinct sets of datasets. Further, we assess the impact of the design features of rotation forest through an ablative study that transforms random forest into rotation forest. We identify the major limitation of rotation forest as its scalability, particularly in number of attributes. To overcome this problem we develop a model to predict the train time of the algorithm and hence propose a contract version of rotation forest where a run time cap is imposed {\em a priori}. We demonstrate that on large problems rotation forest can be made an order of magnitude faster without significant loss of accuracy. We also show that there is no real benefit (on average) from tuning rotation forest. We maintain that without any domain knowledge to indicate an algorithm preference, rotation forest should be the default algorithm of choice for problems with continuous attributes.

研究の動機と目的

実数値特徴のみを有する問題において、回転フォレストが最良の分類器であるかどうかを特定すること。
SVM、木ベースのアンサンブル、ニューラルネットワークといった主な分類器ファミリーと比較して、回転フォレストの性能を評価すること。
アブレーションスタディを通じて、回転フォレストの設計要因が性能に与える影響を評価すること。
特に高次元データにおいて顕著なスケーラビリティの欠如を解消するため、契約ベースのトレーニングメカニズムを開発して回転フォレストのスケーラビリティを改善すること。
ドメイン固有の知識が欠如している状況において、回転フォレストをデフォルトのアルゴリズムとして推奨すること。

提案手法

SVM（RBFおよび2次）、木ベースのアンサンブル（ランダムフォレスト、勾配ブースティング）、ニューラルネットワーク（1〜2層の隠れ層）の3ファミリーに属する10種類の分類器と、回転フォレストを実験的に比較する。
各分類器に対して、約1000のパラメータ組み合わせをカバーする大規模なグリッドサーチを実施し、トレーニングデータ上で10分割交差検証を用いて最良のモデルを選択する。
未学習のテストデータを用いて、分類誤差、バランス誤差、受信者操作特性曲線下積分（AUC）、負の対数尤度の4つの指標でモデルを評価する。
ランダムフォレストを回転フォレストに変換するアブレーションスタディを実施し、回転と特徴サブセット選択の影響を明確に分離する。
事前にトレーニング時間を制限する契約ベースのトレーニングメカニズムを考案し、トレーニング時間の予測モデルを構築して早期停止を支援する。
スクラッチから実装し、scikit-learn互換の基本版を公開することで、回転フォレストのアクセス性を向上させる。

実験結果

リサーチクエスチョン

RQ1実数値データセットにおいて、平均的に回転フォレストは他の分類器よりも顕著に精度が優れているか？
RQ2回転フォレストのどの設計要因が性能向上に最も寄与しているか？
RQ3契約ベースのトレーニングメカニズムは、精度を犠牲にすることなく、大規模問題における回転フォレストの使いやすさを向上させられるか？
RQ4ハイパーパramータチューニングは回転フォレストにとって有益か、それともデフォルト設定に対して頑健であるか？
RQ5新規の実数値分類問題において、回転フォレストはデフォルト分類器として採用されるべきか？

主な発見

3つのベンチマークデータセット（200以上の実数値問題を含む）において、回転フォレストは平均的にすべての競合分類器を顕著に上回り、特にAUCと対数損失の観点で優位性を示した。
アブレーションスタディにより、特徴の回転とサブセット選択が、ランダムフォレストに比べて回転フォレストの優れた性能を支えていることが確認された。
回転フォレストのハイパーパramータチューニングには平均的な利点がなく、デフォルトのハイパーパramータ設定に対して頑健であることが示された。
契約版回転フォレストは、大規模な問題においてトレーニング時間を最大10倍短縮でき、精度の損失は最小限に抑えられ、高次元データに対しても実用的であることがわかった。
小規模な問題では契約の影響はほとんどないが、大規模な問題では、契約時間の延長に伴い精度が向上し、特に時系列に類似したデータでは顕著であった。
強力な性能を発揮しているにもかかわらず、主なツールキットへの統合が不十分で、デフォルト設定（例：10本の木）が不適切であるため、回転フォレストは依然としてあまり使われていない。著者らは、新しい実装によりこの問題を解決した。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。