QUICK REVIEW

[論文レビュー] The Heterogeneous Ensembles of Standard Classification Algorithms (HESCA): the Whole is Greater than the Sum of its Parts

James Large, Jason Lines|arXiv (Cornell University)|Oct 25, 2017

Imbalanced Data Classification Techniques参考文献 18被引用数 36

ひとこと要約

本稿では、標準的な分類器（例：決定木、SVM、ニューラルネットワーク）の異種アンサンブルであるHESCAを提案する。この手法は、学習データからの誤差推定値を用いて、異なるアルゴリズムファミリーに属するモデルを統合する。HESCAは、個々の分類器やチューニング済みSVM、さらには複雑な時系列特化アルゴリズムに対しても顕著に優れた性能を示し、特に小規模データセットや多クラス問題において、高速で頑健なベンチマークを提供する。

ABSTRACT

Building classification models is an intrinsically practical exercise that requires many design decisions prior to deployment. We aim to provide some guidance in this decision making process. Specifically, given a classification problem with real valued attributes, we consider which classifier or family of classifiers should one use. Strong contenders are tree based homogeneous ensembles, support vector machines or deep neural networks. All three families of model could claim to be state-of-the-art, and yet it is not clear when one is preferable to the others. Our extensive experiments with over 200 data sets from two distinct archives demonstrate that, rather than choose a single family and expend computing resources on optimising that model, it is significantly better to build simpler versions of classifiers from each family and ensemble. We show that the Heterogeneous Ensembles of Standard Classification Algorithms (HESCA), which ensembles based on error estimates formed on the train data, is significantly better (in terms of error, balanced error, negative log likelihood and area under the ROC curve) than its individual components, picking the component that is best on train data, and a support vector machine tuned over 1089 different parameter configurations. We demonstrate HESCA+, which contains a deep neural network, a support vector machine and two decision tree forests, is significantly better than its components, picking the best component, and HESCA. We analyse the results further and find that HESCA and HESCA+ are of particular value when the train set size is relatively small and the problem has multiple classes. HESCA is a fast approach that is, on average, as good as state-of-the-art classifiers, whereas HESCA+ is significantly better than average and represents a strong benchmark for future research.

研究の動機と目的

計算制約のもとで、新しい問題に対して最適な分類アルゴリズムファミリーを選択するという実用的課題に取り組む。
異なるアルゴリズムファミリーに属する、最小限にチューニングされた分類器をアンサンブル化することで、単一モデルの最適化に比べて性能が向上するかどうかを調査する。
訓練データからの単純な誤差ベース重み付けが、より複雑な手法と比較して、アンサンブル結合に有効であるかどうかを評価する。
HESCAを、特に低データ量および多クラス環境において、信頼性が高く、高速で汎用的な分類タスクのベンチマークとして確立する。

提案手法

同じ学習データ上で、異なるアルゴリズムファミリー（例：決定木、SVM、ニューラルネットワーク）に属する多様なベース分類器を学習する。
交差検証や類似手法を用いて、各ベース分類器の訓練セット上での誤差を推定する。
推定された訓練誤差に基づいて、各ベース分類器の予測を重み付けする。誤差が小さいほど重みが高くなる。
すべてのベース分類器の重み付き予測を統合して、最終的なアンサンブル予測を形成する。
HESCAとHESCA+の両方で同じ重み付けスキームを用いる。HESCA+は深層ニューラルネットワークと2つの決定木フォレストを含む。
分類誤差、バランス誤差、負の対数尤度、および未学習のテストデータにおけるAUC-ROCといった標準指標を用いて、アンサンブルの性能を評価する。

実験結果

リサーチクエスチョン

RQ1異なるアルゴリズムファミリーに属する複数の最小限チューニング分類器をアンサンブル化することで、最良の単一分類器よりも顕著に性能が向上するか？
RQ2訓練データからの誤差推定値のみを用いてアンサンブル構成要素の重み付けを効果的に行えるか？また、より複雑な結合手法と比較して優れているか？
RQ31つの分類器をチューニングするのではなく、複数の分類器ファミリーの単純なバージョンをアンサンブル化するほうが効果的か？
RQ4HESCAは、1089のパrameter設定を経てチューニングされた高度に最適化されたSVMと比較して、多様なデータセットで性能に優れているか？
RQ5アンサンブルはチューニング済みベースモデルの性能について有意義な洞察を提供するか？また、異なるデータタイプにわたって一般化性能は高いか？

主な発見

HESCAは、206のデータセットにわたって、その個々のコンponents、学習データ上で最も性能の良かった単一分類器、および1089のパrameter設定を経てチューニングされたSVMを顕著に上回る性能を示した。
HESCA+（深層ニューラルネットワーク、SVM、2つの決定木フォレストを含む）は、そのコンponentsのいずれよりも顕著に優れた性能を示し、平均的にHESCAを上回った。
HESCAおよびHESCA+は、1,000件未塔の学習インスタンスと2つ以上のクラスを有するデータセットにおいて特に効果的であり、性能向上が顕著に現れた。
HESCAで用いられた単純な誤差ベース重み付け手法は、Confusion Entropyのようなより複雑なアンサンブル結合手法と同等またはそれ以上の性能を示した。
UCR-UEA時系列アーカイブにおいて、HESCA+は、時間的構造を一切使用しないにもかかわらず、18種類の最先端時系列特化アルゴリズムのうち11種類と同等の性能を達成した。
HESCAは、最先端の分類器と同等の正確性を示す一方で、計算リソースを桁違いに削減するため、実用的で信頼性の高いベンチマークである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。