QUICK REVIEW

[论文解读] Is rotation forest the best classifier for problems with continuous features?

Anthony Bagnall, Flynn, M.|arXiv (Cornell University)|Sep 18, 2018

Time Series Analysis and Forecasting参考文献 45被引用 30

一句话总结

本文评估了旋转森林作为实值数据集的默认分类器，通过广泛的实证比较表明，其在分类误差、AUC 和对数损失方面显著优于随机森林、支持向量机和神经网络等替代方法。作者提出一种基于契约的版本以提升可扩展性，实现更快的训练速度且精度损失极小，结论认为当计算资源允许时，旋转森林应作为连续特征问题的首选算法。

ABSTRACT

In short, our experiments suggest that yes, on average, rotation forest is better than the most common alternatives when all the attributes are real-valued. Rotation forest is a tree based ensemble that performs transforms on subsets of attributes prior to constructing each tree. We present an empirical comparison of classifiers for problems with only real-valued features. We evaluate classifiers from three families of algorithms: support vector machines; tree-based ensembles; and neural networks tuned with a large grid search. We compare classifiers on unseen data based on the quality of the decision rule (using classification error) the ability to rank cases (area under the receiver operating characteristic) and the probability estimates (using negative log likelihood). We conclude that, in answer to the question posed in the title, yes, rotation forest is significantly more accurate on average than competing techniques when compared on three distinct sets of datasets. Further, we assess the impact of the design features of rotation forest through an ablative study that transforms random forest into rotation forest. We identify the major limitation of rotation forest as its scalability, particularly in number of attributes. To overcome this problem we develop a model to predict the train time of the algorithm and hence propose a contract version of rotation forest where a run time cap is imposed {\em a priori}. We demonstrate that on large problems rotation forest can be made an order of magnitude faster without significant loss of accuracy. We also show that there is no real benefit (on average) from tuning rotation forest. We maintain that without any domain knowledge to indicate an algorithm preference, rotation forest should be the default algorithm of choice for problems with continuous attributes.

研究动机与目标

确定旋转森林是否是仅含实值特征问题的最佳分类器。
评估旋转森林与主要分类器族（支持向量机、基于树的集成方法和神经网络）的性能表现。
通过消融研究评估旋转森林设计组件的影响。
通过开发基于契约的训练机制，解决旋转森林在高维数据上可扩展性差的问题。
在缺乏领域特定知识的情况下，倡导将旋转森林作为默认算法。

提出的方法

对来自三个分类器族的10种分类器（支持向量机（RBF和二次核）、基于树的集成方法（随机森林、梯度提升）、神经网络（1–2个隐藏层））进行实证比较。
对每种分类器使用大规模网格搜索（约1000种超参数组合），并通过训练数据上的10折交叉验证选择最佳模型。
使用四种指标在未见测试数据上评估模型：分类误差、平衡误差、ROC曲线下面积（AUC）和负对数似然。
通过将随机森林转化为旋转森林的消融研究，隔离旋转和特征子集选择的影响。
开发一种基于契约的旋转森林版本，预先限制训练时间，并引入一个模型用于预测训练时间并指导提前停止。
实现并发布一个基本的、兼容scikit-learn的旋转森林版本，以提升其可访问性。

实验结果

研究问题

RQ1在实值数据集上，旋转森林的平均准确率是否显著高于其他分类器？
RQ2旋转森林的哪些设计组件对其性能贡献最大？
RQ3基于契约的训练机制是否能在不牺牲准确率的前提下提升旋转森林在大规模问题上的可用性？
RQ4对旋转森林进行超参数调优是否有益，还是其对默认设置具有鲁棒性？
RQ5是否应将旋转森林作为新实值分类问题的默认分类器？

主要发现

在包含200多个实值问题的三个基准数据集上，旋转森林在平均性能上显著优于所有对比分类器，尤其在AUC和对数损失方面表现突出。
消融研究证实，特征旋转和子集选择是旋转森林相较于随机森林性能更优的关键驱动因素。
对旋转森林进行超参数调优未带来平均性能提升，表明其对默认超参数具有鲁棒性。
基于契约的旋转森林版本在大规模问题上将训练时间减少了一个数量级，且精度损失极小，使其适用于高维数据。
在小规模问题上，契约机制影响较小；但在大规模问题上，随着契约时间延长，准确率提升，尤其在时间序列类数据上表现更明显。
尽管性能优异，旋转森林仍因缺乏主流工具包的集成以及默认配置不佳（如仅使用10棵树）而使用率较低，作者通过新实现解决了这些问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。