QUICK REVIEW

[论文解读] The Great Time Series Classification Bake Off: An Experimental Evaluation of Recently Proposed Algorithms. Extended Version

Anthony Bagnall, Aaron Bostrom|arXiv (Cornell University)|Feb 4, 2016

Time Series Analysis and Forecasting被引用 40

一句话总结

本文使用统一的Java框架集成WEKA，对18种近期提出的时序分类（TSC）算法进行了评估，对85个标准化、归一化的数据集执行了100次重复抽样实验。集体变换集成（COTE）显著优于所有其他算法和基准方法，平均比动态时间规整（DTW）高出8%，在严格、可复现的条件下确立了其作为迄今最准确的TSC方法的地位。

ABSTRACT

In the last five years there have been a large number of new time series classification algorithms proposed in the literature. These algorithms have been evaluated on subsets of the 47 data sets in the University of California, Riverside time series classification archive. The archive has recently been expanded to 85 data sets, over half of which have been donated by researchers at the University of East Anglia. Aspects of previous evaluations have made comparisons between algorithms difficult. For example, several different programming languages have been used, experiments involved a single train/test split and some used normalised data whilst others did not. The relaunch of the archive provides a timely opportunity to thoroughly evaluate algorithms on a larger number of datasets. We have implemented 18 recently proposed algorithms in a common Java framework and compared them against two standard benchmark classifiers (and each other) by performing 100 resampling experiments on each of the 85 datasets. We use these results to test several hypotheses relating to whether the algorithms are significantly more accurate than the benchmarks and each other. Our results indicate that only 9 of these algorithms are significantly more accurate than both benchmarks and that one classifier, the Collective of Transformation Ensembles, is significantly more accurate than all of the others. All of our experiments and results are reproducible: we release all of our code, results and experimental details and we hope these experiments form the basis for more rigorous testing of new algorithms in the future.

研究动机与目标

为解决先前TSC算法评估中存在的一致性问题，如仅使用单次训练/测试划分、数据归一化不一致以及缺乏可复现性。
通过在Java中使用统一实现并集成WEKA工具包，建立一个通用、标准化的TSC算法评估框架。
通过应用一致的预处理、重复抽样和模型选择策略，为未来TSC研究提供公平、可复现且透明的基准。
识别出在多种时序问题类型中显著优于标准基准（1-NN DTW和1-NN欧氏距离）的算法。
公开所有代码、结果和实验细节，以促进可复现性，并为未来算法评估提供基础。

提出的方法

所有18种算法及两种基准分类器（1-NN DTW和1-NN欧氏距离）均在一个集成WEKA机器学习工具包的单一Java框架中实现。
对UCR数据集中85个时序数据集中的每一个均进行了归一化处理，并执行了100次重复抽样实验，以确保性能估计的稳健性。
为每种算法使用交叉验证进行模型选择，以避免对测试集的过拟合，并确保公平比较。
集体变换集成（COTE）被实现为一种元集成方法，结合了多种基础分类器，包括形状片段（shapelets）、弹性距离和随机森林。
应用统计显著性检验比较不同算法在各数据集上的性能，使用非参数检验评估差异是否具有实际意义。
所有实验代码、结果和配置文件均已公开发布，以确保完全可复现性和透明性。

实验结果

研究问题

RQ1在扩展的UCR数据集集合上，集体变换集成（COTE）是否显著优于所有其他评估的TSC算法和标准基准？
RQ2近期的TSC算法是否在多种时序问题类型中始终优于传统方法（如1-NN DTW和1-NN欧氏距离）？
RQ3UCR存档中数据归一化不一致和格式错误在多大程度上会偏差算法性能评估？
RQ4统一的、可复现的实验框架是否能减少变异性并提高TSC算法比较的可靠性？
RQ5在特定问题类别（如光谱图、心电图或模拟数据）中，哪些算法类型（如基于形状片段、集成方法、弹性距离）表现最佳？

主要发现

COTE在全部85个数据集上的平均准确率最高，显著优于所有其他算法及两个基准。
在18种评估算法中，仅有9种显著优于1-NN DTW和1-NN欧氏距离两个基准。
COTE平均比1-NN DTW（此前公认的最先进基线）高出8%。
形状片段变换和基于弹性距离的方法（如EE）是COTE中表现最佳的组成部分，为其卓越性能做出了贡献。
在光谱图数据集上，基于向量的分类器达到了100%的准确率，而COTE在所有问题类型中整体表现最准确。
本研究证实，数据质量问题（如ECG200中的错误归一化和未归一化的Coffee数据）会显著偏差性能比较，并导致算法准确率被高估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。