QUICK REVIEW

[论文解读] Outlier Detection as Instance Selection Method for Feature Selection in Time Series Classification

David Cemernek|arXiv (Cornell University)|Nov 16, 2021

Anomaly Detection Techniques and Applications参考文献 1被引用 4

一句话总结

本文提出了一种新颖的时间序列分类实例选择方法，通过利用异常检测来在特征选择过程中优先处理罕见且高判别性的实例。通过仅保留这些罕见实例来过滤训练数据，该方法在多个数据集上将分类性能提升了高达16%，表明异常检测是提升不平衡时间序列数据中特征选择与模型可解释性的有效策略。

ABSTRACT

In order to allow machine learning algorithms to extract knowledge from raw data, these data must first be cleaned, transformed, and put into machine-appropriate form. These often very time-consuming phase is referred to as preprocessing. An important step in the preprocessing phase is feature selection, which aims at better performance of prediction models by reducing the amount of features of a data set. Within these datasets, instances of different events are often imbalanced, which means that certain normal events are over-represented while other rare events are very limited. Typically, these rare events are of special interest since they have more discriminative power than normal events. The aim of this work was to filter instances provided to feature selection methods for these rare instances, and thus positively influence the feature selection process. In the course of this work, we were able to show that this filtering has a positive effect on the performance of classification models and that outlier detection methods are suitable for this filtering. For some data sets, the resulting increase in performance was only a few percent, but for other datasets, we were able to achieve increases in performance of up to 16 percent. This work should lead to the improvement of the predictive models and the better interpretability of feature selection in the course of the preprocessing phase. In the spirit of open science and to increase transparency within our research field, we have made all our source code and the results of our experiments available in a publicly available repository.

研究动机与目标

通过实例选择增强特征选择，以提升时间序列分类性能。
解决时间序列数据中罕见且高判别性事件代表性不足的类别不平衡问题。
探究异常检测是否能有效识别并优先处理用于特征选择的罕见、高信息量实例。
提升时间序列机器学习预处理流程的可解释性与鲁棒性。
通过公开共享代码与实验结果，推动开放科学。

提出的方法

该方法在特征选择之前，应用异常检测算法从训练集中识别并仅保留罕见且高判别性的实例。
将异常检测用作过滤机制，以去除过度代表的正常实例，同时保留罕见且可能具有信息量的事件。
对过滤后的数据集应用标准的特征选择流程，以提升模型性能。
评估了多种异常检测算法（例如局部异常因子、一类支持向量机）在实例选择中的有效性。
将该方法集成到模块化流程中，支持多种特征选择器、分类器和评估指标。
在多样化的时间序列数据集上，使用标准分类基准和性能指标对方法进行评估。

实验结果

研究问题

RQ1异常检测能否在类别不平衡的时间序列数据集中有效识别罕见且高判别性的实例？
RQ2仅保留这些罕见实例来过滤训练数据，是否能提升后续的特征选择与分类性能？
RQ3不同异常检测算法在增强时间序列分类实例选择能力方面表现如何比较？
RQ4该实例选择策略在多样化数据集上可实现的性能提升幅度有多大？
RQ5该方法能否提升时间序列机器学习中特征选择过程的可解释性与鲁棒性？

主要发现

将异常检测作为实例选择方法，能显著提升分类性能，在某些时间序列数据集上性能提升最高达16%。
在部分数据集中，性能提升较为有限，仅增加几个百分点，表明其有效性具有数据集依赖性。
该方法在多种分类器（包括旋转森林和DTW1NN）上均一致提升了模型性能。
该方法通过聚焦于罕见且高信息量的实例，而非主导的、判别性较低的模式，增强了特征选择的可解释性。
结果表明，异常检测是过滤训练数据以提升下游分类准确性的可行且有效策略。
作者在多样化的时间序列数据集上成功验证了其方法，证实了该方法的稳健性与泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。