QUICK REVIEW

[论文解读] Feature Selection: A Data Perspective

Jundong Li, Kewei Cheng|arXiv (Cornell University)|Jan 29, 2016

Gene expression and cancer classification参考文献 207被引用 776

一句话总结

一个全面的综述，从数据中心视角重新审视特征选择，按数据类型（传统、结构化、异质、流数据）以及按算法方法（相似性为基、信息理论、稀疏学习、统计）分类。

ABSTRACT

Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (\url{http://featureselection.asu.edu/}). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

研究动机与目标

将特征选择作为高维数据中至关重要的预处理步骤，以提升可解释性、效率和泛化能力。
从数据中心的角度提供特征选择算法的结构化分类法，包括传统、结构化、异质和流数据。
在大数据时代识别挑战与机遇，并勾勒未来研究的开放问题。

提出的方法

将传统数据的特征选择方法分为四大类：相似性为基础、信息理论、稀疏学习、统计为基础。
将讨论扩展到具有结构化特征（组、树、图）以及异质数据（链接的、多源的、多视角的）和流数据的特征选择。
介绍一个开源特征选择仓库 scikit-feature，并演示使用它进行评估的做法。
讨论混合、基于深度学习的和重建为基础的方法，作为补充途径。

实验结果

研究问题

RQ1跨数据类型用于评估和选择特征的核心类别与标准是什么？
RQ2特征选择方法如何适应传统、结构化、异质和流数据？
RQ3在大数据情境下，特征选择的开放挑战与未来方向是什么？

主要发现

特征选择可以按数据视角和选择策略（包装、过滤、嵌入）来分类，包装方法在计算上成本高。
基于相似性的方法保持数据流形结构，并可应用于有监督、无监督和半监督设置。
信息理论和稀疏学习为最大化相关性、最小化冗余或强加稀疏性提供准则。
结构化和异质数据需要利用组、树、图结构或多数据源的专门算法。
流数据特征选择使得在数据和特征演变时能够进行一次遍历的动态维护。
提供一个开源的特征选择仓库和评估框架，以促进可重复性和比较。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。