QUICK REVIEW

[論文レビュー] Feature Selection: A Data Perspective

Jundong Li, Kewei Cheng|arXiv (Cornell University)|Jan 29, 2016

Gene expression and cancer classification参考文献 207被引用数 776

ひとこと要約

データ中心の視点から特徴選択を包括的に見直し、データタイプ（従来、構造化、異種、ストリーミング）とアルゴリズム手法（類似度ベース、情報理論、スパース学習、統計）で分類。

ABSTRACT

Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (\url{http://featureselection.asu.edu/}). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research.

研究の動機と目的

高次元データにおける解釈性、効率、一般化を改善するための前処理ステップとして特徴選択を動機付ける。
データ中心の視点から従来型、構造化、異種、ストリーミングデータを含む特徴選択アルゴリズムの体系的分類を提供する。
ビッグデータ時代の課題と機会を識別し、将来の研究の未解の問題を概説する。

提案手法

従来データの特徴選択手法を4つの主要グループに分類：類似度ベース、情報理論、スパース学習ベース、統計ベース。
構造化特徴（グループ、ツリー、グラフ）および異種データ（リンク型、マルチソース、マルチビュー）およびストリーミングデータを含む特徴選択へ話題を拡張する。
オープンソースリポジトリscikit-featureを紹介し、それを用いた評価実践を示す。
補助的なアプローチとしてハイブリッド、深層学習ベース、再構成ベースの手法を議論する。」],
research_questions:[
どのようなコアカテゴリと評価基準がデータタイプを横断して特徴を評価・選択する際に用いられるか？
従来型、構造化、異種、ストリーミングデータに対して特徴選択手法はどのように適応しているか？
ビッグデータ文脈における特徴選択の未解の課題と今後の方向性は何か？

実験結果

リサーチクエスチョン

RQ1どのようなコアカテゴリと評価基準がデータタイプを横断して特徴を評価・選択する際に用いられるか？
RQ2従来型、構造化、異種、ストリーミングデータに対して特徴選択手法はどのように適応しているか？
RQ3ビッグデータ文脈における特徴選択の未解の課題と今後の方向性は何か？

主な発見

特徴選択はデータの視点と選択戦略（ラッパー、フィルター、エンベディッド）で分類でき、ラッパーは計算コストが高い。
類似度ベースの手法はデータのマニフォールド構造を保持し、教師あり・教師なし・半教師あり設定で適用可能。
情報理論的およびスパース学習ベースの手法は関連性を最大化し、冗長性を最小化する基準を提供する、またはスパース性を課す。
構造化データと異種データには、グループ、ツリー、グラフ構造、または複数のデータソースを活用する専門的なアルゴリズムが必要。
ストリーミング特徴選択はデータと特徴が進化する中で適切な特徴を1回のパスで動的に維持する。
再現性と比較を促進するためのオープンソースの特徴選択リポジトリと評価フレームワークが提供されている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。