QUICK REVIEW

[论文解读] A Study on Feature Selection Techniques in Educational Data Mining

M. Ramaswami, R. Bhaskaran|ArXiv.org|Dec 19, 2009

Machine Learning and Data Classification参考文献 10被引用 123

一句话总结

本研究在教育数据挖掘中评估了六种过滤式特征选择技术，以识别预测学生表现的最佳特征子集。以朴素贝叶斯作为基线分类器，结果表明，降低特征维度可提高预测准确率、F-measure和ROC值，同时降低计算成本，最佳方法通过在多种分类器上的对比基准测试得以确定。

ABSTRACT

Educational data mining (EDM) is a new growing research area and the essence of data mining concepts are used in the educational field for the purpose of extracting useful information on the behaviors of students in the learning process. In this EDM, feature selection is to be made for the generation of subset of candidate variables. As the feature selection influences the predictive accuracy of any performance model, it is essential to study elaborately the effectiveness of student performance model in connection with feature selection techniques. In this connection, the present study is devoted not only to investigate the most relevant subset features with minimum cardinality for achieving high predictive performance by adopting various filtered feature selection techniques in data mining but also to evaluate the goodness of subsets with different cardinalities and the quality of six filtered feature selection algorithms in terms of F-measure value and Receiver Operating Characteristics (ROC) value, generated by the NaiveBayes algorithm as base-line classifier method. The comparative study carried out by us on six filter feature section algorithms reveals the best method, as well as optimal dimensionality of the feature subset. Benchmarking of filter feature selection method is subsequently carried out by deploying different classifier models. The result of the present study effectively supports the well known fact of increase in the predictive accuracy with the existence of minimum number of features. The expected outcomes show a reduction in computational time and constructional cost in both training and classification phases of the student performance model.

研究动机与目标

识别预测学生表现所需的相关特征子集，且子集基数最小。
评估六种过滤式特征选择算法在提升模型性能方面的有效性。
评估特征子集大小对F-measure和ROC值的影响。
在多种分类器模型上对最佳性能的特征选择方法进行基准测试。
通过最优特征选择减少学生表现建模中的计算时间和训练成本。

提出的方法

应用过滤式特征选择技术，基于统计度量从教育数据集中提取相关特征。
评估六种特定的过滤算法在选择高质量特征子集方面的能力。
使用朴素贝叶斯作为基线分类器，计算每个选定特征子集的F-measure和ROC值。
使用F-measure和受试者工作特征曲线下面积（AUC）衡量性能，以评估分类质量。
测试不同基数的特征子集，以确定最优维度。
通过多种分类器模型进一步验证最佳性能的特征选择方法，以进行基准测试。

实验结果

研究问题

RQ1哪种过滤式特征选择技术在学生表现模型中产生最高的预测准确率？
RQ2所选特征子集的基数如何影响F-measure和ROC值？
RQ3能最大化模型性能的特征子集最优维度是多少？
RQ4特征选择如何降低训练和分类阶段的计算成本？
RQ5哪种特征选择方法在多种分类器模型中表现一致？

主要发现

本研究基于F-measure和ROC性能指标，识别出最有效的过滤式特征选择方法。
降低特征维度可提高预测准确率，证实了精简但相关特征集的优势。
发现最优特征子集大小可提升模型性能，同时最小化计算开销。
最佳性能的特征选择方法在多种分类器模型中表现出一致的优越性。
由于特征子集优化，计算时间和模型构建成本显著降低。
结果支持既定原则：在教育数据挖掘中，更少但高质量的特征可产生更好的预测模型。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。