QUICK REVIEW

[论文解读] Maximum Relevance and Minimum Redundancy Feature Selection Methods for a Marketing Machine Learning Platform

Zhenyu Zhao, Radhika Anand|arXiv (Cornell University)|Aug 15, 2019

Machine Learning and Data Classification参考文献 22被引用 23

一句话总结

本文提出并评估了适用于大规模营销机器学习的增强型mRMR特征选择方法——FCQ、RFCQ和RFRQ。通过整合非线性冗余（RDC）与基于模型的相关性（如随机森林重要性），FCQ变体在AUC和运行时间方面均优于其他方法，实现了高精度与高效率；该方法已成功部署于Uber的自动化机器学习平台，提升了模型可扩展性，并在一次实际交叉销售活动中实现了12%的增量采用率。

ABSTRACT

In machine learning applications for online product offerings and marketing strategies, there are often hundreds or thousands of features available to build such models. Feature selection is one essential method in such applications for multiple objectives: improving the prediction accuracy by eliminating irrelevant features, accelerating the model training and prediction speed, reducing the monitoring and maintenance workload for feature data pipeline, and providing better model interpretation and diagnosis capability. However, selecting an optimal feature subset from a large feature space is considered as an NP-complete problem. The mRMR (Minimum Redundancy and Maximum Relevance) feature selection framework solves this problem by selecting the relevant features while controlling for the redundancy within the selected features. This paper describes the approach to extend, evaluate, and implement the mRMR feature selection methods for classification problem in a marketing machine learning platform at Uber that automates creation and deployment of targeting and personalization models at scale. This study first extends the existing mRMR methods by introducing a non-linear feature redundancy measure and a model-based feature relevance measure. Then an extensive empirical evaluation is performed for eight different feature selection methods, using one synthetic dataset and three real-world marketing datasets at Uber to cover different use cases. Based on the empirical results, the selected mRMR method is implemented in production for the marketing machine learning platform. A description of the production implementation is provided and an online experiment deployed through the platform is discussed.

研究动机与目标

为解决在自动化机器学习平台中从大规模、高维的营销特征空间中选择最优特征子集的挑战。
通过引入基于RDC的非线性冗余度量（非线性相关性）和基于模型的相关性（如随机森林特征重要性）来改进mRMR。
在合成数据集和真实世界营销数据集上评估多种mRMR变体在分类性能与计算效率方面的表现。
在生产环境中实现并优化表现最佳的方法（FCQ），使用Scala Spark实现可扩展性与低延迟推理。
通过在线A/B实验验证该方法在用户交叉销售目标定位中的实际业务影响。

提出的方法

提出一种基于秩距离相关性（RDC）的非线性冗余度量方法，以捕捉线性相关性之外的复杂特征依赖关系。
通过将互信息替换为训练模型（如随机森林）的特征重要性得分，引入基于模型的相关性度量方法。
在mRMR框架中扩展出三种变体：FCQ（无模型、非线性冗余）、RFCQ（基于随机森林的相关性）和RFRQ（基于随机森林的相关性结合RDC冗余）。
采用贪心的迭代选择过程，以最大化与目标的相关性，同时最小化已选特征之间的冗余。
通过在Scala Spark中部署FCQ方法，优化生产流水线，利用DataFrames与RDDs实现性能与内存效率的提升。
在降采样后应用特征选择，以降低计算负载，同时保持数据代表性。

实验结果

研究问题

RQ1在营销分类任务中，引入基于RDC的非线性冗余度量是否能提升特征选择性能，相比线性相关性？
RQ2在真实世界营销数据集中，基于模型的相关性（如随机森林重要性）与互信息在mRMR中的表现相比如何？
RQ3在多种营销使用场景下，哪种mRMR变体（FCQ、RFCQ、RFRQ）在预测性能（AUC）与计算效率之间实现了最佳平衡？
RQ4FCQ方法是否能在具备低延迟要求的生产自动化机器学习平台中实现有效扩展与维护？
RQ5在实际营销活动中使用所选特征选择方法，其真实世界的业务影响如何？

主要发现

FCQ变体在多种分类模型中表现出稳健性能，并展现出极高的计算效率，适用于大规模部署。
RFCQ与RFRQ变体在随机森林模型上取得了最优结果，在其他模型上也表现出良好性能，验证了基于模型相关性的有效性。
FCQ方法已成功在Uber的生产机器学习平台中使用Scala Spark实现，通过优化使用DataFrames与RDDs显著降低了运行时间。
基于FCQ驱动的模型在线实验显示，相较于基线，前60%高潜力用户的新产品采用率提升了12%（p < 0.05）。
预测转化概率最高的前20%用户群体的实际采用率是基线的4倍，证实了模型的有效性。
特征选择流水线降低了模型训练与预测延迟，简化了特征流水线的维护工作，并增强了模型的可解释性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。