QUICK REVIEW

[论文解读] Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning

Guillaume Lemaître, Fernando Nogueira|arXiv (Cornell University)|Sep 21, 2016

Imbalanced Data Classification Techniques参考文献 13被引用 1,560

一句话总结

Imbalanced-learn 提供一个 Python 工具箱，用于欠采样、过采样、混合采样和集成方法，以解决不平衡数据集，并且与 scikit-learn 兼容。

ABSTRACT

Imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition. The implemented state-of-the-art methods can be categorized into 4 groups: (i) under-sampling, (ii) over-sampling, (iii) combination of over- and under-sampling, and (iv) ensemble learning methods. The proposed toolbox only depends on numpy, scipy, and scikit-learn and is distributed under MIT license. Furthermore, it is fully compatible with scikit-learn and is part of the scikit-learn-contrib supported project. Documentation, unit tests as well as integration tests are provided to ease usage and contribution. The toolbox is publicly available in GitHub: https://github.com/scikit-learn-contrib/imbalanced-learn.

研究动机与目标

应对现实世界数据集中类别不平衡的普遍性及其影响，覆盖欺诈检测、医疗诊断等领域。
提供一个 Python API，涵盖广泛的最先进的不平衡处理技术。
确保与 scikit-learn 的兼容性以及高代码质量，以促进采用与贡献。

提出的方法

实现四种策略组：欠采样、过采样、两者的结合，以及集成学习。
提供带有 fit、sample 和 fit_sample 方法的采样器类，遵循类似 scikit-learn 的 API。
支持常见的不平衡技术，如 SMOTE、Borderline 变体、Tomek 链以及基于 ENN 的各种清理方法。
使用与 scikit-learn 兼容的 Pipeline 类来组合采样器、变换器和估计器。
确保对 numpy、scipy 和 scikit-learn 的依赖，采用 MIT 许可并具备 scikit-learn-contrib 状态。
通过单元测试确保高代码质量（在发布 0.1.8 中达到 99% 覆盖率）以及持续集成。

实验结果

研究问题

RQ1在不同数据集和不同不平衡比率下，哪些不平衡处理技术最有效？
RQ2如何通过一个统一的 Python 工具箱在标准 ML 工作流程中简化欠采样、过采样和集成方法的应用？
RQ3哪种 API 设计最符合 scikit-learn，以促进采用与扩展？

主要发现

该工具箱实现了四种主要策略：欠采样、过采样、两者的结合，以及集成学习。
采样器类提供 fit、sample 和 fit_sample 方法，模仿 scikit-learn 的 API。
包含 SMOTE 及其变体，以及随机过采样和若干欠采样/清理技术。
集成方法 EasyEnsemble 和 BalanceCascade 提供了不必使用单一平衡数据集的替代方案。
该项目采用 MIT 许可，依赖于 numpy/scipy/scikit-learn，且与 scikit-learn 完全兼容；它具有文档、测试和 CI 集成。
代码质量和项目活跃度较高，发布版本的测试覆盖率达到 99%，且基于 GitHub 的开发活跃。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。