QUICK REVIEW

[论文解读] Iterative MapReduce for Large Scale Machine Learning

Joshua Rosen, Neoklis Polyzotis|arXiv (Cornell University)|Mar 13, 2013

Cloud Computing and Resource Management参考文献 11被引用 26

一句话总结

本文提出迭代MapReduce（Iterative MapReduce），作为MapReduce模型的扩展，原生支持大规模机器学习工作负载的迭代计算。通过引入一等循环构造及静态优化器以调优数据分区和聚合树结构，该系统在性能上达到与Vowpal Wabbit等专用系统相当的水平，优于标准Hadoop，同时避免了对内存缓存或仅磁盘I/O的假设。

ABSTRACT

Large datasets ("Big Data") are becoming ubiquitous because the potential value in deriving insights from data, across a wide range of business and scientific applications, is increasingly recognized. In particular, machine learning - one of the foundational disciplines for data analysis, summarization and inference - on Big Data has become routine at most organizations that operate large clouds, usually based on systems such as Hadoop that support the MapReduce programming paradigm. It is now widely recognized that while MapReduce is highly scalable, it suffers from a critical weakness for machine learning: it does not support iteration. Consequently, one has to program around this limitation, leading to fragile, inefficient code. Further, reliance on the programmer is inherently flawed in a multi-tenanted cloud environment, since the programmer does not have visibility into the state of the system when his or her program executes. Prior work has sought to address this problem by either developing specialized systems aimed at stylized applications, or by augmenting MapReduce with ad hoc support for saving state across iterations (driven by an external loop). In this paper, we advocate support for looping as a first-class construct, and propose an extension of the MapReduce programming paradigm called {\em Iterative MapReduce}. We then develop an optimizer for a class of Iterative MapReduce programs that cover most machine learning techniques, provide theoretical justifications for the key optimization steps, and empirically demonstrate that system-optimized programs for significant machine learning tasks are competitive with state-of-the-art specialized solutions.

研究动机与目标

解决MapReduce在高效处理迭代机器学习算法方面的根本性局限。
在程序员无法可靠调优底层参数的多租户云环境中，实现系统驱动的优化。
开发一种基于原则的静态优化器，为迭代MapReduce程序选择最优运行时配置。
证明自动化优化可实现与最先进的专用机器学习系统（如Vowpal Wabbit）相当的性能。

提出的方法

通过引入一等循环构造扩展MapReduce模型，原生支持表达迭代机器学习算法。
设计新型运行时，支持循环感知调度、数据缓存以及跨迭代的高效聚合。
开发静态优化器，基于对通信与计算成本的理论分析，选择最优的数据分区和聚合树扇入结构。
在120个节点的集群上，使用真实世界机器学习工作负载对优化器的选择进行经验验证，测量响应时间和成本。
应用理论模型预测最优扇入数和机器数量，通过受控实验进行验证。
结合理论依据与经验评估，调优系统参数，如数据分区和聚合结构。

实验结果

研究问题

RQ1在MapReduce中引入一等循环抽象，是否能显著提升大规模机器学习工作负载的效率与可维护性，相比临时拼凑的迭代编程？
RQ2静态优化器如何自动选择迭代MapReduce程序的最优数据分区和聚合树配置？
RQ3系统驱动的优化在多大程度上可达到或超越专用手调参数的机器学习系统（如Vowpal Wabbit）的性能？
RQ4在真实系统约束和开销下，理论预测的恒定最优扇入数（约为e）是否仍然成立？
RQ5在动态、多租户的云环境中，优化器能否有效平衡响应时间与成本？

主要发现

在大多数配置下，聚合树的最优扇入数在实证中稳定为4或5，与理论预测的e值存在偏差，原因在于未建模的初始化开销。
优化器正确识别出在100GB数据集下，N=120个CPU可最小化响应时间，而N=24个CPU可最小化成本，与理论预测一致。
采用系统优化配置的迭代MapReduce在性能上优于标准Hadoop，并达到与最先进的Vowpal Wabbit系统相当的水平。
系统在不同数据大小和集群配置下均表现出稳健性能，理论预测与实证结果保持一致。
静态优化器在所有测试配置下均成功选择出高效执行计划，无需程序员手动调优。
结果证实，系统驱动的优化在运行时条件波动且不可预测的云环境中是可行且有效的。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。