QUICK REVIEW

[论文解读] Optimal Learners for Multiclass Problems

Amit Daniely, Shai Shalev‐Shwartz|arXiv (Cornell University)|May 10, 2014

Machine Learning and Algorithms参考文献 16被引用 34

一句话总结

本文证明了最优多分类学习需要采用非正规学习——即输出超出假设类别的假设——并表明任何经验风险最小化（ERM）规则都无法达到最优。文章引入了一个新维度 $\text{dim}(\text{H})$，该维度在常数因子内刻画了样本复杂度，并证明了一元包含算法实现了近乎最优的样本复杂度，同时为广义线性分类器构造了计算高效的最优学习器，其样本复杂度优于ERM。

ABSTRACT

The fundamental theorem of statistical learning states that for binary classification problems, any Empirical Risk Minimization (ERM) learning rule has close to optimal sample complexity. In this paper we seek for a generic optimal learner for multiclass prediction. We start by proving a surprising result: a generic optimal multiclass learner must be improper, namely, it must have the ability to output hypotheses which do not belong to the hypothesis class, even though it knows that all the labels are generated by some hypothesis from the class. In particular, no ERM learner is optimal. This brings back the fundmamental question of "how to learn"? We give a complete answer to this question by giving a new analysis of the one-inclusion multiclass learner of Rubinstein et al (2006) showing that its sample complexity is essentially optimal. Then, we turn to study the popular hypothesis class of generalized linear classifiers. We derive optimal learners that, unlike the one-inclusion algorithm, are computationally efficient. Furthermore, we show that the sample complexity of these learners is better than the sample complexity of the ERM rule, thus settling in negative an open question due to Collins (2005).

研究动机与目标

为解决多分类学习中ERM规则存在局限性的根本问题，即如何实现最优学习。
通过一种新的组合维度 $\text{dim}(\text{H})$，该维度将VC维推广至多分类设置，从而刻画多分类假设类的样本复杂度。
证明一元包含算法在归纳和PAC设置下均实现了近乎最优的样本复杂度，优于以往分析结果。
为广义线性分类器构造计算高效的最优学习器，其样本复杂度优于ERM。
通过证明ERM在广义线性模型中非最优，解决了Collins（2005）提出的一个开放问题。

提出的方法

提出了一种新的维度概念 $\dim(\mathcal{H})$，定义为在特定多分类打碎条件下，$\mathcal{H}$ 能打碎的最大集合的大小。
通过一种新颖的序列 $\mu_{\mathcal{H}}(m)$ 分析一元包含多分类学习器，该序列量化了在 $m$ 个样本后可能达到的最佳误差率。
证明一元包含学习器的样本复杂度为 $\Theta\left(\frac{\mu_{\mathcal{H}}(m)}{m}\right)$，表明其在归纳设置下样本复杂度仅比最优值低两倍。
建立从归纳学习到归纳学习的归约，将最优性保证扩展至PAC模型，仅在 $\epsilon$ 和 $\delta$ 上存在对数因子差异。
提出一个猜想，将 $\mu_{\mathcal{H}}(m)$ 与 $\dim(\mathcal{H})$ 联系起来，认为当 $m \geq \dim(\mathcal{H})$ 时，有 $\mu_{\mathcal{H}}(m) = \Theta(\dim(\mathcal{H}))$，这将提供样本复杂度的简洁刻画。
通过利用新维度并证明其样本复杂度优于ERM，为广义线性分类器构造了计算高效的最优学习器，解决了Collins（2005）提出的一个负面结果。

实验结果

研究问题

RQ1是否存在一种通用的最优多分类学习算法？若存在，其性质是什么？
RQ2多分类学习的样本复杂度能否由一个类似于二分类中VC维的单一组合维度来刻画？
RQ3为何ERM在多分类设置中表现不佳？何种学习规则的结构性质使其不足？
RQ4能否为广义线性模型等实际假设类构造计算高效的最优学习器？
RQ5新维度 $\dim(\mathcal{H})$ 是否提供了比现有概念（如Natarajan维或图维）更紧致的样本复杂度刻画？

主要发现

本文证明，任何最优多分类学习规则都必须是非正规的，即必须输出假设类别之外的假设，从而表明ERM在本质上是非最优的。
一元包含算法在归纳设置下实现了样本复杂度仅比最优值低两倍，优于以往的 $\log(|\mathcal{Y}|)$ 因子保证。
新维度 $\dim(\mathcal{H})$ 的取值介于Natarajan维与图维之间：$\Ndim(\mathcal{H}) \leq \dim(\mathcal{H}) \leq \Gdim(\mathcal{H})$，且其对样本复杂度提供了下界，其表现与Natarajan维相当或更优。
对于广义线性分类器，本文构造了计算高效的最优学习器，其样本复杂度严格优于ERM，解决了Collins（2005）提出的一个开放问题。
若猜想 $\mu_{\mathcal{H}}(m) = \Theta(\dim(\mathcal{H}))$ 成立，则可得到样本复杂度的简洁刻画，即 $\epsilon_{\mathcal{H}}(m) = \Theta\left(\frac{\dim(\mathcal{H})}{m}\right)$，且 $m_{\mathcal{H}}(\epsilon,\delta) = \Theta\left(\frac{\dim(\mathcal{H}) \log(1/\delta)}{\epsilon}\right)$。
本文表明图维无法刻画样本复杂度，因为其可能远大于实际样本复杂度，因此不足以刻画最优学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。