QUICK REVIEW

[论文解读] Fast Label Embeddings for Extremely Large Output Spaces

Paul Mineiro, Nikos Karampatziakis|arXiv (Cornell University)|Mar 1, 2015

Text and Document Classification Technologies参考文献 5被引用 2

一句话总结

本文提出 Rembrandt，一种用于在极大规模输出空间中学习低维标签嵌入的快速随机化算法，通过利用随机 SVD 高效近似响应矩阵的顶级奇异向量。该方法在大规模文本分类数据集上实现了最先进性能，并相较于朴素方法实现了指数级加速。

ABSTRACT

Many modern multiclass and multilabel problems are characterized by increasingly large output spaces. For these problems, label embeddings have been shown to be a useful primitive that can improve computational and statistical efficiency. In this work we utilize a correspondence between rank constrained estimation and low dimensional label embeddings that uncovers a fast label embedding algorithm which works in both the multiclass and multilabel settings. The result is a randomized algorithm whose running time is exponentially faster than naive algorithms. We demonstrate our techniques on two large-scale public datasets, from the Large Scale Hierarchical Text Challenge and the Open Directory Project, where we obtain state of the art results. 1 Contributions We provide a statistical motivation for label embedding by demonstrating that the optimal rank-constrained least squares estimator can be constructed from an optimal unconstrained estimator of an embedding of the labels. Thus, embedding can provide beneficial sample complexity reduction even if computational constraints are not binding. We identify a natural object to define label similarity: the expected outer product of the conditional label probabilities. In particular, in conjunction with a low-rank constraint, this indicates two label embeddings are similar when their conditional probabilities are linearly dependent across the dataset. This unifies prior work utilizing the confusion matrix for multiclass [1] and the empirical label covariance for multilabel [5]. We apply techniques from randomized linear algebra [3] to develop an efficient and scalable algorithm for constructing the embeddings, essentially via a novel randomized algorithm. Intuitively, this technique implicitly decomposes the prediction matrix of a model which would be prohibitively expensive to form explicitly. 2 Proposed Algorithm Our proposal is Rembrandt, described in Algorithm 1. We use the top right singular space of ΠX,LY as a label embedding, or equivalently, the top principal components of Y ΠX,LY . Using randomized techniques, we can Algorithm 1 Rembrandt: Response EMBedding via RANDomized Techniques 1: function REMBRANDT(k,X ∈ Rn×d, Y ∈ Rn×c) 2: (p, q)← (20, 1) . These hyperparameters rarely need adjustment. 3: Q← randn(c, k + p) 4: for i ∈ {1, . . . , q} do . Randomized range finder for Y ΠX,LY 5: Z ← arg min ‖Y Q−XZ‖F 6: Q← orthogonalize(Y >XZ) 7: end for . NB: total of (q + 1) data passes, including next line 8: F ← (Y >XQ)>(Y >XQ) . F ∈ R(k+p)×(k+p) is “small” 9: (V,Σ)← eig(F, k) 10: V ← QV . V ∈ Rc×k is the embedding 11: return (V,Σ) 12: end function

研究动机与目标

解决在极大规模输出空间中多类和多标签学习的计算与统计低效问题。
通过将秩约束估计与最优标签嵌入联系起来，为标签嵌入提供统计上合理的理论基础。
通过条件标签概率的期望外积这一共同的标签相似性概念，统一基于混淆矩阵和标签协方差的先前方法。
开发一种可扩展的随机化算法，避免显式构建大型预测矩阵。
在真实世界大规模数据集上展示最先进性能，且仅需极少超参数调优。

提出的方法

该方法使用随机范围查找器来近似矩阵 ΠX,LY 的右顶奇异子空间，该矩阵表示条件标签概率的期望值。
采用随机 SVD 技术，避免显式计算完整响应矩阵，从而实现指数级加速。
算法执行 q 次随机子空间迭代，以估计 YΠX,LY 的主导奇异子空间，通过正交化保持数值稳定性。
构建一个小型矩阵 F = (YᵀXQ)ᵀ(YᵀXQ)，通过特征分解计算前 k 个奇异向量。
最终的标签嵌入 V ∈ ℝ^(c×k) 通过将基 Q 投影到 F 的主成分上获得。
该方法仅需对数据进行 (q+1) 次遍历，使其在大规模学习中极为高效。

实验结果

研究问题

RQ1随机化算法是否能在不牺牲统计准确性的情况下，实现标签嵌入学习的指数级加速？
RQ2是否存在一种统一的统计解释来描述标签相似性，从而同时涵盖多类和多标签设置？
RQ3即使计算约束不构成瓶颈，低秩标签嵌入是否仍能降低样本复杂度？
RQ4条件标签概率的期望外积如何自然地作为标签相似性的度量？
RQ5随机线性代数技术在不显式构建的情况下，能在多大程度上隐式分解大型预测矩阵？

主要发现

所提出的 Rembrandt 算法在 Large Scale Hierarchical Text Challenge 和 Open Directory Project 数据集上实现了最先进性能。
该方法通过避免显式计算大型响应矩阵，相较于朴素标签嵌入算法运行速度呈指数级提升。
最优秩约束最小二乘估计器可从标签嵌入中构建，为标签嵌入提供了超越计算优势的统计依据。
标签相似性通过条件标签概率中的线性相关性自然定义，统一了基于混淆矩阵和标签协方差的先前方法。
该算法仅需极少超参数调优，(p,q) = (20,1) 在各项实验中均表现稳健。
使用随机 SVD 使得仅通过 (q+1) 次数据遍历即可实现标签嵌入的可扩展计算，显著降低时间复杂度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。