QUICK REVIEW

[论文解读] Efficient Algorithms for Large-scale Generalized Eigenvector Computation and Canonical Correlation Analysis

Rong Ge, Chi Jin|arXiv (Cornell University)|Apr 13, 2016

Sparse and Compressive Sensing Techniques参考文献 26被引用 40

一句话总结

该论文提出了一类针对大规模典型相关分析（CCA）和广义特征向量问题的全局线性收敛迭代算法，将CCA转化为top-k广义特征向量问题，并通过加速梯度下降方法利用快速线性系统求解器。该方法的运行时间为 $ O\big(\frac{zk\tilde{\rho}}{\rho}\big) $，其中 $ z $ 为非零元个数，$ \tilde{\rho} $ 为条件数，$ \rho $ 为特征值间隙，这是首个在输入规模上具有近线性依赖关系的、针对此类问题的可证明线性收敛算法。

ABSTRACT

This paper considers the problem of canonical-correlation analysis (CCA) (Hotelling, 1936) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics (Shi and Malik, 2000; Hardoon et al., 2004; Witten et al., 2009). We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-$k$ generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of $O(\\frac{z k \\sqrt{\\kappa}}{\ ho} \\log(1/\\epsilon) \\log \\left(k\\kappa/\ ho\ ight))$ where $z$ is the total number of nonzero entries, $\\kappa$ is the condition number and $\ ho$ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components $k$ up to a $\\log(k)$ factor. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.

研究动机与目标

为大规模设置下的典型相关分析（CCA）和广义特征向量问题开发高效且可扩展的算法。
消除对形成逆协方差矩阵（如 $ \mathbf{S}_{xx}^{-1/2} $）的需求，此类计算在大规模数据集上计算成本过高。
实现对条件数和特征值间隙具有改进依赖关系的全局线性收敛。
提供一个通用框架，将快速线性系统求解器与迭代方法集成用于广义特征向量计算。
通过在小规模和大规模数据集（包括MNIST和URL声誉数据）上的实证验证，评估算法性能。

提出的方法

通过协方差矩阵的变换，将CCA转化为top-k广义特征向量问题。
使用仅需对近似线性系统求解器具有黑箱访问权限的一般算法框架。
通过加速梯度下降实例化该框架，以高效求解线性系统。
在大规模设置中利用稀疏性和小批量处理，以保持计算效率。
通过迭代解与真实典型子空间之间的主角 $ \theta_{\mathbf{B}} $ 定义收敛性，确保 $ \sin \theta_{\mathbf{B}} $ 单调递减。
在实际中通过向 $ \mathbf{S}_{xx} $ 和 $ \mathbf{S}_{yy} $ 加上 $ \lambda \mathbf{I} $ 对病态矩阵进行正则化。

实验结果

研究问题

RQ1我们能否设计一种可证明全局线性收敛的算法，用于CCA和广义特征向量问题，同时避免显式矩阵求逆？
RQ2对于大规模问题，最优运行时间依赖于分量数 $ k $、条件数 $ \kappa $ 和特征值间隙 $ \rho $ 的关系是什么？
RQ3我们能否在保持线性收敛的同时，使运行时间在非零元个数 $ z $ 和 $ k $ 上达到近线性复杂度？
RQ4与现有的一次性方法和迭代方法相比，该方法在大规模数据集上的收敛速度和精度表现如何？
RQ5该算法在稀疏高维数据（如URL声誉数据和Penn Tree Bank数据集）上是否具有实际有效性？

主要发现

所提算法的运行时间为 $ O\big(\frac{zk\sqrt{\kappa}}{\rho}\log(1/\epsilon)\log(k\kappa/\rho)\big) $，在 $ z $ 和 $ k $ 上具有近线性依赖关系，相较于传统SVD方法有显著改进。
该算法展现出全局线性收敛性，$ \sin \theta_{\mathbf{B}} $ 随迭代次数线性递减，该现象在MNIST和PTB数据集上通过实证结果得到验证。
在MNIST数据集上，该算法单调收敛至真实典型子空间，皮尔逊相关系数（PCC）趋近于1，且所有角度 $ \theta_x, \theta_y, \theta_{\mathbf{B}} $ 趋近于零。
在大规模URL声誉数据集上，CCALin在达到相同TCC精度时，计算效率优于S-AppGrad、PCA-CCA、NW-CCA和DW-CCA。
即使初始时 $ \theta_x $ 和 $ \theta_y $ 落后于 $ \theta_{\mathbf{B}} $，该方法仍能保持线性收敛，最终收敛速度至少与 $ \sin \theta_{\mathbf{B}} $ 相当。
实证结果证实，该算法在大规模问题中具有实际可行性，尤其当 $ k \ll n $ 且条件数与特征值间隙适中时。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。