QUICK REVIEW

[论文解读] Fast determinantal point processes via distortion-free intermediate sampling

Michał Dereziński|arXiv (Cornell University)|Nov 8, 2018

Random Matrices and Applications被引用 23

一句话总结

本文提出了一种新颖的算法，实现从确定性点过程（DPPs）的精确采样，其输入稀疏性预处理时间复杂度为 O(nnz(X) log n + poly(d))，采样时间复杂度为 poly(d)，且与 n 无关。通过使用无失真正则化 DPP（R-DPP）作为中间分布——该方法得益于基于泊松分布的大小控制机制——该工作首次实现了预处理和采样成本均在 n 上亚线性增长的精确 DPP 采样，显著优于先前的方法。

ABSTRACT

Given a fixed $n\ imes d$ matrix $\\mathbf{X}$, where $n\\gg d$, we study the complexity of sampling from a distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelepiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O(\ ext{number-of-non-zeros}(\\mathbf{X})\\cdot\\log n)+\ ext{poly}(d)$, and (2) a sampling step which runs in $\ ext{poly}(d)$ time, independent of the number of rows $n$. We achieve this by introducing a new regularized determinantal point process (R-DPP), which serves as an intermediate distribution in the sampling procedure by reducing the number of rows from $n$ to $\ ext{poly}(d)$. Crucially, this intermediate distribution does not distort the probabilities of the target sample. Our key novelty in defining the R-DPP is the use of a Poisson random variable for controlling the probabilities of different subset sizes, leading to new determinantal formulas such as the normalization constant for this distribution. Our algorithm has applications in many diverse areas where determinantal point processes have been used, such as machine learning, stochastic optimization, data summarization and low-rank matrix reconstruction.

研究动机与目标

解决确定性点过程（DPPs）中预处理和采样计算成本过高的问题，特别是在 n 较大时。
克服先前 DPP 算法的局限性，即预处理时间复杂度为 Ω(nd²) 或采样时间复杂度为 Ω(n|S|)。
开发一种方法，实现采样时间与 n 无关的精确 DPP 采样，同时保持较低的预处理成本。
引入正则化 DPP（R-DPP）作为中间分布，以无失真地保留目标 DPP 的概率。
为数据摘要、低秩矩阵重构和随机优化等应用提供高效的 DPP 采样方法。

提出的方法

提出正则化 DPP（R-DPP）作为中间分布，将行数从 n 减少到 poly(d)，同时不扭曲原始 DPP 的目标概率。
使用泊松随机变量控制 R-DPP 中子集的大小，从而推导出新的行列式公式，包括 R-DPP 的闭式归一化常数。
设计两阶段采样过程：首先从 R-DPP 中采样一个大小为 poly(d) 的集合，然后利用目标 DPP 分布对结果进行下采样。
通过保持原始 DPP 中所有子集的相对概率，确保中间采样步骤无失真。
通过利用稀疏矩阵运算和低秩结构，实现输入稀疏性预处理时间复杂度 O(nnz(X) log n + poly(d))。
通过利用中间 R-DPP 的低维结构和高效的行列式计算，保证采样时间复杂度为 poly(d)。

实验结果

研究问题

RQ1我们能否设计一种 DPP 采样算法，使预处理时间复杂度在 n 上亚线性增长，具体为 O(nnz(X) log n + poly(d))？
RQ2是否可能实现采样时间复杂度与 n 无关的精确 DPP 采样，即采样时间复杂度为 poly(d)？
RQ3我们能否构建一种中间分布，使其在将行数减少到 poly(d) 的同时，仍能保留目标 DPP 的概率？
RQ4如何为基于泊松分布控制大小的正则化 DPP 推导出归一化常数？
RQ5在 DPP 算法中使用无失真中间采样，其理论和实际影响是什么？

主要发现

所提算法实现了输入稀疏性预处理时间复杂度：O(nnz(X) log n + poly(d))，这是首次在精确 DPP 采样中实现该结果。
采样时间复杂度降低至 poly(d)，且与 n 无关，使其成为首个具备此特性的精确 DPP 算法。
中间 R-DPP 分布是无失真的，意味着它精确保留了原始 DPP 中所有子集的概率。
使用泊松随机变量控制子集大小，使得能够推导出新的解析公式，包括 R-DPP 归一化常数的闭式表达式。
该方法为大规模应用（如数据摘要和低秩矩阵重构）提供了高效的 DPP 采样，尤其适用于 n ≫ d 的场景。
该算法优于先前的最先进方法，后者需要 Ω(nd²) 的预处理时间或 Ω(n|S|) 的采样时间。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。