QUICK REVIEW

[论文解读] Construction of Sparse Suffix Trees and LCE Indexes in Optimal Time and Space

Dmitry Kosolobov, Nikita Sivukhin|arXiv (Cornell University)|May 8, 2021

Algorithms and Data Compression参考文献 33被引用 3

一句话总结

本文提出了一种确定性算法，可在 O(n/b) 时间和 O(b) 空间内构造长度为 n 的字符串的 τ-划分集合，其大小为 O(b)，从而实现稀疏后缀树和 LCE 索引的最优 O(b) 空间和 O(n/b) 查询时间。该方法利用子串的位表示上的压缩字典树以及预计算的查找表，实现了线性时间构造，且渐近空间使用效率最优。

ABSTRACT

The notions of synchronizing and partitioning sets are recently introduced variants of locally consistent parsings with great potential in problem-solving. In this paper we propose a deterministic algorithm that constructs for a given readonly string of length $n$ over the alphabet $\{0,1,\ldots,n^{\mathcal{O}(1)}\}$ a variant of $τ$-partitioning set with size $\mathcal{O}(b)$ and $τ= \frac{n}{b}$ using $\mathcal{O}(b)$ space and $\mathcal{O}(\frac{1}εn)$ time provided $b \ge n^ε$, for $ε> 0$. As a corollary, for $b \ge n^ε$ and constant $ε> 0$, we obtain linear construction algorithms with $\mathcal{O}(b)$ space on top of the string for two major small-space indexes: a sparse suffix tree, which is a compacted trie built on $b$ chosen suffixes of the string, and a longest common extension (LCE) index, which occupies $\mathcal{O}(b)$ space and allows us to compute the longest common prefix for any pair of substrings in $\mathcal{O}(n/b)$ time. For both, the $\mathcal{O}(b)$ construction storage is asymptotically optimal since the tree itself takes $\mathcal{O}(b)$ space and any LCE index with $\mathcal{O}(n/b)$ query time must occupy at least $\mathcal{O}(b)$ space by a known trade-off (at least for $b \ge Ω(n / \log n)$). In case of arbitrary $b \ge Ω(\log^2 n)$, we present construction algorithms for the partitioning set, sparse suffix tree, and LCE index with $\mathcal{O}(n\log_b n)$ running time and $\mathcal{O}(b)$ space, thus also improving the state of the art.

研究动机与目标

开发一种在最优时间和空间复杂度下构造稀疏后缀树和 LCE 索引的确定性算法。
解决在仅使用 O(b) 额外空间（在只读字符串基础上）构建这些索引的挑战。
改进先前的确定性构造方法，后者需要 O(n log n / b) 时间或更差。
为两种数据结构实现最优的 O(b) 空间和 O(n/b) 查询时间，与已知的下界一致。
将局部一致划分的应用范围扩展至确定性、小空间的字符串索引。

提出的方法

使用 τ-划分集合（τ = n/b）以在 O(b) 空间下实现高效的索引。
在子串的位表示上构建压缩字典树，以支持快速的 LCE 查询。
利用融合树和位操作，在 O(k) 时间内为 k 个子串构建压缩字典树。
预计算一个查找表 D，用于所有可能的位块（编码子串、位数组和压缩字典树），以实现 O(1) 时间的字符计算。
使用位移操作维护动态位数组，以指示 S′ 中在大小为 τ 的滑动窗口内的位置。
利用预计算的表 F 在 O(1) 时间内初始化每个位置的数组 Mi，从而实现高效的范围计数。

实验结果

研究问题

RQ1是否存在一种确定性算法，可在 O(n/b) 时间和 O(b) 空间内为任意 b ≥ n^ϵ 构造稀疏后缀树？
RQ2是否可能仅使用 O(b) 空间和确定性构造实现 LCE 索引的 O(n/b) 查询时间？
RQ3如何利用局部一致划分设计最优的小空间字符串索引？
RQ4构造大小为 O(b) 的划分集合所需的最小空间和时间开销是多少？
RQ5能否使用位级表示和压缩字典树在不存储完整子串的情况下加速 LCE 计算？

主要发现

该算法可在 O(n/b) 时间和 O(b) 空间内为任意 b ≥ n^ϵ 且 ϵ > 0 构造出大小为 O(b) 的 τ-划分集合。
稀疏后缀树和 LCE 索引在 O(n log b n) 时间和 O(b) 空间内被构造，适用于任意 b ≥ Ω(log² n)，优于先前的确定性界。
O(n/b) 的构造时间对于 b ≥ n^ϵ 是最优的，与查询时间的下界一致。
O(b) 的空间使用量对两种数据结构而言都是渐近最优的，符合已知的时间-空间权衡。
通过使用预计算的查找表和位级压缩字典树，实现了每个位置 O(1) 的处理时间，从而获得线性总时间复杂度。
该方法在不依赖随机化的情况下实现了最优性能，填补了确定性小空间索引领域的一个空白。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。