QUICK REVIEW

[论文解读] Linear Time Construction of Cover Suffix Tree and Applications

Jakub Radoszewski|arXiv (Cornell University)|Jan 1, 2023

Algorithms and Data Compression被引用 2

一句话总结

本文提出了首个针对整数字母表上字符串的覆盖后缀树（CST）线性时间构造算法，利用一种新颖的组合特征刻画，将重叠连续子串出现与字符串中的运行（runs）联系起来。该方法实现了CST的最优O(n)时间构造，并由此导出线性时间算法，用于解决拟周期性问题，如计算所有种子和最短α-部分覆盖，以及构建一个O(n)空间索引，可在O(m + output)时间内报告边界间隔的重叠连续出现。

ABSTRACT

The Cover Suffix Tree (CST) of a string $T$ is the suffix tree of $T$ with additional explicit nodes corresponding to halves of square substrings of $T$. In the CST an explicit node corresponding to a substring $C$ of $T$ is annotated with two numbers: the number of non-overlapping consecutive occurrences of $C$ and the total number of positions in $T$ that are covered by occurrences of $C$ in $T$. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-$n$ string in $O(n \log n)$ time. We show how to compute the CST in $O(n)$ time assuming that $T$ is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-$n$ string $T$, one can compute a linear-sized representation of all seeds of $T$ as well as all shortest $α$-partial covers and seeds in $T$ for a given $α$ in $O(n)$ time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an $O(n \log n)$-time algorithm for computing a shortest $α$-partial cover for each $α=1,\ldots,n$; we improve this complexity to $O(n)$. Our results are based on a new characterization of consecutive overlapping occurrences of a substring $S$ of $T$ in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in $T$. This new insight also leads to an $O(n)$-sized index for reporting overlapping consecutive occurrences of a given pattern $P$ of length $m$ in $O(m+output)$ time, where $output$ is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses $O(n \log n)$ space.

研究动机与目标

设计一种针对整数字母表上字符串的覆盖后缀树（CST）的线性时间构造算法。
在Kociumaka等人（2015年）提出的O(n log n)时间CST构造基础上，实现最优O(n)时间复杂度。
实现拟周期性度量（如所有种子和最短α-部分覆盖）的高效计算，时间复杂度为线性时间。
构建一个O(n)空间的索引，用于在O(m + output)时间内报告模式在边界间隔内的所有重叠连续出现，优于先前的O(n log n)空间方案。

提出的方法

提出一种新的组合特征刻画，将子串的重叠连续出现与字符串中的运行（最大重复）联系起来。
证明：由单个运行所隐含的具有重叠连续出现的子串集合具有“三角形”结构。
使用两次自底向上的遍历：一次在后缀树的后缀链接上，另一次在CST上，分别计算显式节点的cv(v)和nov(v)值。
利用加权祖先查询和范围最小值查询（RMQ）构建O(n)空间索引，用于报告边界间隔的重叠连续出现。
利用完美哈希和桶排序，在整数字母表下保持线性时间构造。
在数组MB和ML上预处理RMQ，以在模式出现报告过程中实现O(1)范围查询。

实验结果

研究问题

RQ1能否在整数字母表上字符串的O(n)时间内构造覆盖后缀树？
RQ2基于运行的重叠连续出现新特征刻画，是否能实现拟周期性度量的更快计算？
RQ3能否构建一个O(n)大小的索引，以O(m + output)时间报告模式的所有边界间隔重叠连续出现？
RQ4能否使用CST在O(n)时间内计算所有种子和最短α-部分覆盖？
RQ5所提出的方法能否扩展以改进最大增强后缀树（MAST）的构造时间？

主要发现

对于整数字母表上长度为n的字符串，其覆盖后缀树可在O(n)时间内构造，达到最优的线性时间复杂度。
该算法在O(n)时间内计算出字符串的所有种子，显著优于先前的O(n log n)方法。
该算法在O(n)时间内计算出每个α = 1, ..., n的最短α-部分覆盖，优于先前的O(n log n)时间界限。
构建了一个O(n)空间索引，可在O(m + output)时间内报告长度为m的模式的所有边界间隔重叠连续出现，常数字母表下构造时间也为O(n)。
关键洞见在于：子串的重叠连续出现可被字符串中的运行完全刻画，从而形成可高效计算的三角形结构。
该方法通过在后缀树和CST上分别进行两次自底向上遍历，结合RMQ和加权祖先查询，实现了cv(v)和nov(v)值的高效计算。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。