QUICK REVIEW

[論文レビュー] Linear Time Construction of Cover Suffix Tree and Applications

Jakub Radoszewski|arXiv (Cornell University)|Jan 1, 2023

Algorithms and Data Compression被引用数 2

ひとこと要約

本稿では、整数アルファベット上の文字列に対して、被覆サフィックス木（CST）を線形時間で構築する初のアルゴリズムを提示する。これは、重複する連続する部分文字列の出現を、文字列内のラン（最大反復）に関連付ける新しい組合せ的特徴付けを活用している。この手法により、CSTの最適なO(n)構築が可能となり、すべてのシードや最短のα部分被覆の計算といった準周期性問題の線形時間アルゴリズムが得られる。さらに、O(m + output)時間でギャップが制限された重複する連続する出現を報告するO(n)-空間のインデックスも構築可能である。

ABSTRACT

The Cover Suffix Tree (CST) of a string $T$ is the suffix tree of $T$ with additional explicit nodes corresponding to halves of square substrings of $T$. In the CST an explicit node corresponding to a substring $C$ of $T$ is annotated with two numbers: the number of non-overlapping consecutive occurrences of $C$ and the total number of positions in $T$ that are covered by occurrences of $C$ in $T$. Kociumaka et al. (Algorithmica, 2015) have shown how to compute the CST of a length-$n$ string in $O(n \log n)$ time. We show how to compute the CST in $O(n)$ time assuming that $T$ is over an integer alphabet. Kociumaka et al. (Algorithmica, 2015; Theor. Comput. Sci., 2018) have shown that knowing the CST of a length-$n$ string $T$, one can compute a linear-sized representation of all seeds of $T$ as well as all shortest $α$-partial covers and seeds in $T$ for a given $α$ in $O(n)$ time. Thus our result implies linear-time algorithms computing these notions of quasiperiodicity. The resulting algorithm computing seeds is substantially different from the previous one (Kociumaka et al., SODA 2012, ACM Trans. Algorithms, 2020). Kociumaka et al. (Algorithmica, 2015) proposed an $O(n \log n)$-time algorithm for computing a shortest $α$-partial cover for each $α=1,\ldots,n$; we improve this complexity to $O(n)$. Our results are based on a new characterization of consecutive overlapping occurrences of a substring $S$ of $T$ in terms of the set of runs (see Kolpakov and Kucherov, FOCS 1999) in $T$. This new insight also leads to an $O(n)$-sized index for reporting overlapping consecutive occurrences of a given pattern $P$ of length $m$ in $O(m+output)$ time, where $output$ is the number of occurrences reported. In comparison, a general index for reporting bounded-gap consecutive occurrences of Navarro and Thankachan (Theor. Comput. Sci., 2016) uses $O(n \log n)$ space.

研究の動機と目的

整数アルファベット上の文字列の被覆サフィックス木（CST）を線形時間で構築するアルゴリズムの設計。
Kociumakaら（2015年）のO(n log n)時間のCST構築法を上回り、最適なO(n)時間計算量を達成すること。
すべてのシードや最短のα部分被覆といった準周期性測度の線形時間での効率的計算を可能にすること。
O(n)-サイズのインデックスを構築し、O(m + output)時間でパターンのギャップ制限付き重複する連続出現を報告すること。これは従来のO(n log n)-空間ソリューションを改善する。

提案手法

重複する連続する部分文字列の出現を、文字列内のラン（最大反復）に関連付ける新しい組合せ的特徴付けを導入する。
単一のランによって示される重複する連続する出現を持つ部分文字列の集合は、「三角形」構造を示すことを示す。
2つの下向き走査を用いる：1つはサフィックスツリーのサフィックスリンクを、もう1つはCSTを対象とし、明示的ノードのcv(v)およびnov(v)値を計算する。
重み付き祖先クエリと範囲最小クエリ（RMQ）を用いて、ギャップ制限付き重複する連続出現のためのO(n)-空間インデックスを構築する。
整数アルファベット下で線形時間の構築を維持するため、完全ハッシュとバケットソートを活用する。
パターン出現報告中にO(1)の範囲クエリを可能にするために、配列MBとMLに対するRMQ事前処理を実施する。

実験結果

リサーチクエスチョン

RQ1整数アルファベット上の文字列に対して、被覆サフィックス木をO(n)時間で構築できるか？
RQ2重複する連続する出現の新しいランベース特徴付けは、準周期性測度の高速計算を可能にするか？
RQ3O(n)-サイズのインデックスを構築し、パターンのギャップ制限付き重複する連続出現をO(m + output)時間で報告できるか？
RQ4CSTを用いて、すべてのシードや最短のα部分被覆を線形時間で計算できるか？
RQ5提案手法を拡張して、最大拡張サフィックス木（MAST）の構築時間を改善できるか？

主な発見

整数アルファベット上の長さnの文字列の被覆サフィックス木は、O(n)時間で構築可能であり、最適な線形時間計算量を達成する。
本アルゴリズムは、すべてのシードをO(n)時間で計算可能であり、従来のO(n log n)アプローチに比べて顕著に高速化される。
すべてのα = 1, ..., nについて、最短のα部分被覆をO(n)時間で計算可能であり、従来のO(n log n)の境界を改善する。
O(n)-サイズのインデックスが構築され、長さmのパターンのギャップ制限付き重複する連続出現をO(m + output)時間で報告可能であり、定数アルファベット下ではO(n)の構築時間を持つ。
主な洞察は、部分文字列の重複する連続出現が、文字列内のランによって完全に特徴付けられ、三角形構造を形成することにある。
RMQと重み付き祖先クエリを用いることで、サフィックスツリーとCSTのそれぞれを下向き走査することで、cv(v)およびnov(v)値の効率的計算が可能になる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。