QUICK REVIEW

[論文レビュー] Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform

Nicola Prezza, Giovanna Rosone|arXiv (Cornell University)|Jan 1, 2019

Algorithms and Data Compression被引用数 7

ひとこと要約

本稿では、入力および出力に加えてo(n log σ)ビットの作業領域を用いて、Burrows-Wheeler変換（BWT）から直接にLCP配列を計算し、文字列コレクションのBWTをマージするためのメモリ効率の良いアルゴリズムを提示する。Belazzouguiのsuffix tree区間列挙法とBellerらのLCP構築法を組み合わせることで、最適なメモリ使用量を達成しながらO(n log σ)の時間計算量を維持し、小文字用のアルファベットに対しては従来のO(n)作業領域ソリューションに比べ顕著に改善され、コレクションへの一般化も可能となる。

ABSTRACT

We show that the Longest Common Prefix Array of a text collection of total size n on alphabet [1, {\\sigma}] can be computed from the Burrows-Wheeler transformed collection in O(n log {\\sigma}) time using o(n log {\\sigma}) bits of working space on top of the input and output. Our result improves (on small alphabets) and generalizes (to string collections) the previous solution from Beller et al., which required O(n) bits of extra working space. We also show how to merge the BWTs of two collections of total size n within the same time and space bounds. The procedure at the core of our algorithms can be used to enumerate suffix tree intervals in succinct space from the BWT, which is of independent interest. An engineered implementation of our first algorithm on DNA alphabet induces the LCP of a large (16 GiB) collection of short (100 bases) reads at a rate of 2.92 megabases per second using in total 1.5 Bytes per base in RAM. Our second algorithm merges the BWTs of two short-reads collections of 8 GiB each at a rate of 1.7 megabases per second and uses 0.625 Bytes per base in RAM. An extension of this algorithm that computes also the LCP array of the merged collection processes the data at a rate of 1.48 megabases per second and uses 1.625 Bytes per base in RAM.

研究の動機と目的

入力のBWTから、総長nのテキストコレクションのLCP配列を、サブラインアー作業領域を用いて計算すること。
単一テキストからの空間効率の良いLCP計算を、文字列コレクションへ一般化すること。
2つの文字列コレクションのBWTをマージするための空間効率の良いアルゴリズムを設計すること。
LCPのコアアルゴリズムを支援する、BWTからのsuffix tree区間の要約的列挙を可能にすること。
大規模な生物学的データ、例えば短鎖リードコレクションに対して、最小限のRAM使用量で高い性能を達成すること。

提案手法

波尾木表現を用いて、BWTから右最大部分文字列とそのSA範囲を列挙するBelazzouguiのアルゴリズムを活用する。
一般化suffix treeのリーフ区間を含めるように列挙を拡張し、コレクション用のLCP配列構築を可能にする。
ハイブリッドデータ構造戦略を採用：小文字用アルファベット（σ < √n / log²n）ではO(σ² log²n)の作業領域、それ以上の場合はBellerらのO(n)空間法にフォールバックする。
LCP配列およびドキュメント配列の出力領域を再利用し、中間の波尾木行列を格納することで、総作業領域をo(n log σ)ビットに削減する。
アルファベットサイズに応じて、優先度キューまたはスタックを用いてBWTマージ中の区間ペアを管理し、空間効率を維持する。
列挙およびマージの過程で、波尾木上のランク演算を用いたバックワードサーチにより、拡張部分文字列のSA範囲を計算する。

実験結果

リサーチクエスチョン

RQ1LCP配列を、サブラインアー作業領域（具体的にはo(n log σ)ビット）を用いてBWTから誘導することは可能か？
RQ2suffix tree区間列挙手順を、一般化suffix treeのリーフノードを含めるようにどのように拡張できるか？
RQ32つの文字列コレクションのBWTを、o(n log σ)ビットの作業領域を用いてマージ可能であり、マージされたBWTの再構築能力を保持できるか？
RQ4LCP配列の計算およびBWTのマージにおいて、空間効率と時間計算量のトレードオフはどのように変化するか？
RQ5提案されたアルゴリズムは、DNAリードなどの実際の生物学的データに対して、最小限のRAM使用量で効率的に実装可能か？

主な発見

アルファベット[1, σ]を用いた総長nのテキストコレクションのLCP配列は、BWTおよびLCPに加えてo(n log σ)ビットの作業領域を用いてO(n log σ)時間で計算可能である。
本アルゴリズムは、BellerらのO(n)作業領域ソリューションを文字列コレクションへ一般化し、メモリ使用量をo(n log σ)ビットに削減する。
総長nの2つのコレクションのBWTは、o(n log σ)ビットの作業領域を用いてO(n log σ)時間でマージ可能であり、ドキュメント配列は副産物として得られる。
DNAアルファベットを対象とした最適化実装では、16 GiBの100-bp短鎖リードを1秒間に2.92メガバイト処理し、RAM使用量は1ベースあたり1.5バイトにとどまる。
8 GiBの2つのコレクションをマージするBWTマージアルゴリズムは、1秒間に1.7メガバイト処理し、RAM使用量は1ベースあたり0.625バイトである。
マージコレクションのLCP配列を計算する拡張版では、1秒間に1.48メガバイト処理し、RAM使用量は1ベースあたり1.625バイトである。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。