QUICK REVIEW

[论文解读] Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

Gabriel Bathie, Tomasz Kociumaka|arXiv (Cornell University)|Jan 1, 2023

semigroups and automata theory被引用 1

一句话总结

本文提出了一种空间高效的流式处理与只读在线算法，用于在低距离范围（阈值 k）内计算输入字符串每个前缀到回文语言和平方语言的最小汉明距离与编辑距离。该研究引入了使用 O(k polylog n) 空间和每字符 O(k polylog n) 时间的随机化流式算法，以及针对汉明距离的 O(k polylog n) 空间和针对编辑距离的 O(k⁴ polylog n) 空间的确定性只读算法，两种模型均实现了 poly(k, log n) 的复杂度。

ABSTRACT

We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language $L$, we are given a string $T$ of length $n$, and the task is to compute the minimal distance to $L$ from every prefix of $T$. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold $k$. In this work, our contribution is twofold: - First, we show streaming algorithms, which access the input string $T$ only through a single left-to-right scan. Both for palindromes and squares, our algorithms use $O(k \cdot\mathrm{poly}~\log n)$ space and time per character in the Hamming-distance case and $O(k^2 \cdot\mathrm{poly}~\log n)$ space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in $n$. - Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of $T$. Both for palindromes and squares, our algorithms use $O(k \cdot\mathrm{poly}~\log n)$ space and time per character in the Hamming-distance case and $O(k^4 \cdot\mathrm{poly}~\log n)$ space and amortised time per character in the edit-distance case.

研究动机与目标

解决在低距离约束下，形式语言回文与平方的在线语言距离问题。
设计空间高效的算法，支持对输入字符串的单次从左到右扫描（流式处理）或仅读访问过去字符的随机访问。
在汉明距离与编辑距离变体中，均实现 poly(k, log n) 的时间与空间复杂度，重点关注阈值 k 范围内的性能。
开发针对文本中 k-不匹配与 k-编辑模式出现情况的首个 poly(k, log n) 空间只读算法。

提出的方法

利用随机化流式技术与摘要方法处理汉明距离，借助近似模式匹配工具。
应用局部一致的字符串分解与编辑距离摘要，将编辑距离计算转化为汉明距离问题。
采用 k-不匹配与 k-错误模式匹配技术，并追踪不匹配信息（MI），以检测潜在的回文或平方子串。
利用回文与平方的结构性质——如自相似性与周期性——指导高效检测与距离计算。
通过维护 O(k) 个不匹配信息集合并应用三角不等式界，实现每个字符 O(k polylog n) 时间处理。
在只读模型中，通过多级字符串分解中的分层模式匹配，维护每层空间为 O(k) 的持久数据结构。

实验结果

研究问题

RQ1我们能否设计出在汉明距离与编辑距离下，对回文与平方语言实现 poly(k, log n) 空间与每字符 O(k polylog n) 时间的随机化流式算法？
RQ2在仅读访问过去输入字符的条件下，确定性在线算法可达到的最小空间与时间复杂度是多少？
RQ3如何高效地实时检测可能构成回文或平方的 k-不匹配或 k-错误模式出现？
RQ4能否利用回文与平方的结构性质，在低距离范围内降低距离计算的复杂度？
RQ5在空间、时间与正确性保证方面，随机化流式模型与确定性只读模型之间的权衡是什么？

主要发现

随机化流式算法在汉明距离下实现 O(k polylog n) 空间与每字符 O(k polylog n) 时间，编辑距离下为 O(k² polylog n)，错误概率为反多项式。
确定性只读算法在汉明距离下使用 O(k polylog n) 空间与每字符 O(k polylog n) 时间，编辑距离下为 O(k⁴ polylog n) 空间与摊销每字符 O(k⁴ polylog n) 时间。
作为副产品，首次开发出针对文本中 k-不匹配与 k-编辑模式出现的 poly(k, log n) 空间只读算法。
对于 k-LHD-PAL 与 k-LHD-SQ，流式算法使用 ˜O(k) 时间与 O(k log n) 空间，优于先前的 ˜O(k²) 时间复杂度。
只读算法在 k-LHD-PAL/SQ 上每字符运行时间为 O(k log n)，空间复杂度也为 O(k log n)，与类似问题的最佳已知界限一致。
在编辑距离方面，只读算法实现每字符 ˜O(k⁴) 时间与空间复杂度，具有摊销效率，标志着在确定性在线计算方面迈出了重要一步。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。