[论文解读] Enumerating Regular Languages with Bounded Delay
该论文提出了一种针对正则语言的有界延迟枚举算法,利用编辑脚本通过在单词两端进行入栈/出栈操作来修改前一个单词,从而实现与单词长度无关的常数延迟。该文刻画了可排序的正则语言——即狭长语言(slender languages)——并提供了一种PTIME算法,将任意正则语言划分为最少数量的可排序分量,通过最终周期性的编辑脚本实现最优延迟界限的枚举。
We study the task, for a given language L, of enumerating the (generally infinite) sequence of its words, without repetitions, while bounding the delay between two consecutive words. To allow for delay bounds that do not depend on the current word length, we assume a model where we produce each word by editing the preceding word with a small edit script, rather than writing out the word from scratch. In particular, this witnesses that the language is orderable, i.e., we can write its words as an infinite sequence such that the Levenshtein edit distance between any two consecutive words is bounded by a value that depends only on the language. For instance, (a+b)^* is orderable (with a variant of the Gray code), but a^* + b^* is not. We characterize which regular languages are enumerable in this sense, and show that this can be decided in PTIME in an input deterministic finite automaton (DFA) for the language. In fact, we show that, given a DFA A, we can compute in PTIME automata A₁, …, A_t such that L(A) is partitioned as L(A₁) ⊔ … ⊔ L(A_t) and every L(A_i) is orderable in this sense. Further, we show that the value of t obtained is optimal, i.e., we cannot partition L(A) into less than t orderable languages. In the case where L(A) is orderable (i.e., t = 1), we show that the ordering can be produced by a bounded-delay algorithm: specifically, the algorithm runs in a suitable pointer machine model, and produces a sequence of bounded-length edit scripts to visit the words of L(A) without repetitions, with bounded delay - exponential in |A| - between each script. In fact, we show that we can achieve this while only allowing the edit operations push and pop at the beginning and end of the word, which implies that the word can in fact be maintained in a double-ended queue. By contrast, when fixing the distance bound d between consecutive words and the number of classes of the partition, it is NP-hard in the input DFA A to decide if L(A) is orderable in this sense, already for finite languages. Last, we study the model where push-pop edits are only allowed at the end of the word, corresponding to a case where the word is maintained on a stack. We show that these operations are strictly weaker and that the slender languages are precisely those that can be partitioned into finitely many languages that are orderable in this sense. For the slender languages, we can again characterize the minimal number of languages in the partition, and achieve bounded-delay enumeration.
研究动机与目标
- 为解决在与单词长度无关的前提下,高效枚举无限正则语言的挑战。
- 定义并刻画‘可排序’正则语言,即连续单词之间的Levenshtein编辑距离有界的语言。
- 确定每个正则语言是否可划分为有限个可排序子语言,并计算最小的此类划分。
- 设计一种仅使用单词两端的入栈/出栈操作的有界延迟枚举算法,且可在双端队列中维护。
- 建立在各种约束下判断可排序性的复杂度界限,包括固定编辑距离和划分大小下的NP难性。
提出的方法
- 将枚举建模为生成一系列有限长度的编辑脚本,仅通过在两端进行入栈/出栈操作,将一个单词转换为下一个单词。
- 使用指针机模型模拟枚举过程,确保延迟仅依赖于自动机大小,而不依赖于单词长度。
- 在确定性有限自动机(DFA)上应用深度优先搜索(DFS),以探索所有路径,并通过状态标记避免在不可循环单词枚举中出现环路。
- 通过跟踪递归栈的DFS,在线性时间内识别出唯一的简单环及其从初始状态到环的路径。
- 构建一个最终周期性的编辑脚本序列:首先枚举不可循环的单词,然后过渡到环,最后周期性地遍历该环。
- 将连续单词之间的编辑距离限制在2k以内(k = 状态数),通过确保任一状态在任一编辑操作中至多出现两次来实现。
实验结果
研究问题
- RQ1哪些正则语言可支持一个无限单词序列,使得连续单词之间的Levenshtein编辑距离有界?
- RQ2每个正则语言是否都可划分为有限个可排序子语言的并集?其最小分量数是多少?
- RQ3是否可能仅通过在单词末端进行编辑操作实现正则语言的有界延迟枚举?若可能,其条件是什么?
- RQ4在固定编辑距离和固定类别数的约束下,判断正则语言是否可排序的计算复杂度是多少?
- RQ5如何高效计算正则语言划分为可排序语言的最小划分?其结构特性(如狭长性)如何刻画此类语言?
主要发现
- 一个正则语言是可排序的(即存在连续单词间编辑距离有界的序列)当且仅当它是狭长语言,即其自动机中仅有有限条无限路径。
- 一个正则语言可被划分为的最少可排序分量数t,等于其DFA中两两不可比较的无限路径的数量,且该值可在PTIME内计算。
- 对于1-狭长语言(即t = 1),本文构造了一个最终周期性的编辑脚本序列,可实现所有单词的枚举,延迟为DFA大小的线性函数,且编辑距离不超过2k。
- 有界延迟枚举算法通过仅在双端队列两端进行入栈和出栈操作,维护当前单词。
- 判断一个正则语言是否可被划分为t个类且编辑距离固定为d的partition-orderable语言是NP-hard的,即使对于有限语言也是如此。
- 当限制仅在单词末尾进行编辑操作(栈模型)时,可排序语言的类更弱,且恰好对应于狭长语言,其最小划分大小与有界延迟枚举均可实现。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。