QUICK REVIEW

[Paper Review] Data Structure Lower Bounds for Document Indexing Problems

Peyman Afshani, Jesper Sindahl Nielsen|arXiv (Cornell University)|Jan 1, 2016

Algorithms and Data Compression43 references6 citations

TL;DR

This paper establishes tight, unconditional space-time lower bounds for fundamental document indexing and pattern matching problems—such as two-pattern queries, forbidden-pattern queries, and wildcard pattern indexing—using the pointer machine model. By leveraging combinatorial constructions and measure-based arguments, it proves that known data structures are nearly optimal, with S(n)Q(n) = Ω(n²⁻ᵒ⁽¹⁾) for reporting queries and S(n)Q²(n) = Ω(n²/log⁴n) for counting variants, demonstrating the pointer machine model's power in deriving high-quality lower bounds where other models fail.

ABSTRACT

We study data structure problems related to document indexing and pattern matching queries and our main contribution is to show that the pointer machine model of computation can be extremely useful in proving high and unconditional lower bounds that cannot be obtained in any other known model of computation with the current techniques. Often our lower bounds match the known space-query time trade-off curve and in fact for all the problems considered, there is a very good and reasonable match between our lower bounds and the known upper bounds, at least for some choice of input parameters. The problems that we consider are set intersection queries (both the reporting variant and the semi-group counting variant), indexing a set of documents for two-pattern queries, or forbidden-pattern queries, or queries with wild-cards, and indexing an input set of gapped-patterns (or two-patterns) to find those matching a document given at the query time.

Motivation & Objective

To establish strong, unconditional lower bounds for document indexing and pattern matching data structures, especially where prior conditional bounds fall short.
To demonstrate the superiority of the pointer machine model in deriving high-quality, tight lower bounds that match known upper bounds.
To close the gap between known upper bounds and theoretical limits for problems like two-pattern queries, forbidden-pattern queries, and wildcard pattern indexing.
To analyze the complexity of both reporting and counting variants of set intersection and pattern matching problems in a unified framework.
To explore the limits of linear-space data structures and show that sublinear query time requires super-linear space in many cases.

Proposed method

Uses the pointer machine model to avoid reliance on random access, enabling unconditional lower bounds.
Applies a measure-based argument where patterns are treated as discrete points and documents as ranges, modeling intersection measures.
Employs a randomized construction with high-probability bounds to derive lower bounds for 2P, FP, 2FP, and SI problems.
Leverages Theorem 2 (from prior work) to relate space, query time, and intersection size via parameters t, v, and g(n).
Constructs hard input instances with specific combinatorial properties: e.g., bounded overlap in pattern matches and controlled document intersections.
Uses binomial coefficient bounds and asymptotic analysis to derive tight Ω(n²⁻ᵒ⁽¹⁾) and Ω(n²/log⁴n) lower bounds for space-query time trade-offs.

Experimental results

Research questions

RQ1Can we prove unconditional lower bounds for document indexing problems that match known upper bounds?
RQ2Is the pointer machine model capable of yielding tighter and more informative lower bounds than conditional models like 3SUM or Boolean Matrix Multiplication?
RQ3What is the minimal space required for a data structure to support 2-pattern queries with sublinear query time?
RQ4How does the complexity of wildcard pattern indexing (WCI) scale with the number of wild-cards κ, and can we prove tight bounds that depend on κ?
RQ5Can we establish a separation between the complexity of reporting and counting variants of pattern matching problems?

Key findings

For 2P, FP, 2FP, and set intersection (SI) reporting queries, any pointer machine data structure with query time Q(n) + O(P₁ + P₂ + t) must satisfy S(n)Q(n) = Ω(n²⁻ᵒ⁽¹⁾), proving near-optimality of known structures.
If query time is O((nt)¹/²⁻α + t) for α > 0, then space must be Ω(n^(1+6α)/(1+2α)⁻ᵒ⁽¹⁾), showing super-linear space is necessary for faster query times.
For the counting variant in the semi-group model, S(n)Q²(n) = Ω(n²/log⁴n), indicating that counting is strictly easier than reporting.
For wildcard pattern indexing (WCI) with κ wild-cards, the space lower bound is Ω(n / κ^Θ(log Q(n)/κ)^(κ−1)), which matches known upper bounds under reasonable assumptions.
The lower bound for κ-GPI (gapped patterns) is Ω(n^Ω(log₁/²ᵏ n)), showing that even with sparse patterns, space grows significantly with κ.
The paper shows that any data structure answering 2P queries in O((nt)¹/²⁻ε + t) time for ε > 0 must use super-linear space, confirming a long-standing conjecture.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.