[Paper Review] SSH (Sketch, Shingle, & Hash) for Indexing Massive-Scale Time Series
This paper proposes SSH (Sketch, Shingle, & Hash), a novel data-independent hashing scheme that enables sub-linear similarity search for massive-scale time series using Dynamic Time Warping (DTW). By combining sketching, shingling, and weighted minwise hashing, SSH creates approximate indexes that align nearly perfectly with DTW similarity, achieving up to 20x speedup over the UCR suite while pruning 95% of candidates, even for long queries where branch-and-bound methods fail.
Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.
Motivation & Objective
- To address the scalability limitations of branch-and-bound methods for DTW similarity search on long time series.
- To design a data-independent hashing scheme that preserves DTW alignment properties without requiring expensive retraining.
- To enable sub-linear search performance for massive-scale time series workloads with minimal accuracy loss.
- To overcome the curse of dimensionality in similarity search by leveraging randomized, distribution-agnostic indexing.
Proposed method
- Applying random filters of dimension W to generate 1-bit sketches from time series, capturing local temporal patterns.
- Creating higher-order shingles (n-grams) from the sketch bit-strings to form weighted sets representing local structures.
- Using weighted minwise hashing on the shingle sets to generate locality-sensitive hash codes for efficient bucketing.
- Employing a sliding window with shift size δ to generate multiple sketches and improve robustness to temporal shifts.
- Constructing hash buckets from the resulting hash codes to enable sub-linear candidate retrieval.
- Using a three-stage pipeline: sketching (random filtering), shingling (n-gram extraction), and hashing (minwise hashing) for alignment-aware indexing.
Experimental results
Research questions
- RQ1Can a data-independent hashing scheme be designed to align closely with DTW similarity, even for long time series?
- RQ2How does the performance of hashing-based pruning compare to branch-and-bound methods on long queries?
- RQ3What parameter choices (W, δ, n) optimize accuracy and efficiency in the SSH framework?
- RQ4Can the proposed method scale to datasets with millions of time series without retraining?
- RQ5Does the method maintain high accuracy while achieving sub-linear search complexity?
Key findings
- SSH achieves up to 20x speedup over the UCR suite, the current state-of-the-art for DTW search, on two large-scale time series benchmarks.
- The method prunes approximately 95% of candidate time series, significantly reducing search cost.
- Top-50 search accuracy remains near-perfect (close to 1.0) when using optimal parameters, especially at W=80 for ECG and W=30 for random walk data.
- The optimal filter dimension W is 80 for ECG and 30 for random walk, with accuracy peaking and then declining for larger W.
- The optimal shift size δ is 3 for ECG and 5 for random walk, balancing accuracy and preprocessing cost.
- Preprocessing time scales linearly with W and n, confirming the method’s efficiency for large-scale indexing.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.