QUICK REVIEW

[论文解读] SSH (Sketch, Shingle, & Hash) for Indexing Massive-Scale Time Series

Luo Chen, Anshumali Shrivastava|arXiv (Cornell University)|Jan 1, 2016

Time Series Analysis and Forecasting参考文献 48被引用 13

一句话总结

本文提出SSH（Sketch, Shingle, & Hash）——一种新颖的数据无关哈希方案，通过动态时间规整（DTW）实现大规模时间序列的次线性相似性搜索。通过结合打孔、切片和加权最小哈希技术，SSH生成的近似索引与DTW相似性高度对齐，在修剪95%候选对象的同时，即使在分支定界方法失效的长查询场景下，仍实现高达20倍的加速效果。

ABSTRACT

Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widely used similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted. However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an efficient and approximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes which align (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Our results show that SSH is very effective for longer time sequence and prunes around 95% candidates, leading to the massive speedup in search with DTW. Empirical results on two large-scale benchmark time series data show that our proposed method can be around 20 times faster than the state-of-the-art package (UCR suite) without any significant loss in accuracy.

研究动机与目标

解决分支定界方法在长时间序列上进行DTW相似性搜索时的可扩展性限制。
设计一种无需昂贵再训练即可保持DTW对齐特性的数据无关哈希方案。
在极大规模时间序列工作负载中实现次线性搜索性能，同时保持最小的精度损失。
通过利用随机化、与分布无关的索引方法，克服相似性搜索中的维度灾难问题。

提出的方法

对时间序列应用维度为W的随机滤波器，生成1位打孔结果，以捕捉局部时间模式。
从打孔的位串中提取高阶切片（n-gram），形成表示局部结构的加权集合。
对切片集合使用加权最小哈希，生成局部敏感哈希码，以实现高效的桶划分。
采用滑动窗口机制，滑动步长为δ，生成多个打孔结果，提升对时间偏移的鲁棒性。
基于生成的哈希码构建哈希桶，以支持次线性候选检索。
采用三阶段流水线：打孔（随机滤波）、切片（n-gram提取）和哈希（最小哈希），实现对对齐敏感的索引构建。

实验结果

研究问题

RQ1能否设计一种数据无关的哈希方案，使其在长时间序列上仍能与DTW相似性高度对齐？
RQ2在长查询场景下，基于哈希的剪枝性能与分支定界方法相比如何？
RQ3在SSH框架中，参数选择（W, δ, n）如何在准确率与效率之间实现最优平衡？
RQ4所提出的方法能否在无需再训练的情况下扩展至包含数百万条时间序列的数据集？
RQ5该方法在实现次线性搜索复杂度的同时，是否能保持高精度？

主要发现

在两个大规模时间序列基准测试中，SSH相较于当前DTW搜索的最先进方案UCR套件，实现了高达20倍的加速。
该方法可修剪约95%的候选时间序列，显著降低搜索开销。
在使用最优参数时，Top-50搜索准确率接近完美（接近1.0），尤其在ECG数据中W=80、随机游走数据中W=30时表现最佳。
ECG数据的最优滤波器维度W为80，随机游走数据为30；当W过大时，准确率先上升后下降。
ECG数据的最优滑动步长δ为3，随机游走数据为5，该设置在准确率与预处理成本之间实现良好平衡。
预处理时间与W和n呈线性关系，证实了该方法在大规模索引中的高效性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。