Mining Massive-Scale Time Series Data using Hashing

[摘要] Similarity search on time series is a frequent operation in large-scale data-driven applications. Sophisticated similarity measures are standard for time series matching, as they are usually misaligned. Dynamic Time Warping or DTW is the most widelyused similarity measure for time series because it combines alignment and matching at the same time. However, the alignment makes DTW slow. To speed up the expensive similarity search with DTW, branch and bound based pruning strategies are adopted.However, branch and bound based pruning are only useful for very short queries (low dimensional time series), and the bounds are quite weak for longer queries. Due to the loose bounds branch and bound pruning strategy boils down to a brute-force search. To circumvent this issue, we design SSH (Sketch, Shingle, & Hashing), an eﬃcient andapproximate hashing scheme which is much faster than the state-of-the-art branch and bound searching technique: the UCR suite. SSH uses a novel combination of sketching, shingling and hashing techniques to produce (probabilistic) indexes whichalign (near perfectly) with DTW similarity measure. The generated indexes are then used to create hash buckets for sub-linear search. Empirical results on two large-scalebenchmark time series data show that our proposed method prunes around 95% time series candidates and can be around 20 times faster than the state-of-the-art package (UCR suite) without any signiﬁcant loss in accuracy.

[发布日期] [发布机构] Rice University

[效力级别] Series [学科分类]

[关键词] [时效性]

浏览次数：3

统一登录查看全文激活码登录查看全文