Abstract
Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log(w/δ)). In particular, for a delay of δ = ϵw we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log(w/δ)). In particular, for a delay of δ = ϵw we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.
Original language | English |
---|---|
Title of host publication | Proceedings of the 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023) |
Number of pages | 18 |
Volume | 259 |
Publisher | Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing |
Publication date | 2023 |
Article number | 4 |
ISBN (Electronic) | 978-3-95977-276-1 |
DOIs | |
Publication status | Published - 2023 |
Event | 34th Annual Symposium on Combinatorial Pattern Matching - Marne-la-Vallée, France Duration: 26 Jun 2023 → 28 Jun 2023 |
Conference
Conference | 34th Annual Symposium on Combinatorial Pattern Matching |
---|---|
Country/Territory | France |
City | Marne-la-Vallée |
Period | 26/06/2023 → 28/06/2023 |
Keywords
- String indexing
- Pattern matching
- Sliding window
- Streaming