Algorithms for Strings and Modern Data Models

Tord Joakim Stordalen

Research output: Book/ReportPh.D. thesis

11 Downloads (Pure)

Abstract

We study problems related to pattern matching in strings, and problems in modern data models. Specifically, we study the predecessor problem in a model of computation that captures modern vector processor architectures, the problem of compactly and efficiently counting the co-occurrences of a set of characters in a string, pattern matching in a sliding window over a stream, and the problem of supporting rank and select on degenerate strings.

Predecessor on the Ultra-Wide Word RAM We consider the predecessor problem on the ultra-wide word RAM model of computation, which extends the word RAM model with ultrawords consisting of w2 bits [TAMC, 2015]. The model supports arithmetic and boolean operations on ultrawords, in addition to scattered memory operations that access or modify w (potentially non-contiguous) memory addresses simultaneously. The ultra-wide word RAM model captures (and idealizes) modern vector processor architectures. Our main result is a simple, linear space data structure that supports predecessor in constant time and updates in amortized, expected constant time. This improves the space of the previous constant time solution that uses space in the order of the size of the universe. Our result holds even in a weaker model where ultrawords consist of w1+ϵ bits for any ϵ > 0. It is based on a new implementation of the classic x-fast trie data structure of Willard [Inform. Process. Lett. 17(2), 1983] combined with a new dictionary data structure that supports fast parallel lookups.

The Complexity of the Co-Occurrence Problem Let S be a string of length n over an alphabet Σ and let Q be a subset of Σ of size q ≥ 2. The co-occurrence problem is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(√nq) space data structure that — with some minor additions — supports queries in O(log log n) time [CPM 2021]. Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(log log n) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d + q) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(log log n)-time queries are optimal given optimal space. Secondly, we bound d = O(√nq), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Ω(√nq) bits of space is necessary in the worst case, meaning that the O(√nq) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.

Sliding Window String Indexing in Streams Given a string S over an alphabet Σ, the string indexing problem is to preprocess S to subsequently support efficient pattern matching queries, that is, given a pattern string P report all the occurrences of P in S. In this paper we study the streaming sliding window string indexing problem. Here the string S arrives as a stream, one character at a time, and the goal is to maintain an index of the last w characters, called the window, for a specified parameter w. At any point in time a pattern matching query for a pattern P may arrive, also streamed one character at a time, and all occurrences of P within the current window must be returned. The streaming sliding window string indexing problem naturally captures scenarios where we want to index the most recent data (i.e. the window) of a stream while supporting efficient pattern matching. Our main result is a simple O(w) space data structure that uses O(log w) time with high probability to process each character from both the input string S and any pattern string P. Reporting each occurrence of P uses additional constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w+δ) space data structure for this problem that improves the above time bounds to O(log(w/δ)). In particular, for a delay of δ = ϵw we obtain an O(w) space data structure with constant time processing per character. The key idea to achieve our result is a novel and simple hierarchical structure of suffix trees of independent interest, inspired by the classic log-structured merge trees.

Rank and Select on Degenerate Strings A degenerate string is a sequence of subsets of some alphabet; it represents any string obtainable by selecting one character from each set from left to right. Recently, Alanko et al. generalized the rank-select problem to degenerate strings, where given a character c and position i the goal is to find either the ith set containing c or the number of occurrences of c in the first i sets [SEA 2023]. The problem has applications to pangenomics; in another work by Alanko et al. they use it as the basis for a compact representation of de Bruijn Graphs that supports fast membership queries. In this paper we revisit the rank-select problem on degenerate strings, introducing a new, natural parameter and reanalyzing existing reductions to rank-select on regular strings. Plugging in standard data structures, the time bounds for queries are improved exponentially while essentially matching, or improving, the space bounds. Furthermore, we provide a lower bound on space that shows that the reductions lead to succinct data structures in a wide range of cases. Finally, we provide implementations; our most compact structure matches the space of the most compact structure of Alanko et al. while answering queries twice as fast. We also provide an implementation using modern vector processing features; it uses less than one percent more space than the most compact structure of Alanko et al. while supporting queries four to seven times faster, and has competitive query time with all the remaining structures.
Original languageEnglish
PublisherTechnical University of Denmark
Number of pages80
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'Algorithms for Strings and Modern Data Models'. Together they form a unique fingerprint.

Cite this