String Indexing and Compression

Max Harry Rishøj Pedersen

Research output: Book/ReportPh.D. thesis

92 Downloads (Pure)

Abstract

We study the design of efficient algorithms and data structures for variants of combinatorial pattern matching, as well as compression of strings and string indices.
String Indexing for Top-k Close Consecutive Occurrences We study the string indexing for top-k close consecutive occurrences problem, where the goal is to preprocess a given string S into a compact data
structure that efficiently supports the query: given a string P and positive integer k, report the k closest consecutive occurrences of P in S. Here, a consecutive occurrence is a pair (i, j), i < j, such that P occurs at positions i and j in S and there is no occurrence of P between i and j, and their distance is defined as j − i. We present three time-space trade-offs. Let n be the length of S, m the length of P, and ϵ ∈ (0, 1]. Our first result achieves O(n log n) space and optimal query time of O(m + k). Our second and third results achieve linear space and query times either O(m + k1+ϵ) or O(m + k log1+ϵ n). We also extend our solutions to several related problems. Gapped Indexing for Consecutive Occurrences We study a variant of string indexing where the goal is to compactly represent a string such that given two patterns P1 and P2 and a gap range [α, β], we can quickly find the consecutive occurrences of P1 and P2 with distances in [α, β], i.e., pairs of subsequent occurrences of the two patterns that are between α and β characters apart. We present data structures that use Õ(n) space and Õ(|P1|+|P2|+n2/3) time for existence and counting queries and Õ(|P1|+|P2|+n2/3occ1/3) time for reporting queries. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using Õ(n) space must use Ω(|P1| + |P2| +√n) query time. Sliding Window String Indexing We study the streaming sliding window string indexing problem, where a string S arrives as a stream and the goal is to maintain an index of the last w characters, called the window. At any point in time a pattern matching query for a pattern P may arrive, also streamed, and all occurrences of P within the window must be returned. We present a simple O(w) space data structure that uses O(log w) time with high probability to process each character from either stream, plus additional worst-case constant time per reported occurrence. Compared to previous work in similar scenarios this result is the first to achieve an efficient worst-case time per character from the input stream with high probability. We also consider a delayed variant of the problem, where a query may be answered at any point within the next δ characters that arrive from either stream. We present an O(w + δ) space data structure for this problem that improves the above time bounds to O(log(w/δ)). In particular, for a delay of δ = ϵw we obtain an O(w) space data structure with constant time processing per character. New Advances in Rightmost Lempel-Ziv The Lempel-Ziv (LZ) 77 factorization for a string S of length n, over a linearly-sortable alphabet, can be computed in O(n) time. It is unknown whether this time can be achieved for the rightmost LZ parsing, where each referencing phrase points to its rightmost previous occurrence. The currently best solution takes O(n(1 + log σ/√log n)) time [Belazzougui & Puglisi SODA2016]. We show that this problem is much easier to solve for the LZ-End factorization [Kreft & Navarro DCC2010], where the rightmost factorization can be obtained in O(n) time for the greedy parsing (with phrases of maximal length), and in O(n + z√log z) time for any LZ-End parsing of z phrases. We also make advances towards a linear time solution for the general case. We show how to solve several non-trivial subsets of the phrases of any LZ-like parsing in O(n) time. As a prime example, we can find the rightmost occurrence of all phrases of length Ω(log6.66 n/ log2σ) in O(n/logσn) time and space. Fast Compression of Deterministic Finite Automata The delayed deterministic finite automaton [Kumar et al. SIGCOMM2006] is an effective compressed representation of deterministic finite automata that forms the basis of numerous subsequent compression results. The compression algorithm, as well as later algorithms based on the idea, have an inherent quadratic-time bottleneck as they consider every pair of states to compute the optimal compression. We present a simple, general framework based on locality-sensitive hashing for circumventing this bottleneck. We apply it to speed up several algorithms and experimentally evaluate them on real-world regular expression sets. We obtain an order of magnitude improvement in compression time, with little to no loss of compression, or even significantly better compression in some cases.
Original languageEnglish
PublisherTechnical University of Denmark
Number of pages103
Publication statusPublished - 2023

Fingerprint

Dive into the research topics of 'String Indexing and Compression'. Together they form a unique fingerprint.
  • Adaptive Compressed Computation

    Pedersen, M. H. R. (PhD Student), Gørtz, I. L. (Main Supervisor), Bille, P. (Supervisor), Puglisi, S. J. (Examiner) & Weimann, O. (Examiner)

    01/02/202015/01/2024

    Project: PhD

Cite this