Projects per year
Abstract
In this dissertation we study the design of efﬁcient algorithms, data structures, and protocols for computation on compressed data and manipulation of long sequences. We consider the following topics: TimeSpace TradeOffs for Lempel–Ziv Compressed Indexing Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel–Ziv 1977 compression scheme. We obtain the following timespace tradeoffs: For constantsized alphabets (i) O(m + occlglgn) time using O(z lg(n/z)lglgz) space, or (ii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z lg(n/z)) space,
For integer alphabets polynomially bounded by n(iii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lglgz)) space, or(iv) O(m + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lgε z)) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ε > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm) to O(m) at the cost of increasing the space by a factor lglgz. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgε z lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1−δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ` in O(` + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak preﬁx search.
Compressed Indexing with Signature Grammars The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lglgn+lgε z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ε > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n1−δ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lglgz), and one where the query time changes from worstcase to expected. These results improve the previously best known solutions. Notably, this is the ﬁrst data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D range reporting.
Decompressing LempelZiv Compressed Text We consider the problem of decompressing the Lempel–Ziv 77 representation of a string S of length n using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in O(n) time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a grammar of size O(z lg(n/z)) and then stream S in linear time. In this paper, we show that O(n) time and O(z) working space can be achieved for constantsize alphabets. On larger alphabets, we describe (i) a tradeoff achieving O(nlgδ σ) time and O(z lg1−δ σ) space for any 0 ≤ δ ≤ 1 where σ is the size of the alphabet, and (ii) a solution achieving O(n) timeand O(z lglgn) space. Our solutions can, more generally, extract any speciﬁed subsequence of S with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77compressed text.
Compressed Communication Complexity of Longest Common Preﬁxes We consider the communication complexity of fundamental longest common preﬁx (LCP) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common preﬁx ` = LCP(A,B) using as few rounds and bits of communication as possible. We show that if the longest common preﬁx of A and B is compressible, then we can signiﬁcantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common preﬁx has an LZ77 parse of z phrases, only O(lgz) rounds and O(lg`) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings B1,...,Bk, and the goal is to ﬁnd the length of the maximal longest preﬁx shared by A and any of B1,...,Bk. Here, we give a protocol with O(lgz) rounds and O(lgz lgk + lg`) total communication. We present our result in the publiccoin model of computation but by a standard technique our results generalize to the privatecoin model. Furthermore, if we view the input strings as integers the problems are the greaterthan problem and the predecessor problem.
Fast Dynamic Arrays We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(nε) for ε > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to 108 elements. Our fastest implementation uses much less space than multiset while providing speedups of 40×for access operations compared to multiset and speedups of 10.000×compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations.
For integer alphabets polynomially bounded by n(iii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lglgz)) space, or(iv) O(m + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lgε z)) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ε > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm) to O(m) at the cost of increasing the space by a factor lglgz. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgε z lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1−δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ` in O(` + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak preﬁx search.
Compressed Indexing with Signature Grammars The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lglgn+lgε z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ε > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n1−δ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lglgz), and one where the query time changes from worstcase to expected. These results improve the previously best known solutions. Notably, this is the ﬁrst data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D range reporting.
Decompressing LempelZiv Compressed Text We consider the problem of decompressing the Lempel–Ziv 77 representation of a string S of length n using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in O(n) time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a grammar of size O(z lg(n/z)) and then stream S in linear time. In this paper, we show that O(n) time and O(z) working space can be achieved for constantsize alphabets. On larger alphabets, we describe (i) a tradeoff achieving O(nlgδ σ) time and O(z lg1−δ σ) space for any 0 ≤ δ ≤ 1 where σ is the size of the alphabet, and (ii) a solution achieving O(n) timeand O(z lglgn) space. Our solutions can, more generally, extract any speciﬁed subsequence of S with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77compressed text.
Compressed Communication Complexity of Longest Common Preﬁxes We consider the communication complexity of fundamental longest common preﬁx (LCP) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common preﬁx ` = LCP(A,B) using as few rounds and bits of communication as possible. We show that if the longest common preﬁx of A and B is compressible, then we can signiﬁcantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common preﬁx has an LZ77 parse of z phrases, only O(lgz) rounds and O(lg`) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings B1,...,Bk, and the goal is to ﬁnd the length of the maximal longest preﬁx shared by A and any of B1,...,Bk. Here, we give a protocol with O(lgz) rounds and O(lgz lgk + lg`) total communication. We present our result in the publiccoin model of computation but by a standard technique our results generalize to the privatecoin model. Furthermore, if we view the input strings as integers the problems are the greaterthan problem and the predecessor problem.
Fast Dynamic Arrays We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(nε) for ε > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to 108 elements. Our fastest implementation uses much less space than multiset while providing speedups of 40×for access operations compared to multiset and speedups of 10.000×compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations.
Original language  English 

Publisher  DTU Compute 

Number of pages  100 
Publication status  Published  2018 
Series  DTU Compute PHD2018 

Volume  490 
ISSN  09093192 
Fingerprint Dive into the research topics of 'Compressed and efficient algorithms and data structures for strings'. Together they form a unique fingerprint.
Projects
 1 Finished

Compressed Computation on Structured Data
Ettienne, M. B., Bille, P., Gørtz, I. L., Witt, C., Fischer, J. C. & Munro, J. I.
01/05/2015 → 14/11/2018
Project: PhD