## Compressed and efficient algorithms and data structures for strings

Research output: Research › Ph.D. thesis – Annual report year: 2018

In this dissertation we study the design of efﬁcient algorithms, data structures, and protocols for computation on compressed data and manipulation of long sequences. We consider the following topics:

Time-Space Trade-Offs for Lempel–Ziv Compressed Indexing Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel–Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets

(i) O(m + occlglgn) time using O(z lg(n/z)lglgz) space, or

(ii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z lg(n/z)) space,

For integer alphabets polynomially bounded by n(iii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lglgz)) space, or(iv) O(m + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lgε z)) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ε > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm) to O(m) at the cost of increasing the space by a factor lglgz. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgε z lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1−δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ` in O(` + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak preﬁx search.

Compressed Indexing with Signature Grammars The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lglgn+lgε z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ε > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n1−δ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lglgz), and one where the query time changes from worst-case to expected. These results improve the previously best known solutions. Notably, this is the ﬁrst data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D range reporting.

Decompressing Lempel-Ziv Compressed Text We consider the problem of decompressing the Lempel–Ziv 77 representation of a string S of length n using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in O(n) time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a grammar of size O(z lg(n/z)) and then stream S in linear time. In this paper, we show that O(n) time and O(z) working space can be achieved for constant-size alphabets. On larger alphabets, we describe (i) a trade-off achieving O(nlgδ σ) time and O(z lg1−δ σ) space for any 0 ≤ δ ≤ 1 where σ is the size of the alphabet, and (ii) a solution achieving O(n) timeand O(z lglgn) space. Our solutions can, more generally, extract any speciﬁed subsequence of S with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text.

Compressed Communication Complexity of Longest Common Preﬁxes We consider the communication complexity of fundamental longest common preﬁx (LCP) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common preﬁx ` = LCP(A,B) using as few rounds and bits of communication as possible. We show that if the longest common preﬁx of A and B is compressible, then we can signiﬁcantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common preﬁx has an LZ77 parse of z phrases, only O(lgz) rounds and O(lg`) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings B1,...,Bk, and the goal is to ﬁnd the length of the maximal longest preﬁx shared by A and any of B1,...,Bk. Here, we give a protocol with O(lgz) rounds and O(lgz lgk + lg`) total communication. We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem.

Fast Dynamic Arrays We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(nε) for ε > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to 108 elements. Our fastest implementation uses much less space than multiset while providing speedups of 40×for access operations compared to multiset and speedups of 10.000×compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations.

Time-Space Trade-Offs for Lempel–Ziv Compressed Indexing Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel–Ziv 1977 compression scheme. We obtain the following time-space trade-offs: For constant-sized alphabets

(i) O(m + occlglgn) time using O(z lg(n/z)lglgz) space, or

(ii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z lg(n/z)) space,

For integer alphabets polynomially bounded by n(iii) O(m(1 + lgε z lg(n/z)) + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lglgz)) space, or(iv) O(m + occ(lglgn + lgε z)) time using O(z(lg(n/z) + lgε z)) space, where n and m are the length of the input string and query string respectively, z is the number of phrases in the LZ77 parse of the input string, occ is the number of occurrences of the query in the input and ε > 0 is an arbitrarily small constant. In particular, (i) improves the leading term in the query time of the previous best solution from O(mlgm) to O(m) at the cost of increasing the space by a factor lglgz. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgε z lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1−δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ` in O(` + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak preﬁx search.

Compressed Indexing with Signature Grammars The compressed indexing problem is to preprocess a string S of length n into a compressed representation that supports pattern matching queries. That is, given a string P of length m report all occurrences of P in S. We present a data structure that supports pattern matching queries in O(m + occ(lglgn+lgε z)) time using O(z lg(n/z)) space where z is the size of the LZ77 parse of S and ε > 0 is an arbitrarily small constant, when the alphabet is small or z = O(n1−δ) for any constant δ > 0. We also present two data structures for the general case; one where the space is increased by O(z lglgz), and one where the query time changes from worst-case to expected. These results improve the previously best known solutions. Notably, this is the ﬁrst data structure that decides if P occurs in S in O(m) time using O(z lg(n/z)) space. Our results are mainly obtained by a novel combination of a randomized grammar construction algorithm with well known techniques relating pattern matching to 2D range reporting.

Decompressing Lempel-Ziv Compressed Text We consider the problem of decompressing the Lempel–Ziv 77 representation of a string S of length n using a working space as close as possible to the size z of the input. The folklore solution for the problem runs in O(n) time but requires random access to the whole decompressed text. A better solution is to convert LZ77 into a grammar of size O(z lg(n/z)) and then stream S in linear time. In this paper, we show that O(n) time and O(z) working space can be achieved for constant-size alphabets. On larger alphabets, we describe (i) a trade-off achieving O(nlgδ σ) time and O(z lg1−δ σ) space for any 0 ≤ δ ≤ 1 where σ is the size of the alphabet, and (ii) a solution achieving O(n) timeand O(z lglgn) space. Our solutions can, more generally, extract any speciﬁed subsequence of S with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text.

Compressed Communication Complexity of Longest Common Preﬁxes We consider the communication complexity of fundamental longest common preﬁx (LCP) problems. In the simplest version, two parties, Alice and Bob, each hold a string, A and B, and we want to determine the length of their longest common preﬁx ` = LCP(A,B) using as few rounds and bits of communication as possible. We show that if the longest common preﬁx of A and B is compressible, then we can signiﬁcantly reduce the number of rounds compared to the optimal uncompressed protocol, while achieving the same (or fewer) bits of communication. Namely, if the longest common preﬁx has an LZ77 parse of z phrases, only O(lgz) rounds and O(lg`) total communication is necessary. We extend the result to the natural case when Bob holds a set of strings B1,...,Bk, and the goal is to ﬁnd the length of the maximal longest preﬁx shared by A and any of B1,...,Bk. Here, we give a protocol with O(lgz) rounds and O(lgz lgk + lg`) total communication. We present our result in the public-coin model of computation but by a standard technique our results generalize to the private-coin model. Furthermore, if we view the input strings as integers the problems are the greater-than problem and the predecessor problem.

Fast Dynamic Arrays We present a highly optimized implementation of tiered vectors, a data structure for maintaining a sequence of n elements supporting access in time O(1) and insertion and deletion in time O(nε) for ε > 0 while using o(n) extra space. We consider several different implementation optimizations in C++ and compare their performance to that of vector and multiset from the standard library on sequences with up to 108 elements. Our fastest implementation uses much less space than multiset while providing speedups of 40×for access operations compared to multiset and speedups of 10.000×compared to vector for insertion and deletion operations while being competitive with both data structures for all other operations.

Original language | English |
---|

Number of pages | 100 |
---|---|

State | Submitted - 2018 |

Series | DTU Compute PHD-2018 |
---|---|

Volume | 490 |

ISSN | 0909-3192 |

### Projects

- Compressed Computation on Structured Data
Project: PhD

Download as:

ID: 152799981