TY - GEN

T1 - Time-space trade-offs for lempel-ziv compressed indexing

AU - Bille, Philip

AU - Ettienne, Mikko Berggren

AU - Gørtz, Inge Li

AU - Vildhøj, Hjalte Wedel

PY - 2017

Y1 - 2017

N2 - Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in (i) O (m + occ lg lg n) time using O(z lg(n/z) lg lg z) space, or (ii) (m (1 + lgϵ z/lg(n/z) + occ(lg lg n + lgϵ z)) time using O(z lg (n/z)) space, for any 0 <ϵ <1 In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lg lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgϵ z/lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1-δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ℓ in O(ℓ + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

AB - Given a string S, the compressed indexing problem is to preprocess S into a compressed representation that supports fast substring queries. The goal is to use little space relative to the compressed size of S while supporting fast queries. We present a compressed index based on the Lempel-Ziv 1977 compression scheme. Let n, and z denote the size of the input string, and the compressed LZ77 string, respectively. We obtain the following time-space trade-offs. Given a pattern string P of length m, we can solve the problem in (i) O (m + occ lg lg n) time using O(z lg(n/z) lg lg z) space, or (ii) (m (1 + lgϵ z/lg(n/z) + occ(lg lg n + lgϵ z)) time using O(z lg (n/z)) space, for any 0 <ϵ <1 In particular, (i) improves the leading term in the query time of the previous best solution from O(m lg m) to O(m) at the cost of increasing the space by a factor lg lg z. Alternatively, (ii) matches the previous best space bound, but has a leading term in the query time of O(m(1 + lgϵ z/lg(n/z))). However, for any polynomial compression ratio, i.e., z = O(n1-δ), for constant δ > 0, this becomes O(m). Our index also supports extraction of any substring of length ℓ in O(ℓ + lg(n/z)) time. Technically, our results are obtained by novel extensions and combinations of existing data structures of independent interest, including a new batched variant of weak prefix search.

KW - Information Sources and Analysis

KW - Social Sciences

KW - Compressed indexing

KW - LZ77

KW - Pattern matching

KW - Prefix search

KW - Commerce

KW - Indexing (of information)

KW - Compression scheme

KW - Input string

KW - Leading terms

KW - Pattern strings

KW - Space bounds

KW - Time-space

KW - Economic and social effects

U2 - 10.4230/LIPIcs.CPM.2017.16

DO - 10.4230/LIPIcs.CPM.2017.16

M3 - Article in proceedings

SN - 9783959770392

T3 - Leibniz International Proceedings in Informatics

BT - Proceedings of 28th Annual Symposium on Combinatorial Pattern Matching

PB - Schloss Dagstuhl - Leibniz-Zentrum für Informatik

T2 - 28th Annual Symposium on Combinatorial Pattern Matching

Y2 - 4 July 2017 through 6 July 2017

ER -