On the Approximation Ratio of Lempel-Ziv Parsing

Travis Gagie, Gonzalo Navarro, Nicola Prezza

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

81 Downloads (Pure)

Abstract

Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based compression. A plausible lower bound is b, the least number of phrases of a general bidirectional parse of a text, where phrases can be copied from anywhere else in the text. Since computing b is NP-complete, a popular gold standard is z, the number of phrases in the Lempel-Ziv parse of the text, where phrases can be copied only from the left. While z can be computed in linear time, almost nothing has been known for decades about its approximation ratio with respect to b. In this paper we prove that $$z=O(b\log (n/b))$$ z = O ( b log ( n / b ) ) , where n is the text length. We also show that the bound is tight as a function of n, by exhibiting a string family where $$z = \varOmega (b\log n)$$ z = Ω ( b log n ) . Our upper bound is obtained by building a run-length context-free grammar based on a locally consistent parsing of the text. Our lower bound is obtained by relating b with r, the number of equal-letter runs in the Burrows-Wheeler transform of the text. On our way, we prove other relevant bounds between compressibility measures.
Original languageEnglish
Title of host publicationLatin 2018: Theoretical Informatics
Volume10807
PublisherSpringer
Publication date2018
Pages490-503
ISBN (Print)9783319774039
DOIs
Publication statusPublished - 2018
Event13th Latin American Theoretical INformatics Symposium - Cultural Center Borges, Buenos Aires, Argentina
Duration: 16 Apr 201819 Apr 2018
Conference number: 13
https://latin2018.dc.uba.ar/#

Conference

Conference13th Latin American Theoretical INformatics Symposium
Number13
LocationCultural Center Borges
CountryArgentina
CityBuenos Aires
Period16/04/201819/04/2018
Internet address
SeriesLecture Notes in Computer Science
Volume10807
ISSN0302-9743

Keywords

  • Computer Science
  • Algorithm Analysis and Problem Complexity
  • Computer Systems Organization and Communication Networks
  • Mathematics of Computing
  • Data Structures
  • Computer Graphics
  • Artificial Intelligence (incl. Robotics)

Cite this

Gagie, T., Navarro, G., & Prezza, N. (2018). On the Approximation Ratio of Lempel-Ziv Parsing. In Latin 2018: Theoretical Informatics (Vol. 10807, pp. 490-503). Springer. Lecture Notes in Computer Science, Vol.. 10807 https://doi.org/10.1007/978-3-319-77404-6_36