Tight Bounds for Compressing Substring Samples

Philip Bille, Christian Mikkelsen Fuglsang, Inge Li Gørtz

Research output: Contribution to journalConference articleResearchpeer-review

4 Downloads (Pure)

Abstract

We consider the problem of compressing a set of substrings sampled from a string and analyzing the size of the compression. Given a string S of length n, and integers d and m where n m ≥ 2d > 0, let SCS(S, m, d) be the string obtained by sequentially concatenating substrings of length m sampled regularly at intervals of d starting at position 1 in S. We consider the size of the LZ77 parsing of SCS(S, m, d), in relation to the size of the LZ77 parsing of S. This is motivated by genome sequencing, where the mentioned sampling process is an idealization of the short-read DNA sequencing. We show the following upper bound:
|LZ77(SCS(S, m, d))| ≤ |LZ77(S)| + 2 (n −m)/d.
We also give a lower bound showing that this is tight. This improves previous results by Badkobeh et al. [ICTCS 2022], and closes the open problem of whether their bound can be improved. Another natural question is whether assuming that all letters in S are part of a sample, it is always the case that |LZ77(S)| ≤ |LZ77(SCS(S, m, d))|. Surprisingly, we show that there is a family of strings such that |LZ77(SCS(S, m, d))| = |LZ77(S)| − 1.
Original languageEnglish
JournalLeibniz International Proceedings in Informatics, LIPIcs
Volume296
Pages (from-to)9:1-9:14
ISSN1868-8969
DOIs
Publication statusPublished - 2024
Event35th Annual Symposium on Combinatorial Pattern Matching - Fukuoka, Japan
Duration: 25 Jun 202427 Jun 2024

Conference

Conference35th Annual Symposium on Combinatorial Pattern Matching
Country/TerritoryJapan
CityFukuoka
Period25/06/202427/06/2024

Keywords

  • Algorithms
  • Compression
  • Lempel-Ziv

Fingerprint

Dive into the research topics of 'Tight Bounds for Compressing Substring Samples'. Together they form a unique fingerprint.

Cite this