Optimal-time text indexing in BWT-runs bounded space

Travis Gagie, Gonzalo Navarro, Nicola Prezza

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

Indexing highly repetitive texts | such as genomic databases, software repositories and versioned text collections | has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness | the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors | have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ) within O(r log(n=r)) space, on a RAM machine with words of w = (log n) bits. Raising the space to O(rw logσ (n=r)), we support locate in O(mlog(σ)=w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n=r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n=r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map.
Original languageEnglish
Title of host publicationProceedings of the Annual Acm-siam Symposium on Discrete Algorithms
Volume133646
PublisherSociety for Industrial and Applied Mathematics
Publication date2018
Pages1459-1477
ISBN (Print)9781611975031
DOIs
Publication statusPublished - 2018
Event29th Annual ACM-SIAM Symposium on Discrete Algorithms - Astor Crowne Plaze, New Orleans French Quarter , New Orleans , United States
Duration: 7 Jan 201810 Jan 2018
Conference number: 29

Conference

Conference29th Annual ACM-SIAM Symposium on Discrete Algorithms
Number29
LocationAstor Crowne Plaze, New Orleans French Quarter
CountryUnited States
CityNew Orleans
Period07/01/201810/01/2018
SeriesProceedings of the Annual Acm-siam Symposium on Discrete Algorithms

Fingerprint Dive into the research topics of 'Optimal-time text indexing in BWT-runs bounded space'. Together they form a unique fingerprint.

Cite this