Abstract
Indexing highly repetitive texts | such as genomic databases, software repositories and versioned text collections | has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness | the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors | have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ) within O(r log(n=r)) space, on a RAM machine with words of w = (log n) bits. Raising the space to O(rw logσ (n=r)), we support locate in O(mlog(σ)=w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n=r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n=r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map.
Original language | English |
---|---|
Title of host publication | Proceedings of the Annual Acm-siam Symposium on Discrete Algorithms |
Volume | 133646 |
Publisher | Society for Industrial and Applied Mathematics |
Publication date | 2018 |
Pages | 1459-1477 |
ISBN (Print) | 9781611975031 |
DOIs | |
Publication status | Published - 2018 |
Event | 29th Annual ACM-SIAM Symposium on Discrete Algorithms - Astor Crowne Plaze, New Orleans French Quarter , New Orleans, United States Duration: 7 Jan 2018 → 10 Jan 2018 Conference number: 29 |
Conference
Conference | 29th Annual ACM-SIAM Symposium on Discrete Algorithms |
---|---|
Number | 29 |
Location | Astor Crowne Plaze, New Orleans French Quarter |
Country/Territory | United States |
City | New Orleans |
Period | 07/01/2018 → 10/01/2018 |
Sponsor | Association for Computing Machinery, Society for Industrial and Applied Mathematics |
Series | Proceedings of the Annual Acm-siam Symposium on Discrete Algorithms |
---|