Abstract
Indexing highly repetitive texts | such as genomic databases, software repositories and versioned text collections | has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transform (BWT). One of the earliest indexes for repetitive collections, the Run-Length FMindex, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in the text (in loglogarithmic time per pattern symbol, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. Since then, a number of other indexes with space bounded by other measures of repetitiveness | the number of phrases in the LempelZiv parse, the size of the smallest grammar generating the text, the size of the smallest automaton recognizing the text factors | have been proposed for efficiently locating, but not directly counting, the occurrences of a pattern. In this paper we close this long-standing problem, showing how to extend the Run-Length FMindex so that it can locate the occ occurrences efficiently within O(r) space (in loglogarithmic time each), and reaching optimal time O(m+occ) within O(r log(n=r)) space, on a RAM machine with words of w = (log n) bits. Raising the space to O(rw logσ (n=r)), we support locate in O(mlog(σ)=w+occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log(n=r)) space that replaces the text and efficiently extracts any text substring, with an O(log(n=r)) additive time penalty over the optimum. Preliminary experiments show that our new structure outperforms the alternatives by orders of magnitude in the space/time tradeoff map.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the Annual Acm-siam Symposium on Discrete Algorithms |
| Volume | 133646 |
| Publisher | Society for Industrial and Applied Mathematics |
| Publication date | 2018 |
| Pages | 1459-1477 |
| ISBN (Print) | 9781611975031 |
| DOIs | |
| Publication status | Published - 2018 |
| Event | 29th Annual ACM-SIAM Symposium on Discrete Algorithms - Astor Crowne Plaze, New Orleans French Quarter , New Orleans, United States Duration: 7 Jan 2018 → 10 Jan 2018 Conference number: 29 |
Conference
| Conference | 29th Annual ACM-SIAM Symposium on Discrete Algorithms |
|---|---|
| Number | 29 |
| Location | Astor Crowne Plaze, New Orleans French Quarter |
| Country/Territory | United States |
| City | New Orleans |
| Period | 07/01/2018 → 10/01/2018 |
| Sponsor | Association for Computing Machinery, Society for Industrial and Applied Mathematics |
| Series | Proceedings of the Annual Acm-siam Symposium on Discrete Algorithms |
|---|