Topics in combinatorial pattern matching

Hjalte Wedel Vildhøj

Research output: Book/ReportPh.D. thesis

598 Downloads (Pure)

Abstract

This dissertation studies problems in the general theme of combinatorial pattern matching. More specifically, we study the following topics:

Longest Common Extensions. We revisit the longest common extension (LCE) problem, that is, preprocess a string T into a compact data structure that supports fast LCE queries. An LCE query takes a pair (i, j) of indices in T and returns the length of the longest common prefix of the suffixes of T starting at positions i and j. Such queries are also commonly known as longest common prefix (LCP) queries. We study the time-space trade-offs for the problem, that is, the space used for the data structure vs. the worst-case time for answering an LCE query. Let n be the length of T. Given a parameter τ , 1 ≤ τ ≤ n, we show how to achieve either O(n/√τ ) space and O(τ) query time, or O(n/τ) space and O(τ log (|LCE(i, j)|/τ)) query time, where ?LCE(i, j)| denotes the length of the LCE returned by the query. These bounds provide the first smooth trade-offs for the LCE problem and almost match the previously known bounds at the extremes when τ = 1 or τ = n. We apply the result to obtain improved bounds for several applications where the LCE problem is the computational bottleneck, including approximate string matching and computing palindromes. We also present an efficient technique to reduce LCE queries on two strings to one string. Finally, we give a lower bound on the time-space product for LCE data structures in the non-uniform cell probe model showing that our second trade-off is nearly optimal.

Fingerprints in Compressed Strings. The Karp-Rabin fingerprint of a string is a type of hash value that due to its strong properties has been used in many string algorithms. We show how to construct a data structure for a string S of size N compressed by a context-free grammar of size n that supports fingerprint queries. That is, given indices i and j, the answer to a query is the fingerprint of the substring S[i, j]. We present the first O(n) space data structures that answer fingerprint queries without decompressing any characters. For Straight Line Programs (SLP) we get O(log N) query time, and
for Linear SLPs (an SLP derivative that captures LZ78 compression and its variations) we get O (log log N) query time. Hence, our data structures has the same time and space complexity as for random access in SLPs. We utilize the fingerprint data structures to solve the longest common extension problem in query time O(log N log e) and O(log e log log e+log log N) for SLPs and Linear SLPs, respectively. Here, e = |LCE(i, j)| denotes the length of the LCE.

Sparse Text Indexing. We present efficient algorithms for constructing sparse suffix trees, sparse suffix arrays and sparse positions heaps for b arbitrary positions of a text T of length n while using only O(b) words of space during the construction. Our main contribution is to show that the sparse suffix tree (and array) can be constructed in O(n log² b) time. To achieve this we develop a technique, that allows to efficiently answer b longest common prefix queries on suffixes of T, using only O(b) space. Our first solution is Monte-Carlo and outputs the correct tree with high probability. We then give a Las-Vegas algorithm which also uses O(b) space and runs in the same time bounds with high probability when b = O(√n). Furthermore, additional tradeoffs between the space usage and the construction time for the Monte-Carlo algorithm are given. Finally, we show that at the expense of slower pattern queries, it is possible to construct sparse position heaps in O(n + b log b) time and O(b) space.

The Longest Common Substring Problem. Given m documents of total length n, we consider the problem of finding a longest string common to at least d ≥ 2 of the documents. This problem is known as the longest common substring (LCS) problem and has a classic O(n) space and O(n) time solution (Weiner [FOCS’73], Hui [CPM’92]). However, the use of linear space is impractical in many applications. We show several time-space trade-offs for this problem. Our main result is that for any trade-off parameter 1 ≤ τ ≤ n, the LCS problem can be solved in O(τ) space and O(n²/τ) time, thus providing the first smooth deterministic time-space trade-off from constant to linear space. The result uses a new and very simple algorithm, which computes a τ-additive approximation to the LCS
in O(n²/τ) time and O(1) space. We also show a time-space trade-off lower bound for deterministic branching programs, which implies that any deterministic RAM algorithm solving the LCS problem on documents from a sufficiently large alphabet in O(τ) space must use Ω (n√log(n/(τ log n))/ log log(n/(τ log n)) time.

Structural Properties of Suffix Trees. We study structural and combinatorial properties of suffix trees. Given an unlabeled tree T on n nodes and suffix links of its internal nodes, we ask the question “Is T a suffix tree?”, i.e., is there a string S whose suffix tree has the same topological structure as T ? We place no restrictions on S, in particular we do not require that S ends with a unique symbol. This corresponds to considering the more general definition of implicit or extended suffix trees. Such general suffix trees have many applications and are for example needed to allow efficient updates when suffix trees are built online. We prove that T is a suffix tree if and only if it is realized by a string S of length n – 1, and we give a linear-time algorithm for inferring S when the first letter on each edge is known.
Original languageEnglish
Place of PublicationKgs. Lyngby
PublisherTechnical University of Denmark
Number of pages147
Publication statusPublished - 2015
SeriesDTU Compute PHD-2014
Number348
ISSN0909-3192

Fingerprint

Dive into the research topics of 'Topics in combinatorial pattern matching'. Together they form a unique fingerprint.

Cite this