Sparse Regular Expression Matching

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review


A regular expression specifies a set of strings formed by single characters combined with concatenation, union, and Kleene star operators. Given a regular expression R and a string Q, the regular expression matching problem is to decide if Q matches any of the strings specified by R. Regular expressions are a fundamental concept in formal languages and regular expression matching is a basic primitive for searching and processing data. A standard textbook solution [Thompson, CACM 1968] constructs and simulates a nondeterministic finite automaton, leading to an O(nm) time algorithm, where n is the length of Q and m is the length of R. Despite considerable research efforts only polylogarithmic improvements of this bound are known. Recently, conditional lower bounds provided evidence for this lack of progress when Backurs and Indyk [FOCS 2016] proved that, assuming the strong exponential time hypothesis (SETH), regular expression matching cannot be solved in O((nm)1−ϵ), for any constant ϵ > 0. Hence, the complexity of regular expression matching is essentially settled in terms of n and m. In this paper, we take a new approach and introduce a density parameter, ∆, that captures the amount of nondeterminism in the NFA simulation on Q. The density is at most nm + 1 but can be significantly smaller. Our main result is a new algorithm that solves regular expression matching in (equation presented) time. This essentially replaces nm with ∆ in the complexity of regular expression matching. We complement our upper bound by a matching conditional lower bound that proves that we cannot solve regular expression matching in time O(∆1−ϵ) for any constant ϵ > 0 assuming SETH. The key technical contribution in the result is a new linear space representation of the classic position automaton that supports fast state-set transition computation in near-linear time in the size of the input and output state sets. To achieve this we develop several new insights and techniques of independent interest, including new structural properties of the parse trees of regular expression, a decomposition of state-set transitions based on parse trees, and a fast batched predecessor data structure.

Original languageEnglish
Title of host publicationProceedings of the 2024 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA)
PublisherSociety for Industrial and Applied Mathematics
Publication date2024
ISBN (Electronic)978-1-61197-791-2
Publication statusPublished - 2024
Event2024 Annual ACM-SIAM Symposium on Discrete Algorithms - Alexandria, United States
Duration: 7 Jan 202410 Jan 2024


Conference2024 Annual ACM-SIAM Symposium on Discrete Algorithms
Country/TerritoryUnited States


Dive into the research topics of 'Sparse Regular Expression Matching'. Together they form a unique fingerprint.

Cite this