The Complexity of the Co-Occurrence Problem

Philip Bille, Inge Li Gørtz, Tord Joakim Stordalen

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

30 Downloads (Orbit)

Abstract

Let S be a string of length n over an alphabet Σ and let Q be a subset of Σ of size q≥2. The co-occurrence problem is to construct a compact data structure that supports the following query: given an integer w return the number of length-w substrings of S that contain each character of Q at least once. This is a natural string problem with applications to, e.g., data mining, natural language processing, and DNA analysis. The state of the art is an O(nq−−√) space data structure that—with some minor additions—supports queries in O(loglogn) time [CPM 2021].
Our contributions are as follows. Firstly, we analyze the problem in terms of a new, natural parameter d, giving a simple data structure that uses O(d) space and supports queries in O(loglogn) time. The preprocessing algorithm does a single pass over S, runs in expected O(n) time, and uses O(d+q) space in addition to the input. Furthermore, we show that O(d) space is optimal and that O(loglogn)-time queries are optimal given optimal space. Secondly, we bound d=O(nq−−√), giving clean bounds in terms of n and q that match the state of the art. Furthermore, we prove that Ω(nq−−√) bits of space is necessary in the worst case, meaning that the O(nq−−√) upper bound is tight to within polylogarithmic factors. All of our results are based on simple and intuitive combinatorial ideas that simplify the state of the art.
Original languageEnglish
Title of host publicationString Processing and Information Retrieval
PublisherSpringer
Publication date2022
Pages38–52
ISBN (Print)978-3-031-20642-9
DOIs
Publication statusPublished - 2022
Event29th International Symposium on String Processing and Information Retrieval - Universidad de Concepción, Concepción, Chile
Duration: 8 Nov 202210 Nov 2022
Conference number: 29

Conference

Conference29th International Symposium on String Processing and Information Retrieval
Number29
LocationUniversidad de Concepción
Country/TerritoryChile
CityConcepción
Period08/11/202210/11/2022
SeriesLecture Notes in Computer Science
Volume13617
ISSN0302-9743

Fingerprint

Dive into the research topics of 'The Complexity of the Co-Occurrence Problem'. Together they form a unique fingerprint.

Cite this