Output-Sensitive Pattern Extraction in Sequences

Roberto Grossi, Giulia Menconi, Nadia Pisanti, Roberto Trani, Søren Juhl Vind

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

170 Downloads (Pure)

Abstract

Genomic Analysis, Plagiarism Detection, Data Mining, Intrusion Detection, Spam Fighting and Time Series Analysis are just some examples of applications where extraction of recurring patterns in sequences of objects is one of the main computational challenges. Several notions of patterns exist, and many share the common idea of strictly specifying some parts of the pattern and to don’t care about the remaining parts. Since the number of patterns can be exponential in the length of the sequences, pattern extraction focuses on statistically relevant patterns, where any attempt to further refine or extend them causes a loss of significant information (where the number of occurrences changes). Output-sensitive algorithms have been proposed to enumerate and list these patterns, taking polynomial time O(nc) per pattern for constant c > 1, which is impractical for massive sequences of very large length n.

We address the problem of extracting maximal patterns with at most k don’t care symbols and at least q occurrences. Our contribution is to give the first algorithm that attains a stronger notion of output-sensitivity, borrowed from the analysis of data structures: the cost is proportional to the actual number of occurrences of each pattern, which is at most n and practically much smaller than n in real applications, thus avoiding the aforementioned cost of O(nc) per pattern.
Original languageEnglish
Title of host publicationProceedings of the IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2014)
PublisherSchloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany
Publication date2014
Pages303-314
ISBN (Print)978-3-939897-77-4
DOIs
Publication statusPublished - 2014
Event34th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2014) - India International Centre, New Delhi, India
Duration: 15 Dec 201417 Dec 2014
Conference number: 34
http://www.fsttcs.org/index.php

Conference

Conference34th IARCS Annual Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS 2014)
Number34
LocationIndia International Centre
Country/TerritoryIndia
CityNew Delhi
Period15/12/201417/12/2014
Internet address
SeriesLeibniz International Proceedings in Informatics
Volume29
ISSN1868-8969

Bibliographical note

Will be published Open Access in the Leibniz International Proceedings in Informatics (LIPIcs)

Keywords

  • Pattern Extraction
  • Motif Detection
  • Pattern Discovery
  • Motif Trie

Fingerprint

Dive into the research topics of 'Output-Sensitive Pattern Extraction in Sequences'. Together they form a unique fingerprint.

Cite this