Hierarchical Compression of Highly-Repetitive Data

Project Details

Layman's description

Thanks to advances in DNA sequencing techniques, datasets of DNA genomes have become extremely abundant, and have grown incredibly in size. To give us a quick impression, just a single human genome requires 3GB of space to be stored. Genomic databases contain thousands of genomes from humans, bacteria, viruses and other organisms, that scientists use to identify diseases, study evolution, and more. The computer tools that are needed to perform these jobs now need to deal with massive data sets of growing size, which are costly to store, and computationally expensive to work with.

Compressing the DNA data is a necessary step towards reducing the cost of storing the data, transmitting it over the internet, and computing with the data. We can use standard data compression algorithms for this task, however DNA data are not random sequences: different genomes will contain many similarities between them if they come from the same species, or from species who have evolved similarly from a close evolutionary parent. Taking advantage of the repeating patterns of DNA is crucial to achieve good compression ratios for a genomic database.

One of the currently used compression methods is called relative compression. Given several sequences that need to be compressed, one of them is taken as the reference, and the rest of the sequences are compressed relative to it. In a popular algorithm called Lempel-Ziv, the compression is done by replacing subsequences of the reference sequence in the other sequences by the position and the length of the subsequence in the reference. For example, if we have two sequences "AAACG" and "GTAAAGAAATAA", we can take the first one as reference, and compress the second one with "GT(1, 3)G(1, 3)T(1, 2)", where for example (1, 3) means the subsequence of length 3 at the first position of the reference.

To improve on relative compression, instead of choosing only one reference, hierarchical relative compression clusters the data into groups of similar sequences. Inside each cluster, a reference sequence is chosen, and relative compression is applied, for example using Lempel-Ziv. Then, the multiple reference sequences of the clusters are again compressed, building a hierarchy of references. This can be layered multiple times. Hierarchical compression promises to exploit better the similarities between sequences in a diverse genomic database, and provide better compression rates.

If we develop successful hierarchical compression methods for DNA data, the resulting computer tools will be greatly helpful for scientists studying DNA all around the world. It will reduce the rising data storing costs of genomic research projects, improve the efficiency of the computerized tools for analyzing DNA data, and accelerate network transfers of big amounts of sequences.
StatusActive
Effective start/end date01/10/202330/09/2026

Fingerprint

Explore the research topics touched on by this project. These labels are generated based on the underlying awards/grants. Together they form a unique fingerprint.