The problem of sequence identification or matching is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections, a resource-efficient solution is critical. To solve this problem, we propose a lossless compressed data structure for colored de Bruijn graphs, which can be regarded as a map from k-mers to their color sets. The color set of a k-mer is the collection of all the identifiers of the references in which that k-mer can be found. Our solutions exploit the repetitiveness of the color sets when indexing large collections of related genomes, extracting repeating patterns and encoding them once, instead of redundantly replicating their representation. Experimental results show that these representations substantially improve over the space effectiveness of the best previous solutions while impacting only marginally the efficiency of the queries.

Differential compression for colored de Bruijn graphs

Campanelli, Alessio
2024/2025

Abstract

The problem of sequence identification or matching is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections, a resource-efficient solution is critical. To solve this problem, we propose a lossless compressed data structure for colored de Bruijn graphs, which can be regarded as a map from k-mers to their color sets. The color set of a k-mer is the collection of all the identifiers of the references in which that k-mer can be found. Our solutions exploit the repetitiveness of the color sets when indexing large collections of related genomes, extracting repeating patterns and encoding them once, instead of redundantly replicating their representation. Experimental results show that these representations substantially improve over the space effectiveness of the best previous solutions while impacting only marginally the efficiency of the queries.
File in questo prodotto:
File Dimensione Formato  
878170-1292320.pdf

accesso aperto

Tipologia: Altro materiale allegato
Dimensione 3.81 MB
Formato Adobe PDF
3.81 MB Adobe PDF Visualizza/Apri

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/23303