This thesis introduces a novel compression technique for repetitive tries, which are fundamental data structures for representing large sets of strings in applications like bioinformatics and natural language processing. While tries are efficient for prefix-based queries, their inherent repetitiveness often leads to a large memory footprint. Traditional compression algorithms often fail to exploit this redundancy, and even specialized techniques like the Extended Burrows-Wheeler Transform may not be optimal in such cases. The proposed method aims to find an effective trade-off between high compression and efficient indexing. The core idea is to identify and merge identical subtrees by reducing the trie to a minimal deterministic finite automaton (DFA). However, a fully minimized DFA is difficult to index efficiently. Therefore, this work proposes a partial minimization of the trie that ensures the resulting automaton is p-sortable, a property that allows for efficient indexing. The problem is framed as a String Partitioning Problem, where the sequence of nodes in the trie, ordered co-lexicographically, is partitioned into p subsequences to maximize the merging of equivalent states. This optimization problem is then reduced to finding a Minimum Weight Perfect Bipartite Matching, which can be solved efficiently. The result is a compressed, p-sortable automaton that supports efficient queries. Experimental results demonstrate the effectiveness of this method, particularly for highly repetitive datasets, achieving a balance of compression and indexability that prior methods could not attain.

A New Compression Technique for Repetitive Tries

TONETTO, DAVIDE
2024/2025

Abstract

This thesis introduces a novel compression technique for repetitive tries, which are fundamental data structures for representing large sets of strings in applications like bioinformatics and natural language processing. While tries are efficient for prefix-based queries, their inherent repetitiveness often leads to a large memory footprint. Traditional compression algorithms often fail to exploit this redundancy, and even specialized techniques like the Extended Burrows-Wheeler Transform may not be optimal in such cases. The proposed method aims to find an effective trade-off between high compression and efficient indexing. The core idea is to identify and merge identical subtrees by reducing the trie to a minimal deterministic finite automaton (DFA). However, a fully minimized DFA is difficult to index efficiently. Therefore, this work proposes a partial minimization of the trie that ensures the resulting automaton is p-sortable, a property that allows for efficient indexing. The problem is framed as a String Partitioning Problem, where the sequence of nodes in the trie, ordered co-lexicographically, is partitioned into p subsequences to maximize the merging of equivalent states. This optimization problem is then reduced to finding a Minimum Weight Perfect Bipartite Matching, which can be solved efficiently. The result is a compressed, p-sortable automaton that supports efficient queries. Experimental results demonstrate the effectiveness of this method, particularly for highly repetitive datasets, achieving a balance of compression and indexability that prior methods could not attain.
File in questo prodotto:
File Dimensione Formato  
Tesi_Magistrale_Davide_Tonetto.pdf

embargo fino al 06/11/2026

Dimensione 1.97 MB
Formato Adobe PDF
1.97 MB Adobe PDF

I documenti in UNITESI sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14247/27026