This repository computes insertion, deletion, and substitution rates simultaneously between two input genomes under a non-simple mutation model.
To be added
To be added
The code in this repository does the following:
- It first extracts k-mers from both input genome files
- For each input, the program builds the unitigs from k-mers using cuttlefish. Let the set of unitigs built from the two genomes be
$U_1$ and$U_2$ - For a unitig
$u$ in$U_1$ , we align it to all unitigs in$U_2$ . The alignment can be expensive for long unitigs. To make this feasible, we split long unitigs into smaller unitigs of length 5k - Next, for a unitig
$u$ in$U_1$ , we align it to all unitigs in$U_2$ using edlib. We get the best alignment using alignment score, and count the number of k-mers using single substitution, insertion, and deletion - These counts are used to solve our estimators for the three mutation rates
conda create -n <environment_name> --file requirements.txt -c conda-forge -c bioconda -y
conda activate <environment_name>
python main.py <orig_genome_filename (str)> <mutated_genome_filename (str)> <ksize (int)> --num_threads <num_threads (int)>
The script should compute and print the estimates for mutation rates using two ways.